├── .gitignore ├── CONTRIBUTING.md ├── ISSUE_TEMPLATE.md ├── R course survey responses ├── R course for journalists (Responses) - Survey Results.csv ├── R course survey responses.md ├── image_0.png ├── image_1.png ├── image_2.png ├── image_3.png ├── image_4.png ├── image_5.png └── image_6.png ├── README.md ├── blog_paragraph.md ├── data_sets_for_recipes ├── 2016-sqf-file-spec.xlsx ├── fd_endireh2016.xlsx ├── sfpd_stopfrisk_datarecipeskeleton.md └── sqf-2016.csv ├── docs └── index.md ├── licence ├── material ├── examples │ └── switzerland-dual-use-goods.Rmd ├── ideas │ ├── Intrductory_ideas.md │ ├── course_content.md │ ├── introductory_ideas_2.md │ └── rmarkdown_template.Rmd └── lessons │ ├── Camila-s docs │ ├── morosidad16.xlsx │ └── morosidad17.xlsx │ ├── Rplot.png │ ├── Tutorial R.Rmd │ ├── demographics1.xlsx │ ├── demographics2.dta │ ├── p1.png │ ├── p4.png │ ├── pipeline_1_ask.Rmd │ ├── polls.Rmd │ ├── polls.nb.html │ ├── results.csv │ ├── switzerland-dual-use │ ├── data_clean │ │ ├── elic_2016_1.RData │ │ ├── elic_2016_2.RData │ │ ├── elic_2016_3.RData │ │ └── elic_2016_4.RData │ ├── recipe_switzerland-dual-use.Rmd │ └── recipe_switzerland-dual-use.html │ └── why_use_R.md ├── proposal.Rmd ├── proposal.md ├── proposal.pdf ├── proposal_files └── figure-html │ └── unnamed-chunk-2-1.png ├── protocol_call_1.md ├── protocol_calls_2017_fall.md ├── protocol_calls_2017_spring.md ├── protocol_calls_2018_spring.md ├── r-consortium-proposal.Rproj ├── r-package ├── .Rbuildignore ├── DESCRIPTION ├── NAMESPACE ├── R │ └── raw_data_introduction.R ├── data │ ├── elic_2016_1.RData │ ├── elic_2016_2.RData │ ├── elic_2016_3.RData │ ├── elic_2016_4.RData │ ├── elic_2016_q1_raw.RData │ ├── elic_2016_q2_raw.RData │ ├── elic_2016_q3_raw.RData │ ├── elic_2016_q4_raw.RData │ ├── introduction_clean.RData │ └── introduction_raw.RData ├── inst │ └── tutorials │ │ ├── en-introduction │ │ ├── en-introduction.Rmd │ │ ├── en-introduction.html │ │ └── images │ │ │ ├── Wait_what.jpg │ │ │ └── install_pkgs.png │ │ ├── en-recipe-france-presidentialelec-polls │ │ ├── en-recipe-france-presidentialelec-polls.Rmd │ │ └── img │ │ │ ├── category.png │ │ │ ├── copy.gif │ │ │ ├── css.gif │ │ │ ├── francepolls.png │ │ │ ├── huffpo.png │ │ │ └── lemonde.png │ │ ├── en-recipe-switzerland-dual-use │ │ ├── en-recipe-switzerland-dual-use.Rmd │ │ └── en-recipe-switzerland-dual-use.html │ │ ├── en-recipe-template │ │ ├── en-recipe-template.Rmd │ │ ├── en-recipe-template.html │ │ └── images │ │ │ └── Data-pipeline-v2-EN.png │ │ ├── en-skills-template │ │ ├── en-skills-template.Rmd │ │ └── en-skills-template.html │ │ ├── sp-Skills-analisis │ │ ├── Analisis con R.Rmd │ │ └── encuesta.xlsx │ │ ├── sp-recipe-Voto evangelico │ │ ├── Voto evangelico.Rmd │ │ └── diputados.xlsx │ │ ├── sp-recipe-mecixo-gender-violence │ │ ├── sp-recipe-mexico-gender-violence.Rmd │ │ └── sp-recipe-mexico-gender-violence.html │ │ ├── sp-skills-Intro R │ │ └── Intro R.Rmd │ │ └── sp-skills-Limpieza │ │ ├── Limpieza.Rmd │ │ ├── morosidad16.xlsx │ │ └── morosidad17.xlsx └── man │ └── results.Rd └── submission_form.txt /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | We are a group of people excited about data literacy and/or R. We want to teach journalists how to use R for their work. 2 | 3 | ### Want to help? 4 | Awesome! We need lots of help! 5 | We are very thankful for any contribution. You don't need to be an R specialist or know R at all. 6 | Come talk to us to find out how you could possibly help :cake: :smiley: :clap: 7 | 8 | Send an email to heidi@schoolofdata.ch (Heidi is very friendly and will help you figure out how you 9 | can best contribute to the project) or join our [gitter chat](https://gitter.im/school-of-data/r-consortium-proposal). 10 | 11 | We currently need help with: 12 | 13 | - Creating tutorials (data recipes and skills lessons) 14 | - Translation (no R knowledge required) 15 | - Beta testing (see below): go through the existing material and point out 16 | things you don't understand or points where you get stuck. 17 | - Planning of the website 18 | 19 | ### Creating tutorials 20 | If you want to add a data recipe or skills lesson, please use the templates to get started: 21 | 22 | - [data recipes](https://github.com/school-of-data/r-consortium-proposal/tree/master/r-package/inst/tutorials/en-recipe-template) 23 | - [skills lessons](https://github.com/school-of-data/r-consortium-proposal/tree/master/r-package/inst/tutorials/en-skills-template) 24 | 25 | You can look at the end-result in R via: 26 | ``` 27 | devtools::install_github("school-of-data/r-consortium-proposal", 28 | subdir="r-package") 29 | learnr::run_tutorial("en-recipe-template", package = "ddj") 30 | learnr::run_tutorial("en-skills-template", package = "ddj") 31 | ``` 32 | 33 | #### Process of creating a data recipe 34 | 35 | 1. Find an interesting story for journalists based on a dataset. Note: The dataset should be publicly available data. Datasets containing personal information will not be accepted. 36 | 2. Clone to the repo or simply copy the Rmarkdown file 37 | 3. Following the data pipeline described in the template, create your recipe with: one section of your recipe = one section of the data pipeline. Note: the recipe should be self-contained: readers should not have to look up external material to complete it. 38 | 4. Create a pull request on Github in order for your recipe to be reviewed and added to the list. Alternatively, send an email to heidi [at] schoolofdata [dot] ch 39 | 40 | 41 | ### Beta testing 42 | 43 | You have R and RStudio installed and know how to run R code and install packages? 44 | You speak English? Then you are the right person to help us in beta testing! :raised_hands: 45 | 46 | 1. Find content you'd like to test. Currently the *Introduction* and the *Swiss dual use data recipe* are 47 | ready for beta testing. To take a look at them, open R and run: 48 | ``` 49 | devtools::install_github("school-of-data/r-consortium-proposal", 50 | subdir="r-package") 51 | ``` 52 | and then 53 | ``` 54 | learnr::run_tutorial("en-introduction", package = "ddj") 55 | ``` 56 | for looking at the *Introduction* or 57 | ``` 58 | learnr::run_tutorial("en-recipe-switzerland-dual-use", package = "ddj") 59 | ``` 60 | for the *Swiss dual use data recipe*. 61 | 62 | 2. Open an [issue](https://github.com/school-of-data/r-consortium-proposal/issues/new) and fill it out. You don't have to do it all at once. Do it in your pace and come back when you have time! The template shows you how to give us a sign that you are done. Do that in the end. 63 | 64 | 3. Be proud of yourself :cake: :clap: :smile: 65 | -------------------------------------------------------------------------------- /ISSUE_TEMPLATE.md: -------------------------------------------------------------------------------- 1 | *If you are testing our R learning content for Jouralists, please use this template. 2 | Otherwise you can just delete this.* 3 | 4 | *Thank you for helping us with the testing of the R learning content for Journalists! Please check out the steps you need to do [here](https://github.com/school-of-data/r-consortium-proposal/blob/master/CONTRIBUTING.md#beta-testing).* 5 | 6 | *Pro tip: click on "Preview" to see what your edits look like.* 7 | 8 | ### Name: 9 | ### R skills: 10 | 11 | - **Have you used R before?** 12 | 13 | 14 | - **Do you have RStudio and R already installed?** 15 | 16 | 17 | - **What is the most advanced thing/project/analysis that you have done with R?** 18 | 19 | 20 | 21 | ## Writing, tone: 22 | 23 | - **Is the text easy to follow?** 24 | 25 | 26 | - **Do the explanations come at the right moment?** 27 | 28 | 29 | - **Are there any leaps in logic?** 30 | 31 | 32 | - **Is the grammar/orthography correct?** 33 | 34 | 35 | - **Any other comments on the writing/tone?** 36 | 37 | 38 | 39 | ## Code: 40 | 41 | - **When you run the code inside the website, does that work correctly?** 42 | 43 | 44 | - **When you run all the code on your local machine, does everything work? (Doing this takes a while, but we would really appreciat it :cake:)** 45 | 46 | 47 | - **Is the data still available from where the content says it is? Does the data have the same name and columns?** 48 | 49 | 50 | - **Is there another way of doing a step that would be better for beginners or should be mentioned as an alternative?** 51 | 52 | 53 | 54 | ## Other: 55 | **Anything else you would like to mention?** 56 | 57 | 58 | *Almost done: To show that you have finished the testing, comment on this issue with the words "I finished beta testing. @HeidiSeibold please update learning content now.".* 59 | 60 | 61 | *Thank you so much for your help! :tada: :cake: :clap:* 62 | -------------------------------------------------------------------------------- /R course survey responses/R course survey responses.md: -------------------------------------------------------------------------------- 1 | # R survey responses 2 | 3 | ** ** 4 | 5 | School of Data conducted a survey to identify the needs of data journalists in terms of learning how to use R. In a period of two weeks, we received 97 responses. 6 | 7 | ## Our audience 8 | 9 | Half of the respondents (51%) are data journalists with between 1 and 4 years of experience, a third (32%) had less than a year's experience working in data journalism and 17% were journalists with 5 years or more in the field. 10 | 11 | ![image alt text](image_0.png) 12 | 13 | Nevertheless, only 4 out of 10 work full time as data journalists in newsrooms, as freelancers or in other types of companies. A third don’t consider data journalism their primary employment, and the rest are either students or work part-time. 14 | 15 | Most of the journalists (84%) work in media organisations (print, online, radio or TV), while the others do research or work in other types of organisations. 16 | 17 | In terms of the tasks they perform on a data journalism project, 47% of them go through the whole data pipeline: data gathering, cleaning, analysis, visualisation and storytelling. 18 | 19 | ![image alt text](image_1.png) 20 | 21 | Data gathering, data analysis and writing the story are the most frequent tasks journalists perform in a project. This suggests that they tend to perform multiple elements of the process, instead of focusing on a single component. For our purposes, this means that a course that focuses on multiple skills rather than on a single process could be more appealing for the journalists. 22 | 23 | ![image alt text](image_2.png) 24 | 25 | Additionally, most of the journalists (93%) use spreadsheets for their work and tools like Open Refine and Tableau. The usage of statistical packages and programing languages is less frequent, but still significant; 36% and 20%, respectively. This implies that there is a group of advanced users for whom it might be easier to learn R (if they don’t already use it). 26 | 27 | ![image alt text](image_3.png) 28 | 29 | 30 | Regarding their knowledge of certain topics, half of the respondents have moderate expertise in data analysis and statistics but very little experience with programing. Therefore, when designing an R course for this audience, the materials could focus more on the logic of R’s programing language and practical examples of it, rather than trying to teach basic statistical concepts using the software. 31 | 32 | ![image alt text](image_4.png) 33 | 34 | Finally, 51% work within a team. Of these, 63% said they collaborate to create data driven stories. This collaboration happens across all the elements of the pipeline; rather than a member of the team being allocated an element each, the whole team collaborates on each component. 35 | 36 | ![image alt text](image_5.png) 37 | 38 | On the other hand, 22% answered that their teammates do work with data, but in separate stories and 14% of these respondents told they are the only ones working with data. This reinforces the fact that data journalists have to know the whole data pipeline process and perform multiple tasks in their projects. 39 | 40 | ## How do they learn? 41 | 42 | ### R courses 43 | 44 | 35% of the respondents had taken an R course before. Of these, 70% had taken an online course and 64% managed to finish it. Those that didn’t conclude the course cited the main reason as lack of time and also, that they didn’t find it useful for their work. 45 | 46 | Almost half the respondents that hadn’t taken an R course said the main reason for this was that they hadn’t found a course that addressed their specific needs. This strengthens our main hypothesis: there is a need for learning materials, specially designed for journalists, that can help them not only learn a new tool, but teach them how they can apply it in their daily work. 47 | 48 | ![image alt text](image_6.png) 49 | 50 | -------------------------------------------------------------------------------- /R course survey responses/image_0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/R course survey responses/image_0.png -------------------------------------------------------------------------------- /R course survey responses/image_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/R course survey responses/image_1.png -------------------------------------------------------------------------------- /R course survey responses/image_2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/R course survey responses/image_2.png -------------------------------------------------------------------------------- /R course survey responses/image_3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/R course survey responses/image_3.png -------------------------------------------------------------------------------- /R course survey responses/image_4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/R course survey responses/image_4.png -------------------------------------------------------------------------------- /R course survey responses/image_5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/R course survey responses/image_5.png -------------------------------------------------------------------------------- /R course survey responses/image_6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/R course survey responses/image_6.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # R course for data journalists from the school of data 2 | 3 | In this project we want to create an online R course for journalists and data journalists. We focus on people who are used to working with spreadsheets (e.g. excel). The course will be available in the languages Spanish, French, German and English. 4 | 5 | We think R is great for journalists all over the world because it is free, open source and helps cleaning data, analysing data and creating beautiful graphs and reports. So far resources for journalists to learn R are very limited and we want to help make R one of journalists' favorite tools :tada: 6 | 7 | 8 | So far we created... 9 | 10 | - The [proposal](https://github.com/school-of-data/r-consortium-proposal/blob/master/proposal.md) 11 | we wrote for the R Consortium 12 | - Templates for [data recipes](https://github.com/school-of-data/r-consortium-proposal/tree/master/r-package/inst/tutorials/en-recipe-template) 13 | and [skills lessons](https://github.com/school-of-data/r-consortium-proposal/tree/master/r-package/inst/tutorials/en-skills-template). You can also check them out in R via 14 | ``` 15 | devtools::install_github("school-of-data/r-consortium-proposal", 16 | subdir="r-package") 17 | learnr::run_tutorial("en-recipe-template", package = "ddj") 18 | learnr::run_tutorial("en-skills-template", package = "ddj") 19 | ``` 20 | - The [intro lesson](https://github.com/school-of-data/r-consortium-proposal/blob/master/material/lessons/Introduction.Rmd) 21 | - A [data journalists view on R](https://github.com/school-of-data/r-consortium-proposal/blob/master/material/lessons/why_use_R.md) by Timo Grossenbacher 22 | - The following data recipes 23 | + [Switzerland dual use goods](https://github.com/school-of-data/r-consortium-proposal/blob/master/material/lessons/switzerland-dual-use/recipe_switzerland-dual-use.Rmd) 24 | + [WIP - Visualisation of polls results during presidential election in France](https://github.com/school-of-data/r-consortium-proposal/blob/master/r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/en-recipe-france-presidentialelec-polls.Rmd) 25 | - The following skills lessons 26 | + NA 27 | 28 | But there is much more work to do and we need help! 29 | 30 | 31 | ### Wanna get involved? 32 | 33 | We are very thankful for any contributions. You don't need to be an R specialist or know R at all. 34 | 35 | Check out [CONTRIBUTING.md](https://github.com/school-of-data/r-consortium-proposal/blob/master/CONTRIBUTING.md) for infos on how you can help :cake: :smiley: :clap: 36 | -------------------------------------------------------------------------------- /blog_paragraph.md: -------------------------------------------------------------------------------- 1 | School of Data is a network of data literacy practitioners, both organizations and individuals, implementing training and other data literacy activities in their respective countries and regions. Members of School of Data work to empower civil society organizations (CSOs), journalists, civil servants and citizens with the skills they need to use data effectively in their efforts to create better, more equitable and more sustainable societies 2 | 3 | Our R consortium grant project focuses on developing learning materials about R for journalists, with a focus on making them accessible and relevant to journalists from various countries. As a consequence, our content will use country-relevant examples and will be translated in several languages (English, French, Spanish, German). 4 | -------------------------------------------------------------------------------- /data_sets_for_recipes/2016-sqf-file-spec.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/data_sets_for_recipes/2016-sqf-file-spec.xlsx -------------------------------------------------------------------------------- /data_sets_for_recipes/fd_endireh2016.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/data_sets_for_recipes/fd_endireh2016.xlsx -------------------------------------------------------------------------------- /data_sets_for_recipes/sfpd_stopfrisk_datarecipeskeleton.md: -------------------------------------------------------------------------------- 1 | # Stop, Question, and Frisk data for the NYPD from 2003 - 2016 2 | ## Data recipe skeleton 3 | 4 | This link contains various zipped CSV's of data records from the New York Police Department Stop, Question, and Frisk database. Each year is in its own zip file. 5 | - http://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page 6 | - Just 1 year can be used, or multiple years can be downloaded and joined to look at trends across years. 7 | - `sqf-2016.csv` is also in this folder, in case the website goes down or changes URL. 8 | 9 | The 2016 CSV is a quick download, and unzipped it's a CSV of just under 4 MB. A very manageable size with only ~12,000 rows. I suspect the other ones are small as well. 10 | 11 | This website also contains an XLSX file that has the database file specifications. Here you can find what each of the columns in the database dump stand for. 12 | 13 | ## Ask 14 | So many questions! I have not validated whether or not all of these present some interesting findings, but they're things to look into. 15 | - Under what scenarios do police officers *not* explain a reason for stopping a suspect? 16 | - Are suspects more likely to be frisked or searched if there was a summons? 17 | - Can we predict whether a suspect is likely to be frisked or searched based on their sex? Race? Age? A Combination of these factors? 18 | - Is there a zip code that stops are more likely to happen in? 19 | 20 | ## Find 21 | - http://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page 22 | 23 | ## Get 24 | - Download and unzip file. Only one CSV. 25 | `sqf_data_raw <- read.csv('sqf-2016.csv',header=T,na.strings=c(""))` 26 | - Will probably do the trick. These files have headers and empty fields 27 | 28 | ## Verify 29 | - Show the names of the columns 30 | - Trim columns that we might not need, like any column with location data that isn't a zipcode, or all of the "reason for stop"/"reason for frisk" etc. columns 31 | - Summary to briefly review suspect information (age, sex, height, etc) 32 | - Columns with too many empties? `sapply(sqf_data_raw,function(x) sum(is.na(x)))` 33 | 34 | ## Clean 35 | - ?? 36 | 37 | ## Analyze 38 | - Linear, logistic, or poisson regressions!! 39 | - http://www.statmethods.net/advstats/glm.html 40 | - If that's too advanced, 41 | - Graphing something like suspect gender x frisk or no frisk 42 | - Histogram of the zipcodes where stops happen in 43 | 44 | ## Present 45 | - Linear/logistic regressions can show scatterplots w/ lines of best fit over. 46 | - Histograms can have prettified axes etc. 47 | -------------------------------------------------------------------------------- /docs/index.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/docs/index.md -------------------------------------------------------------------------------- /licence: -------------------------------------------------------------------------------- 1 | Content: CC BY SA 2 | Code: MIT 3 | -------------------------------------------------------------------------------- /material/examples/switzerland-dual-use-goods.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Data set: Swiss exports of dual-use goods' 3 | author: "Heidi Seibold" 4 | date: "April 24, 2017" 5 | output: html_document 6 | --- 7 | 8 | Dual-use goods are goods that can be used for civil and military purposes. 9 | In Switzerland, these goods are governed with a special legislation – 10 | unlike in other countries, where they are looked at as conventional arms exports. 11 | 12 | Thanks to [SRF data](https://github.com/srfdata/2017-01-dual-use) we have 13 | access to the raw dual-use data in Switzerland (2012 - 2016) as well as data 14 | cleaning procedures in R. 15 | 16 | ## What can be learned using this data? 17 | 18 | - Reading excel files (data are given to the user in xlsx format); package `xlsx` 19 | - Data cleaning; packages `tidyr`, `dplyr` 20 | - Automation (data is realeased every year and can simply be added to the script) 21 | - Plausability checks 22 | - Saving data as csv so it can be passed on 23 | - Reproducibility by using GitHub, thorough documentation, `rmarkdown` 24 | - Visualisation with `ggplot2` 25 | - Publishing with github pages 26 | 27 | 28 | 29 | 30 | -------------------------------------------------------------------------------- /material/ideas/Intrductory_ideas.md: -------------------------------------------------------------------------------- 1 | ## In what contexts would a journalist use R instead of Excel? 2 | 3 | ### Some resources 4 | 5 | * https://www.r-bloggers.com/using-excel-versus-using-r/ 6 | * https://www.quora.com/Why-is-SAS-and-R-programming-used-instead-of-MS-excel 7 | * http://blog.revolutionanalytics.com/2014/10/why-r-is-better-than-excel.html 8 | * https://r-dir.com/blog/2013/11/r-vs-excel-for-data-analysis.html 9 | * https://www.rforexcelusers.com/excel-vs-r-when-to-use-what/ 10 | 11 | ### When Excel can not do it 12 | 13 | * working with large volumes of data, for example: electoral data, household surveys, census data. 14 | * Make advanced statistical analysis. 15 | * Create less common visualisations quickly (violon chart, slope graph) to explore the data 16 | * keep track of the changes and allow reproducibility of the contents 17 | 18 | ### When Excel can not do it as well 19 | 20 | * make transformations to variables: create new variables, transform variables, delete values or observations. (Even though you can do this in Excel, it’s easier to do it in R with a few lines of code) 21 | * Create publishable data visualizations without the need of using another program after doing the analysis. 22 | * automate easily some processes (VBA vs R code) 23 | 24 | ## In what context would Excel make sense instead of R? 25 | 26 | ... 27 | 28 | ## Which dataset can be used to quickly guide a learner through the R vs Excel differences 29 | 30 | * electoral data (most likely available in all countries where we want to localise the content) 31 | * 32 | 33 | ## Introductory steps - phase 1 34 | 35 | The introductory steps are a guided process through which: 36 | 37 | * we spell out clearly how R can be used for a basic data story 38 | * we show how Excel compares, for each step 39 | * we introduce the reader to the basics of R, iterating on each notion learned in the previous step 40 | * we contextualise the step within a data-driven story writing workflow. 41 | 42 | ### Main intro 43 | _interactive quiz used as a way to start engaging the visitor right away, and to get statistics from various populations_ 44 | 45 | Are you a journalist? 46 | -> Yes -> No 47 | 48 | |____(if no) this tutorial is delibarately, intensely, exclusively focused on journalists. If you want to continue you'll have to pretend to be a journalist and, believe us, it's not pleasant. Do you still want to continue? 49 | 50 | ____ -> Yes -> No 51 | 52 | | 53 | |____ (if yes) Have you ever heard about this thing we call R? 54 | 55 | ____-> Yes -> No 56 | | 57 | |____ (if no) Well, despite its mysterious and concise name, R is an open source programming 58 | language and environment for statistical computing, widely used among data analysts, statisticians, 59 | programmers and more recently, by data journalists. Our mission today will be to convince you 60 | that R is worth your precious time and will help you improve your work in the newsroom. 61 | 62 | | 63 | |____ (if yes) Do you already use R your work? 64 | 65 | ____-> Yes -> No 66 | 67 | | 68 | |____________ (if yes) What follows is a beginner tutorial for R-newbies. 69 | You could jump directly to [our list of R recipes]() or still follow the introduction tutorial 70 | to steal ideas for your own R trainings (we know who you are!). 71 | 72 | | 73 | |____________ (if no) So an R atheist? Or maybe agnostic? 74 | Our mission will be to convice you to actually start using R in your workflow. 75 | If we fail, we allow you the privilege to send us an angry tweet. 76 | 77 | -------------------- 78 | | Challenge accepted | 79 | -------------------- 80 | 81 | ### The context 82 | _Here we set up the story that will be used to contextualise all actions in the tutorial_ 83 | 84 | The presidential election is over and you are tasked to do a post-election analysis. You have data about the number of votes, demographics and geographic distribution of the votes. In order to see what would make a good story, you decide to analyse and visualise the data. 85 | Even though you normally work with spreadsheets, you want to try R. 86 | 87 | 88 | ### Getting the data 89 | 90 | You stretch your arms, grab a coffee, and off you go. First, the data. The ministry of the interior publishes the detailed election data, in xml format (?){what is XML?}. Ugh, Excel doesn't read that easily. 91 | 92 | You also want to align it with a dataset which gives various information about regions, in order to try various analyses. The file you use is published by the National Statistics Institute in CSV format. Excel reads those. Good. 93 | 94 | Now let's get to work: 95 | 96 | Step | Excel | R 97 | --- | --- | --- 98 | 1 | You download the two data files from their respective websites | You write a function directly in the R interface which downloads and loads the data, allowing you to only change the URL to update this part of the process next year. 99 | 2 | You opt to convert the XML data using an XML converter. The result is not perfect, but you can work with it now. | 100 | 3 | You load the files in your Excel workbook, within two different spreadsheets. | 101 | 102 | Step saved using R: 2. Winner? R! 103 | 104 | * A simple line of code can replace cumbersome manual operations 105 | * R loads all sort of file formats, and doesn't complain 106 | * Write once, reuse several times: Not only you can update your code easily, but you can easily share it with other projects. 107 | * Your stagiaire doesn't know where to find the relevant data? no problem, the links are all in the R code! 108 | 109 | > **Wait, what?** 110 | > 111 | > how do I even know how write this code? I'm not a programmer! 112 | > 113 | > Well you don't need to: as long as you understand what you want to do, you will probably find an answer in the [documentation](), [tutoriels]() or [existing projects](). 114 | 115 | **Are you convinced yet that R is worth your time?** 116 | ____ not really 117 | ____ hmmm, I need to see more to decide 118 | ____ yes, R won my heart 119 | 120 | ### Verifying the data 121 | 122 | In Excel: creating sums of elements to check against total, etc. 123 | In R: ... 124 | 125 | ### Cleaning the data 126 | 127 | ... 128 | 129 | #### Analysing the data 130 | 131 | ... 132 | 133 | ### Showing analysis to your colleagues 134 | 135 | (in order to choose one or several angles, for example). 136 | 137 | ### Try various visualisations 138 | 139 | ... 140 | 141 | ### Edit the visualisations based on feedback from your editor 142 | 143 | ... 144 | 145 | ### Send the final visualisation and article to the online editor 146 | 147 | ... 148 | 149 | ### Send another version of the visualisation to the print editor 150 | 151 | ... 152 | 153 | This was the last step! Are you convinced yet? If not, check out [other scenarios]() or [send us an angry tweet]() 154 | 155 | ## Introductory steps - phase 2 156 | 157 | ### recipe1: more advanced electoral map. 158 | 159 | **context**: we're going to use the same dataset, and rerun the programme until the cleaning phase. 160 | **relevant advantage**: keep track of the changes and allow reproducibility of the contents 161 | steps: .... 162 | 163 | ### recipe2 : analysing electoral map against other demographic data 164 | 165 | **context**: we're going to use a functionality of R which allows you to easily merge datasets, and automate the download of two different datasets to be merged! 166 | **relevant advantage**: ... 167 | 168 | ### recipe3: when something goes wrong 169 | 170 | **context**: we're going to add another layer of information and use R code from your colleague, but we'll also have to fix a mistake in the code. 171 | **relevant advantage**: ... 172 | -------------------------------------------------------------------------------- /material/ideas/course_content.md: -------------------------------------------------------------------------------- 1 | # What content should the course cover? 2 | 3 | ## Ideas 4 | (add all ideas that come to mind, we can throw them out later) 5 | 6 | - Reading csv files, spreadsheets, survey data in Stata or SPSS format (.dta, .sav) + 7 | - Scraping websites 8 | - Finding and reading open goverment data 9 | - Cleaning data (create new variables, delete observations, transform variables) + 10 | - Sumarizing data: tabulate, descriptive statistics, filters. + 11 | - Making beautiful graphs (ggplot2). Also focus on interactive/js graphs. 12 | - Adding your own theme to a graph + 13 | - Adding images to a graph (ggimage) 14 | - Making maps with R 15 | - Reproducible workflow with Rmarkdown 16 | - Working with datetime data (posixct etc) --> this could be part of the cleaning data module + 17 | - Regular expressions + 18 | - How to find help about R? (Stackoverflow, ...) 19 | - data applications/Shiny 20 | 21 | # Modules (based on the data pipeline) 22 | 23 | ## 1. Data Gathering 24 | Objectives/Contents: 25 | 1. Teach journalists the basics of R (how does the program/console work, how to write commands, installing packages...) so that they become familiar with the programming language. 26 | 2. Learn how to import and export data in different formats: flat files (.txt, .csv), excel files, statistical packages files (from Stata or SPPS). 27 | 28 | ## 2. Data Cleaning 29 | Objectives/Contents: 30 | 1. Learn how to clean data in R using the tidyr and dplr package and other commands: transform variables, delete observations, create new variables, change formats, rename variables, label variables, recoding variables 31 | 32 | ## 3. Data analysis 33 | Objectives/Contents: 34 | 1. Learn how to tabulate data: summarize data, frequency tables, crosstabs. 35 | 2. Basic descriptive statistics with R 36 | 3. Run regressions 37 | 38 | ## 4. Data visualization 39 | Objectives/Contents: 40 | 1. Learn how to visualize data using ggplot2 41 | 42 | ## 5. Present 43 | 44 | 1. How to concretely integrate my work? Integration with various CMS, etc. 45 | -------------------------------------------------------------------------------- /material/ideas/introductory_ideas_2.md: -------------------------------------------------------------------------------- 1 | ## Introductory steps - phase 1 2 | 3 | The introductory steps are a guided process through which: 4 | 5 | * we spell out clearly how R can be used for a basic data story 6 | * we show how Excel compares, for each step 7 | * we introduce the reader to the basics of R, iterating on each notion learned in the previous step 8 | * we contextualise the step within a data-driven story writing workflow. 9 | 10 | ### Main intro 11 | _interactive quiz used as a way to start engaging the visitor right away, and to get statistics from various populations_ 12 | 13 | Are you a journalist? 14 | -> Yes -> No 15 | 16 | |____(if no) this tutorial is delibarately, intensely, exclusively focused on journalists. If you want to continue you'll have to pretend to be a journalist and, believe us, it's not pleasant. Do you still want to continue? 17 | 18 | ____ -> Yes -> No 19 | 20 | | 21 | |____ (if yes) Have you ever heard about this thing we call R? 22 | 23 | ____-> Yes -> No 24 | | 25 | |____ (if no) Well, despite its mysterious and concise name, R is an open source programming 26 | language and environment for statistical computing, widely used among data analysts, statisticians, 27 | programmers and more recently, by data journalists. Our mission today will be to convince you 28 | that R is worth your precious time and will help you improve your work in the newsroom. 29 | 30 | | 31 | |____ (if yes) Do you already use R your work? 32 | 33 | ____-> Yes -> No 34 | 35 | | 36 | |____________ (if yes) What follows is a beginner tutorial for R-newbies. 37 | You could jump directly to [our list of R recipes]() or still follow the introduction tutorial 38 | to steal ideas for your own R trainings (we know who you are!). 39 | 40 | | 41 | |____________ (if no) So an R atheist? Or maybe agnostic? 42 | Our mission will be to convice you to actually start using R in your workflow. 43 | If we fail, we allow you the privilege to send us an angry tweet. 44 | 45 | -------------------- 46 | | Challenge accepted | 47 | -------------------- 48 | 49 | ### The context 50 | _Here we set up the story that will be used to contextualise all actions in the tutorial_ 51 | 52 | The presidential election is over and you are tasked to do a post-election analysis. You have data about the number of votes, demographics and geographic distribution of the votes. In order to see what would make a good story, you decide to analyse and visualise the data. 53 | Even though you normally work with spreadsheets, you want to try R. Here are a few reasons why: 54 | 55 | Excel | R 56 | --- | --- 57 | It's a "point and click program" | You write a functions directly in the R interface and R does all the magic 58 | If you want to repeat a process you have to do it every time | R allows for automation of processes that have to be repeated. rite once, reuse several times: Not only you can update your code easily, but you can easily share it with other projects. 59 | Only allows some types of files | Can load different kinds of files that can be complicated or impossible to work with in Excel 60 | .... | ..... 61 | 62 | > **Wait, what?** 63 | > 64 | > how do I even know how write R code? I'm not a programmer! 65 | > 66 | > Well you don't need to: as long as you understand what you want to do, you will probably find an answer in the [documentation](), [tutoriels]() or [existing projects](). 67 | 68 | We are going to guide you in each step, don't worry, it won't be that hard. Let's start with getting the data. 69 | 70 | ### Getting the data 71 | 72 | You stretch your arms, grab a coffee, and off you go! First, you have to find the electoral data. The ministry of the interior publishes the detailed election data, in xlm format. 73 | 74 | You also want to align it with a dataset which gives various information about regions, in order to try various analyses. The file you use is published by the National Statistics Institute in CSV format and other files are in XLSX. 75 | 76 | (_the idea is to use different data formats, we have to select the final databases, for this example I'm using fictional files_) 77 | 78 | Let's download those 2 files and store them in a folder called `electoral-data`. 79 | Your must have 2 files in your folder:   80 | `electoral2.cvs` 81 | `electoral3.xlsx` 82 | And a link for the XML file. 83 | 84 | Now let's try reading in the data: 85 | 1. Open your R editor (e.g. RStudio) 86 | 2. Set your working directory to `electoral-data`. To do this enter 87 | ```{r, eval=FALSE} 88 | setwd("/PATH/electoral-data") 89 | ``` 90 | in the console. 91 | Then, we are going to need two specialised packages to read xml and xls files into R: `XML` and `readxl`. 92 | To install the packages run 93 | ```{r, eval=FALSE} 94 | install.packages("readxl") 95 | install.packages("XML") 96 | ``` 97 | To load the packages run 98 | ```{r} 99 | library("readxl") 100 | library("XML") 101 | ``` 102 | Now we can read the data: 103 | ```{r} 104 | data1 <- xmlTreeParse(link to your data) 105 | data2 <2 read.csv("electoral2.xlsx", header = FALSE) 106 | data3 <- read_excel(path = "electoral3.xlsx") 107 | 108 | ``` 109 | 110 | See? That was easy! Now you have your data loaded into the program and can start working. 111 | Are you seeing the magic yet? 112 | * A simple line of code can replace cumbersome manual operations that you have to do in Excel to import the data 113 | * R loads all sort of file formats, and doesn't complain 114 | * Write once, reuse several times: Not only you can update your code easily, but you can easily share it with other projects. 115 | * Your stagiaire doesn't know where to find the relevant data? no problem, the links are all in the R code! 116 | 117 | > **Let's review the commands we learned in this step:** 118 | > 119 | > set your folder `setwd()` 120 | > Install Packages `install.packages()` 121 | > Load Libraries `library()` 122 | > Read files `read.csv` `read_excel` `xmlTreeParse` 123 | 124 | 125 | > **Want to learn what other types of data can be imported into R** 126 | > 127 | > Check this! https://www.datacamp.com/community/tutorials/r-data-import-tutorial#gs.oZSkvFo 128 | > 129 | 130 | Now let's move on with the analysis! 131 | 132 | ### Verifying the data 133 | Now that you have imported your data into R, it's time to start exploring it. Here are some useful commands: 134 | 135 | - `names()` shows the collumn names of the data frame 136 | ```{r} 137 | names(electoral2) 138 | ``` 139 | - `str()` shows the type of each variable (here `num` for numeric and `chr` for character) and the 140 | first few values 141 | ```{r} 142 | str(electoral2) 143 | ``` 144 | Now you can see that your database has X variables, Y of which are numeric and the others that are strings. 145 | 146 | 147 | ### Cleaning the data 148 | 149 | ... 150 | 151 | #### Analysing the data 152 | 153 | ... 154 | 155 | ### Showing analysis to your colleagues 156 | 157 | (in order to choose one or several angles, for example). 158 | 159 | ### Try various visualisations 160 | 161 | ... 162 | 163 | ### Edit the visualisations based on feedback from your editor 164 | 165 | ... 166 | 167 | ### Send the final visualisation and article to the online editor 168 | 169 | ... 170 | 171 | ### Send another version of the visualisation to the print editor 172 | 173 | ... 174 | 175 | This was the last step! Are you convinced yet? If not, check out [other scenarios]() or [send us an angry tweet]() 176 | 177 | ## Introductory steps - phase 2 178 | 179 | ### recipe1: more advanced electoral map. 180 | 181 | **context**: we're going to use the same dataset, and rerun the programme until the cleaning phase. 182 | **relevant advantage**: keep track of the changes and allow reproducibility of the contents 183 | steps: .... 184 | 185 | ### recipe2 : analysing electoral map against other demographic data 186 | 187 | **context**: we're going to use a functionality of R which allows you to easily merge datasets, and automate the download of two different datasets to be merged! 188 | **relevant advantage**: ... 189 | 190 | ### recipe3: when something goes wrong 191 | 192 | **context**: we're going to add another layer of information and use R code from your colleague, but we'll also have to fix a mistake in the code. 193 | **relevant advantage**: ... 194 | 195 | -------------------------------------------------------------------------------- /material/ideas/rmarkdown_template.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'R course template' # I would suggest using something like 'Data pipeline step 1: Ask' 3 | author: "Heidi Seibold" 4 | date: "April 21, 2017" 5 | output: html_document 6 | --- 7 | 8 | # Howto 9 | 10 | If you work with RStudio: 11 | 12 | - Go to File -> New File -> R Markdown 13 | - Enter title, e.g. *Data pipeline step 1: Ask* 14 | - Enter your name as author 15 | - Click *OK* 16 | - Save file in the [GitHub repo](https://github.com/school-of-data/r-consortium-proposal) 17 | under `/r-consortium-proposal/material/lessons`; Naming should be sensible, such as 18 | `pipeline_1_ask.Rmd` for the document that explains the *ASK* step of the data pipeline. 19 | - Add your text 20 | - Push the *knit* button (or use `rmarkdown::render("rmarkdown_template.Rmd")`) 21 | 22 | If you need help with RMarkdown, check out [this page](http://rmarkdown.rstudio.com/lesson-1.html). 23 | 24 | -------------------------------------------------------------------------------- /material/lessons/Camila-s docs/morosidad16.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/Camila-s docs/morosidad16.xlsx -------------------------------------------------------------------------------- /material/lessons/Camila-s docs/morosidad17.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/Camila-s docs/morosidad17.xlsx -------------------------------------------------------------------------------- /material/lessons/Rplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/Rplot.png -------------------------------------------------------------------------------- /material/lessons/Tutorial R.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Tutorial R para periodistas" 3 | author: "Camila Salazar"" 4 | output: 5 | learnr::tutorial: 6 | progressive: true 7 | allow_skip: true 8 | runtime: shiny_prerendered 9 | --- 10 | 11 | ```{r setup, include=FALSE} 12 | library(learnr) 13 | library(tidyverse) 14 | ``` 15 | 16 | ## R para periodistas: Cómo iniciar 17 | R es un entorno y lenguaje de programación para analizar datos. R es una herramienta muy útil para hacer periodismo de datos, ya que nos permite realizar todo el proceso de obtener, limpiar, analizar y visualizar la información desde el mismo lugar. En estos tutoriales aprenderá cómo utilizar el programa para sus proyectos periodísticos. 18 | 19 | ### Contenidos 20 | 21 | En este tutorial aprenderá: 22 | 23 | * Cómo instalar R 24 | * Cómo instalar RStudio 25 | * Cómo instalar paquetes en R 26 | * Generalidades del programa 27 | 28 | Empecemos 29 | 30 | ## ¿Cómo instalar R? 31 | 32 | ![](https://vimeo.com/203516510) 33 | 34 | Tal como se observa en el video, para instalar R vamos a la dirección [cloud.r-project](https://cloud.r-project.org) y seleccionamos R para Linux, Mac o Windows. Una vez descargado, se siguen las instrucciones de instalación. 35 | 36 | Algunas de las ventajas de R: 37 | 38 | * Es gratuito 39 | * Es colaborativo, por lo que constantemente surgen nuevas actualizaciones según las necesidades de los usuarios. 40 | 41 | ## ¿Cómo instalar RStudio? 42 | 43 | RStudio es una interfaz desarrollada para R. ¿Qué significa esto? R Studio nos ayuda a escribir código, visualizar resultados y en general trabajar con el lenguaje de programación R de una forma más fácil. Como recomendación, es mucho más sencillo trabajar con RStudio y es la herramienta que vamos a utilizar en los siguientes tutoriales. 44 | 45 | Algo a tomar en cuenta es que es necesario tener instalado R para poder utilizar RStudio. Veamos cómo instalarlo: 46 | 47 | ![](https://vimeo.com/203516968) 48 | 49 | Tal como se observa en el video, para instalar R vamos a la dirección [rstudio.com](https://rstudio.com/download) y seleccionamos la versión para nuestro sistema operativo. 50 | 51 | Una vez descargado, lo abrimos y estamos listos para comenzar. 52 | 53 | ### Quiz #1: Instalar R y RStudio 54 | 55 | ```{r quiz3, echo=FALSE} 56 | quiz(caption = "Instalar R y RStudio", 57 | question("¿Qué es R studio?", 58 | answer("Una aplicación que nos ayuda a usar de forma más fácil R.", correct = TRUE, message = "RStudio tiene una interfaz amigable, que nos facilita escribir, usar y salvar código de R."), 59 | answer("Una aplicación para usar R sin escribir código", message = "¡No! El código es una de las grandes ventajas que tiene R frente a otros programas como Excel. El Código nos permite llevar un registro del trabajo que estamos haciendo y permite la reproductibilidad de los contenidos."), 60 | answer("Un programa de hogas de cálculo como Excel"), 61 | answer("Es lo mismo que R", message = "No. Como ya vimos anteriormente son dos cosas diferentes. R es un lenguaje, como el español. RStudio es un programa que nos permite usar ese lenguaje, de la misma forma que un programa como Work nos permite escribir textos en español."), 62 | allow_retry = TRUE 63 | ), 64 | question("¿RStudio es gratuito?", 65 | answer("Sí", correct = TRUE, message = "Al igual que R, Rstudio es gratis y open-source."), 66 | answer("No.") 67 | ), 68 | question("¿Es necesario instalar R si ya tengo RStudio?", 69 | answer("Yes.", correct = TRUE, message = "R no viene con RStudio;hay que instalarlos de forma separada."), 70 | answer("No.", message = "R no viene con RStudio;hay que instalarlos de forma separada.") 71 | ) 72 | ) 73 | ``` 74 | 75 | ## Instalar y utilizar paquetes 76 | Un paquete de R es un conjunto de funciones, datasets, documentacion que permite ampliar la funcionalidad de R. Por ejemplo, existen paquetes para hacer gráficos, importar archivos de Excel, hacer análisis estadístico, entre otros. En la actualidad existen cerca de 10.500 paquetes en R. 77 | 78 | ¿Cómo instalarlos? 79 | 80 | ![](https://vimeo.com/203516241) 81 | 82 | ###Instalar paquetes 83 | 84 | Como vimos en el video para instalar un paquete escribimos en la consola `install.packages()`, y ponemos entre __paréntesis y con comillas__ el nombre del paquete que queremos instalar. Por ejemplo, si queremos instalar el paquete `tidyverse`, escribimos `install.packages("tidyverse")` en la consola. 85 | Si queremos instalar varios paquetes a la vez lo hacemos así: `install.packages(c("tidyverse", "ggplot2", "xlsx"))`, es decir, ponemos los diferentes nombres de los paquetes en un vector (c) y los separamos por comas. 86 | 87 | Es importante considerar que los paquetes se instalan una sola vez en la computadora, por lo que no hay necesidad de instalarlos cada vez que abrimos RStudio. 88 | 89 | ###Cargar paquetes al programa 90 | Una vez que hemos instalado un paquete, tenemos que cargarlos a Rstudio para poder utilizarlos. Para hacer esto usamos el comando `libray(nombre del paquete)`. De esta forma si quisieramos utilizar el paquete `ggplot2`, el orden sería el siguente: 91 | `install.packages("ggplot2")` 92 | `library(ggplot2)` 93 | 94 | También existe otra forma de cargar paquetes al programa sin necesidad de instalarlos. Esto se hace con el comando `require(nombre del paquete)`. 95 | 96 | ¿Cuáles paquetes instalar? Como se mencionó al inicio R tiene cerca de 10.500 paquetes. Según las tareas que queramos realizar vamos a necesitar diferentes paquetes. A lo largo de estos materiales vamos a ir conociendo algunos de ellos. 97 | 98 | Para iniciar es recomendable instalar el paquete `tidyverse`, que contiene un conjunto de paquetes de uso frecuente. 99 | 100 | 101 | ### Quizz: Instalar paquetes 102 | 103 | ```{r names, echo = FALSE} 104 | quiz(caption = "Quiz - Paquetes en R", 105 | question("¿Cuál comando se utiliza para instalar paquetes?", 106 | answer("`library()`", message = "No, library se utiliza luego de instalar un programa"), 107 | answer("`install.packages()`", correct = TRUE), 108 | answer("`install_packages()`"), 109 | answer("No hay ningún comando se tiene que ir al sitio [cran.r-project.org](http://cran.r-project.org) y bajar los paquetes manualmente.", message = "R permite descargar los paquetes desde el programa, solo se necesita conexión a interntet"), 110 | allow_retry = TRUE 111 | ), 112 | question("¿Cada cuánto hay que instalar un paquete?", 113 | answer("Cada vez que abrimos R"), 114 | answer("Cada vez que reseteamos nuestra computadora"), 115 | answer("Solamente una vez. Una vez instalado, el paquete queda almacenado en nuestra computadora.", correct = TRUE), 116 | answer("No es necesario instalar paquetes para usar R.", message = "Aunque algunas funciones se pueden realizar sin el uso de paquetes, estos son los que nos permiten ampliar las posibilidades de R y las tareas que podemos hacer con el programa"), 117 | allow_retry = TRUE 118 | ), 119 | question("¿Cuál comando usamos para cargar los paquetes ya instalados?", 120 | answer("`library.load()`"), 121 | answer("`require()`", message ="Este comando se utiliza cuando queremos usar un programa sin instalarlo previamente" ), 122 | answer("`library()`", correct = TRUE), 123 | answer("No es necesario un comando, una vez instalado el paquete se puede utlizar."), 124 | allow_retry = TRUE 125 | ) 126 | ) 127 | ``` 128 | 129 | ## Antes de comenzar 130 | Una vez que ya hemos instalado R, RStudio y algunos paquetes, podemos comenzar a trabajar. Lo primero que tenemos que hacer es elegir un directorio de trabajo, donde vamos a guardar nuestros archivos y de dónde vamos a cargar nuestros archivos. 131 | 132 | Para saber en cuál directorio estamos trabajando, usamos el comando `getwd()`. Si queremos cambiar el directorio de trabajo usamos `setwd("directorio de trabajo")` y listo! 133 | 134 | Lo ideal es tener todos los archivos que vayamos a necesitar en esa carpeta. 135 | 136 | ## Importar archivos 137 | 138 | Cuando trabajamos con datos nos enfrenteamos a diferentes tipos de archivos: .csv, .xlsx, .dta, .sav, entre otros. En este segmento vamos a ver cómo importar archivos a R para poder analizarlos. 139 | 140 | ###Importar .csv 141 | 142 | -------------------------------------------------------------------------------- /material/lessons/demographics1.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/demographics1.xlsx -------------------------------------------------------------------------------- /material/lessons/demographics2.dta: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/demographics2.dta -------------------------------------------------------------------------------- /material/lessons/p1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/p1.png -------------------------------------------------------------------------------- /material/lessons/p4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/p4.png -------------------------------------------------------------------------------- /material/lessons/pipeline_1_ask.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Data pipeline step 1: Ask' 3 | author: "Heidi Seibold" 4 | date: "April 21, 2017" 5 | output: html_document 6 | --- 7 | 8 | 9 | 10 | You can find the template for lessons 11 | [here](https://github.com/school-of-data/r-consortium-proposal/blob/master/material/ideas/rmarkdown_template.Rmd). 12 | -------------------------------------------------------------------------------- /material/lessons/polls.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Recipe: visualize trends in electoral polls" 3 | author: Samuel Goëta 4 | output: html_notebook 5 | --- 6 | # Define 7 | - issue with surveys: blurring signals, we want to know the trends rather than having a daily update on where each polling locates 8 | --- 9 | # Find 10 | --- 11 | # Get 12 | 13 | ```{r} 14 | #install.packages("rvest") 15 | library(rvest) 16 | polls <- read_html("https://en.wikipedia.org/wiki/Opinion_polling_for_the_French_presidential_election,_2017") 17 | polls_round1 <- polls %>% 18 | # html_node("#mw-content-text > table:nth-child(30)") %>% 19 | html_node(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "wikitable", " " ))]') %>% 20 | html_table(header = TRUE,fill=TRUE) 21 | ``` 22 | 23 | 24 | 25 | --- 26 | # Verify 27 | --- 28 | # Clean 29 | ```{r} 30 | polls_round1 <- polls_round1[-c(1,2, 3, 4, 5, 94), ] #virer les lignes inutiles 31 | colnames(polls_round1) <- c("Poll source","Fieldwork date", "Sample size", "Abstention", "Arthaud", "Poutou", "Mélenchon", "Hamon", "Macron", "Lassalle","Fillon", "Dupont-Aignan", "Asselineau","Le Pen", "Cheminade") #enlever les partis des noms de colonnes 32 | save(polls_round1, file = "./polls_round1.rdata") #enregistrer parce que ça va planter 33 | load(file = "./polls_round1.rdata") 34 | library(tidyverse) 35 | polls_round1 <- polls_round1 %>% mutate_at(vars(`Abstention`:`Cheminade`),funs(parse_number),na=character()) #enlever les pourcentages et absention 0 36 | 37 | #install.packages("lubridate") 38 | #install.packages("rex") 39 | library(lubridate) 40 | library(rex) 41 | rex_mode() #charge les éléments de rex en mémoire (voir aide) 42 | get_date <- rex( #produit la regex à partir d'un langage lisible et compréhensible 43 | maybe(numbers %>% between(1,2), #pas obligé de l'avoir, peut etre entre un et deux nombres puis un tiret 44 | "–"), 45 | capture( 46 | name = "date", #groupe qu'on veut récupérer 47 | numbers %>% between(1,2), #entre un et deux nombres 48 | space, #puis un espace 49 | letters %>% n_times(3), #3 lettres pour le mois 50 | space, 51 | "20", one_of(1:5), number #année 52 | ) 53 | ) 54 | 55 | date <-re_matches(polls_round1$`Fieldwork date`, get_date)$date #crée le vecteur date 56 | 57 | polls_round1<- polls_round1 %>% 58 | mutate(date = date %>% dmy()) #ajoute la colonne 59 | 60 | polls_round1_long <- polls_round1 %>% gather(Candidates, Percentage, Abstention:Cheminade) #gather 61 | 62 | save(polls_round1_long, file = "./polls_round1_long.Rdata") 63 | 64 | 65 | ``` 66 | 67 | 68 | --- 69 | # Visualize 70 | 71 | devtools::install_github("hrbrmstr/hrbrthemes") 72 | library(hrbrthemes) 73 | 74 | polls_round1_long %>% 75 | filter(polls_round1_long$Candidates!="Abstention") %>% 76 | ggplot() + 77 | geom_point(mapping = aes(x=date,y=Percentage,color = Candidates, alpha = "0.7")) + # 78 | geom_smooth(aes(date,y=Percentage,color = Candidates))+ 79 | labs(x = "Date of poll", 80 | y = "Percentage of candidate in poll", 81 | title = "Polls during the 2017 Presidential election in France", 82 | subtitle = "Evolution of polls during the month preceding the first round of the 2017 Presidential election in France", 83 | caption = "Source : Wikipedia")+ 84 | theme_ipsum() + 85 | theme(legend.position="right") #reminder, keep track: do it after + 86 | 87 | --- 88 | # Present 89 | -------------------------------------------------------------------------------- /material/lessons/switzerland-dual-use/data_clean/elic_2016_1.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/switzerland-dual-use/data_clean/elic_2016_1.RData -------------------------------------------------------------------------------- /material/lessons/switzerland-dual-use/data_clean/elic_2016_2.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/switzerland-dual-use/data_clean/elic_2016_2.RData -------------------------------------------------------------------------------- /material/lessons/switzerland-dual-use/data_clean/elic_2016_3.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/switzerland-dual-use/data_clean/elic_2016_3.RData -------------------------------------------------------------------------------- /material/lessons/switzerland-dual-use/data_clean/elic_2016_4.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/material/lessons/switzerland-dual-use/data_clean/elic_2016_4.RData -------------------------------------------------------------------------------- /material/lessons/why_use_R.md: -------------------------------------------------------------------------------- 1 | Hello! 2 | 3 | I am data journalist working for Swiss Radio and Television ([SRF Data](https://srf.ch/data)) and I have been using R for almost three years – as a means for researching and publishing countless investigative and explanatory pieces. 4 | In addition to the reporting, we have set new standards in (European) data journalism by publishing most of our [data and code on GitHub pages](https://srfdata.github.io), which got us a nomination for the ["data journalism website of the year"](https://www.datajournalismawards.org/shortlist/) and has inspired other outlets [in doing the same](https://br-data.github.io/diskriminierung-mietmarkt-analyse/analyse.nb.html). R and RMarkdown enabled and empowered us to do this. 5 | Besides that, R has many other advantages for journalistic work, among which are: 6 | 1. **R is good at almost everything**: Getting data from a website, transposing a spreadsheet, combining multiple tables, converting JSON to CSV and vice versa, filtering and sorting data, drawing some exploratory plots, preparing data for further use in an interactive data visualization, creating GIFs, you name it. Of course, there are good tools out there that can do all of thath – but with R you can combine everything in one script, which gives you a true efficiency boost. 7 | 2. **R is a language, not a tool**: When talking to colleagues in journalism, this is the most often heard argument against using R. In fact, it’s R’s biggest asset, especially in an environment where methods, and not just results, matter. Which leads to the next point. 8 | 3. **R supports a transparent and reproducible workflow.**: With R, lying is difficult. Once you’re ready to publish your script, and not just your results, everyone will know what you did – and people will point out your methodological flaws or even errors. But data journalism [needs to be open](http://www.brianckeegan.com/2014/04/the-need-for-openness-in-data-journalism/). And this is hard or even impossible without using a scripting language. 9 | 4. **R is easy to learn and let’s you get started in 5 minutes.**: While not all will agree with me on this and while it depends on the definition of "learn", getting started in R is pretty straightforward. Nowadays there are some many [good tutorials](https://rddj.info) and with the [`tidyverse`](https://www.tidyverse.org/) becoming more popular, R is getting more and more accessible for people without prior programming experience. 10 | 5. **R is free and open source.** 11 | 12 | 13 | Now, enjoy your journey into R – trust me, it's worth it. 14 | 15 | Timo Grossenbacher 16 | 17 | Adapted from the blog post https://timogrossenbacher.ch/2015/12/why-data-journalists-should-start-using-r-in-2016/. 18 | -------------------------------------------------------------------------------- /proposal.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'R Consortium Proposal: School of Data Material Development' 3 | output: 4 | html_document: 5 | keep_md: yes 6 | pdf_document: default 7 | --- 8 | 9 | 10 | 11 | 12 | ```{r, echo=FALSE, message=FALSE} 13 | library("knitr") 14 | library("googlesheets") 15 | library("ggplot2") 16 | ``` 17 | 18 | 19 | 20 | ## The context 21 | 22 | The field of data journalism has seen an exponential rise in activity during for the past 10 years, in conjunction with the open data movement and the multiplication of user-friendly tools aimed at journalists and other similar data users. With its annual [international conferences](https://www.ire.org/nicar), the growth of university classes and [online courses](http://schoolofdata.org), and a sustained funding from traditional journalism supporters and new open data funders, data journalism has become a leading topic in regard to innovation in journalism. 23 | 24 | Defined as practice [allowing journalists to combine their ability to tell a compelling story with the sheer scale and range of digital information now available](http://datajournalismhandbook.org/1.0/en/introduction_0.html), data journalism is also seen as a new approach to improve the role of media as a government watchdog. Concrete examples of the impact of data journalism are plenty ([Panama Papers](https://panamapapers.icij.org/), [Migrant Files](www.themigrantsfiles.com), [World Bank eviction scandal](https://www.icij.org/project/world-bank), ...) and they all reinforce the urgency to increase the adoption of the data journalism methodologies, techniques and tools. 25 | 26 | R is one of them. 27 | 28 | ## The Problem 29 | R has great tools for accessing and analyzing open data sources, which can be hugely beneficial to journalists. Some leading newsrooms, such as ProPublica and the New York Times do use R in their data journalism workflow, to great effect. However beyond these obvious names, there is a low level of awareness of R in the journalism community. If we also take into account the low availability of specialized R learning content -- especially in languages other than English -- it appears that a substantial effort is needed to reach more journalists across the world. 30 | 31 | This problem is compounded by several other factors, that we explored in a small worldwide survey that we conducted in preparation for this proposal. It appeared that, among the journalists who had never taken an R course, the main reason was the lack of learning content addressing their specific needs. Another substantial portion had never heard of R (see [our summary on Github](https://github.com/school-of-data/r-consortium-proposal/blob/master/R%20course%20survey%20responses/R%20course%20survey%20responses.md)). This strengthens our main hypothesis: there is a need for learning content specially designed for journalists. 32 | 33 | 34 | 35 | ## The Plan 36 | 37 | With the help of the survey results (see [our summary](https://github.com/school-of-data/r-consortium-proposal/blob/master/R%20course%20survey%20responses/R%20course%20survey%20responses.md)), we refined our initial idea of creating journalism-specific learning content into a series of learning modules organised into tracks and based on real projects. While the early introductory modules will be fully written by our team in order to fully tailor them to journalists, we plan to use the tracks as a way to curate existing R learning content available online, making them contextually useful for journalists. When necessary we will also translate this content to make it useful for non-English speakers and adapt the data examples for national contexts. All our content will be published on GitHub and the School of Data website, under Creative Commons and MIT licences. 38 | 39 | The four languages/countries (French/France, German/Switzerland, English/Ghana and Spanish/Costa Rica) are chosen due to countries of living of the proposal writers. In the future we plan to extend the materials to other languages/countries using the School of Data network. 40 | 41 | 42 | #### Goals 43 | 44 | - Produce introductory learning content about R tailored for journalists in four languages: English, French, Spanish and German 45 | - Design a content format that encourages journalists to learn without feeling overwhelmed 46 | - Select and organize existing learning resources about R to direct journalists to them 47 | - Optional: Organize workshops in four countries: Ghana, France, Costa Rica and Switzerland 48 | 49 | #### Deliverables 50 | 51 | - Several modules about learning R for journalists, from "why would I want to use R?" to "using R in a newsroom workflow", organised in learning tracks 52 | - A simple GitHub-based website displaying the content in an attractive format 53 | - Optional: Training plan for in-person workshops 54 | - Optional: Slides for in-person workshops 55 | 56 | 57 | #### Timeline 58 | 59 | ```{r, echo=FALSE, message=FALSE} 60 | library(lubridate) 61 | costs_sheet <- gs_title("ScoDa R consortium proposal: Costs & Timeline") 62 | timel <- as.data.frame(gs_read(costs_sheet, ws = 2), 63 | stringsAsFactors = FALSE) 64 | names(timel) <- gsub(pattern = " |/", ".", names(timel)) 65 | timel$Start.Date <- strptime(timel$Start.Date, format = "%d.%m.%Y") 66 | timel$End.Date <- strptime(timel$End.Date, format = "%d.%m.%Y") 67 | timel$Work.Activity <- factor(timel$Work.Activity, levels = rev(timel$Work.Activity)) 68 | 69 | ggplot(data = timel) + 70 | geom_segment(aes(x = as_date(Start.Date), xend = as_date(End.Date), 71 | y = Work.Activity, yend = Work.Activity)) + 72 | ylab("Work / Activity") + xlab("Date") 73 | 74 | ``` 75 | 76 | 77 | 78 | #### Likely failure modes and how to recover 79 | As we will prepare the material in a modular way, it will not be a big problem if we do not manage preparing as much material as we now anticipate. In this case we will have less modules, but quality of those will be ensured. Our content being open and reusable, anyone from the School of Data community will be able to extend the content beyond our initial work. 80 | 81 | 82 | ## How Can The ISC Help? 83 | 84 | ```{r, echo=FALSE, message=FALSE} 85 | costs <- gs_read(costs_sheet, ws = 1) 86 | costs$Notes[is.na(costs$Notes)] <- "" 87 | 88 | material <- subset(costs, Info == "material", 89 | select = c("Work", "Funding requested", 90 | "School of Data co-funding")) 91 | workshop <- subset(costs, Info == "workshop", 92 | select = c("Work", "Funding requested", 93 | "School of Data co-funding")) 94 | ``` 95 | 96 | We would like to request financial support of `r sum(material["Funding requested"])` USD from the ISC to develop material for journalists who would like to learn R. 97 | The School of Data can support the material development in guidance for the material writing as well as in communication and maintenance. 98 | The development costs can be split up as follows: 99 | ```{r, echo=FALSE} 100 | kable(material, format = "markdown") 101 | ``` 102 | 103 | 104 | If additional support is possible, we would like to organize workshops in Ghana, Costa Rica, France and Switzerland. The financial support needed are 105 | `r sum(workshop["Funding requested"])` USD and can be split up as follows: 106 | ```{r, echo=FALSE} 107 | kable(workshop, format = "markdown") 108 | ``` 109 | 110 | 111 | ## Dissemination 112 | 113 | The School of Data has a broad community with many members working in journalism. 114 | 115 | We will promote the content within and beyond our community, by sending out newsletters, [tweets](https://twitter.com/SchoolOfData) and information on the main [School of Data website](https://schoolofdata.org/) as well as the websites of the chapters in the different countries. We will publish the materials under CC-BY license and make it easy for others to build upon the content and extend it. 116 | 117 | ## The team 118 | 119 | #### The people 120 | 121 | - [Heidi Seibold](http://www.ebpi.uzh.ch/en/aboutus/departments/biostatistics/teambiostats/seibold.html) is a PhD Student in Biostatistics in Switzerland and co-organizer of the Zurich R User Group. 122 | - [Camila Salazar](), is a Costa Rican Data Journalist working at the lead newspaper La Nación. 123 | - [David Opoku](), is the Africa Lead at Open Knowledge International, focused on capacity building. David is from Ghana and is a proficient R user. 124 | - [Cedric Lombion](http://schoolofdata.org/team) is a member of School of Data's International Coordination Team, responsible for Program Design and Implementation. 125 | - [Samuel Goëta]() is a French sociology researcher, data trainer and co-founder of Open Knowledge France and specialized in open data 126 | - [Joel Gombin]() is a French political scientist known for his detailed analysis of the political party National Front using R. 127 | 128 | #### The organisation 129 | 130 | School of Data is a network of data literacy practitioners, both organizations and individuals, implementing training and other data literacy activities in their respective countries and regions. Members of School of Data work to empower civil society organizations (CSOs), journalists, civil servants and citizens with the skills they need to use data effectively in their efforts to create better, more equitable and more sustainable societies. Over the past four years, School of Data has succeeded in developing and sustaining a thriving and active network of data literacy practitioners in partnership with our implementing partners across Europe, Latin America, Asia and Africa. 131 | 132 | Our network includes [13 organisations across the world and a total of 101 active individuals](https://schoolofdata.carto.com/viz/20b9844e-7c7b-11e6-afd1-0e05a8b3e3d7/public_map), which all contribute to School of Data key programs: the Fellowship, the Curriculum and Member Support. Together, we have produced or are in the process of producing dozens of articles, lessons and hands-on tutorials on how to work with data. They are published on our website and are widely reused both by our network and beyond, benefiting thousands of people around the world. Additionally, we have trained over 6500 people through our tailored training events and mentored dozens of organisations to become tech savvy and data driven. 133 | 134 | ---- 135 | This proposal and further details are available at https://github.com/school-of-data/r-consortium-proposal -------------------------------------------------------------------------------- /proposal.md: -------------------------------------------------------------------------------- 1 | # R Consortium Proposal: School of Data Material Development 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ## The context 11 | 12 | The field of data journalism has seen an exponential rise in activity during for the past 10 years, in conjunction with the open data movement and the multiplication of user-friendly tools aimed at journalists and other similar data users. With its annual [international conferences](https://www.ire.org/nicar), the growth of university classes and [online courses](http://schoolofdata.org), and a sustained funding from traditional journalism supporters and new open data funders, data journalism has become a leading topic in regard to innovation in journalism. 13 | 14 | Defined as practice [allowing journalists to combine their ability to tell a compelling story with the sheer scale and range of digital information now available](http://datajournalismhandbook.org/1.0/en/introduction_0.html), data journalism is also seen as a new approach to improve the role of media as a government watchdog. Concrete examples of the impact of data journalism are plenty ([Panama Papers](https://panamapapers.icij.org/), [Migrant Files](www.themigrantsfiles.com), [World Bank eviction scandal](https://www.icij.org/project/world-bank), ...) and they all reinforce the urgency to increase the adoption of the data journalism methodologies, techniques and tools. 15 | 16 | R is one of them. 17 | 18 | ## The Problem 19 | R has great tools for accessing and analyzing open data sources, which can be hugely beneficial to journalists. Some leading newsrooms, such as ProPublica and the New York Times do use R in their data journalism workflow, to great effect. However beyond these obvious names, there is a low level of awareness of R in the journalism community. If we also take into account the low availability of specialized R learning content -- especially in languages other than English -- it appears that a substantial effort is needed to reach more journalists across the world. 20 | 21 | This problem is compounded by several other factors, that we explored in a small worldwide survey that we conducted in preparation for this proposal. It appeared that, among the journalists who had never taken an R course, the main reason was the lack of learning content addressing their specific needs. Another substantial portion had never heard of R (see [our summary on Github](https://github.com/school-of-data/r-consortium-proposal/blob/master/R%20course%20survey%20responses/R%20course%20survey%20responses.md)). This strengthens our main hypothesis: there is a need for learning content specially designed for journalists. 22 | 23 | 24 | 25 | ## The Plan 26 | 27 | With the help of the survey results (see [our summary](https://github.com/school-of-data/r-consortium-proposal/blob/master/R%20course%20survey%20responses/R%20course%20survey%20responses.md)), we refined our initial idea of creating journalism-specific learning content into a series of learning modules organised into tracks and based on real projects. While the early introductory modules will be fully written by our team in order to fully tailor them to journalists, we plan to use the tracks as a way to curate existing R learning content available online, making them contextually useful for journalists. When necessary we will also translate this content to make it useful for non-English speakers and adapt the data examples for national contexts. All our content will be published on GitHub and the School of Data website, under Creative Commons and MIT licences. 28 | 29 | The four languages/countries (French/France, German/Switzerland, English/Ghana and Spanish/Costa Rica) are chosen due to countries of living of the proposal writers. In the future we plan to extend the materials to other languages/countries using the School of Data network. 30 | 31 | 32 | #### Goals 33 | 34 | - Produce introductory learning content about R tailored for journalists in four languages: English, French, Spanish and German 35 | - Design a content format that encourages journalists to learn without feeling overwhelmed 36 | - Select and organize existing learning resources about R to direct journalists to them 37 | - Optional: Organize workshops in four countries: Ghana, France, Costa Rica and Switzerland 38 | 39 | #### Deliverables 40 | 41 | - Several modules about learning R for journalists, from "why would I want to use R?" to "using R in a newsroom workflow", organised in learning tracks 42 | - A simple GitHub-based website displaying the content in an attractive format 43 | - Optional: Training plan for in-person workshops 44 | - Optional: Slides for in-person workshops 45 | 46 | 47 | #### Timeline 48 | 49 | ![](README_files/figure-html/unnamed-chunk-2-1.png) 50 | 51 | 52 | 53 | #### Likely failure modes and how to recover 54 | As we will prepare the material in a modular way, it will not be a big problem if we do not manage preparing as much material as we now anticipate. In this case we will have less modules, but quality of those will be ensured. Our content being open and reusable, anyone from the School of Data community will be able to extend the content beyond our initial work. 55 | 56 | 57 | ## How Can The ISC Help? 58 | 59 | 60 | 61 | We would like to request financial support of 11200 USD from the ISC to develop material for journalists who would like to learn R. 62 | The School of Data can support the material development in guidance for the material writing as well as in communication and maintenance. 63 | The development costs can be split up as follows: 64 | 65 | |Work | Funding requested| School of Data co-funding| 66 | |:-----------------------------------------------------|-----------------:|-------------------------:| 67 | |Writing sprint | 5000| 1000| 68 | |Translation | 1500| 0| 69 | |Integration sprint (interactive learning environment) | 1500| 0| 70 | |Test workshop | 1000| 0| 71 | |Iteration workshop | 1000| 0| 72 | |Development and design sprint (website) | 1200| 0| 73 | |Communication and outreach | 0| 1000| 74 | |Website maintenance | 0| 1000| 75 | 76 | 77 | If additional support is possible, we would like to organize workshops in Ghana, Costa Rica, France and Switzerland. The financial support needed are 78 | 7000 USD and can be split up as follows: 79 | 80 | |Work | Funding requested| School of Data co-funding| 81 | |:----------------------------------------------------|-----------------:|-------------------------:| 82 | |Workshop room | 800| 0| 83 | |Food | 200| 0| 84 | |Communication and outreach | 2000| 1000| 85 | |Workshop content planning | 1000| 0| 86 | |Workshop material (slides, etc) | 1000| 0| 87 | |Time on workshop | 1000| 0| 88 | |Workshop documentation based on the learning content | 1000| 0| 89 | 90 | 91 | ## Dissemination 92 | 93 | The School of Data has a broad community with many members working in journalism. 94 | 95 | We will promote the content within and beyond our community, by sending out newsletters, [tweets](https://twitter.com/SchoolOfData) and information on the main [School of Data website](https://schoolofdata.org/) as well as the websites of the chapters in the different countries. We will publish the materials under CC-BY license and make it easy for others to build upon the content and extend it. 96 | 97 | ## The team 98 | 99 | #### The people 100 | 101 | - [Heidi Seibold](http://www.ebpi.uzh.ch/en/aboutus/departments/biostatistics/teambiostats/seibold.html) is a PhD Student in Biostatistics in Switzerland and co-organizer of the Zurich R User Group. 102 | - [Camila Salazar](), is a Costa Rican Data Journalist working at the lead newspaper La Nación. 103 | - [David Opoku](), is the Africa Lead at Open Knowledge International, focused on capacity building. David is from Ghana and is a proficient R user. 104 | - [Cedric Lombion](http://schoolofdata.org/team) is a member of School of Data's International Coordination Team, responsible for Program Design and Implementation. 105 | - [Samuel Goëta]() is a French sociology researcher, data trainer and co-founder of Open Knowledge France and specialized in open data 106 | - [Joel Gombin]() is a French political scientist known for his detailed analysis of the political party National Front using R. 107 | 108 | #### The organisation 109 | 110 | School of Data is a network of data literacy practitioners, both organizations and individuals, implementing training and other data literacy activities in their respective countries and regions. Members of School of Data work to empower civil society organizations (CSOs), journalists, civil servants and citizens with the skills they need to use data effectively in their efforts to create better, more equitable and more sustainable societies. Over the past four years, School of Data has succeeded in developing and sustaining a thriving and active network of data literacy practitioners in partnership with our implementing partners across Europe, Latin America, Asia and Africa. 111 | 112 | Our network includes [13 organisations across the world and a total of 101 active individuals](https://schoolofdata.carto.com/viz/20b9844e-7c7b-11e6-afd1-0e05a8b3e3d7/public_map), which all contribute to School of Data key programs: the Fellowship, the Curriculum and Member Support. Together, we have produced or are in the process of producing dozens of articles, lessons and hands-on tutorials on how to work with data. They are published on our website and are widely reused both by our network and beyond, benefiting thousands of people around the world. Additionally, we have trained over 6500 people through our tailored training events and mentored dozens of organisations to become tech savvy and data driven. 113 | 114 | ---- 115 | This proposal and further details are available at https://github.com/school-of-data/r-consortium-proposal 116 | -------------------------------------------------------------------------------- /proposal.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/proposal.pdf -------------------------------------------------------------------------------- /proposal_files/figure-html/unnamed-chunk-2-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/proposal_files/figure-html/unnamed-chunk-2-1.png -------------------------------------------------------------------------------- /protocol_call_1.md: -------------------------------------------------------------------------------- 1 | # Protocol of the call January 3 2017 2 | 3 | People: 4 | - Camila Salazar 5 | - David Opoku 6 | - Joel Gombin 7 | - Malick Lingani 8 | - Samuel Goeta 9 | - Heidi Seibold 10 | 11 | Pad: https://edupad.ch/Bj59GTUv4G 12 | 13 | 14 | ## Goals 15 | 16 | - Target journalists (maybe CSOs / people who work on advocacy issues, depending on similarity of wishes to journalists) 17 | who have some experience in working with data. 18 | - Create contextualised resources, specific to journalists needs (adapt datasets and examples) 19 | - Languages: French, Spanish, German, English 20 | - Material should be modular, so it can be reused 21 | - Material should be focused on self learning (possibly also usable in workshops) 22 | - We want to reuse existing material/infrastructure, if possible 23 | - First section of material should convince journalists to use R (low entry barriers, no installations at first) 24 | - Test the material in workshops (in different countries around the world) 25 | 26 | 27 | ## TODOs and responsibilities: 28 | 29 | Before next call (end of January): 30 | 31 | - Samuel / Joel: Check out existing material (https://rddj.info/, datactivist, Joel => mapping capabilities) 32 | - Camila / David / Joel: What do DDJs need? 33 | - Heidi: Costs, what do we need money for? 34 | - David: Timeline for material creation 35 | - Joel : check out possible forms of the material: 36 | RStudio instance (cost) / [datacamp light](https://github.com/datacamp/datacamp-light); 37 | consider limited internet access in some countries 38 | (avoid heavy files and need of staying connected for developing countries) 39 | 40 | In February: 41 | 42 | - Heidi / Malick : Writing the proposal 43 | 44 | 45 | 46 | ## R consortium proposal info 47 | 48 | - ~10k USD 49 | - Use [proposal from Software Carpentry](https://github.com/lgatto/SC-ICS-Proposal/blob/master/SC-ISC-proposal.md) as help 50 | - Deadline: End of February 51 | - More [here](https://www.r-consortium.org/projects/call-for-proposals) 52 | 53 | 54 | 55 | *Next call: Doodle* 56 | 57 | 58 | -------------------------------------------------------------------------------- /protocol_calls_2017_fall.md: -------------------------------------------------------------------------------- 1 | # 31.10.2017 2 | 3 | ## What is our progress on the project? 4 | See [progress document](https://docs.google.com/document/d/1Q_i7b16EZrtadMW4Cx1csgCxGNm24Putf_u-kMu6aQc/edit?usp=sharing). 5 | 6 | 7 | ## Writing sprint / hackathon 8 | - Heidi will redo the doodle: only weekends ([link](https://doodle.com/poll/7dc4cecct83e3ei5)) 9 | - OpenPizza? 10 | - Let's try to figure out how to pay people 11 | - Heidi sends out an email with the doodle 12 | 13 | ## Organisation of the project next year 14 | - Heidi will start a postdoc and cannot do organisation + helping 15 | - Does anyone want to lead the project starting 2018? 16 | - Peter would be interested after solving his primary employment problem. 17 | - If we don't find anyone Cedric and David would also take over (this is not our preferred solution) 18 | 19 | # 17.10.2017 20 | 21 | ## Reorganisation of the project 22 | 23 | ### [Communication in gitter](https://gitter.im/mozilla/open-leadership-training) 24 | 25 | ### [Organisation Document](https://docs.google.com/document/d/1Q_i7b16EZrtadMW4Cx1csgCxGNm24Putf_u-kMu6aQc/edit?usp=sharing) 26 | 27 | ### [GitHub issues](https://github.com/school-of-data/r-consortium-proposal/issues) 28 | 29 | 30 | ## Moving foward 31 | 32 | 1. Where are we on the data recipes? 33 | 2. Where are we on the skills lessons? 34 | 35 | 36 | # 03.10.2017 37 | 38 | ## Report from summer camp (Peter) 39 | Very educational, 30 people from 20 countries 40 | 41 | Camila, a R expert from Turkey and Peter met 42 | 43 | - Keep in mind that data journalists are not very computer literate 44 | - Not advanced excel users 45 | - Need to explain each step in depth; much more than now (maybe not all of the material) 46 | - Peter got the names of some people who would be interested to test the material 47 | 48 | 49 | ## Discussion based on Peter's report 50 | 51 | - Recipes should be self consistent and not refer to other recipes. 52 | - Refering to the very short skills lessons should be o.k., but it should mostly be self consistent. 53 | - Idea: use RStudio cheatsheets in skills lessons 54 | 55 | 56 | ## Progress 57 | 58 | ### Making it easy for people to navigate the project (Heidi) 59 | #### README and CONTRIBUTING 60 | Heidi created a [README.md](https://github.com/school-of-data/r-consortium-proposal/blob/master/README.md) and a [CONTRIBUTING.md](https://github.com/school-of-data/r-consortium-proposal/blob/master/CONTRIBUTING.md) to help interested people figure out how to help. 61 | 62 | #### Templates 63 | The templates should make it easy to contribute: 64 | ``` 65 | devtools::install_github("school-of-data/r-consortium-proposal", 66 | subdir = "r-package") 67 | learnr::run_tutorial("en-recipe-template", package = "ddj") 68 | learnr::run_tutorial("en-skills-template", package = "ddj") 69 | ``` 70 | 71 | ### General intro lesson (Camila) 72 | - Intro lesson on data analysis (es) 73 | - Intro lesson on data viz (es) 74 | 75 | 76 | ## TODOs until next tuesday 77 | 78 | - Samuel: works on his recipe, asks Joel if he still wants to do a recipe on data from his article, put RStartHere in "further reading", set up a call for next week 79 | - Heidi: create some issues for hacktoberfest https://hacktoberfest.digitalocean.com/ 80 | - Peter: contact data journalist from Turkey, asking for data and/or code 81 | 82 | 83 | # 05.09.2017 84 | 85 | ## Contract 86 | The contract with the R consortium asks in exhibit A to define deliverables/milestones and their due date to get payments. As it's a legally binding document that I will sign on behalf of OK France, I just want to make sure we agree on it. 87 | We agreed that, as the project is moving on slowly (but still moving on), we should not have deadlines that are not reachable. We need to take time to do it properly so here is the proposed schedule : 88 | Writing of the materials: end-2017 89 | Translation of the materials: March 2018 90 | Design, developpement and launch of the website: June 2018 91 | If it's ok with you guys by the end of the week, I will send it to the R consortium. 92 | 93 | ## Guidelines 94 | Peter suggested that we have guidelines for harmonising the materials and setting the expectations of what would be a good recipe on terms of intended audience, expectations, structure, presentation… 95 | We also mentioned that recipes should have a further reading section (instead of learning tracks in the initial proposal) 96 | Heidi will start drafting the guidelines with my help. 97 | 98 | ## Basic skills 99 | Peter suggested that we should start with a basic skills lesson, something very basic to bring people on board of R. 100 | To be explored further 101 | 102 | ## Call for proposals 103 | After the guidelines are agreed, we might send a call for proposals out there to have maybe 2 additionnal recipes. 104 | To be explored further 105 | 106 | 107 | -------------------------------------------------------------------------------- /protocol_calls_2017_spring.md: -------------------------------------------------------------------------------- 1 | # [04.04.2017](https://hackmd.io/CwEwDATBDGFgtAZgOwDNj2I4BDeAjADlT0TBwE4xhp8cBGEVIA==#) 2 | 3 | 4 | ### Who receives the money? 5 | Samuel: **OK France** should a priori not take a fee. I asked for confirmation. If no answer within the next 24h, we should consider it agreed. 6 | 7 | **-> Transfer money to OK France, Heidi makes sure the money gets there** 8 | 9 | 10 | ### What tools should we use to track time and progress? 11 | http://www.toggl.com/ 12 | 13 | 14 | ### Content planning 15 | - First ideas in GitHub [document](https://github.com/school-of-data/r-consortium-proposal/blob/master/material/ideas/course_content.md). 16 | - Data pipeline 17 | - What content around R does exist (we can then sort it along the data pipeline) 18 | - What are the key questions that matter for journalists: it's not just 'how to analyse data with R' but 'when do I use R in my workflow', 'how can I make R work with other tools used by my colleagues' etc. 19 | - Make sure we ask journalists about their opinions along the way 20 | (Joel will give a talk at French R conference and presents our project there) 21 | 22 | **-> Heidi talks to Timo about what we can reuse from RDDJ website** 23 | **-> Camila adds the general learning objectives** 24 | **-> R users add details on how to achieve objectives** 25 | 26 | 27 | 28 | ### Write blog post announcing the project 29 | - Include a summary of the results from the previous survey 30 | 31 | **-> Camila sets up the blog post** 32 | 33 | 34 | # [21.03.2017](https://edupad.ch/MmwmFyiZ6z) 35 | ### Paragraph for R consortium blog 36 | Everyone likes it so far :) Comments still welcome. 37 | Heidi will send it out tomorrow. 38 | 39 | ### Who receives the money? 40 | OK France would receive it but take management fees (amount unclear). 41 | **-> Samuel will ask Pierre** 42 | 43 | What happens with cofounding of School of Data 44 | **-> Samuel will ask Cedric** 45 | 46 | ### How to move forward? / Next steps? / Responsibilities? 47 | Start with thinking about content now 48 | What R content do we teach in the course? 49 | What are interesting data bases for different countries (538 : https://github.com/rudeboybert/fivethirtyeight & https://github.com/fivethirtyeight/data)? 50 | 51 | Gather somewhere (summer camp?) or work online? 52 | Work at the same time -> 2 days working 53 | Writing sprint (remote/online): 24.4. -- 28.4. 54 | France, Switzerland - 17 h 00 CET 55 | Ouagadougou, Burkina Faso - 16 h 00 GMT 56 | San Jose, Costa Rica - 10 h 00 CST 57 | 58 | Until the end of the project: Quick calls at 17 h CET on tuesdays + update each other on slack 59 | 60 | ### Who gets paid for what? 61 | Everyone writes down the hours of work. 62 | At the end we divide the budget by the number of hours worked in total and distribute accordingly. 63 | **-> Heidi and Samuel check if there are good tools to take time.** 64 | 65 | 66 | # [14.02.2017](https://edupad.ch/n9Qxo2U9Xe) 67 | ### Report on TODOs / what has been achieved 68 | 69 | - Samuel / Joel: Check out existing material 70 | + https://rddj.info 71 | + Great but it's a catalogue 72 | + Not everything is open licensed (ex. : R for Excel users book not open) 73 | + We need to create tracks but which tracks (e.g. https://bento.io/), tracks are contextualised learning. Gives you an angle (for example R in the radio). 74 | + Find angles that are relevant to journalists 75 | + Examples are not specifically for journalists 76 | + Usable 77 | + Use other people's stuff, if there is already something 78 | - Camila / David / Joel: What do DDJs need? 79 | + https://docs.google.com/forms/d/1IO3BFrZFb8C1QcaaKvCbo902D-9J2xHbmmCqf_ceIoQ/prefill 80 | + Send out this week 81 | + Summarize until friday February 3 82 | - Heidi: Costs, what do we need money for? 83 | + https://docs.google.com/spreadsheets/d/1I_EgWyiLUpF9-RpF5hlJUFiXgy1Wsunj5KWFpu2qhtk/edit?usp=sharing 84 | + Test workshop at Open Knowledge festival? 85 | - David: Timeline for material creation 86 | + Info about material needed 87 | - Joel : check out possible forms of the material: 88 | + RStudio instance (cost) / datacamp light; 89 | + consider limited internet access in some countries (avoid heavy files and need of staying connected for developing countries)? 90 | - Cedric : draft proposal 91 | + https://schoolofdata.slack.com/archives/r-projects/p1483521500000030 92 | 93 | 94 | ### Next steps 95 | 96 | Send survey 97 | Summarize survey 98 | Update timeline -> update costs 99 | Check if wishes of journalists can be adhered already with existing material 100 | Which tracks are needed? 101 | Write the outline of the proposal (pinned stuff from Cedric) 102 | 103 | 104 | Deadline : end of February 105 | 106 | 107 | -------------------------------------------------------------------------------- /protocol_calls_2018_spring.md: -------------------------------------------------------------------------------- 1 | # 15.01.2018 2 | 3 | ### Who takes responsibility of what? 4 | - Heidi will still be involved (keep maintaining the technical documentation) but cannot be doing all the organisation any longer. 5 | - General coordination (organising calls, writing reports, clarifying timeline, ...): ScoDa can handle this, someone might be recruited (probably end of January). 6 | 7 | ### How to finish the recipes and skills lessons? 8 | - Stick to the naming guidelines for [recipes](https://github.com/school-of-data/r-consortium-proposal/blob/master/r-package/inst/tutorials/en-recipe-template/en-recipe-template.Rmd#naming-conventions) and [skills lessons](https://github.com/school-of-data/r-consortium-proposal/blob/master/r-package/inst/tutorials/en-skills-template/en-skills-template.Rmd#naming-conventions). 9 | - Please use no capital letters, no spaces and no special characters in names of files. 10 | - Add RData-version of all raw data [here](https://github.com/school-of-data/r-consortium-proposal/tree/master/r-package/data). For learnr to work without problems, we need this. Also we only "pretend" to load the actual data in whatever format it is, but really only load the RData files (this makes everything MUCH faster). For an example check the [intro](https://github.com/school-of-data/r-consortium-proposal/blob/master/r-package/inst/tutorials/en-introduction/en-introduction.Rmd). You can also ask Peter or Heidi for help. 11 | 12 | ### How do we organize the testing phase? 13 | - Testing will first focus on individual recipes. 14 | - The [info for contributors](https://github.com/school-of-data/r-consortium-proposal/blob/master/CONTRIBUTING.md#beta-testing) explains what to do as a tester. 15 | - The [issue templates](https://github.com/school-of-data/r-consortium-proposal/issues/new) help with getting the feedback in an ordered way. To change the template, edit [this document](https://github.com/school-of-data/r-consortium-proposal/blob/master/ISSUE_TEMPLATE.md). 16 | :::info 17 | 💡 [shinyapps.io](https://www.shinyapps.io) should easily host learnr tutorials. 18 | More info: https://rstudio.github.io/learnr/publishing.html#shiny_server 19 | ::: 20 | 21 | - TODO: update the info for contributors so that people understand what types of people we are looking for. 22 | - Basic R users (who can already run a bit of R code). 23 | - People who have R + RStudio installed. 24 | - We can get feedback from beginners later when we have it on a website. 25 | - Note: we have different languages, say which tutorial is in which language. 26 | - TODO: update the issue template with the question "what is your R skills level so far"? 27 | - TODO schedule call for "*How to get learnr tutorial to work in the R package*" with Camila and Gibran. 28 | 29 | 30 | 31 | -------------------------------------------------------------------------------- /r-consortium-proposal.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | -------------------------------------------------------------------------------- /r-package/.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*.rdx 2 | ^.*.rdb 3 | inst/tutorials/en-introduction/en-introduction_cache 4 | inst/tutorials/en-introduction/en-introduction_files 5 | -------------------------------------------------------------------------------- /r-package/DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: ddj 2 | Title: Data Journalism R Learning Materials 3 | Date: 2017-09-01 4 | Version: 0.0-1 5 | Authors@R: c(person(given = "Heidi", family = "Seibold", 6 | role = c("aut", "cre"), email = "Heidi.Seibold@uzh.ch"), 7 | person(given = "Peter", family = "Pearman", 8 | role = c("aut"), email = "pbpearman@gmail.com")) 9 | Description: R learning materials created with learnr. Material is focused on 10 | teaching journalists and is available in languages English, German, French and 11 | Spanish. 12 | Depends: 13 | R (>= 3.1.0), 14 | learnr 15 | Imports: 16 | ggplot2, 17 | dplyr 18 | License: GPL-2 | GPL-3 19 | RoxygenNote: 6.0.1 20 | -------------------------------------------------------------------------------- /r-package/NAMESPACE: -------------------------------------------------------------------------------- 1 | # Generated by roxygen2: do not edit by hand 2 | 3 | -------------------------------------------------------------------------------- /r-package/R/raw_data_introduction.R: -------------------------------------------------------------------------------- 1 | #' Raw data used in introduction data recipe 2 | #' 3 | #' For further details see \code{learnr::run_tutorial("en-introduction", package = "ddj")} 4 | #' @docType data 5 | #' 6 | #' @usage data(introduction_raw) 7 | #' 8 | #' @format Three objects of type \code{data.frame} named results, demographic1 and demographic2. 9 | #' 10 | #' @keywords datasets 11 | #' 12 | #' @aliases demographic1 demographic2 gbr_sp 13 | #' 14 | #' @examples 15 | #' data(introduction_raw) 16 | "results" 17 | 18 | -------------------------------------------------------------------------------- /r-package/data/elic_2016_1.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/elic_2016_1.RData -------------------------------------------------------------------------------- /r-package/data/elic_2016_2.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/elic_2016_2.RData -------------------------------------------------------------------------------- /r-package/data/elic_2016_3.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/elic_2016_3.RData -------------------------------------------------------------------------------- /r-package/data/elic_2016_4.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/elic_2016_4.RData -------------------------------------------------------------------------------- /r-package/data/elic_2016_q1_raw.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/elic_2016_q1_raw.RData -------------------------------------------------------------------------------- /r-package/data/elic_2016_q2_raw.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/elic_2016_q2_raw.RData -------------------------------------------------------------------------------- /r-package/data/elic_2016_q3_raw.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/elic_2016_q3_raw.RData -------------------------------------------------------------------------------- /r-package/data/elic_2016_q4_raw.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/elic_2016_q4_raw.RData -------------------------------------------------------------------------------- /r-package/data/introduction_clean.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/introduction_clean.RData -------------------------------------------------------------------------------- /r-package/data/introduction_raw.RData: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/data/introduction_raw.RData -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-introduction/images/Wait_what.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/en-introduction/images/Wait_what.jpg -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-introduction/images/install_pkgs.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/en-introduction/images/install_pkgs.png -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/en-recipe-france-presidentialelec-polls.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Introduction" 3 | output: learnr::tutorial 4 | runtime: shiny_prerendered 5 | author: Samuel Goëta 6 | --- 7 | 8 | # Introduction 9 | --- 10 | # DEFINE 11 | Polls have become the main instrument to measure the pulse of an electoral campaign. They work somehow like weather predictions: the closer we get from the moment we are trying to forecast, the more precise they become. But they can also be misleading, think in 2016 about the Brexit referendum or the Trump unpredicted election. 12 | 13 | Therefore, polls can give you blurring signals, information that can confuse the voter or the analyst. This is especially true during the campaign itself when journalists report every day several new polls with sometimes conflicting conclusions. For instance, in the US, during the presidential campaign, [RealClearPolitics](https://www.realclearpolitics.com/epolls/latest_polls/president/) referenced about 30 new polls a day. 14 | 15 | In this complex landscape, visualisations can bring some clarity and help citizens understand the larger trends beyond each poll. During the French presidential campaign, several news websites have produced data visualisations to show the trends behind all the polls produced daily by the main polling institutes. For instance, [LeMonde](http://abonnes.lemonde.fr/election-presidentielle-2017/visuel/2017/04/12/que-disent-les-sondages-de-la-presidentielle-2017_5110324_4854003.html) produced this lign graph with a background showing the variation between polling institutes and the larger trends of intentions of the electorate. 16 | ![](./img/lemonde.png) 17 | The [Huffington Post](http://elections.huffingtonpost.com/pollster/france-presidential-election-round-1) has produced a similar visualisation with dots for each data point in a poll and a trend line. 18 | ![](img/huffpo.png) 19 | As this visualisation is very useful to understand the trend in an election, as journalists, we want to be able to produce this graph and adapt it to other contexts than the French election. 20 | This is what we'll do in this recipe! We'll show you step by step how to produce the above visualisations. 21 | 22 | 23 | --- 24 | # FIND 25 | 26 | To find that data, a journalist may want to report in a spreadsheet every new poll published but this might reveal quite burdersome. But, fortunately, for us, the Wikipedia community has done a truly amazing work in reporting every new poll in a single page with a normalized template during the primary elections and the Presidential campaign on both the first and second rounds. 27 | ![](./img/francepolls.png) 28 | Even better, there is an [entire category dedicated to opinion polling for elections](https://en.wikipedia.org/wiki/Category:Opinion_polling_for_elections) in which you can find opinion polling data for elections in the UK, Poland, Peru, Kenya, India, Iran… This a global effort to put opinion polling in perspective. 29 | 30 | 31 | You don't see the election going on in your country? Why don't you start reporting the latest opinion polls on Wikipedia? It's easy to do, very useful for the public and we might get help for other Wikipedia contributors. 32 | 33 | --- 34 | # GET 35 | 36 | Now that we know where we can find the data, we need to actually be able to use it. Note that the data is on a webpage which means you need to extract that data from the webpage to a dataset. 37 | 38 | One straightforward way would be to simply copy the table to a spreadsheet. This actually works: 39 | 40 | ![](./img/copy.gif) 41 | 42 | But, while this approach is convinient when the campaign is over, this can be overwhelming in an ongoing race. As a journalist, whenever you want to update your graph, you will need to redo the whole process outside of R and import back the data. 43 | 44 | An alternative way would be to practice web scrapping which is the fancy name data scientists gave to extracting data from websites. R Studio has developed a package named `rvest` which allows users to extract data from a webpage and import them directly in a dataframe. So let's install and launch this package! 45 | ```{r} 46 | install.packages("rvest") 47 | library(rvest) 48 | ``` 49 | 50 | Ok! So the first thing with `rvest`is to import the whole html page in which your data is nested. This is done using `<-`as an assignment operator like you do when you generally import data in R and with the `read_html` which obviously reads the html source when you specify a URL. 51 | ```{r} 52 | polls <- read_html("https://en.wikipedia.org/wiki/Opinion_polling_for_the_French_presidential_election,_2017") 53 | ``` 54 | 55 | Right now, the entire source code of the Wikipedia page on the French presidential election is currently imported in R. But, in this visualisation, we will focus on the first round data similary to what LeMonde and the Huffington Post have done above. This is still done using the `rvest` package with its `html_node` fonction which allows you to easily extract pieces out of HTML documents using Xpath (a query language for selecting nodes from an XML document such as a webpage) and CSS selectors (a way of targeting specific HTML elements based on the specific CSS styling applied to them). 56 | 57 | To identify the Xpath, you can install a Chrome extension named [SelectorGadget](http://selectorgadget.com/) which will help know the Xpath and the CSS selector for the specific element we want to extract. 58 | 59 | But there's an ever easier way, you can just use your browser. Right click on the table you want to extract (for us the first round opinion polls with verified candidates, those who were admitted to run) and select Inspect in Chrome or Examine in Firefox. Below, we will see the elements in which the table is nested in the HTML source code. Select `table.wikitable` which contains the entire table. This will select a line in blue in the source code: right click on it, select copy and copy selector. 60 | 61 | ![](./img/css.gif) 62 | 63 | You should get the following CSS selector : `#mw-content-text > div > table:nth-child(66)`. Note that this might change if the page structure is updated. 64 | 65 | Then you will need to tell `rvest` your discovery by reporting the CSS in the `html_node` fonction which extracts specific elements defined by their position. The `html_table` fonction will import that data into a dataframe we will name `polls_round1`. 66 | 67 | ```{r} 68 | library(tidyverse) 69 | polls_round1 <- polls %>% 70 | html_node("#mw-content-text > div > table:nth-child(66)") %>% 71 | html_table(header = TRUE,fill=TRUE) #header=TRUE uses the first row as a header and fill=TRUE automatically fills rows with NAs when there is fewer than the maximum number of columns. 72 | ``` 73 | 74 | 75 | --- 76 | # Verify 77 | It's always important to check the quality of the data. In this exemple, questions you can ask yourself can be: who are the listed polling institutions? What is the founding source of these polls? What are the polling methods? Is the sample sufficient to represent the country population? 78 | 79 | --- 80 | # Clean 81 | Once you have checked the quality of the data, you will need to clean it before using them. Let's have a look at `polls_round1`, the data frame containing the polls results. You can see that the first four rows are redundant with the colomn header and the last row contains methodological information. Let's see remove that to make the data processable by creating a vector,a sequence of data elements, with `c` and removing it with `-` : 82 | ```{r} 83 | polls_round1 <- polls_round1[-c(1,2, 3, 4, 5, 94), ] 84 | ``` 85 | This is better but we lost column names. So let's rename all columnn names by using the `colnames` fonctions. 86 | ```{r} 87 | colnames(polls_round1) <- c("Poll source","Fieldwork date", "Sample size", "Abstention", "Arthaud", "Poutou", "Melenchon", "Hamon", "Macron", "Lassalle","Fillon", "Dupont-Aignan", "Asselineau","Le Pen", "Cheminade") #enlever les partis des noms de colonnes 88 | ``` 89 | Let's look again at our dataframe in the "environment" section of R Studio. If you look at the values for each candidates (Arthaud, Lassalle or Macron for instance), you will see `chr` which means that it's a chain of character for R. This means it is not considered as figures. The reason is we still have percentages which it does not process so let's remove them. 90 | 91 | To do this, we will use `mutate_at` which is [one of the verbs in `dplyr`](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf) to wrangle with the data. We will use `vars`to include the columns containing data for abstention and each candidate then we call the function `pars_number` by using `funs` to keep only what's before the percentage. Finally, we will record that data back in `polls_round1`. 92 | ```{r} 93 | polls_round1 <- polls_round1 %>% mutate_at(vars(`Arthaud`:`Cheminade`),funs(parse_number),na=character()) #enlever les pourcentages et absention 0 94 | ``` 95 | 96 | ```{r} 97 | #install.packages("lubridate") 98 | #install.packages("rex") 99 | library(lubridate) 100 | library(rex) 101 | rex_mode() #charge les éléments de rex en mémoire (voir aide) 102 | get_date <- rex( #produit la regex à partir d'un langage lisible et compréhensible 103 | maybe(numbers %>% between(1,2), #pas obligé de l'avoir, peut etre entre un et deux nombres puis un tiret 104 | "–"), 105 | capture( 106 | name = "date", #groupe qu'on veut récupérer 107 | numbers %>% between(1,2), #entre un et deux nombres 108 | space, #puis un espace 109 | letters %>% n_times(3), #3 lettres pour le mois 110 | space, 111 | "20", one_of(1:5), number #année 112 | ) 113 | ) 114 | 115 | date <-re_matches(polls_round1$`Fieldwork date`, get_date)$date #crée le vecteur date 116 | 117 | polls_round1<- polls_round1 %>% 118 | mutate(date = date %>% dmy()) #ajoute la colonne 119 | 120 | polls_round1_long <- polls_round1 %>% gather(Candidates, Percentage, Abstention:Cheminade) #gather 121 | 122 | save(polls_round1_long, file = "./polls_round1_long.Rdata") 123 | 124 | ``` 125 | 126 | 127 | --- 128 | # Visualize 129 | ```{r} 130 | #devtools::install_github("hrbrmstr/hrbrthemes") 131 | library(hrbrthemes) 132 | 133 | polls_round1_long %>% 134 | filter(polls_round1_long$Candidates!="Abstention") %>% 135 | ggplot() + 136 | geom_point(mapping = aes(x=date,y=Percentage,color = Candidates, alpha = "0.7")) + # 137 | geom_smooth(aes(date,y=Percentage,color = Candidates))+ 138 | labs(x = "Date of poll", 139 | y = "Percentage of candidate in poll", 140 | title = "Polls during the 2017 Presidential election in France", 141 | subtitle = "Evolution of polls during the month preceding the first round of the 2017 Presidential election in France", 142 | caption = "Source : Wikipedia")+ 143 | theme_ipsum() + 144 | theme(legend.position="right") #reminder, keep track: do it after + 145 | ``` 146 | --- 147 | # Present 148 | Maybe we can make it interactive? 149 | -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/category.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/category.png -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/copy.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/copy.gif -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/css.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/css.gif -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/francepolls.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/francepolls.png -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/huffpo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/huffpo.png -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/lemonde.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/en-recipe-france-presidentialelec-polls/img/lemonde.png -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-recipe-template/en-recipe-template.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Recipe: Data Recipe Name" 3 | output: learnr::tutorial 4 | runtime: shiny_prerendered 5 | --- 6 | 7 | ```{r setup, include=FALSE} 8 | library("learnr") 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | 13 | ## Introduction 14 | 15 | This template shows you how to create a data recipe. You can also look at 16 | existing recipes for further help. 17 | 18 | ### Naming conventions 19 | Recipe files should be named `language-recipe-name.Rmd` (example `de-recipe-dual-use.Rmd`) 20 | 21 | - `language` should be `en` for English, `de` for German, `es` for Spanish and `fr` for French. File names should all be in English in order to know which lessons belong together. 22 | - `recipe` is the same word for all recipes (look at the skills lesson template for skills lesson naming conventions). 23 | - `name` should be a descriptive name (e.g. `brexit` for analysis of brexit data). 24 | 25 | Titles of recipes should be `Data Recipe: Data Recipe Name` with upper case letters in English, 26 | `Datenrezept: Name` in German, ??? in Spanish and ??? in French. 27 | 28 | ### What happens in the following? 29 | The following chapters are the steps of 30 | the School of Data Pipeline (see https://schoolofdata.org/methodology/). 31 | In this template the steps of the pipeline are explained in the respective 32 | sections. Please explain what a journalist would do for the 33 | given example in the given step. 34 | 35 | ![](images/Data-pipeline-v2-EN.png) 36 | 37 | ## DEFINE 38 | 39 | You can add interactive components such as questions 40 | (for help, see https://rstudio.github.io/learnr/questions.html) 41 | 42 | ```{r letter-a, echo=FALSE} 43 | question("Did you get what I was saying?", 44 | answer("No"), 45 | answer("Maybe"), 46 | answer("Yes", correct = TRUE) 47 | ) 48 | ``` 49 | 50 | 51 | > Define: Data-driven projects always have a "define the problem you're trying to solve" component. It's in this stage you start asking questions and come around the issues that will matter in the end. Defining your problem means going from a theme (e.g. air pollution) to one or multiple specific questions (has bikesharing reduced air pollution?). Being specific forces you to formulate your question in a way that hints at what kind of data will be needed. Which in turns helps you scope your project: is the data needed easily available? Or does it sound like some key datasets will probably be hard to get? 52 | 53 | 54 | ## FIND 55 | 56 | > Find: While the problem definition phase hints at what data is needed, finding the data is another step, of varying difficulty. There are a lot of tools and techniques to do that, ranging from a simple question on your social network, to using the tools provided by a search engine (such as Google search operators), open data portals or a Freedom of Information request querying about what data is available in that branch of government. This phase can make or break your project, as you can't do much if you can't find the data! But this is also where creativity can make a difference: using proxy indicators, searching in non-obvious locations... don't give up too soon! 57 | 58 | ## GET 59 | 60 | > Get: To get the data from its inital location to your computer can be short and easy or long and painful. Luckily, there's plenty of ways of doing that. You can crowdsource using online forms, you can perform offline data collection, you can use some crazy web scraping skills, or you could simply download the datasets from government sites, using their data portals or through a Freedom of Information request. 61 | 62 | We try to make our examples as close to reality as possible. Thus we use data that people can find online. 63 | They can download it and read in the funny file formats that the data comes in. 64 | Doing that in the learnr environment, however, is not a good idea. Best is to show, the learners what to 65 | do, but actually do something else in the background. 66 | 67 | #### Example: 68 | What the students see: 69 | ```{r} 70 | # read data into R 71 | results <- read.csv("https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/master/material/lessons/results.csv", header = TRUE) 72 | ``` 73 | 74 | What you do in the background (use `echo=FALSE` to hide it; note that this does not show in the html so **look into the Rmd file** to understand what is happening here!): 75 | ```{r read, echo=FALSE} 76 | data(introduction_raw, package = "ddj") 77 | ``` 78 | 79 | 80 | You can store the data as `RData` in the R package in the `data` folder and 81 | document it in the `R` folder. See as an example: `data/introduction_raw.RData` and 82 | `R/raw_data_introduction.R`. I created the `RData` file via: 83 | ```{r, eval=FALSE} 84 | results <- read.csv("https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/master/material/lessons/results.csv", header = TRUE) 85 | 86 | # download data 87 | download.file("https://github.com/school-of-data/r-consortium-proposal/blob/master/material/lessons/demographics1.xlsx?raw=true", destfile = "demographics1.xlsx") 88 | download.file("https://github.com/school-of-data/r-consortium-proposal/blob/master/material/lessons/demographics2.dta?raw=true", destfile = "demographics2.dta") 89 | 90 | # read demographic data 91 | demographic1 <- read.xlsx("demographics1.xlsx", sheetIndex = 1) 92 | demographic2 <- read.dta("demographics2.dta") 93 | 94 | # spatial info 95 | library("raster") 96 | gbr_sp <- getData("GADM", country = "GBR", level = 2) 97 | 98 | # save intermediate results 99 | save(results, demographic1, demographic2, gbr_sp, file = "../../../data/introduction_raw.RData") 100 | # Note, that my working directory is the directory of this template, so I need to go a couple of folder back with ../ 101 | ``` 102 | 103 | 104 | ## VERIFY 105 | 106 | > Verify: We got our hands in the data, but that doesn't mean it's the data we need. We have to check out if details are valid, such as the meta-data, the methodology of collection, if we know who organised the dataset and it's a credible source. We've heard a joke once, but it's only funny because it's true: all data is bad, we just need to find out how bad it is! 107 | 108 | ## CLEAN 109 | 110 | The CLEAN and ANALYSE steps will likely be the longest parts with the most R code. 111 | Make the R code interactive by including exercises 112 | (for help see https://rstudio.github.io/learnr/exercises.html). 113 | 114 | #### Example: 115 | 116 | Try printing the names of the `results` data.frame. 117 | ```{r ex_look1, exercise=TRUE, exercise.setup = "read"} 118 | 119 | ``` 120 | 121 | ```{r ex_look1-solution} 122 | names(results) 123 | ``` 124 | Note that you need to add `exercise.setup = "read"` to reference the R chunk 125 | that reads in the data. 126 | 127 | 128 | 129 | > Clean: It's often the case the data we get and validate is messy. Duplicated rows, column names that don't match the records, values that contain characters which will make it difficult for a computer to process and so on. In this step, we need skills and tools that will help us get the data into a machine-readable format, so that we can analyse it. We're talking about tools like OpenRefine or LibreOffice Calc and concepts like relational databases. 130 | 131 | 132 | ## ANALYSE 133 | 134 | If the code is more complex, already enter the solution. 135 | By clicking a butten, the learner can then run the code but still 136 | has the possibily to change the code. 137 | 138 | Example: 139 | ```{r first_plot, exercise = TRUE} 140 | library("ggplot2") 141 | p <- ggplot(mtcars, aes(wt, mpg)) 142 | p + geom_point() 143 | ``` 144 | 145 | 146 | > Analyse: This is it! It's here where we get insights about the problem we defined in the beginning. We're gonna use our mad mathematical and statistical skills to interview a dataset like any good journalist. But we won't be using a recorder and a notebook. We can analyse datasets using many, many skills and tools. We can use visualisations to get insights of different variables, we can use programming languages packages, such as Pandas (Python) or simply R, we can use spreadsheet processors, such as LibreOffice Calc or even statistical suites like PSPP. 147 | 148 | ## PRESENT 149 | 150 | > Present: And, of course, you will need to present your data. Presenting it is all about thinking of your audience, the questions you set out to answer and the medium you select to convey your message or start your conversation. You don't have to do all by yourself, it's good practice to get support from professional designers and storytellers, who are experts at understanding what are the best ways to present data visually and with words. 151 | 152 | 153 | ## Summary and further reading 154 | 155 | |In this lesson we learned | Further reading | 156 | | ---------------------------------- | -------------------------------------------------------| 157 | |How to do a |[RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/) on **a**. | 158 | | |The [dplyr website](http://dplyr.tidyverse.org/). | 159 | |How to do b |The [ggplot2 website](http://ggplot2.tidyverse.org/) | 160 | |How to do c |...| 161 | 162 | -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-recipe-template/images/Data-pipeline-v2-EN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/en-recipe-template/images/Data-pipeline-v2-EN.png -------------------------------------------------------------------------------- /r-package/inst/tutorials/en-skills-template/en-skills-template.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Skills Lesson: Skills Lesson Name" 3 | output: learnr::tutorial 4 | runtime: shiny_prerendered 5 | --- 6 | 7 | ```{r setup, include=FALSE} 8 | library("learnr") 9 | knitr::opts_chunk$set(echo = FALSE) 10 | ``` 11 | 12 | ## Introduction 13 | 14 | This is where a very short introduction goes. 15 | 16 | Possible skills lessons could be: 17 | 18 | - Skills Lesson: Reading Data into R 19 | - Skills Lesson: Data Preparation 20 | - Skills Lesson: Visualising Data 21 | 22 | Skills lessons should be very short and simple. 23 | 24 | - Show important packages with 1 or 2 simple examples. 25 | - Link to other pages for *further reading*. 26 | 27 | ### Naming conventions 28 | Recipe files should be named `language-skills-name.Rmd` (example `de-skills-reading-data.Rmd`) 29 | 30 | - `language` should be `en` for English, `de` for German, `es` for Spanish and `fr` for French. File names should all be in English in order to know which lessons belong together. 31 | - `skills` is the same word for all skills lessons (look at the recipe template for recipe naming conventions). 32 | - `name` should be a descriptive name (e.g. `visualisation` for a data visualisation skills lesson). 33 | 34 | Titles of skills lessons should be `Skills Lesson: Skills Lesson Name` with upper case letters in English, 35 | `Kompetenzlektion: Name` in German, ??? in Spanish and ??? in French. 36 | 37 | 38 | ## Important packages and functions 39 | 40 | For this skill packages `ggplot2` and `dplyr` are especially important. 41 | With them you can do things like 42 | ```{r plot1, exercise = TRUE} 43 | library("ggplot2") 44 | d <- ggplot(diamonds, aes(carat, price)) 45 | d + geom_point(alpha = 1/10) 46 | ``` 47 | and 48 | ```{r summarise, exercise = TRUE, message=FALSE} 49 | library("dplyr") 50 | iris %>% 51 | group_by(Species) %>% 52 | summarise_all(mean) 53 | ``` 54 | 55 | 56 | ## Further reading 57 | 58 | - [RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/) 59 | on **this** and **that**. 60 | - The [dplyr website](http://dplyr.tidyverse.org/). 61 | 62 | 63 | 64 | -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-Skills-analisis/Analisis con R.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Introducción a análisis exploratorio de datos con R" 3 | author: "Camila Salazar" 4 | date: "October 9, 2017" 5 | output: html_document 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | ## ¿Qué vamos a aprender? 13 | 14 | En este tutorial vamos a aprender cómo realizar análisis exploratorio de datos con R. El primer paso para analizar datos es comenzar a explorar las variables de nuestra base de datos para: 15 | 16 | * Encontrar patrones 17 | * Identificar errores 18 | * Plantear nuevas hipótesis o preguntas 19 | * Identificar relaciones entre variables 20 | * Empezar a encontrar respuestas a nuestras preguntas de investigación 21 | 22 | Es importante considerar que este es un primer paso y no corresponde a un análisis estadístico riguroso, pero puede permitirnos encontrar respuestas o guiarnos en el tipo de análisis que queremos realizar. 23 | 24 | ## ¿Qué es análisis exploratorio? 25 | 26 | Lectura recomendada: [Exploratory Data Analysis-Howard Seltman](http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf) 27 | El análisis exploratorio puede ser **gráfico** o **no gráfico** y **univariado** o **multivariado** (normalmente de dos variables). 28 | 29 | * *No gráfico*: Calcula estadísticas descriptivas de las variables 30 | * *Gráfico*: Calcula estadísticas de forma gráfica 31 | * *Univariado*: Analiza una sola variable a la vez 32 | * *Multivariado*: Analiza dos o más variables 33 | 34 | A su vez, cada uno de esas dividisiones puede subdividirse según los tipos de datos con los que trabajemos: **cateógicos** o **numéricos**. 35 | 36 | ## Cargar paquetes 37 | 38 | Lo primero que tenemos que hacer es cargar los paquetes que vamos a utilizar para el análisis. En este caso vamos a usar: 39 | ```{r echo=T, message=FALSE, warning=FALSE} 40 | library(dplyr) 41 | library(ggplot2) 42 | library(readxl) 43 | library(gmodels) 44 | ``` 45 | 46 | (Recordar que si no ha instalada estos paquetes debe correr primero el comando: `install.packages("nombre del paquete")`) 47 | 48 | ## Importar archivos 49 | Antes de comenzar, hay que cambiar el directorio de trabajo y seleccionar el folder en donde tenemos nuestros archivos. Esto se hace con el comando `setwd()`. 50 | 51 | Ahora podemos importar los archivos de varias formas. Podemos hacerlo desde el menú de arriba, usando: File -> Import Dataset y seleccionando el tipo de archivo que queremos importar. 52 | 53 | También podemos hacerlo escribiendo el código. Es siempre recomendable asignar el archivo que importamos a un objeto con el símbolo `<-`. Inicialmente vamos a importar unos archivos .xlsx, por lo que usamos el paquete `readxl`, que ya instalamos. 54 | 55 | Los archivos que vamos a usar inicialmente son: 56 | ```{r echo=T, message=FALSE, warning=FALSE} 57 | #Recuerden cambiar el directorio de trabajo 58 | setwd("/Users/Camila/Documents/Curso periodismo datos/2017/Clases/Analisis_R") 59 | encuesta <- read_xlsx("encuesta.xlsx") 60 | encuesta <- tbl_df(encuesta) 61 | ``` 62 | 63 | El archivo contiene información sobre las personas que viven solas en Costa Rica. Los datos se extrajeron de la Encuesta Nacional de Hogares del 2016 y contienen las siguientes variables: 64 | 65 | * `id`: identificador único 66 | * `REGION`: Región de planificación donde vive la persona (Central, Chorotega, Pacífico Central, Brunca, Huetar Atlántica, Huetar Norte) 67 | * `ZONA`: Zona en la que vive la persona (urbana o rural) 68 | * `Tipo_vivienda`: Tenencia de la vivienda (Propia totalmente pagada, Propia pagando a plazos, Alquilada, En precario, Otra tenencia) 69 | * `Mensualidad`: Mensualidad pagada por la vivienda, cuando esta no es propia pagada 70 | * `M2`: Metros cuadrados de construcción de la vivienda 71 | * `Sexo`: Sexo (Hombre o Mujer) 72 | * `Edad`: Edad de la persona 73 | * `Estado_civil`: Estado civil 74 | * `NivInst`: Nivel de escolaridad: 75 | *0- Sin nivel de instrucción:Persona con ninguna educación, con preparatoria o educación especial sin aprobación de niveles* 76 | *1- Primaria incompleta:Persona de primero hasta quinto grado y primaria con año aprobado no declarado* 77 | *2- Primaria completa:Persona con sexto grado* 78 | *3- Secundaria académica incompleta :Persona de primer año hasta cuarto año de secundaria y secundaria con año aprobado no declarado* 79 | *4- Secundaria académica completa:Persona con quinto año de secundaria, con o sin título de bachiller* 80 | *5- Secundaria técnica incompleta:Persona con primer año hasta quinto año en secundaria técnica y secundaria técnica con año aprobado no declarado.* 81 | *6- Secundaria técnica completa:Persona con secundaria técnica concluida con o sin título de bachiller.* 82 | *7- Educación superior de pregrado y grado:Persona que tiene desde un año hasta tres años en para-universitaria, incluyendo año no declarado, y desde un año hasta seis años de universidad, incluyendo año no declarado* 83 | *8- Educación superior de posgrado:Persona con estudios universitarios de especialidad, maestría o doctorado desde un año hasta seis años, incluyendo año no declarado.* 84 | * `Escolari`: Años de escolaridad 85 | * `CondAct`: Condición de actividad (Ocupado, Desempleado, Fuera de la fuerza de trabajo) 86 | * `ingreso`: Ingreso total neto del Hogar 87 | * `np`: Nivel de pobreza (Pobreza no extrema, Pobreza extrema, No pobre) 88 | * `quintil`: Quintil de ingreso per cápita del hogar 89 | 90 | 91 | 92 | ## Explorar los datos 93 | 94 | Una vez que cargamos el archivo podemos comenzar a explorar los datos. (Recordar comandos de [Tutorial de limpieza con R](http://rpubs.com/camilamila/limpieza)). 95 | 96 | ```{r echo=T, message=FALSE, warning=FALSE} 97 | glimpse(encuesta) 98 | ``` 99 | 100 | Con esto podemos ver que tenemos cuatro variables numericas (dbl) y el resto con caracteres. 101 | 102 | También hay otros comandos exploratorios como: `str()`, `head()`, `tail()`, `class()`, `class()` 103 | 104 | **Tip** 105 | Podemos convertir las variables de texto a factores (variables con categorías) con el comando `as.factor()`. Por ejemplo si queremos convertir la variable de REGION, el comando sería: 106 | 107 | ```{r echo=T, message=FALSE, warning=FALSE} 108 | encuesta$REGION<-as.factor(encuesta$REGION) 109 | ``` 110 | 111 | Note que el `$` se utiliza para llamar a variables de un dataframe. En el ejemplo anterior, el comando se lee: guarde la variable REGION del dataframe encuesta como un factor. 112 | 113 | ##Variables categóricas 114 | Para las variables categóricas podemos calcular tablas de frecuencia, es decir, ver el número de ocurrencias de cada categoría de la variable. Esto lo hacemos con el comando `table()`. 115 | 116 | ###Frecuencias simples 117 | Entonces si quisieramos calcular la frecuencia de la variable ZONA el comando sería: 118 | ```{r echo=T, message=FALSE, warning=FALSE} 119 | table(encuesta$ZONA) 120 | ``` 121 | Y podemos ver que 447 personas viven en Zona Rural y 1014 en Urbana. 122 | 123 | ### Tablas de contingencia 124 | Ahora si queremos tabular dos variables, simplemente las separamos por coma. 125 | 126 | Por ejemplo la frecuencia de Tipo de vivienda según Zona: 127 | ```{r echo=T, message=FALSE, warning=FALSE} 128 | table(encuesta$Tipo_vivienda, encuesta$ZONA) 129 | ``` 130 | 131 | ### Proporciones 132 | Los números absolutos a veces no son útiles para entender los datos, por lo que es mejor utilizar proporciones. Para ello usamos el comando `prop.table()`. 133 | 134 | Por ejemplo si quisieramos mostrar la tabla anterior como proporciones, lo que hacemos es ingresar ese comando dentro del comando de `prop.table()`. 135 | ```{r echo=T, message=FALSE, warning=FALSE} 136 | prop.table(table(encuesta$Tipo_vivienda, encuesta$ZONA)) 137 | ``` 138 | 139 | En este caso nos muestra los datos como proporciones totales, pero ¿cómo hacemos si queremos ver porcentajes por fila o columna?. 140 | 141 | Esto lo hacemos poniendo una coma y luego 1 (filas) o 2 (columnas). 142 | 143 | ```{r echo=T, message=FALSE, warning=FALSE} 144 | #Filas 145 | prop.table(table(encuesta$Tipo_vivienda, encuesta$ZONA),1) 146 | #Columnas 147 | prop.table(table(encuesta$Tipo_vivienda, encuesta$ZONA),2) 148 | ``` 149 | 150 | ###CrossTable() 151 | 152 | Un comando muy útil para simplificar los pasos es el comando `CrossTable()` del paquete `gmodels()`. El comando nos permite presentar en una misma tabla los porcentajes por fila o columna y el total de la tabla. 153 | 154 | ```{r echo=T, message=FALSE, warning=FALSE} 155 | CrossTable(encuesta$Tipo_vivienda, encuesta$ZONA) 156 | ``` 157 | 158 | En este caso, el cuadro de arriba, nos dice qué es lo que se muestra en la tabla: N (número de observaciones), Chi-square (estadístico, no interesa por ahora), N/Row total (porcentaje por fila), N/Col Total (porcentaje por columna), N/Table Total (porcentaje total). 159 | 160 | Tambien podemos simplificar la tabla para que nos muestre menos resultados, cambiando las opciones: 161 | `prop.r=TRUE`: porcentaje por filas. Si lo ponemos = F no lo muestra. 162 | `prop.c=TRUE`: porcentaje por columnas. Si lo ponemos = F no lo muestra 163 | `prop.t=TRUE`: porcentaje total de la tabla. Si lo ponemos = F no lo muestra 164 | `prop.chisq=TRUE`: Chi-square. Si lo ponemos = F no lo muestra 165 | 166 | Entonces por ejemplo si solo queremos el porcentaje por fila y columna, la tabla sería: 167 | ```{r echo=T, message=FALSE, warning=FALSE} 168 | CrossTable(encuesta$Tipo_vivienda, encuesta$ZONA, prop.t=F, prop.chisq = F) 169 | ``` 170 | -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-Skills-analisis/encuesta.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/sp-Skills-analisis/encuesta.xlsx -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-recipe-Voto evangelico/Voto evangelico.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Ejercicio analisis" 3 | author: Camila Salazar 4 | output: html_document 5 | self_contained: false 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | knitr::opts_chunk$set(echo = TRUE) 10 | ``` 11 | 12 | # ¿Cuánto ha variado el voto evangélico en CR en las últimas 5 elecciones? 13 | 14 | Este análisis se realizó para el trabajo ["Voto por diputados evangélicos se triplicó en cinco elecciones"](http://www.nacion.com/gnfactory/investigacion/2017/partidos-evangelicos/index.html?desktop=true), publicado en el diario La Nación el 26 de junio de 2017. 15 | 16 | Para ello se construyó una base, con datos del TSE, de las elecciones de 1998, 2002, 2006, 2010 y 2014. Se usaron los datos de votación para diputados de partidos evangélicos: Alianza Nacional Cristiana, Renovación Costarricense y Restauración Nacional. 17 | 18 | ##Paquetes y carga de archivos 19 | Lo primero que hacemos es cargar el archivo de excel con todas sus hojas. Luego la función `read_excel_allsheets` recorre el archivo para leer todas las hojas, y finalmente, las guardamos en una lista (un objeto de R). 20 | 21 | Con la función `join_all` del paquete `plyr` unimos todas las hojas en un mismo archivo. Finalmente reemplazamos los blancos con 0, para facilitar el análisis. 22 | 23 | ```{r echo=T, message=FALSE, warning=FALSE} 24 | library(tidyr) 25 | library(plyr) 26 | library(dplyr) 27 | library(readxl) 28 | library(ggplot2) 29 | library(ggthemes) 30 | 31 | #Cargar hojas del archivo 32 | diputados <- excel_sheets("diputados.xlsx") 33 | 34 | #leer hojas del archivo 35 | read_excel_allsheets <- function(diputados) { 36 | sheets <- readxl::excel_sheets(diputados) 37 | x <- lapply(sheets, function(X) readxl::read_excel(diputados, sheet = X)) 38 | names(x) <- sheets 39 | x 40 | } 41 | 42 | #crear lista con archivos 43 | mysheets <- read_excel_allsheets("diputados.xlsx") 44 | 45 | #Unir bases 46 | evangelicos <- join_all(mysheets, by = "codigod", type = "left", match="all") 47 | 48 | #reemplazar blancos con cero 49 | evangelicos[is.na(evangelicos)] <- 0 50 | ``` 51 | 52 | ##Limpieza 53 | La base de datos contiene la cantidad de votos válidos por distrito total y para los partidos Renovación Costarricense (rc), Restauración Nacional (rn) y Alianza Nacional Cristiana (anc). Lo primero que tenemos que hacer es crear nuevas variables que sume la cantidad de votos válidos para esos partidos por cada distrito. Eso lo hacemos con el paquete `dplyr`. 54 | 55 | ```{r echo=T, message=FALSE, warning=FALSE} 56 | evangelicos <- evangelicos %>% 57 | mutate(tot14=rc14+rn14, 58 | tot10=rc10+rn10, 59 | tot06=rc06+rn06, 60 | tot02=rc02+anc02, 61 | tot98=rc98+anc98) %>% 62 | select(provincia, codigoc, canton, HASC, codigod, distrito, starts_with("valido"), starts_with("tot")) 63 | ``` 64 | 65 | ###Agrupar por cantones 66 | El análisis se va a enfocar en la votación por cantones y no distritos, por lo que es necesario agrupar los datos por cantones. Además se crean variables de porcentaje de voto evangélico por cantón. 67 | ```{r echo=T, message=FALSE, warning=FALSE} 68 | cantones <- evangelicos %>% 69 | group_by(codigoc, canton, HASC) %>% 70 | summarise(validos14=sum(validos14), 71 | tot14=sum(tot14), 72 | validos10=sum(validos10), 73 | tot10=sum(tot10), 74 | validos06=sum(validos06), 75 | tot06=sum(tot06), 76 | validos02=sum(validos02), 77 | tot02=sum(tot02), 78 | validos98=sum(validos98), 79 | tot98=sum(tot98)) %>% 80 | mutate(por14=tot14/validos14*100, 81 | por10=tot10/validos10*100, 82 | por06=tot06/validos06*100, 83 | por02=tot02/validos02*100, 84 | por98=tot98/validos98*100) %>% 85 | select(codigoc, canton, HASC, starts_with("por")) 86 | ``` 87 | Ahora podemos empezar a analizar los datos, por ejemplo, preguntarnos **en cuáles cantones el porcentaje de voto evangélico superó el 10%, en la última elección. 88 | ```{r echo=T, message=FALSE, warning=FALSE} 89 | cantones %>% 90 | select(canton, por14)%>% 91 | filter(por14>=10) %>% 92 | arrange(desc(por14)) 93 | ``` 94 | Los datos muestran que en 2014, 17 cantones tuvieron un porcentaje de votación a partidos evangélicos superior al 10%. Matina fue el cantón con mayor porcentaje, con un 24%. al compararlo con las elecciones anteriores, en 2010, solo 2 cantones mostraron ese comportamiento, en 2006 fueron 5 cantones, en 2002, cuatro y en 1998 solamente 2. 95 | 96 | ```{r echo=T, message=FALSE, warning=FALSE} 97 | cantones %>% 98 | select(canton, por10)%>% 99 | filter(por10>=10) %>% 100 | arrange(desc(por10)) 101 | 102 | cantones %>% 103 | select(canton, por06)%>% 104 | filter(por06>=10)%>% 105 | arrange(desc(por06)) 106 | 107 | cantones %>% 108 | select(canton, por02)%>% 109 | filter(por02>=10)%>% 110 | arrange(desc(por02)) 111 | 112 | cantones %>% 113 | select(canton, por98)%>% 114 | filter(por98>=10)%>% 115 | arrange(desc(por98)) 116 | ``` 117 | 118 | ###Agrupar por provincia 119 | También podemos agrupar los datos por provincia y ver cómo ha cambiado el voto evangélico. 120 | ```{r echo=T, message=FALSE, warning=FALSE} 121 | provincia <- evangelicos %>% 122 | group_by(provincia) %>% 123 | summarise(validos14=sum(validos14), 124 | tot14=sum(tot14), 125 | validos10=sum(validos10), 126 | tot10=sum(tot10), 127 | validos06=sum(validos06), 128 | tot06=sum(tot06), 129 | validos02=sum(validos02), 130 | tot02=sum(tot02), 131 | validos98=sum(validos98), 132 | tot98=sum(tot98)) %>% 133 | mutate(Y2014=tot14/validos14*100, 134 | Y2010=tot10/validos10*100, 135 | Y2006=tot06/validos06*100, 136 | Y2002=tot02/validos02*100, 137 | Y1998=tot98/validos98*100) %>% 138 | select(provincia, starts_with("Y")) 139 | provincia 140 | ``` 141 | Se observa que Limón es la provincia con mayor porcentaje de votación para estos partidos. Podemos hacer un gráfico para ver cómo ha evolucionado el voto evangélico. Para ello, lo primero que debemos hacer es reestructurar la base, de forma que tengamos en una sola variable el año y en la otra el porcentaje. Esto lo hacemos con la función `gather` del paquete `tidyr`. 142 | 143 | ```{r echo=T, message=FALSE, warning=FALSE} 144 | provincia2 <- provincia%>% 145 | gather(anio, porcentaje,-provincia) %>% 146 | arrange(anio) 147 | head(provincia2, 10) 148 | ``` 149 | 150 | Ahora vamos a usar el paquete `ggplot2` y `ggthemes` para graficar. `ggplot2` funciona agregando elementos. En la primera línea: `ggplot(provincia2, aes(x=anio, y=porcentaje, group=provincia, colour=provincia))`, lo primero es seleccionar el dataframe del cual vamos a tomar los datos. Luego viene el apartado de **aesthetics** donde especificamos qué variables queremos graficar. 151 | 152 | En este caso queremos la variable de año en el eje X, y el porcentaje en el eje Y. Además queremos que nos agrupe los datos por provincia y que cada provincia sea de un color específico. 153 | 154 | Luego podemos ir agregando elementos con el signo de **+**. `geom_line(size=1)`, lo que nos dice es qué tipo de gráfico queremos, en este caso es un gráfico de líneas; por ejemplo si fuera de barras, pondríamos geom_bar(). 155 | 156 | Los demás elementos que se agregaron fue para darle estilo, por ejemplo un título, un tema (en este caso para que se viera similar a los gráficos del medio FiveThirtyEight), etcétera. 157 | 158 | ```{r echo=T, message=FALSE, warning=FALSE} 159 | graficoprovincias<- ggplot(provincia2, aes(x=anio, y=porcentaje, group=provincia, colour=provincia)) + 160 | geom_line(size=1) + 161 | ggtitle("Evolucion del voto evangelico") + 162 | theme_fivethirtyeight() + 163 | ylab("Porcentaje de votos a partidos evangelicos") + 164 | theme(axis.title = element_text()) + 165 | theme(axis.title.x = element_blank()) 166 | 167 | graficoprovincias 168 | ``` 169 | 170 | 171 | En estos enlaces, puede ver más detalle de cómo hacer gráficos con ggplot2: 172 | 173 | [ggplot2 Cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) 174 | [Bar and line graphs](http://www.cookbook-r.com/Graphs/Bar_and_line_graphs_(ggplot2)/) 175 | [Tutorial Harvard](http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html) 176 | 177 | ##Ver el total del país 178 | Ahora calculemos cómo ha variado el voto evangélico a nivel nacional. 179 | ```{r echo=T, message=FALSE, warning=FALSE} 180 | total<- evangelicos %>% 181 | summarise(validos14=sum(validos14), 182 | tot14=sum(tot14), 183 | validos10=sum(validos10), 184 | tot10=sum(tot10), 185 | validos06=sum(validos06), 186 | tot06=sum(tot06), 187 | validos02=sum(validos02), 188 | tot02=sum(tot02), 189 | validos98=sum(validos98), 190 | tot98=sum(tot98)) %>% 191 | mutate(Y2014=tot14/validos14*100, 192 | Y2010=tot10/validos10*100, 193 | Y2006=tot06/validos06*100, 194 | Y2002=tot02/validos02*100, 195 | Y1998=tot98/validos98*100) %>% 196 | select(starts_with("Y")) 197 | total 198 | #Cuánto varió 199 | total$Y2014/total$Y1998 200 | ``` 201 | Los resultados muestran que entre 1998 y el 2014, el voto evangélico a nivel nacional se triplicó, al pasar de 2,6 a 8,1. Podemos ver la evolución en un gráfico: 202 | 203 | ```{r echo=T, message=FALSE, warning=FALSE} 204 | total <- total%>% 205 | gather(anio, porcentaje) %>% 206 | arrange(desc(anio)) 207 | 208 | graficoevolucion <- ggplot(total, aes(x=anio, y=porcentaje, group=1)) + 209 | geom_line(size=1.5, color="#00cef6") + 210 | ggtitle("Evolucion del voto evangelico") + 211 | theme_fivethirtyeight() + 212 | ylab("Porcentaje de votos a partidos evangélicos") + 213 | theme(axis.title = element_text()) + 214 | theme(axis.title.x = element_blank()) 215 | graficoevolucion 216 | ``` 217 | 218 | ##Mapas 219 | Una forma visual efectiva para ver la votación es por medio de mapas. Para hacer mapas en R vamos a requerir de varios paquetes: 220 | 221 | ```{r echo=T, message=FALSE, warning=FALSE} 222 | library("gpclib") 223 | library("raster") 224 | library("maptools") 225 | library("broom") 226 | library(mapproj) 227 | gpclibPermit() 228 | ``` 229 | 230 | Ahora es necesario importar geodatos para Costa Rica de la web. Luego con el comando `fortify` transformamos esos datos en polígonos que nos permiten mapear la información. Finalmente con el comando `merge`, unidos los geodatos con la base que habíamos creado previamente sobre el porcentaje de votación por cantones y ordenamos los polígonos para que el programa sepa en qué orden debe dibujar el mapa. 231 | 232 | ```{r echo=T, message=FALSE, warning=FALSE} 233 | #Importar geodatos 234 | cr <- getData("GADM", country = "CRI", level = 2) 235 | #transformar geodatos 236 | cr2<- fortify(cr, region = "HASC_2") 237 | #Unir bases 238 | cr_mapa <- merge(cr2, cantones, by.x= "id", by.y="HASC", all.x = TRUE) 239 | #ordenar pol?gonos 240 | ord2<- order(cr_mapa$order) 241 | cr_mapa <- cr_mapa[ord2, ] 242 | ``` 243 | 244 | Ahora estamos listos para dibujar los mapas. ¿Cómo leemos el código?: 245 | * `geom_polygon()` es el tipo de gráfico para hacer mapas. Lo primero es especificar los datos, que están en la base de datos que creamos `cr_mapa`. Luego especificamos los aesthetics. Siempre que hacemos un mapa vamos a poner en el eje X la longitud y en el Y la latitud. Luego agrupamos por la variable `group` y finalmente rellenamos con la variable de porcentaje de votos. 246 | * `coord_map()` especifica que es un mapa de coordenadas. 247 | * `ylim()` sirve para decirle al gráfico en donde centrar las coordenadas. Si es un mapa de CR, lo recomendable es poner `ylim(8, NA)`. 248 | * `scale_fill_gradient()` indica que queremos colorear el mapa con una escala. En este caso, los colores se seleccionaron de forma manual, poniendo el valor más bajo y el más alto, y R colorea según esa escala. Además se establecen los límites que van de 0 a 25%. Esto es muy importante, porque como año a año los porcentajes cambian, si no establecemos un límite los colores cambiarían y lo que necesitamos es que la escala de color se mantenga igual a lo largo de los años para ver la evolución. 249 | * `labs` agrega etiquetas al mapa. 250 | * `theme_void()` se usa para que el fondo salga en blanco. 251 | 252 | ```{r echo=T, message=FALSE, warning=FALSE} 253 | m_1998 <-ggplot() + 254 | geom_polygon(data = cr_mapa, aes(x = long, y = lat, group = group, fill = por98)) + 255 | coord_map() + ylim(8, NA) + 256 | scale_fill_gradient(low = "#c6c6c6", high = "#1f4263", limits = c(0, 25)) + 257 | labs(x = NULL, 258 | y = NULL, 259 | title = "Voto a partidos cristianos", 260 | subtitle = "Elecciones 1998") + 261 | theme_void() 262 | 263 | m_2002 <-ggplot() + 264 | geom_polygon(data = cr_mapa, aes(x = long, y = lat, group = group, fill = por02)) + 265 | coord_map() + ylim(8, NA) + 266 | scale_fill_gradient(low = "#c6c6c6", high = "#1f4263", limits = c(0, 25)) + 267 | labs(x = NULL, 268 | y = NULL, 269 | title = "Voto a partidos cristianos", 270 | subtitle = "Elecciones 2002") + 271 | theme_void() 272 | 273 | m_2006 <-ggplot() + 274 | geom_polygon(data = cr_mapa, aes(x = long, y = lat, group = group, fill = por06)) + 275 | coord_map() + ylim(8, NA) + 276 | scale_fill_gradient(low = "#c6c6c6", high = "#1f4263", limits = c(0, 25)) + 277 | labs(x = NULL, 278 | y = NULL, 279 | title = "Voto a partidos cristianos", 280 | subtitle = "Elecciones 2006") + 281 | theme_void() 282 | m_2010 <-ggplot() + 283 | geom_polygon(data = cr_mapa, aes(x = long, y = lat, group = group, fill = por10)) + 284 | coord_map() + ylim(8, NA) + 285 | scale_fill_gradient(low = "#c6c6c6", high = "#1f4263", limits = c(0, 25)) + 286 | labs(x = NULL, 287 | y = NULL, 288 | title = "Voto a partidos cristianos", 289 | subtitle = "Elecciones 2010")+ 290 | theme_void() 291 | m_2014 <-ggplot() + 292 | geom_polygon(data = cr_mapa, aes(x = long, y = lat, group = group, fill = por14)) + 293 | coord_map() + ylim(8, NA) + 294 | scale_fill_gradient(low = "#c6c6c6", high = "#1f4263", limits = c(0, 25)) + 295 | labs(x = NULL, 296 | y = NULL, 297 | title = "Voto a partidos cristianos", 298 | subtitle = "Elecciones 2014")+ 299 | theme_void() 300 | ``` 301 | ```{r echo=F, message=FALSE, warning=FALSE} 302 | m_1998 303 | m_2002 304 | m_2006 305 | m_2010 306 | m_2014 307 | ``` -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-recipe-Voto evangelico/diputados.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/sp-recipe-Voto evangelico/diputados.xlsx -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-recipe-mecixo-gender-violence/sp-recipe-mexico-gender-violence.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Data Recipe: Incorporar el diseño de muestra para conocer la dimensión de la violencia de género en la población mexicana" 3 | output: learnr::tutorial 4 | runtime: shiny_prerendered 5 | author: Gibran Mena 6 | --- 7 | 8 | ```{r setup, include=FALSE} 9 | library("learnr") 10 | knitr::opts_chunk$set(echo = TRUE) 11 | ``` 12 | 13 | 14 | ## Introduction 15 | 16 | En una encuesta de muestreo estratificado, cada una de las observaciones de la base de datos tiene un "peso" determinado, que le es asignado de acuerdo con el diseño muestral de la encuesta. Factores como la correlación de las observaciones disminuyen el tamaño estadístico de la muestra. Incorporar los elementos del diseño muestral, tales como el estrato, factor y conglomerado, hacen la diferencia entre tener un análisis con consistencia interna, es decir, que represente sólo al número mismo de personas encuestadas, o con consistencia externa, es decir que represente a la totalidad de la población. 17 | En esta lección aprenderás a incorporar los elementos del diseño muestral en una encuesta cuyo tema son las dinámicas de las relaciones en México, y que incluye temas como los de violencia de género, para tener un retrato fiel de la población objetiva a partir de una muestra estratificada. 18 | 19 | ```{r, eval=FALSE} 20 | # install.packages("foreign") 21 | library(foreign) 22 | # install.packages("survey") 23 | library(survey) 24 | ``` 25 | 26 | 27 | ## DEFINE 28 | 29 | Desde 2003, con con frecuencia inconstante, el Instituto Nacional de Estadística y Geografía (INEGI) ha desplegado la Encuesta Nacional sobre las Dinámicas de las Relaciones en los Hogares (ENDIREH) en México, en todas las entidades de la república. Los cuestionarios versan sobre las relaciones entre personas no sólo en el hogar, sino también en centros de estudio y trabajo. Esta encuesta ha sido utilizada por colectivos y organizaciones para detectar diversas formas de violencia de género en contra de la mujer. 30 | 31 | Los resultados de la encuesta de 2016 fueron publicados en agosto de 2017. Sus resultados son tomados por organizaciones y periodistas con frecuencia, pero son raras las ocasiones en que se incorporan los elementos del diseño de la misma. Esta receta presenta el uso apropiado del paquete `svyby` para corregir este error común contestando la pregunta de cuál es la proporción de mujeres que han sido objeto de golpes o agresiones físicas atendiendo a su estado civil. Es decir, en cuanto son solteras con novio o pareja (o exnovio o expareja), casadas o unidas con alguna pareja, o separadas, divorciadas, viudas. 32 | 33 | 34 | Para trabajar con esta receta te sugiero 35 | 36 | - Crear una carpeta llamada `ENDIREH-violencia-género` 37 | - Dentro de esta carpeta, crea otras dos `Inp` y `Out` 38 | 39 | Para obtener una estructura de carpetas como esta: 40 | 41 | ```{} 42 | └── ENDIREH-violencia-género 43 | ├── Inp 44 | └── Out 45 | ``` 46 | 47 | ## FIND 48 | 49 | La base de datos es pública y se encuentra en el sitio del Instituto Nacional de Estadística y Geografía (INEGI), el principal órgano estadístico público mexicano. 50 | 51 | http://www.beta.inegi.org.mx/proyectos/enchogares/especiales/endireh/2016/default.html 52 | 53 | Para descargar simplemente da click en `descargar todos los archivos` 54 | 55 | De entre todas las bases, utilizaremos TB_SEC_XII.dbf 56 | Te sugiero que pongas los archivos TB_SEC_XII.dbf y el diccionario de datos fd_endireh2016.xlsx, en la carpeta `Inp` creada anteriormente 57 | 58 | 59 | ## GET 60 | 61 | Radicamos el directorio de trabajo en la carpeta `Inp` 62 | > setwd("~/Desktop/ENDIREH-violencia-género/Inp") 63 | 64 | Usaremos la función `read` de la biblioteca `foreign` que antes instalamos. 65 | 66 | ```{r, eval=FALSE} 67 | data <- read.dbf("TB_SEC_XII.dbf", as.is = T) 68 | save(data, file = "../../../data/mexico-gender-violence_raw.RData") 69 | ``` 70 | 71 | ## ANALYSE 72 | 73 | Comencemos por explorar la base. Las variables con que contamos son: 74 | 75 | ```{r, eval=FALSE} 76 | names(data) 77 | ``` 78 | 79 | Una de las variables, T_INSTRUM, se refiere al tipo de cuestionario usado por la encuesta. Existen distintos tipos de cuestionario para mujeres en distintos estados civiles. Una ojeada al diccionario de datos nos da la siguiente información sobre esta variable: 80 | 81 | A1. Mujer casada o unida con pareja residente 82 | A2. Mujer casada o unida con pareja ausente temporal 83 | B1. Mujer separada o divorciada 84 | B2. Mujer viuda 85 | C1. Mujer soltera con novio o pareja o exnovio o expareja 86 | C2. Mujer soltera que nunca ha tenido novio 87 | 88 | Una tabla nos dice cuántas mujeres en cada estado civil entrevistó la encuesta 89 | 90 | El primer ejercicio de análisis, para el que a propósito encontraremos proporciones SIN hacer uso del diseño muestral. 91 | 92 | Para ello cruzaremos las variables del tipo de cuestionario (referida al estado civil de la mujer en cuestión) y la variable P12_17_1, si la golpeó alguna de sus parejas o esposos anteriores, para la que encontramos los siguientes casos: 93 | 94 | 1 - Sí 95 | 2 - No 96 | 9 - No especificado 97 | b - blanco 98 | 99 | Elaboramos una tabla de frecuencia considerando nuestras dos variables 100 | 101 | ```{r, eval=FALSE} 102 | tab <- table(data$P12_17_1, data$T_INSTRUM) 103 | ``` 104 | 105 | Eliminamos los registros que no nos son útiles, la fila 3(No especificado) y la columna C2 (son mujeres que nunca han tenido una pareja y por lo tanto jamás han sido violentadas por esta) 106 | 107 | ```{r, eval=FALSE} 108 | tab <- tab[-3, -6] 109 | ``` 110 | 111 | Una tabla de frecuencia por sí misma no es utilizable porque nos da números absolutos. Requerimos tener números relativos, proporciones o porcentajes. Para generar esta tabla de proporciones usamos el elemento `margin`de la función `prop.table`, margin=1 señala que la variable indepentiente son las filas, mientras que margin=2 indica que son las columnas la variable independiente. En este caso queremos averiguar la probabilidad o proporción de que una mujer sea violentada DADO su estado civil, por lo que variable independiente será dicho estado (margin=2). El último argumento de la función (3), indica el redondeo a tres decimales. 112 | 113 | ```{r, eval=FALSE} 114 | round(prop.table(tab, margin=2), 3) 115 | ``` 116 | 117 | Aparentemente, las mujeres solteras que tienen un novio o pareja o exnovio o expareja, tienen una probabilidad radicalmente menor (10%) de recibir golpes o agresiones físicas que las mujeres casadas, unidas o divorciadas (ronda el 40% la probabilidad) y, en menor medida, que las mujeres viudas (con un 35% de probabilidad) 118 | 119 | Por último, hacemos una prueba de hipótesis. Para ello aplicamos la prueba de la Chi Cuadrada, que nos dice la probabilidad de presentarse estas mismas proporciones (esta misma distribución de datos) dada una hipótesis nula (suponiendo que nuestra hipótesis de relación entre estado civil y violencia fuese falsa). 120 | 121 | ```{r, eval=FALSE} 122 | chisq.test(tab) 123 | ``` 124 | 125 | Como puede observarse, el valor p o `p value` de 2.2e-16, una probabilidad muy baja de que la misma distribución de datos se daría en caso de ser cierta la hipótesis nula (ninguna relación entre estado civil y violencia física). 126 | 127 | ## VERIFY 128 | 129 | No obstante, veamos si cambia la probabilidad de estas relaciones en los casos en que, de hecho, incorporamos el diseño muestral de la encuesta, lo cual nos da un análisis correcto de la misma. 130 | 131 | Para ello usaremos la biblioteca de R `survey`que instalamos y requerimos o "activamos" previamente. 132 | 133 | Los elementos del diseño de muestra de ENDIREH son los siguientes, y pueden encontrarse en el diccionario de datos 134 | 135 | Factor vivienda 136 | Factor de la mujer 137 | Estrato 138 | UPM de diseño: conglomerado 139 | Estrato de diseño 140 | 141 | Vamos a incorporarlos y almacenarlos en un objeto llamado "disenio" 142 | 143 | ```{r, eval=FALSE} 144 | disenio <- svydesign(~UPM_DIS, strata=~EST_DIS, weight=~FAC_MUJ, data=data) 145 | ``` 146 | 147 | Ahora almacenamos en una base de datos temporal las medias ponderadas con los factores del diseño para cada uno de los estados civiles de las mujeres con las que hemos estado trabajando 148 | 149 | ```{r, eval=FALSE} 150 | tmp <- svymean(~as.factor(data$T_INSTRUM), disenio, na.rm=TRUE) 151 | ``` 152 | 153 | Almacenamos igualmente en un objeto de nombre cualquiera el resultado de calcular la ponderación con la función `svyby`, considerando los subsets que necesitamos. En este caso, queremos 154 | 155 | 156 | base <- svyby(~P12_17_1, ~subset(T_INSTRUM, P12_C_4!=9), disenio, svymean, na.rm=TRUE) 157 | 158 | 159 | 160 | 161 | 162 | ## Summary and further reading 163 | 164 | |In this lesson we learned | Further reading | 165 | | ---------------------------------- | -------------------------------------------------------| 166 | |How to do a |[RStudio cheatsheets](https://www.rstudio.com/resources/cheatsheets/) on **a**. | 167 | | |The [dplyr website](http://dplyr.tidyverse.org/). | 168 | |How to do b |The [ggplot2 website](http://ggplot2.tidyverse.org/) | 169 | |How to do c |...| 170 | -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-recipe-mecixo-gender-violence/sp-recipe-mexico-gender-violence.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | Data Recipe: Data Recipe Name 17 | 18 | 19 | 20 | 21 | 26 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 |
43 |
44 | 45 |
46 | 47 |
48 |

Introduction

49 |
plot(1:5)
50 |

51 |
# install.packages("foreign")
 52 | library(foreign)
 53 | # install.packages("survey")
 54 | library(survey)
55 |
data <- read.dbf("TB_SEC_XII.dbf", as.is = T)
 56 | save(data, file = "../../../data/mexico-gender-violence_raw.RData")
57 |
58 |
59 |

DEFINE

60 |
61 |
62 |

FIND

63 |
64 |
65 |

GET

66 |
67 |
68 |

VERIFY

69 |
70 |
71 |

CLEAN

72 |
73 |
74 |

ANALYSE

75 |
76 |
77 |

PRESENT

78 |
79 |
80 |

Summary and further reading

81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 |
In this lesson we learnedFurther reading
How to do aRStudio cheatsheets on a.
The dplyr website.
How to do bThe ggplot2 website
How to do c
107 | 108 | 112 | 113 | 116 | 117 | 120 | 121 |
122 | 123 |
124 | 125 |
126 |
127 |
128 |
129 | 130 | 131 |
132 | 133 |
134 | 135 | 136 |
137 |
138 |
139 |
140 | 141 | 142 |
143 |
144 | 145 | 146 | 147 | 148 | 157 | 158 | 159 | 160 | 168 | 169 | 170 | 171 | 172 | -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-skills-Intro R/Intro R.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Tutorial R para periodistas" 3 | author: "Camila Salazar" 4 | output: 5 | learnr::tutorial: 6 | progressive: true 7 | allow_skip: true 8 | df_print: default 9 | runtime: shiny_prerendered 10 | --- 11 | 12 | ```{r setup, include=FALSE} 13 | library(learnr) 14 | library(tidyverse) 15 | library(nycflights13) 16 | ``` 17 | 18 | ## Inicio 19 | R es un entorno y lenguaje de programación para analizar datos. R es una herramienta muy útil para hacer periodismo de datos, ya que nos permite realizar todo el proceso de obtener, limpiar, analizar y visualizar la información desde el mismo lugar. En estos tutoriales aprenderá cómo utilizar el programa para sus proyectos periodísticos. 20 | 21 | __¿Por qué usar R en periodismo?__ 22 | 23 | * Nos permite hacer las diferentes tareas del data pipeline, desde un solo programa. 24 | * Reproductibilidad del flujo de trabajo. 25 | * R es gratuito y colaborativo, por lo que constantemente surgen nuevas actualizaciones según las necesidades de los usuarios. 26 | 27 | ### Contenidos 28 | 29 | En este tutorial aprenderá: 30 | 31 | * Cómo instalar R 32 | * Cómo instalar RStudio 33 | * Cómo instalar paquetes en R 34 | * Generalidades del programa 35 | * Importar datos a R 36 | * Tipos de datos 37 | 38 | Empecemos 39 | 40 | ## ¿Cómo instalar R? 41 | 42 | ![](https://vimeo.com/203516510) 43 | 44 | Tal como se observa en el video, para instalar R vamos a la dirección [cloud.r-project](https://cloud.r-project.org) y seleccionamos R para Linux, Mac o Windows. Una vez descargado, se siguen las instrucciones de instalación. 45 | 46 | 47 | ## ¿Cómo instalar RStudio? 48 | 49 | RStudio es una interfaz desarrollada para R. ¿Qué significa esto? R Studio nos ayuda a escribir código, visualizar resultados y en general trabajar con el lenguaje de programación R de una forma más fácil. Como recomendación, es mucho más sencillo trabajar con RStudio y es la herramienta que vamos a utilizar en los siguientes tutoriales. 50 | 51 | Algo a tomar en cuenta es que es necesario tener instalado R para poder utilizar RStudio. Veamos cómo instalarlo: 52 | 53 | ![](https://vimeo.com/203516968) 54 | 55 | Tal como se observa en el video, para instalar R vamos a la dirección [rstudio.com](https://rstudio.com/download) y seleccionamos la versión para nuestro sistema operativo. 56 | 57 | Una vez descargado, lo abrimos y estamos listos para comenzar. 58 | 59 | ### Quiz #1: Instalar R y RStudio 60 | 61 | ```{r quiz3, echo=FALSE} 62 | quiz(caption = "Instalar R y RStudio", 63 | question("¿Qué es R studio?", 64 | answer("Una aplicación que nos ayuda a usar de forma más fácil R.", correct = TRUE, message = "RStudio tiene una interfaz amigable, que nos facilita escribir, usar y salvar código de R."), 65 | answer("Una aplicación para usar R sin escribir código", message = "¡No! El código es una de las grandes ventajas que tiene R frente a otros programas como Excel. El Código nos permite llevar un registro del trabajo que estamos haciendo y permite la reproductibilidad de los contenidos."), 66 | answer("Un programa de hogas de cálculo como Excel"), 67 | answer("Es lo mismo que R", message = "No. Como ya vimos anteriormente son dos cosas diferentes. R es un lenguaje, como el español. RStudio es un programa que nos permite usar ese lenguaje, de la misma forma que un programa como Work nos permite escribir textos en español."), 68 | allow_retry = TRUE 69 | ), 70 | question("¿RStudio es gratuito?", 71 | answer("Sí", correct = TRUE, message = "Al igual que R, Rstudio es gratis y open-source."), 72 | answer("No.") 73 | ), 74 | question("¿Es necesario instalar R si ya tengo RStudio?", 75 | answer("Yes.", correct = TRUE, message = "R no viene con RStudio;hay que instalarlos de forma separada."), 76 | answer("No.", message = "R no viene con RStudio;hay que instalarlos de forma separada.") 77 | ) 78 | ) 79 | ``` 80 | 81 | ## Instalar y utilizar paquetes 82 | Un paquete de R es un conjunto de funciones, datasets, documentacion que permite ampliar la funcionalidad de R. Por ejemplo, existen paquetes para hacer gráficos, importar archivos de Excel, hacer análisis estadístico, entre otros. En la actualidad existen cerca de 10.500 paquetes en R. 83 | 84 | ¿Cómo instalarlos? 85 | 86 | ![](https://vimeo.com/203516241) 87 | 88 | ###Instalar paquetes 89 | 90 | Como vimos en el video para instalar un paquete escribimos en la consola `install.packages()`, y ponemos entre __paréntesis y con comillas__ el nombre del paquete que queremos instalar. Por ejemplo, si queremos instalar el paquete `tidyverse`, escribimos `install.packages("tidyverse")` en la consola. 91 | Si queremos instalar varios paquetes a la vez lo hacemos así: `install.packages(c("tidyverse", "ggplot2", "xlsx"))`, es decir, ponemos los diferentes nombres de los paquetes en un vector (c) y los separamos por comas. 92 | 93 | Es importante considerar que los paquetes se instalan una sola vez en la computadora, por lo que no hay necesidad de instalarlos cada vez que abrimos RStudio. 94 | 95 | ###Cargar paquetes al programa 96 | Una vez que hemos instalado un paquete, tenemos que cargarlos a Rstudio para poder utilizarlos. Para hacer esto usamos el comando `libray(nombre del paquete)`. De esta forma si quisieramos utilizar el paquete `ggplot2`, el orden sería el siguente: 97 | `install.packages("ggplot2")` 98 | `library(ggplot2)` 99 | 100 | También existe otra forma de cargar paquetes al programa sin necesidad de instalarlos. Esto se hace con el comando `require(nombre del paquete)`. 101 | 102 | ¿Cuáles paquetes instalar? Como se mencionó al inicio R tiene cerca de 10.500 paquetes. Según las tareas que queramos realizar vamos a necesitar diferentes paquetes. A lo largo de estos materiales vamos a ir conociendo algunos de ellos. 103 | 104 | Para iniciar es recomendable instalar el paquete `tidyverse`, que contiene un conjunto de paquetes de uso frecuente. 105 | 106 | 107 | ### Quizz #2: Instalar paquetes 108 | 109 | ```{r names, echo = FALSE} 110 | quiz(caption = "Quiz - Paquetes en R", 111 | question("¿Cuál comando se utiliza para instalar paquetes?", 112 | answer("`library()`", message = "No, library se utiliza luego de instalar un programa"), 113 | answer("`install.packages()`", correct = TRUE), 114 | answer("`install_packages()`"), 115 | answer("No hay ningún comando se tiene que ir al sitio [cran.r-project.org](http://cran.r-project.org) y bajar los paquetes manualmente.", message = "R permite descargar los paquetes desde el programa, solo se necesita conexión a interntet"), 116 | allow_retry = TRUE 117 | ), 118 | question("¿Cada cuánto hay que instalar un paquete?", 119 | answer("Cada vez que abrimos R"), 120 | answer("Cada vez que reseteamos nuestra computadora"), 121 | answer("Solamente una vez. Una vez instalado, el paquete queda almacenado en nuestra computadora.", correct = TRUE), 122 | answer("No es necesario instalar paquetes para usar R.", message = "Aunque algunas funciones se pueden realizar sin el uso de paquetes, estos son los que nos permiten ampliar las posibilidades de R y las tareas que podemos hacer con el programa"), 123 | allow_retry = TRUE 124 | ), 125 | question("¿Cuál comando usamos para cargar los paquetes ya instalados?", 126 | answer("`library.load()`"), 127 | answer("`require()`", message ="Este comando se utiliza cuando queremos usar un programa sin instalarlo previamente" ), 128 | answer("`library()`", correct = TRUE), 129 | answer("No es necesario un comando, una vez instalado el paquete se puede utlizar."), 130 | allow_retry = TRUE 131 | ) 132 | ) 133 | ``` 134 | 135 | ## Antes de comenzar 136 | Una vez que ya hemos instalado R, RStudio y algunos paquetes, podemos comenzar a trabajar. Lo primero que tenemos que hacer es elegir un directorio de trabajo, donde vamos a guardar nuestros archivos y de dónde vamos a cargar nuestros archivos. 137 | 138 | Para saber en cuál directorio estamos trabajando, usamos el comando `getwd()`. Si queremos cambiar el directorio de trabajo usamos `setwd("directorio de trabajo")` y listo! 139 | 140 | Lo ideal es tener todos los archivos que vayamos a necesitar en esa carpeta. 141 | 142 | ###¿Cómo pedir ayuda? 143 | Siempre que vamos a aprender a usar un programa es muy importante saber cómo acceder a la ayuda. En R esto es muy fácil y solamente se debe escribir `?` delante del comando o el objeto. Por ejemplo, si quisiéramos saber más sobre el comando `setwd()`, simplemente escribimos `?setwd()` 144 | 145 | Pruebe ese comando en la consola: 146 | 147 | ```{r help, exercise = TRUE} 148 | 149 | ``` 150 | 151 |
152 | **Pista:** Escriba `?setwd()` y de click al botón azul. 153 |
154 | 155 | ```{r help-check, echo = FALSE} 156 | # checking code 157 | ``` 158 | 159 | ## Importar archivos 160 | 161 | Cuando trabajamos con datos nos enfrenteamos a diferentes tipos de archivos: .csv, .xlsx, .dta, .sav, entre otros. En este segmento vamos a ver cómo importar archivos a R para poder analizarlos. 162 | 163 | ###Importar .csv 164 | Para importar archivos .csv usamos el comando `read.csv()`. En el argumento de la función, se debe poner la dirección de dónde está el archivo, ya sea en la computadora o en línea; se debe indicar si la primera fila corresponde a los nombres de las variables y detallar el tipo de separador de los datos y preferiblemente la codificación del archivo, para que lea los caracteres extraños correctamente. Por ejemplo se vería así: `read.csv("archivo.csv", header = TRUE, sep=",", fileEncoding = "UTF-8")` 165 | 166 | ####Ejercicio 167 | En la consola de abajo corra el código para importar un archivo csv que se encuentra en la siguiente dirección `http://bit.ly/2vsPsRZ` 168 | ```{r import, exercise=TRUE} 169 | base <- read.csv("http://bit.ly/2vsPsRZ", header = TRUE, sep=",", fileEncoding = "UTF-8") 170 | 171 | base 172 | ``` 173 | 174 | ¡La base está importada! 175 | 176 | __Importante:__ 177 | 178 | Como puede ver en el código antes del comando escribimos `base <- ` ¿Por qué? Para almacenar elementos (bases de datos, gráficos, vectores, matrices) en R como objetos usamos el símbolo `<-`. En el ejemplo anterior, queríamos importar una archivo .csv y almacenarlo como un objeto en R (en este caso un objeto llamado `base`), para poder utilizarlo luego. Entonces una vez importado el archivo escribimos el nombre del objeto `base` para que se muestre en la consola. 179 | 180 | 181 | ####Otros comandos para importar archivos delimitados 182 | 183 | Con el comando `read.delim()` se pueden importar `.csv` u otros archivos delimitados como `.txt`. El comando sería `read.delim("nombrearchivo", sep="separador")`, donde en lugar de separador pondríamos el caracter por el que están separados los datos, por ejemplo: `,` `:` `;` `|` `$`, entre otros. 184 | 185 | ###Importar archivos de Excel 186 | 187 | Para importar archivos de Excel, existen diferentes paquetes: 188 | 189 | * `readxl()` 190 | * `xlsx ()` 191 | 192 | ```{r, eval=FALSE} 193 | #Con xlsx 194 | base <- read.xlsx("nombre del archivo", 195 | sheetName= "nombre de la hoja") 196 | 197 | #Con readxl 198 | base <- read_excel("nombre del archivo") 199 | ``` 200 | 201 | ###Importar archivos de Stata y SPSS 202 | 203 | Se puede utilizar el paquete `foreign()` 204 | ```{r, eval=FALSE} 205 | #Archivo de SPSS 206 | archivo <- read.spss("nombre del archivo.sav", use.value.labels = TRUE, to.data.frame = TRUE) 207 | 208 | ##Se puede poner la opcion use.value.labes = FALSE, si se quiere obtener los codigos de la variables y no las etiquetas. 209 | 210 | #Con readxl 211 | archivo <- read.dta("nombre del archivo") 212 | ``` 213 | 214 | ##Tipos de datos 215 | 216 | Ahora que aprendimos como importar archivos, podemos empezar a trabajar. En este segmento vamos a aprender sobre dos estructuras de datos muy comunes en R, los **data frame** y las **tibbles**. Estas son estructuras para alamacenar datos tabulares, como por ejemplo los archivos que importamos desde Excel. Básicamenete estos objetos tienen variables (que pueden ser de diferente tipo) y filas, que corresponden a cada una de las observaciones. 217 | 218 | Para estos ejemplos vamos a utilizar la base del paquete `nycflights13`, que contiene la base de datos `flights`. Esta base describe a cada uno de los vuelos que salieron de Nueva York durante el 2013. Los datos vienen del US [Bureau of Transportation Statistics](http://www.transtats.bts.gov/DatabaseInfo.asp?DB_ID=120&Link=0), y están documentados en `?flights`. 219 | 220 | Ya el paquete está cargado en el tutorial, pero si lo quisiera instalar desde su computadora recuerde los pasos: 221 | 222 | 1.`install.packages('nycflights13')` para instalar el paquete 223 | 2. Cargue el paquete con `library(nycflights13)` 224 | 225 | ###Ver la base de datos 226 | Para ver la base de datos, puede escribir en la consola de abajo en nombre de la base: `flights`. 227 | 228 | ```{r flights, exercise = TRUE} 229 | 230 | ``` 231 | 232 |
233 | **Pista:** Escriba flights y luego el botón azul. 234 |
235 | 236 | ```{r flights-check} 237 | # checking code 238 | ``` 239 | 240 | Para ver la base de datos completa y no solamente las primeras filas usamos el comando `View()`. En este caso `View(flights)`. Este comando nos permite ver la base completa en otra ventana. 241 | 242 | ### Tipos de variables 243 | 244 | ```{r flights3, exercise = TRUE, exercise.eval = TRUE} 245 | flights 246 | ``` 247 | 248 | Como puede observar R muestra debajo del nombre de cada columna abreviaciones de letras. Estas describen el tipo de variable que está almacenada en cada columna de `flights`: 249 | 250 | * `int` se refiere a "intergers" o números enteros. 251 | 252 | * `dbl` se refiere a doubles, o números reales. 253 | 254 | * `chr` se refiere a caracteres, también conocidos como "strings". 255 | 256 | * `dttm` se refiere a date-times (una fecha + una hora). 257 | 258 | Otros tipos de variables comunes que no contiene esta base de datos pero que es frecuente encontrar son: 259 | 260 | * `lgl` se refiere a logical, vectores que solo conienen `TRUE` o `FALSE`. 261 | 262 | * `fctr` se refiere a factors, que R utiliza para representar variables categóricas. 263 | 264 | * `date` se refiere a dates, o fechas. 265 | 266 | 267 | ### Quiz #3: Tipos de datos 268 | 269 | ```{r exercises1, exercise = TRUE} 270 | 271 | ``` 272 | 273 | ```{r quiz1, echo = FALSE} 274 | quiz(caption = "Use la consola de arriba para poder responder las preguntas.", 275 | question("¿Qué información contiene la variable `air_time`? Lea la ayuda de la base de datos usano `?flights` para averiguarlo.", 276 | answer("Hora de salida de avión"), 277 | answer("Tiempo que el avión está en aire, en minutos", correct= TRUE), 278 | answer("Distancia ente aeropuertos"), 279 | answer("Tiempo de retraso del avión"), 280 | allow_retry = TRUE 281 | ), 282 | question("¿Cuántas filas y variables tiene `flights`?", 283 | answer("19 filas y 336.776 variables"), 284 | answer("336.766 filas y 7 variables"), 285 | answer("336.766 filas y 19 variables", correct = TRUE), 286 | answer("No se sabe"), 287 | incorrect = "Pista: R enumera las filas y columnas cuando escribimos el nombre de la base en la consola. Examine el contenido de `flights` en la consola de arriba.", 288 | allow_retry = TRUE 289 | ), 290 | question("¿Qué tipos de variables hay en `flights`? Selecciones todas las que aplican.", 291 | type = "multiple", 292 | allow_retry = TRUE, 293 | incorrect = "Revise de nuevo `flights`.", 294 | answer("integers", correct = TRUE), 295 | answer("doubles", correct = TRUE), 296 | answer("factors"), 297 | answer("characters", correct = TRUE) 298 | ) 299 | ) 300 | ``` 301 | -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-skills-Limpieza/ Limpieza.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Clase 6 - Limpieza de datos" 3 | author: Camila Salazar 4 | output: html_document 5 | --- 6 | 7 | ```{r setup, include=FALSE} 8 | knitr::opts_chunk$set(echo = TRUE) 9 | ``` 10 | 11 | ## ¿Qué vamos a aprender? 12 | 13 | En esta clase vamos a aprender a usar algunos paquetes para limpiar datos y estructurar información en R. En las clases anteriores vimos algunas características que deben tener los datos limpios: 14 | 15 | 1. Cada columna es una variable: una variable contiene todos los valores que miden el atributo (altura, peso, temperatura, etc) 16 | 2. Cada fila es una observación 17 | 18 | ## ¿Qué paquetes vamos a necesitar? 19 | 20 | ```{r echo=T, message=FALSE, warning=FALSE} 21 | 22 | library(dplyr) 23 | library(tidyr) 24 | library(readxl) 25 | ``` 26 | 27 | ## Importar los datos 28 | Antes de comenzar, hay que cambiar el directorio de trabajo y seleccionar el folder en donde tenemos nuestros archivos. Esto se hace con el comando `setwd()`. 29 | 30 | Ahora podemos importar los archivos de varias formas. Podemos hacerlo desde el menú de arriba, usando: File -> Import Dataset y seleccionando el tipo de archivo que queremos importar. 31 | 32 | También podemos hacerlo escribiendo el código. Es siempre recomendable asignar el archivo que importamos a un objeto con el símbolo `<-`. Inicialmente vamos a importar unos archivos .xlsx, por lo que usamos el paquete `readxl`, que ya instalamos. 33 | 34 | Los archivos que vamos a usar inicialmente son: 35 | ```{r echo=T, message=FALSE, warning=FALSE} 36 | #Recuerden cambiar el directorio de trabajo 37 | setwd("/Users/Camila/Documents/Curso periodismo datos/2017/Clases/Limpieza_R") 38 | morosidad16 <- read_xlsx("morosidad16.xlsx") 39 | morosidad17<- read_xlsx("morosidad17.xlsx") 40 | ``` 41 | 42 | ##Explorar los datos y unir las bases 43 | Antes de comenzar a limpiar o analizar los datos tenemos que explorar los datos. Para estos primeros ejercicios vamos a usar las bases de datos de morosidad con la CCSS. 44 | 45 | Ambas bases de datos tienen las siguientes variables: 46 | 47 | * `id`: cédula del deudor ya sea cédula física o jurídica. 48 | * `nombre`: Nombre de la persona física o jurídica. 49 | * `deuda16` y `deuda17` : monto de la deuda en colones. 50 | * `situación`: situación en la que se encuentra la deuda 51 | * `lugar.pago`: sucursal donde se tiene que cancelar la deuda 52 | * `Estado`: Estado de la deuda 53 | 54 | Cómo primer paso vamos a unir ambas bases de datos. 55 | 56 | ###Unir bases de datos 57 | 58 | Para unir bases de datos necesitamos una o más variables en común entre las dos bases de datos. Existen varias funciones para unir bases de datos. Una de ellas es `merge`, que sirve para unir columnas de dos bases de datos diferentes. Para unir bases es clave hacerse la pregunta: **¿Queremos conservar todas las variables o solo las que hacen match?**. De acuerdo a la respuesta, así va a depender la fórmula que usemos. 59 | 60 | La sintaxis de `merge()` es simple: 61 | `merge(base1, base2, by.x="nombre variable base 1", by.y="nombre variable base 2")` 62 | 63 | En el caso en el que la variable se llame igual en las dos bases: 64 | `merge(base1, base2, by="nombre variable")` 65 | 66 | En esos ejemplos, el comando une solamente los casos en común entre las dos bases. 67 | 68 | Si queremos que se unan **todos** los casos, usamos la opción `all=TRUE`: 69 | `merge(base1, base2, by="nombre variable", all=TRUE)` 70 | 71 | Pueden ver este documento donde se muestra cómo funciona esta función con detalle. [link](http://www.princeton.edu/~otorres/Merge101R.pdf) 72 | 73 | Para este ejercicio queremos unir las dos bases de morosidad en una sola, dejando todas las observaciones de ambas bases, entonces usamos el comando: 74 | 75 | ```{r echo=T, message=FALSE, warning=FALSE} 76 | morosidad<-merge(morosidad17, morosidad16, by= "id", all=TRUE) 77 | ``` 78 | 79 | Ahora podemos convertirla base a una tibble para facilitar la lectura 80 | ```{r echo=T, message=FALSE, warning=FALSE} 81 | morosidad <-tbl_df(morosidad) 82 | ``` 83 | 84 | ###Explorar los datos 85 | Ahora que tenemos una nueva base de datos, podemos explorar sus contenidos. Usemos las siguientes funciones: 86 | 87 | Si queremos imprimir la base de datos, nada más ponemos el nombre del objeto: 88 | ```{r echo=T, message=FALSE, warning=FALSE} 89 | morosidad 90 | ``` 91 | 92 | `dim()` Esta función nos permite ver la dimensión de la base de datos, en este caso tenemos 10 variables y 78.703 observaciones 93 | ```{r echo=T, message=FALSE, warning=FALSE} 94 | dim(morosidad) 95 | ``` 96 | 97 | `head()` Nos permite ver las primeras filas de la base de datos. Incluso podemos seleccionar la cantidad de filas que queremos ver por ejemplo `head(base, n=20)`, lo cual nos muestra las primeras 20 filas. También podemos ver que nos muestra el tipo de variable. 98 | ```{r echo=T, message=FALSE, warning=FALSE} 99 | head(morosidad) 100 | ``` 101 | 102 | `tail()` Nos permite ver las últimas filas. Tiene la misma sintaxis que head() 103 | ```{r echo=T, message=FALSE, warning=FALSE} 104 | tail(morosidad) 105 | ``` 106 | 107 | `glimpse()` nos permite exploarar las variables. Nos dice, al lado de cada variable cuál es el tipo. Por ejemplo nos dice que nombre.x es caracter y que deuda17 es "double", el cual es un formato numérico. 108 | ```{r echo=T, message=FALSE, warning=FALSE} 109 | glimpse(morosidad) 110 | ``` 111 | 112 | 113 | # dplyr 114 | Uno de los paquetes que más funciona para manipular datos de forma fácil es `dplyr`. Este paquete tiene, entre otras, cinco funciones para manipular datos: `select()` `filter()` `arrange()``mutate()` `summarize()`. 115 | 116 | **Veámos cómo se usa** 117 | 118 | ###select() 119 | Select nos permite seleccionar columnas. La sintaxis sería: `select(dataframe, col1, col2)` conde col1, col2, se refiere a los nombres de las columnas que queramos seleccionar. 120 | 121 | ![](/Users/Camila/Desktop/select.png) 122 | 123 | Por ejemplo supongamos que queremos seleccionar únicamente las columnas de id y deuda: 124 | ```{r echo=T, message=FALSE, warning=FALSE} 125 | select(morosidad, id, deuda17, deuda16) 126 | ``` 127 | 128 | También podemos seleccionar todas las columnas menos algunas, esto lo hacemos poniendo `-` antes del nombre de la columna que no queremos seleccionar. Por ejemplo, si queremos todas las columnas menos `Estado`: 129 | 130 | ```{r echo=T, message=FALSE, warning=FALSE} 131 | select(morosidad, -Estado) 132 | ``` 133 | 134 | Si queremos seleccionar un rango de columnas por ejemplo de `id` a `situacion` usamos `:`. 135 | 136 | ```{r echo=T, message=FALSE, warning=FALSE} 137 | select(morosidad, id:situacion.x) 138 | ``` 139 | 140 | Si queremos guardar el resultado de esa función en un nuevo objeto, debemos asignarlo con `<-`. Por ejemplo para este ejericio **nos interesa quedarnos únicamente con columnas que no estén repetidas**. Si observamos la base, nos damos cuenta que la columna de nombre, lugar de pago y situación se repiten, entonces vamos a deseleccionar esas columnas y crear un nuevo objeto que se llame `morosidad1`. 141 | 142 | ```{r echo=T, message=FALSE, warning=FALSE} 143 | morosidad1 <- select(morosidad, -nombre.y, -situacion.y, -lugar.pago.y) 144 | morosidad1 145 | ``` 146 | 147 | ###filter() 148 | La función filter nos permite filtrar filas. 149 | 150 | ![](/Users/Camila/Desktop/filter.png) 151 | 152 | La sintaxis es simple: `filter(base, condicion)`. Donde condición es la condión lógica por la que queremos filtrar datos. Para ello usamos operadores lógicos: 153 | 154 | * `>`: mayor que 155 | * `<`: menor que 156 | * `>=`: mayor o igual que 157 | * `<=`: menor o igual que 158 | * `==`: igual que (se ponen **dos** signos de igual) 159 | * `!=`: diferente 160 | * `&`: y 161 | * `|`: o 162 | * `is.na(variable)`: filtra los valores en blanco de la variable seleccionada. 163 | * `!is.na(variable)`: filtra los valores que **no** están en blanco de la variable. 164 | 165 | Por ejemplo si queremos filtrar solamente las deudas superiores a un millón: 166 | 167 | ```{r echo=T, message=FALSE, warning=FALSE} 168 | filter(morosidad1, deuda17>1000000) 169 | ``` 170 | 171 | O las deudas que crecieron entre 2016 y 2017: 172 | 173 | ```{r echo=T, message=FALSE, warning=FALSE} 174 | filter(morosidad1, deuda17>deuda16) 175 | ``` 176 | 177 | O solamente las deudas mayores a un millón y de díficil cobro: 178 | 179 | ```{r echo=T, message=FALSE, warning=FALSE} 180 | filter(morosidad1, deuda17>1000000 & situacion.x=="DIFICIL COBRO") 181 | ``` 182 | 183 | En este ejemplo, tenemos deudas del 2016 y del 2017, y **nos interesa analizar únicamente los casos de las empresas o personas que han estado morosas por los dos años**. Para ello podemos usar la función filter: 184 | 185 | ```{r echo=T, message=FALSE, warning=FALSE} 186 | morosidad1 <- filter(morosidad1, !is.na(deuda17), !is.na(deuda16)) 187 | ``` 188 | El código de arriba lo que hace es filtrar la base por todos aquellos registros que no tengan valores vacíos en 2017 y luego por todos los que no tienen registros vacíos en 2016. Como podemos ver, esto nos da como resultado menos registros en nuestra base de datos. 189 | 190 | ###mutate() 191 | Mutate nos permite crear nuevas columnas de forma fácil. 192 | 193 | Podemos crear una variable que me diga cuánto cambió la deuda, que es la diferencia entre `deuda17` y `deuda16`: 194 | 195 | ```{r echo=T, message=FALSE, warning=FALSE} 196 | morosidad1 <- mutate(morosidad1, cambio.deuda=deuda17-deuda16) 197 | ``` 198 | 199 | Ahora podemos crear una nueva variable que me categorice el cambio en la deuda en si aumentó o no. Esto podemos hacerlo con la función `if_else()` o `ifelse` (funcionar igual). La sintaxis es: `ifelse(condición, valor cierto, valor falso)`. (Es similar a la función if en Excel). 200 | 201 | ```{r echo=T, message=FALSE, warning=FALSE} 202 | morosidad1 <- mutate(morosidad1, tipo.cambio=ifelse(cambio.deuda<0,"disminuyó", "aumentó")) 203 | ``` 204 | 205 | Con mutate podemos crear multiples variables a la vez, separando cada una por coma, por ejemplo: 206 | ```{r echo=T, message=FALSE, warning=FALSE} 207 | morosidad1 <- mutate(morosidad1, cambio.deuda=deuda17-deuda16, 208 | tipo.cambio=ifelse(cambio.deuda<0,"disminuyó", "aumentó")) 209 | ``` 210 | 211 | 212 | ### arrange() 213 | 214 | Arrange nos permite ordenar las base por una o varias columnas 215 | 216 | Por ejemplo, queremos ordenar la base en orden ascendente por deuda17 y por cambio.deuda: 217 | ```{r echo=T, message=FALSE, warning=FALSE} 218 | morosidad1 <- arrange(morosidad1, deuda17, cambio.deuda) 219 | ``` 220 | 221 | 222 | Si lo queremos en orden descendente usamos `desc()` 223 | ```{r echo=T, message=FALSE, warning=FALSE} 224 | morosidad1 <- arrange(morosidad1, desc(deuda17), desc(cambio.deuda)) 225 | ``` 226 | 227 | ###Simplificar el trabajo: %>% 228 | Un operador muy útil cuando trabajamos con `dplyr` es `pipe operator` que visualmente se ve así `%>%`. Este operador nos va a facilitar muchísimo el trabajo con funciones y nos permite hacer comando con menos líneas de código. 229 | 230 | **¿Cómo funciona el %>%?** 231 | Lo primero es poner el ibjeto (tabla o dataframe) al cual queremos aplicar las operaciones de la forma `base %>% funcion()`. Esto nos ahorra estar poniendo como primer argumento de las funciones de dplyr al objeto. 232 | 233 | Por ejemplo, recapitulemos todas las líneas de código que usamos anteriormente para limpiar la base de datos: 234 | ```{r echo=T, message=FALSE, warning=FALSE, eval=FALSE} 235 | morosidad1 <- select(morosidad, -nombre.y, -situacion.y, -lugar.pago.y) 236 | morosidad1 <- filter(morosidad1, !is.na(deuda17), !is.na(deuda16)) 237 | morosidad1 <- mutate(morosidad1, cambio.deuda=deuda17-deuda16) 238 | morosidad1 <- mutate(morosidad1, tipo.cambio=ifelse(cambio.deuda<0,"disminuyó", "aumentó")) 239 | morosidad1 <- arrange(morosidad1, desc(deuda17), desc(cambio.deuda)) 240 | ``` 241 | 242 | Todos los pasos realizados anteriormente podríamos haberlos hecho de forma más simple usando el `%>%`: 243 | ```{r echo=T, message=FALSE, warning=FALSE} 244 | morosidad2 <- morosidad %>% 245 | select(-nombre.y, -situacion.y, -lugar.pago.y) %>% 246 | filter(!is.na(deuda17), !is.na(deuda16)) %>% 247 | mutate(cambio.deuda=deuda17-deuda16, 248 | tipo.cambio=ifelse(cambio.deuda<0,"disminuyó", "aumentó")) %>% 249 | arrange(desc(deuda17), desc(cambio.deuda)) 250 | ``` 251 | 252 | ###Exportar base limpia a otros formatos 253 | Ahora que tenemos la base limpia podemos exportarla a otros formatos por ejemplo a csv. 254 | ```{r echo=T, message=FALSE, warning=FALSE} 255 | write.csv(morosidad2, "baselimpia.csv") 256 | ``` 257 | 258 | 259 | # tidyr 260 | `tidyr` es un paquete diseñado para tener datos ¨tidy¨ o limpios. Esos datos siguen dos principios que ya hemos visto reiteradamente: Cada variable está en una sola columna y cada fila es una observación. 261 | 262 | ![](/Users/Camila/Desktop/tidy.png) 263 | 264 | Para entender mejor qué es "tidy data", puede leer este artículo: [Tidy data - Hadley Wickham](http://vita.had.co.nz/papers/tidy-data.pdf) 265 | 266 | Este paquete tiene 4 funciones principales `gather()` `spread()` `separate()` y `unite()`. 267 | 268 | Antes de comenzar a ver las funciones importemos los archivos de trabajo: 269 | ```{r echo=T, message=FALSE, warning=FALSE} 270 | #Ponemos tbl_df() para convertirlo de una vez a una tibble 271 | estudiantes <- tbl_df(read.csv("students.csv", header = T, sep= ",")) 272 | estudiantes2<- tbl_df(read.csv("students2.csv", header = T, sep= ",")) 273 | ``` 274 | 275 | Veamos `estudiantes`: 276 | ```{r echo=T, message=FALSE, warning=FALSE} 277 | estudiantes 278 | ``` 279 | 280 | **¿Cuál es el problema?** 281 | male y female son valores de la variable sexo, por lo que hay que cambiar la estructura de la base de datos. Esta estructura de una base de datos se conoce como "formato ancho" y tenemos que convertirla a un "formato largo". Para eso usamos la función gather(). 282 | 283 | ### gather() 284 | La función toma las columnas múltiples, las colapsa en una sola y crea una nueva columna con los valores respectivos. 285 | 286 | La sintaxis es `gather(data, key, value, columnas)` o ` data %>% gather(key, value, columnas)`. Donde: 287 | 288 | * `data`: es la tabla o el data frame 289 | * `key`: es el nombre que le voy a dar a la variable que voy a "fundir". 290 | * `value`: nombre de la variable que va a guardar los valores. 291 | * `columna`: las columnas que quiero fundir. Podemos ponerlas separadas por coma, o pondemos usar el operador `-` para seleccionar todas las columnas menos una. 292 | 293 | En este ejemplo para reestructurar `estudiantes` el comando sería: 294 | ```{r echo=T, message=FALSE, warning=FALSE} 295 | estudiantes_long <- gather(estudiantes, sexo, frecuencia, -grade) 296 | estudiantes_long 297 | ``` 298 | 299 | ### spread() 300 | Es la función contraria a `gather()`, que nos devolvería a la tabla original 301 | ```{r echo=T, message=FALSE, warning=FALSE} 302 | estudiantes_wide <- spread(estudiantes_long, sexo, frecuencia) 303 | estudiantes_wide 304 | ``` 305 | 306 | 307 | Veamos ahora qué pasa con `estudiantes2`: 308 | ```{r echo=T, message=FALSE, warning=FALSE} 309 | estudiantes2 310 | ``` 311 | 312 | En este caso tenemos un doble problema **tenemos valores de una misma variable en diferentes columnas y diferentes variables en una sola**. En este caso nos separa a los hombres y mujeres segun la clase en la que están: 1 y 2. 313 | 314 | Entonces tenemos que hacer dos pasos. Primero usamos la función `gather()`: 315 | ```{r echo=T, message=FALSE, warning=FALSE} 316 | estudiantes2_long <- gather(estudiantes2, sexo_clase, frecuencia, -grade) 317 | estudiantes2_long 318 | ``` 319 | 320 | Y ahora usamos la función `separate()`. 321 | 322 | ### separate() 323 | Esta función nos permite separar columnas. La sintaxis es: 324 | 325 | `separate(data, col, into, sep)`, donde: 326 | 327 | * `data`: es la tabla o el data frame 328 | * `col`: la columna que hay que separar 329 | * `into`: las columnas por las que vamos a separar. Se pueden poner como vector de la forma c("col1", "col2") 330 | * `sep`: separador, por ejemplo comas, puntos, guion bajo u otros caracteres. Si no se especifica el argumento R trata de identificar el patrón para separar los datos. Cuando se utiliza esta opción se debe poner sep=" " (entre comillas iría el caracter que usemos para separar) 331 | 332 | En este ejemplo, queremos separar el sexo de la clase. 333 | ```{r echo=T, message=FALSE, warning=FALSE} 334 | estudiantes2_long2 <- separate(estudiantes2_long, sexo_clase, c("sexo", "clase")) 335 | #En este caso R detectó el caracter para separar, el resultado es igual a si hubiéramos puesto la opción sep="_" 336 | estudiantes2_long2 337 | ``` 338 | 339 | Estos pasos podemos simplificarlos en uno solo con el uso del `%>%`: 340 | 341 | ```{r echo=T, message=FALSE, warning=FALSE} 342 | estudiantes2_long <- estudiantes2 %>% 343 | gather(sexo_clase, frecuencia, -grade) %>% 344 | separate(sexo_clase, c("sexo", "clase")) %>% 345 | print 346 | #el comando print, nos imprime el resultado 347 | ``` 348 | 349 | ### unite() 350 | Es el contrario a separate. 351 | `unite(data, col, ... , sep)`, donde: 352 | 353 | * `data`: es la tabla o el data frame 354 | * `col`: la nueva columna con los valores unidos 355 | * `...`: la lista de columnas que queremos unir 356 | * `sep`: separador que va a unir las columnas, por ejemplo _. 357 | 358 | En este caso si queremos volver a la tabla original: 359 | ```{r echo=T, message=FALSE, warning=FALSE} 360 | estudiantes2_unida <- estudiantes2_long %>% 361 | unite(sexo_clase, sexo, clase, sep="-") %>% 362 | print 363 | ``` 364 | 365 | #### Enlaces recomendados 366 | 367 | [Data Processing with dplyr & tidyr-Rpubs](https://rpubs.com/bradleyboehmke/data_wrangling) 368 | [Datacamp](https://www.datacamp.com) 369 | 370 | 371 | 372 | 373 | -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-skills-Limpieza/morosidad16.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/sp-skills-Limpieza/morosidad16.xlsx -------------------------------------------------------------------------------- /r-package/inst/tutorials/sp-skills-Limpieza/morosidad17.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/school-of-data/r-consortium-proposal/77d32189753755bb30ed7e2ddb0d6937fb45ac68/r-package/inst/tutorials/sp-skills-Limpieza/morosidad17.xlsx -------------------------------------------------------------------------------- /r-package/man/results.Rd: -------------------------------------------------------------------------------- 1 | % Generated by roxygen2: do not edit by hand 2 | % Please edit documentation in R/raw_data_introduction.R 3 | \docType{data} 4 | \name{results} 5 | \alias{results} 6 | \alias{demographic1} 7 | \alias{demographic2} 8 | \alias{gbr_sp} 9 | \title{Raw data used in introduction data recipe} 10 | \format{Three objects of type \code{data.frame} named results, demographic1 and demographic2.} 11 | \usage{ 12 | data(introduction_raw) 13 | } 14 | \description{ 15 | For further details see \code{learnr::run_tutorial("en-introduction", package = "ddj")} 16 | } 17 | \examples{ 18 | data(introduction_raw) 19 | } 20 | \keyword{datasets} 21 | -------------------------------------------------------------------------------- /submission_form.txt: -------------------------------------------------------------------------------- 1 | Heidi Seibold 2 | 3 | University of Zurich / School of Data 4 | 5 | heidi@schoolofdata.ch 6 | 7 | 0041 78 20 33 44 0 8 | 9 | School of Data Material Development 10 | 11 | 11200 (+ 7000) 12 | 13 | We would like to develop R learning material for journalists in four different 14 | languages. 15 | 16 | We are aware of sending in the proposal after the deadline and have been in 17 | contact about this with Gabor Csardi. 18 | We would be very grateful if you would consider our proposal. 19 | --------------------------------------------------------------------------------