├── .github ├── .gitignore └── workflows │ └── bookdown.yaml ├── .Rbuildignore ├── under_construction.jpg ├── .gitignore ├── resources ├── file_ingestions │ ├── fig1.png │ ├── fig2.jpeg │ ├── fig2.png │ ├── sample.xml │ ├── sample.json │ ├── Iris.csv │ └── Iris.xlsx ├── mapcharts │ └── religions.xlsx ├── r_neural_network │ ├── loss.png │ └── nn.png ├── sample_project │ └── election.jpg ├── wordcloud2_tutorial │ ├── sun.jpg │ ├── HTML.png │ ├── star.png │ ├── tips.png │ ├── word0.png │ ├── word1.png │ ├── word2.png │ ├── Wordcloud.png │ └── wordcloud_sun.png ├── ggplot_stat_and_theme │ ├── change.png │ └── change2.png ├── color_deficiency_images │ ├── new-1.png │ ├── new-2.png │ ├── new-3.png │ ├── new-4.png │ ├── new-5.png │ ├── new-6.png │ ├── new-7.png │ ├── new-8.png │ ├── original-1.png │ ├── original-2.png │ ├── original-3.png │ ├── original-4.png │ ├── original-5.png │ ├── original-6.png │ ├── original-7.png │ └── original-8.png ├── edav_history_of_charts │ ├── EDAV_CC.pdf │ ├── History_Of_Charts.png │ └── History_Of_Charts_Purple.png ├── machine_learning_tutorial │ ├── KNN.png │ ├── Kfold.png │ ├── NaiveBayes.png │ ├── R-Squared.jpg │ ├── Roccurves.png │ └── ConfusionMatrix.png ├── tutorial_pull_request_mergers │ ├── 1.png │ ├── 2.png │ ├── 3.png │ ├── 4.png │ ├── chap_1.png │ ├── yml_1.png │ ├── check_format.png │ ├── files_changed.png │ ├── conflicted_file.png │ ├── resolve_conflicts.png │ ├── request_change_files.png │ ├── request_change_lines.png │ ├── delete_most_of_bookdown_yml.png │ └── request_change_submit_review.png ├── wordclouds_video_tutorial │ ├── cover.png │ ├── slides.pptx │ └── Apple_logo_black.png.png ├── edavsurveyandanalysis │ ├── EDAV CC (1).pdf │ └── EDAV COMMUNITY CONTRIBUTION (Responses).xlsx ├── ggvis_cheatsheet_image │ └── ggvis_cheatsheet.pdf ├── ml_interview_concept_mindmap │ └── ml_mindmap.pdf ├── rstudio_editor_theme_setup_imgs │ ├── 1.2 │ │ └── try.png │ ├── 2.1 │ │ ├── css.png │ │ └── rstheme.png │ ├── 2.3 │ │ └── global.png │ └── 1.1 │ │ ├── appearance.png │ │ └── preference.png ├── tidytext_cheatsheet │ └── tidytext_cheatsheet.pdf ├── hypothesis_testing_guide_in_r │ └── hypothesis_testing.png ├── plotly_python_3d_surface_images │ ├── 3d_surface_plot1.png │ ├── 3d_surface_plot2.png │ └── maximum_likelihood_function.JPG ├── data_explorer_cheatsheet │ └── data_explorer_cheat_sheet.pdf ├── cheatsheet_GGally_plots │ └── cheatsheet_for_ggally-page-001.jpg └── Visualization_in_python_vs_R │ ├── tophit.csv │ └── vaccinations.csv ├── _output.yml ├── sample_project.Rmd ├── _common.R ├── index.Rmd ├── cctemplate.Rproj ├── README.md ├── python_seaborn.Rmd ├── api_requests_and_web_scraping.Rmd ├── ggvis_cheatsheet.Rmd ├── machine_learning_r.Rmd ├── appendix_initial_setup.Rmd ├── edavsurveyandanalysis.Rmd ├── two_fun_r_packages.Rmd ├── ml_interview_concept_mindmap.Rmd ├── bioconductor.Rmd ├── cheatsheet_datascience.Rmd ├── become_better_business_analysts.Rmd ├── ggplot2_matplotlib_cheatsheet.Rmd ├── tidytext_cheatsheet.Rmd ├── edav_data_importance.Rmd ├── ggplotscheatsheet.Rmd ├── bestparcoords.Rmd ├── cheatsheet_of_ggally_plots.Rmd ├── plotly_cheatsheet.Rmd ├── ggmap_cheatsheet.Rmd ├── googleVis.Rmd ├── ggstatsplot.Rmd ├── DESCRIPTION ├── network_visualization.Rmd ├── misleading_visualizations.Rmd ├── cheatsheet_patchwork.Rmd ├── playful_r_packages.rmd ├── _bookdown.yml ├── resparking_creativity_for_visualizations.Rmd ├── learning_echarts4r.Rmd ├── assignment.Rmd ├── covid_shiny_dashboard.Rmd ├── python_geospatial_visualization.Rmd ├── data_explorer_cheatsheet.Rmd ├── plotly_python_surface.Rmd ├── mapcharts.Rmd ├── r_neural_network.Rmd ├── color_deficiency.Rmd ├── file_ingestions.Rmd ├── appendix_pull_request_tutorial.Rmd ├── gradient_descent.Rmd ├── wordclouds_video_tutorial.Rmd ├── sandbox.Rmd ├── ggfortify_and_autoplotly_tutorial.Rmd ├── introduction_to_plotly.Rmd ├── dygraph_tutorial.Rmd ├── bubble_chart_application.Rmd └── tutorial_of_making_different_types_of_interactive_graphs.Rmd /.github/.gitignore: -------------------------------------------------------------------------------- 1 | *.html 2 | -------------------------------------------------------------------------------- /.Rbuildignore: -------------------------------------------------------------------------------- 1 | ^.*\.Rproj$ 2 | ^\.Rproj\.user$ 3 | -------------------------------------------------------------------------------- /under_construction.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/under_construction.jpg -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rhistory 3 | .RData 4 | .Ruserdata 5 | *.Rproj 6 | **/.DS_Store 7 | -------------------------------------------------------------------------------- /resources/file_ingestions/fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/file_ingestions/fig1.png -------------------------------------------------------------------------------- /resources/file_ingestions/fig2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/file_ingestions/fig2.jpeg -------------------------------------------------------------------------------- /resources/file_ingestions/fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/file_ingestions/fig2.png -------------------------------------------------------------------------------- /resources/mapcharts/religions.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/mapcharts/religions.xlsx -------------------------------------------------------------------------------- /resources/r_neural_network/loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/r_neural_network/loss.png -------------------------------------------------------------------------------- /resources/r_neural_network/nn.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/r_neural_network/nn.png -------------------------------------------------------------------------------- /resources/sample_project/election.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/sample_project/election.jpg -------------------------------------------------------------------------------- /resources/wordcloud2_tutorial/sun.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordcloud2_tutorial/sun.jpg -------------------------------------------------------------------------------- /resources/wordcloud2_tutorial/HTML.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordcloud2_tutorial/HTML.png -------------------------------------------------------------------------------- /resources/wordcloud2_tutorial/star.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordcloud2_tutorial/star.png -------------------------------------------------------------------------------- /resources/wordcloud2_tutorial/tips.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordcloud2_tutorial/tips.png -------------------------------------------------------------------------------- /resources/wordcloud2_tutorial/word0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordcloud2_tutorial/word0.png -------------------------------------------------------------------------------- /resources/wordcloud2_tutorial/word1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordcloud2_tutorial/word1.png -------------------------------------------------------------------------------- /resources/wordcloud2_tutorial/word2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordcloud2_tutorial/word2.png -------------------------------------------------------------------------------- /resources/ggplot_stat_and_theme/change.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/ggplot_stat_and_theme/change.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/new-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/new-1.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/new-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/new-2.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/new-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/new-3.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/new-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/new-4.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/new-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/new-5.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/new-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/new-6.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/new-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/new-7.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/new-8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/new-8.png -------------------------------------------------------------------------------- /resources/edav_history_of_charts/EDAV_CC.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/edav_history_of_charts/EDAV_CC.pdf -------------------------------------------------------------------------------- /resources/ggplot_stat_and_theme/change2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/ggplot_stat_and_theme/change2.png -------------------------------------------------------------------------------- /resources/machine_learning_tutorial/KNN.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/machine_learning_tutorial/KNN.png -------------------------------------------------------------------------------- /resources/machine_learning_tutorial/Kfold.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/machine_learning_tutorial/Kfold.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/1.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/2.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/3.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/4.png -------------------------------------------------------------------------------- /resources/wordcloud2_tutorial/Wordcloud.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordcloud2_tutorial/Wordcloud.png -------------------------------------------------------------------------------- /resources/wordclouds_video_tutorial/cover.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordclouds_video_tutorial/cover.png -------------------------------------------------------------------------------- /resources/edavsurveyandanalysis/EDAV CC (1).pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/edavsurveyandanalysis/EDAV CC (1).pdf -------------------------------------------------------------------------------- /resources/wordcloud2_tutorial/wordcloud_sun.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordcloud2_tutorial/wordcloud_sun.png -------------------------------------------------------------------------------- /resources/wordclouds_video_tutorial/slides.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordclouds_video_tutorial/slides.pptx -------------------------------------------------------------------------------- /resources/color_deficiency_images/original-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/original-1.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/original-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/original-2.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/original-3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/original-3.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/original-4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/original-4.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/original-5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/original-5.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/original-6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/original-6.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/original-7.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/original-7.png -------------------------------------------------------------------------------- /resources/color_deficiency_images/original-8.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/color_deficiency_images/original-8.png -------------------------------------------------------------------------------- /resources/machine_learning_tutorial/NaiveBayes.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/machine_learning_tutorial/NaiveBayes.png -------------------------------------------------------------------------------- /resources/machine_learning_tutorial/R-Squared.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/machine_learning_tutorial/R-Squared.jpg -------------------------------------------------------------------------------- /resources/machine_learning_tutorial/Roccurves.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/machine_learning_tutorial/Roccurves.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/chap_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/chap_1.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/yml_1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/yml_1.png -------------------------------------------------------------------------------- /_output.yml: -------------------------------------------------------------------------------- 1 | bookdown::bs4_book: 2 | repo: 3 | base: https://github.com/jtr13/cc22mw 4 | branch: main 5 | icon: "fa-solid fa-chart-scatter" 6 | -------------------------------------------------------------------------------- /resources/edav_history_of_charts/History_Of_Charts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/edav_history_of_charts/History_Of_Charts.png -------------------------------------------------------------------------------- /resources/ggvis_cheatsheet_image/ggvis_cheatsheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/ggvis_cheatsheet_image/ggvis_cheatsheet.pdf -------------------------------------------------------------------------------- /resources/machine_learning_tutorial/ConfusionMatrix.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/machine_learning_tutorial/ConfusionMatrix.png -------------------------------------------------------------------------------- /resources/ml_interview_concept_mindmap/ml_mindmap.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/ml_interview_concept_mindmap/ml_mindmap.pdf -------------------------------------------------------------------------------- /resources/rstudio_editor_theme_setup_imgs/1.2/try.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/rstudio_editor_theme_setup_imgs/1.2/try.png -------------------------------------------------------------------------------- /resources/rstudio_editor_theme_setup_imgs/2.1/css.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/rstudio_editor_theme_setup_imgs/2.1/css.png -------------------------------------------------------------------------------- /resources/tidytext_cheatsheet/tidytext_cheatsheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tidytext_cheatsheet/tidytext_cheatsheet.pdf -------------------------------------------------------------------------------- /resources/rstudio_editor_theme_setup_imgs/2.1/rstheme.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/rstudio_editor_theme_setup_imgs/2.1/rstheme.png -------------------------------------------------------------------------------- /resources/rstudio_editor_theme_setup_imgs/2.3/global.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/rstudio_editor_theme_setup_imgs/2.3/global.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/check_format.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/check_format.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/files_changed.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/files_changed.png -------------------------------------------------------------------------------- /resources/rstudio_editor_theme_setup_imgs/1.1/appearance.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/rstudio_editor_theme_setup_imgs/1.1/appearance.png -------------------------------------------------------------------------------- /resources/rstudio_editor_theme_setup_imgs/1.1/preference.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/rstudio_editor_theme_setup_imgs/1.1/preference.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/conflicted_file.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/conflicted_file.png -------------------------------------------------------------------------------- /resources/wordclouds_video_tutorial/Apple_logo_black.png.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/wordclouds_video_tutorial/Apple_logo_black.png.png -------------------------------------------------------------------------------- /resources/edav_history_of_charts/History_Of_Charts_Purple.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/edav_history_of_charts/History_Of_Charts_Purple.png -------------------------------------------------------------------------------- /resources/hypothesis_testing_guide_in_r/hypothesis_testing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/hypothesis_testing_guide_in_r/hypothesis_testing.png -------------------------------------------------------------------------------- /resources/plotly_python_3d_surface_images/3d_surface_plot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/plotly_python_3d_surface_images/3d_surface_plot1.png -------------------------------------------------------------------------------- /resources/plotly_python_3d_surface_images/3d_surface_plot2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/plotly_python_3d_surface_images/3d_surface_plot2.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/resolve_conflicts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/resolve_conflicts.png -------------------------------------------------------------------------------- /resources/data_explorer_cheatsheet/data_explorer_cheat_sheet.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/data_explorer_cheatsheet/data_explorer_cheat_sheet.pdf -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/request_change_files.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/request_change_files.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/request_change_lines.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/request_change_lines.png -------------------------------------------------------------------------------- /resources/cheatsheet_GGally_plots/cheatsheet_for_ggally-page-001.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/cheatsheet_GGally_plots/cheatsheet_for_ggally-page-001.jpg -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/delete_most_of_bookdown_yml.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/delete_most_of_bookdown_yml.png -------------------------------------------------------------------------------- /resources/tutorial_pull_request_mergers/request_change_submit_review.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/tutorial_pull_request_mergers/request_change_submit_review.png -------------------------------------------------------------------------------- /resources/plotly_python_3d_surface_images/maximum_likelihood_function.JPG: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/plotly_python_3d_surface_images/maximum_likelihood_function.JPG -------------------------------------------------------------------------------- /resources/edavsurveyandanalysis/EDAV COMMUNITY CONTRIBUTION (Responses).xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jtr13/cc22mw/main/resources/edavsurveyandanalysis/EDAV COMMUNITY CONTRIBUTION (Responses).xlsx -------------------------------------------------------------------------------- /sample_project.Rmd: -------------------------------------------------------------------------------- 1 | # Sample project 2 | 3 | Joe Biden and Donald Trump 4 | 5 | This chapter gives a sample layout of your `Rmd` file. 6 | 7 | ![Test Photo](resources/sample_project/election.jpg) 8 | -------------------------------------------------------------------------------- /_common.R: -------------------------------------------------------------------------------- 1 | # from: https://github.com/hadley/r4ds/blob/master/_common.R 2 | knitr::opts_chunk$set( 3 | cache = TRUE, 4 | fig.align = 'center', 5 | out.width = '80%', 6 | message = FALSE, 7 | warning = FALSE 8 | ) 9 | -------------------------------------------------------------------------------- /index.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Community Contributions for EDAV Fall 2022 Mon/Wed" 3 | date: "`r Sys.Date()`" 4 | site: bookdown::bookdown_site 5 | github-repo: jtr13/cc22mw 6 | description: "This book contains community contributions for EDAV Fall 2022 Mon/Wed section" 7 | --- 8 | 9 | # Welcome! 10 | 11 | -------------------------------------------------------------------------------- /cctemplate.Rproj: -------------------------------------------------------------------------------- 1 | Version: 1.0 2 | 3 | RestoreWorkspace: Default 4 | SaveWorkspace: Default 5 | AlwaysSaveHistory: Default 6 | 7 | EnableCodeIndexing: Yes 8 | UseSpacesForTab: Yes 9 | NumSpacesForTab: 2 10 | Encoding: UTF-8 11 | 12 | RnwWeave: Sweave 13 | LaTeX: pdfLaTeX 14 | 15 | BuildType: Package 16 | PackageUseDevtools: Yes 17 | PackageInstallArgs: --no-multiarch --with-keep.source 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # cc22mw 2 | 3 | Community contributions for STAT GR 5702 (EDAV) Fall 2022 Mon/Wed 4 | 5 | The class projects will be collected as `.Rmd` files and rendered into a bookdown book: jtr13.github.io/cc22mw 6 | 7 | Contributors (everyone), see: https://jtr13.github.io/cc22mw/github-submission-instructions.html 8 | 9 | Collaborators, see: https://jtr13.github.io/cc22mw/tutorial-for-pull-request-mergers.html 10 | -------------------------------------------------------------------------------- /python_seaborn.Rmd: -------------------------------------------------------------------------------- 1 | # Data Visualization with Seaborn 2 | 3 | Yingjie Qu (yq2350) & Liwen Zhu (lz2512) 4 | 5 | We created an introduction to data visualization with Seaborn, a package in python. The introduction includes importing the package and visualizing typical graphs, such as histograms and scatterplots. We also show how to draw more complex diagrams, biplots, and ridge plots, with the help of other packages. 6 | 7 | Please look at the files in 'https://github.com/A1anZhu/python_seaborn'. -------------------------------------------------------------------------------- /resources/file_ingestions/sample.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 1 4 | Alia 5 | 620 6 | IT 7 | 8 | 9 | 2 10 | Brijesh 11 | 440 12 | Commerce 13 | 14 | 15 | 3 16 | Yash 17 | 600 18 | Humanities 19 | 20 | 21 | 4 22 | Mallika 23 | 660 24 | IT 25 | 26 | 27 | 5 28 | Zayn 29 | 560 30 | IT 31 | 32 | -------------------------------------------------------------------------------- /api_requests_and_web_scraping.Rmd: -------------------------------------------------------------------------------- 1 | # Python tutorial: gather data through API and web scraping 2 | 3 | Yaohong Liang 4 | 5 | I wrote a medium blog to talk about how to get data through API requests and web scraping using python. Since getting data is always the first thing we do before any data analysis, I think this could be helpful for anyone who wants to know how to get data themselves. This blog introduce 2 methods to extract data from the Internet. The first one through API requests, and The other one is through web scraping. I provided step-by-step examples for each method, and I also discuss the pros and cons for each of them at the end of the article. My blog can be found at the following link: 6 | 7 | [Python tutorial: gather data through API and web scraping](https://medium.com/@yaohong010/python-tutorial-gather-data-through-api-and-web-scraping-f55de6000cac) -------------------------------------------------------------------------------- /ggvis_cheatsheet.Rmd: -------------------------------------------------------------------------------- 1 | # ggvis cheat sheet 2 | 3 | P. Spencer Davis 4 | 5 | For my community contribution, I made a cheatsheet for ggvis. I did this because ggvis did not yet have a proper cheat sheet. While ggvis is no longer being developed, it has an advantage in the interactivity of the graphs that can be made with the package. The process of creating this cheatsheet was difficult, but I managed to figure out the most efficient way to create the cheatsheet in time, and feel that, if done again, I could do so faster and better. 6 | 7 | The following link leads to a pdf of the cheat sheet: https://github.com/psd2126/myrepo/blob/main/P_Spencer_Davis_ggvis_cheatsheet_final.pdf 8 | 9 | ![ggvis cheat sheet](resources/ggvis_cheatsheet_image/ggvis_cheatsheet.pdf) 10 | 11 | sources: 12 | https://ggvis.rstudio.com/layers.html 13 | https://github.com/rstudio/cheatsheets/blob/main/.github/CONTRIBUTING.md -------------------------------------------------------------------------------- /machine_learning_r.Rmd: -------------------------------------------------------------------------------- 1 | # Tutorial on Machine Learning in R 2 | 3 | Priyanka Balakumar and Hao Pan 4 | 5 | ## Introduction 6 | Machine learning is a branch in computer science that studies the design of algorithms that can learn. Within machine learning, there exists three main categories: supervised learning, unsupervised learning, and reinforcement learning. This ‘Machine Learning in R’ cheat sheet explores some basic supervised and unsupervised machine learning techniques. 7 | 8 | ## Motivation 9 | There are several advantages to implementing machine learning using R. Firstly, R provides easily explainable code which is especially helpful for those starting out in machine learning and needing to explain their code. In addition, R is the perfect language for easy data visualization. There are corresponding functions for different machine learning techniques that allow a user to visualize and understand performance results. 10 | 11 | Here is the link to the cheatsheet: 12 | https://github.com/Stephen-Pan30/EDAV-CommunityContribution/blob/main/Machine%20Learning%20Cheatsheet.pdf 13 | -------------------------------------------------------------------------------- /appendix_initial_setup.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Appendices {-} 2 | 3 | This chapter contains additional information regarding how to setup the GitHub repo. 4 | 5 | # Github initial setup 6 | 7 | Joyce Robbins 8 | 9 | ## Create new repo 10 | 11 | Create a new repository by copying this template: http://www.github.com/jtr13/cctemplate and following instructions in the README. 12 | 13 | ## Pages in repo settings 14 | 15 | Change source to gh-pages 16 | 17 | May have to trigger GHA to get it to work 18 | 19 | ## Add packages to DESCRIPTION file 20 | 21 | *Need a better process...* 22 | 23 | Downloaded submissions from CourseWorks 24 | 25 | Create DESCRIPTION file. Add add dependencies with `projthis::proj_update_deps()` 26 | 27 | https://twitter.com/ijlyttle/status/1370776366585614342 28 | 29 | Add these Imports to the real DESCRIPTION file. 30 | 31 | Found problematic packages by looking at reverse dependencies of the packages that failed to install: 32 | 33 | `devtools::revdep()` 34 | 35 | Also used `pak::pkg_deps_tree()` 36 | 37 | Problems: 38 | 39 | `magick` 40 | 41 | `rJava` dependency of `qdap` 42 | -------------------------------------------------------------------------------- /resources/file_ingestions/sample.json: -------------------------------------------------------------------------------- 1 | { 2 | "quiz": { 3 | "sport": { 4 | "q1": { 5 | "question": "Which one is correct team name in NBA?", 6 | "options": [ 7 | "New York Bulls", 8 | "Los Angeles Kings", 9 | "Golden State Warriros", 10 | "Huston Rocket" 11 | ], 12 | "answer": "Huston Rocket" 13 | } 14 | }, 15 | "maths": { 16 | "q1": { 17 | "question": "5 + 7 = ?", 18 | "options": [ 19 | "10", 20 | "11", 21 | "12", 22 | "13" 23 | ], 24 | "answer": "12" 25 | }, 26 | "q2": { 27 | "question": "12 - 8 = ?", 28 | "options": [ 29 | "1", 30 | "2", 31 | "3", 32 | "4" 33 | ], 34 | "answer": "4" 35 | } 36 | } 37 | } 38 | } -------------------------------------------------------------------------------- /edavsurveyandanalysis.Rmd: -------------------------------------------------------------------------------- 1 | # EDAV Survey and Analysis 2 | 3 | Varalika Mahajan and Vrinda Bhat 4 | 5 | ### Project Proposal 6 | 7 | #### Project Group: CC5 8 | 9 | Varalika Mahajan: vm2695 10 | 11 | Vrinda Bhat: vgb2113 12 | 13 | #### Visualization Preference Analysis 14 | 15 | ##### Overview 16 | 17 | Form Link: https://forms.gle/SYKeS6fqGktr5qrN8 18 | 19 | No of responses: 80 20 | 21 | Data Collected: https://docs.google.com/spreadsheets/d/1apWvWCVmk4donteDaWEuMjDvHc1gJZxfrbSXeILNJyE/edit?usp=sharing 22 | 23 | Full Document Link: https://docs.google.com/document/d/1RKa9HXl6icDS5ci4K7oLATZjFOp4oWNzELslTATMJL8/edit?usp=sharing 24 | 25 | ##### Goals: 26 | 27 | To understand people’s opinions on different visualizations: Different views on graphs and visualizations help us to understand their importance and the structural necessity of existing and most-used graphs. 28 | 29 | Validate if factors like age, work experience, and gender impact their opinion: Usually applying the learned classroom concepts to real-life work-related problems gives us a better understanding of what is more meaningful for business insights. Thus, we checked if factors like experience, age, and gender showed a huge impact on choices. 30 | 31 | -------------------------------------------------------------------------------- /two_fun_r_packages.Rmd: -------------------------------------------------------------------------------- 1 | # Two Interesting R packages 2 | 3 | Shiyuan Xu and Ying Hong 4 | 5 | As our community contribution, we made a demonstration on two exciting R packages called "cowsay" and "fortunes". 6 | 7 | The link to the Shiny app can be found here: 8 | https://shiyuan-xu.shinyapps.io/cc_xu_ying/ 9 | 10 | The link to the Rmd file can be found here: 11 | https://github.com/ShiyuanXu/Stats5702_CC 12 | 13 | ## Motivation 14 | As we learned in this class, R is a powerful tool to analyze and visualize data. However, sometimes when we get bored with R, we started to question what else R can do and what interesting packages that R has. Therefore, we started to explore a few fun packages in R— “cowsay” and “fortunes”. 15 | In the project, we included some detailed explanations of what the packages do and how to use the arguments to control output. Afterward, we make a fun interactive tool using shiny to combine the two packages and the link is attached above (as an interactive file cannot render a meaningful HTML). 16 | 17 | ## Evaluation 18 | From this project, besides the contents of the package, we learned that coding can be fun and fascinating. Next time we would like to introduce more intriguing packages into our demo and create a more interesting interactive game. -------------------------------------------------------------------------------- /ml_interview_concept_mindmap.Rmd: -------------------------------------------------------------------------------- 1 | # Machine Learning Interview mindmap and commonly asked questions 2 | 3 | Haolong Liu 4 | 5 | As the community contribution, [mindmap can be found here](resources/ml_interview_concept_mindmap/ml_mindmap.pdf) 6 | 7 | Description: I created a mind map for machine learning interviews and included many commonly asked questions, so that it gives people a broad understanding of many machine learning concepts and how to articulate these knowledge during an interview. 8 | 9 | Motivation: I am preparing a 2023 summer machine learning engineer intern technical interview. Therefore, it is very beneficial to organize important machine learning knowledge and commonly asked interview questions so that I can perform better. 10 | 11 | The need it addresses: Machine learning is a really broad subject and can go very deep. This mind map is certainly not comprehensive. For example, it does not include RNN or CNN. However, machine learning intern jobs mostly likely will not ask very deep questions. My mind map gives people directions on what subject they need to know. Plus some specific questions to help people better understand and prepare. 12 | What I might do differently: Include more industrial application related questions if I have more experience. -------------------------------------------------------------------------------- /.github/workflows/bookdown.yaml: -------------------------------------------------------------------------------- 1 | # Workflow derived from https://github.com/r-lib/actions/tree/master/examples 2 | # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help 3 | on: 4 | push: 5 | branches: [main, master] 6 | 7 | name: bookdown 8 | 9 | jobs: 10 | bookdown: 11 | runs-on: ubuntu-latest 12 | env: 13 | GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} 14 | steps: 15 | - uses: actions/checkout@v2 16 | 17 | - uses: r-lib/actions/setup-pandoc@v2 18 | 19 | - uses: r-lib/actions/setup-r@v2 20 | with: 21 | use-public-rspm: true 22 | 23 | - uses: r-lib/actions/setup-r-dependencies@v2 24 | with: 25 | pak-version: devel 26 | 27 | - name: Cache bookdown results 28 | uses: actions/cache@v2 29 | with: 30 | path: _bookdown_files 31 | key: bookdown-${{ hashFiles('**/*Rmd') }} 32 | restore-keys: bookdown- 33 | 34 | - name: Build site 35 | env: 36 | API_TOKEN: ${{ secrets.API_TOKEN }} 37 | run: Rscript -e 'bookdown::render_book("index.Rmd", quiet = TRUE)' 38 | 39 | - name: Deploy to GitHub pages 🚀 40 | uses: JamesIves/github-pages-deploy-action@4.1.4 41 | with: 42 | branch: gh-pages 43 | folder: _book 44 | -------------------------------------------------------------------------------- /bioconductor.Rmd: -------------------------------------------------------------------------------- 1 | # Cheetsheet of RNA-seq workflow in bioconductor 2 | 3 | Yusong Zhao 4 | 5 | ### Motivation 6 | 7 | Bioconductor is an R-based project that specializes in processing biological data, and many specially designed packages are included in it. It clearly define multiple workflows aiming to solve specific bioinformatics problems, such as RNA-seq analysis, gene annotation, etc. 8 | 9 | However, some of the workflows contain too many packages and cover too many questions, which makes it difficult for users to understand. Therefore, for the rna-seq data analysis problem in bioinformatics, I made a cheetsheet to conduct some core analysis in the whole flow, which may provide some help to beginners who are unfamiliar with bioconductor as well as rna-seq analysis. 10 | 11 | 12 | 13 | ### What I have learned & Evaluation 14 | 15 | I learned how to get biological files from relevant databases, how to perform basic analysis with bioconductor packages and workflows. I also learned how to make simple cheetsheet with powerpoint. 16 | 17 | The workload is large, but its application scenario is relatively small (because it is not a common problem). In the future, I want to simplify more workflow in bioconductor. Also, explanations of some biological concepts can be added to the cheetsheet so that people without prior knowledge in the field can use bioconductor too. 18 | 19 | **File link:** https://github.com/Zhao-YS/22fall_EDAV_CC/blob/main/cheetsheet_yz4406.pdf 20 | 21 | **Citations:** 22 | 1. https://www.bioconductor.org/ 23 | 2. https://bioconductor.org/packages/release/workflows/html/rnaseqGene.html 24 | 3. https://bioconductor.org/packages/3.16/bioc/html/DESeq2.html -------------------------------------------------------------------------------- /cheatsheet_datascience.Rmd: -------------------------------------------------------------------------------- 1 | # Cheatsheet for data science 2 | 3 | Liri Chen and Shuo Liu 4 | 5 | ## What We Learned 6 | 7 | For community contribution project, we made a data science cheatsheet, which obtained an overall review of data science knowledge structure, applications, and opportunities. Our motivation for making this cheatsheet is that this cheatsheet will help us prepare for data science job interviews, as well as help us studying for finals of several courses. The cheatsheet includes knowledge in areas of probability and statistics, hypothesis testings and machine learning models. In statistics section, we included probability functions and distributions. In hypothesis testing section, we included knowledge of sample and population distribution, as well as ways of using sample statistics and confidence interval to check null hypothesis rejection. In the machine learning section, we incorporate models of linear regression, logistic regression, decision tree, random forest and KNN. We also covered knowledge of model evaluation, dimension reduction and neural networks. 8 | 9 | From this community contribution project, we organized and reviewed data science concepts and knowledge. Our future work includes expanding the existing knowledge points, adding more detailed descriptions, and creating front-end display pages, such as Github pages. 10 | 11 | ## Link 12 | 13 | Link to the cheatsheet: 14 | [https://drive.google.com/file/d/1gfHUbs1YxJ-_hpP4-jVsmw772312Cor4/view?usp=share_link]{target="_blank"} 15 | 16 | ## Reference 17 | 18 | This cheatsheet cites the template and some contents by Aaron Wang: [https://github.com/aaronwangy/Data-Science-Cheatsheet]{target="_blank"} 19 | 20 | -------------------------------------------------------------------------------- /become_better_business_analysts.Rmd: -------------------------------------------------------------------------------- 1 | # Becoming a better business analyst 2 | 3 | Jinyu Wang and Shuai Yuan 4 | 5 | As our community contribution, we wrote an article on becoming a better business analyst based on our own experience in the industry. The full article can be found here: 6 | 7 | Main Goal: 8 | 9 | The main goal of the project is to give everyone some advice on some core information about business analytics and data analysis. Also, we will include some advice on how to become a better business analyst based on our own experience. Since the job market is pretty bad these days, we think we should share some decent advice and help all of us improve together. We included three tips in detail: 10 | 11 | 1\. Proactive application of data science methods helps me to be more productive, giving me more time to work on things that matter. 12 | 13 | 2\. Applying some data science methods can help to get more accurate analysis results and create higher business value. 14 | 15 | 3\. When you practice your coding skills, outperform the job requirements, and use proactive data analysis to optimize daily jobs, you will stand out in the interview process. 16 | 17 | Evaluation: 18 | 19 | By completing this article, we fully summarized our internship experiences and rethought the good and bad steps we took. We learned from each other and it helped us to know what we individually lack in the current step and we both have better future planning on becoming better data analysts or using data analysis skills. For the evaluation, we think it will be better if we can share more detail and visualize our project we have done, so that other people can see more clearly. 20 | -------------------------------------------------------------------------------- /ggplot2_matplotlib_cheatsheet.Rmd: -------------------------------------------------------------------------------- 1 | # Data Visualization with GGPlot2 (R) and MatPlotLib (Python) Cheat Sheet 2 | 3 | Kylie Brothers 4 | 5 | Motivation 6 | 7 | Throughout undergraduate, graduate, and professional careers, the two most widely used statistical languages are Python and R. Both of these languages have many similarities, but can have very differing syntaxes and coding implementations. 8 | 9 | For this project, it was important to create a cheat sheet so an individual can easily navigate both languages and transfer knowledge learned in this course to Python. Therefore, I created a cheat sheet for GGPlot2 in R, which we learned extensively in class, and MatPlotLib in Python, which has been the primary visualization package used in my undergraduate and graduate careers. Both of these packages have their own cheat sheets, respectively, but they are formatted differently and it can be hard to parse through both of them to gain the exact line(s) of code you need when making comparisons or migrating code to another programming language. 10 | 11 | Link to cheat sheet: 12 | 13 | If I Had More Time... 14 | 15 | If I had more time, or the next time I create a cheat sheet to navigate these packages, I would include faceting, subplots, and tick markers. In addition, I think it would be useful to also have a comparison for GGPlot2 (R) and Seaborn (Python); or MatPlotLib (Python) and Seaborn (Python). All three of these packages are heavily used for visualization and having an easy to access cheat sheet for all of these languages and packages would make it a lot quicker and easier to transverse any coding that is trying to be implemented. 16 | -------------------------------------------------------------------------------- /tidytext_cheatsheet.Rmd: -------------------------------------------------------------------------------- 1 | # Text data visualization by tidytext 2 | 3 | Xiaolin Sima and Yanni Chen 4 | 5 | As our community contribution, we created a cheatsheet about text data visualization by tidytext. 6 | [cheatsheet can be found here](resources/tidytext_cheatsheet/tidytext_cheatsheet.pdf) 7 | 8 | ## Motivation for the project and need it addrressed 9 | 10 | Text mining and natural language processing is an essential but complicated field with a lot of tools and ways to analyze. However, we found tidytext as a useful door helping us to open the world of text mining. When we meet text data, using tidytext package can make many text analysis tasks easier and more effective. Much of the infrastructure needed for text mining with tidy data frames already exists in other widely used packages like dplyr, tidyr and ggplot2. In that way we created this cheatsheet, providing functions with examples to allow the use of tidytext package combined with existing infrastructure to do some basic text analysis works. 11 | 12 | ## Own evaluation of the project 13 | 14 | By creating the cheatsheet, we have not only learned how to use the tidytext package to do the text frequency and sentiment analysis, but also got familiar with how to combine the tidytext package with other data visualization tools, for example ggplot and wordcloud. The cheatsheet is generally useful if someone wants to visualize the text data analysis, specifically word frequency and sentiment analysis using R. 15 | 16 | However, There are also places we need to improve. Adding more example not limited to the wine tasting review might be one place to make the cheatsheet more generalized. Also, the deeper sentiment analysis regarding each category can be listed for further analysis if more room allowed. 17 | -------------------------------------------------------------------------------- /edav_data_importance.Rmd: -------------------------------------------------------------------------------- 1 | # Data importance & visualization 2 | 3 | Utsav Vachhani 4 | 5 | ## Description: 6 | This document is a collection of my thoughts on importance of data visualization in a blog post format. In the article, I go through a brief history of how data visualization came around to be and what purpose it has served. Additionally, I connect it to how it is utilized today and the present some use cases. The blog is intended for non-technical users, so I also introduce some fundamental graphs that are common in data visualizations. Finally, I touch on some negative uses of data visualization in recent times and analyze instances to alert the audience to be on the lookout. 7 | 8 | ## Motivation: 9 | Data visualization has always been interesting to me as I come from a business background. It serves as a great complement to the rigorous analysis and experimentation done in data science and, gives us a concise way to show insights through appealing visualizations. I also like that it introduces a creative and artistic component to an otherwise analysis heavy workload. I noticed that sometimes professionals, especially ones interested in data science or related fields, focus a lot on getting results but not enough time to translate them properly. I want this blog to showcase that data visualization is very influential today since a lot of people will believe everything they see online without proper research. Therefore, it is more imperative to at least know the basics to identify inaccuracies when shown. When I have a completed personal website, I am planning to include this blog as an article to highlight the significance of data visualization. 10 | 11 | 12 | [Link to pdf](https://github.com/utsavkv/EDAV_CC/blob/main/EDAV_FinalCC.pdf){target="_blank"} -------------------------------------------------------------------------------- /ggplotscheatsheet.Rmd: -------------------------------------------------------------------------------- 1 | # Cheatsheet of ggplot2 (ggplot2 & alluvial & heatmap) 2 | 3 | Yuehan Hu and Wen Chen 4 | 5 | ## The Cheatsheet Link: 6 | 7 | https://github.com/yuehanhu/ggplot2cheatsheet/blob/main/Cheatsheet.pdf 8 | 9 | ## Introduction to ggplot2 10 | 11 | ggplot2 is a plotting package that creates complex plots from data based on the grammar of graphics. Moreover, ggplot2 provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. This package provides a very convenient way for us to create different kinds of distributions. 12 | 13 | ## Summary to the Cheatsheet 14 | 15 | Our cheatsheet includes three parts, the first part includes the basic definition, some essential grammar and the most common visualizations of ggplot2, which provides us the clearly frame to use standard format and easy way to distinguish the delicate difference between these visualizations. The second part includes the brief description, essential grammar, an example, different curve types, color and legend customizations of ggalluvial. Alluvial is very useful to visualize frequency distributions over time or frequency tables involving several categorical variables. The third part includes a short introduction, an example, color and legend customizations of heatmap. Heatmap is good for us to show relationships between two variables, one plotted on each axis. 16 | 17 | ## The evaluation of the Cheatsheet 18 | 19 | Although there are many similarities in each part of grammar when creating different visualizations, we marked the difference in grey, which help us to distinguish and remember. This cheatsheet summarize some important kinds of visualizations in ggplot2, which also lead us review the content of ggplot2 in this class generally. 20 | 21 | ```{r, include=FALSE} 22 | knitr::opts_chunk$set(echo = TRUE) 23 | ``` 24 | 25 | -------------------------------------------------------------------------------- /bestparcoords.Rmd: -------------------------------------------------------------------------------- 1 | # Best Parallel Coordinates R Package 2 | 3 | Shubham Kaushal and Daniel Young 4 | 5 | ## Introduction 6 | 7 | GGally has a ggparcoord function but the user is in charge of what order the features go in along the x axis. There are some built-in functions to arrange them automatically but we propose our own method. We find the two highest correlated features, then greedily continue appending the next highest correlated feature until we have an ordering. 8 | 9 | ## Example 10 | 11 | ### Installation 12 | 13 | We install from a github repository 14 | 15 | ```{r} 16 | #devtools::install_github("ShubhamKaushal15/bestparcoords") 17 | library("bestparcoords") # Must be installed from source 18 | ``` 19 | 20 | ### Data 21 | 22 | We demo our package with the mtcars dataset. Any preprocessing must be done before we use the package so for example we will remove the last column. 23 | 24 | ```{r} 25 | data(mtcars) 26 | mtcars <- mtcars[,-11] 27 | head(mtcars) 28 | ``` 29 | 30 | ### Default 31 | 32 | The default ordering shown here isn't the best, it just leaves the columns in order regardless of their correlation. 33 | 34 | ```{r} 35 | GGally::ggparcoord(mtcars, splineFactor=10, alphaLines=0.5) 36 | ``` 37 | 38 | ### bestparcoords 39 | 40 | Here we show the bestparcoords graph which is a better than the default version because it accentuates the alternating trends between pairs of features which shows the underlying pattern in the data much better. 41 | 42 | ```{r} 43 | cols <- bestparcoords::bestparcoord(mtcars) 44 | ``` 45 | 46 | Best parcoords also outputs the features in the order it found, which we can use for other purposes. For example, if we want to display the plot with a different spline factor. 47 | 48 | ```{r} 49 | print(cols) 50 | indices <- match(cols, colnames(mtcars)) 51 | print(indices) 52 | ``` 53 | 54 | ```{r} 55 | GGally::ggparcoord(mtcars, columns=indices) 56 | ``` 57 | -------------------------------------------------------------------------------- /cheatsheet_of_ggally_plots.Rmd: -------------------------------------------------------------------------------- 1 | # Cheatsheet of GGally (including ggcoef_model and ggpairs) 2 | 3 | Xinhao Dai 4 | 5 | ### Introductions of GGally 6 | 7 | ggplot2 is a R package for plotting based on grammar of graphics. GGally is a extension of ggplot2. It adds several helpful functions to reduce the difficulties of combining different geoms. To enable novices to quickly start using the packet, I create a cheatsheet for GGally. 8 | 9 | This cheatsheet includes several essential and efficient plots for explorary data analysis and visualization. First, it's the ggcoef_model function, it helps to visualize the coefficients of a fitted model. It's more intuitive and clear. Moreover, it enables us to compare several models in a plot so we can choose the best model among them. Second, it's the ggpairs() function, which shows the relationships between every two variables in a dataset. It helps us to indentify what's the relevent variable to a response variable. Moreover, the cheatsheet display several high-level plots that can be used in ggpairs in order to help users have a more clear intuiation. 10 | 11 | In this project, I have learned how to do a quickly read through the documents of a packet and distinguish which functions are the most essential and helpful ones. Moreover, I have learned how to create a tidy and beautiful cheatsheet for the users. And if I have another chance, I will try to create my own templeate rather than use the rstudio templeate. 12 | 13 | You can find a pdf version of the cheetsheet on [https://github.com/Russell-A/cheatsheet-for-ggally](https://github.com/Russell-A/cheatsheet-for-ggally/blob/main/cheatsheet_for_ggally.pdf). Below is an enclosed image of the cheatsheet. 14 | 15 | ```{r, echo=FALSE} 16 | knitr::include_graphics("resources/cheatsheet_GGally_plots/cheatsheet_for_ggally-page-001.jpg") 17 | ``` 18 | 19 | Reference: [https://ggobi.github.io/ggally/index.html](https://ggobi.github.io/ggally/index.html). -------------------------------------------------------------------------------- /plotly_cheatsheet.Rmd: -------------------------------------------------------------------------------- 1 | # Cheatsheet for Plotly 2 | 3 | Taichen Zhou, Yichen Huang 4 | 5 | For the community contribution for the EDAV class, we researched and created an R graphing library - plotly which can create integrative, publication-quality graphs. In this cheat sheet, we will provide the example of how to create basic charts, statistical charts, scientific charts etc. with Plotly. 6 | The link of the cheat sheet is [here](https://yichuang25.github.io/Plotly-Cheatsheet/). 7 | 8 | ## Motivation 9 | Data visualization is a crucial part for data scientist. As the tech industry matures, the demand for better visualizations increases. Each type of visualization has its specialty, just like a line chart describes trend better than a bar plot. To display data in an extraordinary fashion, Plotly is insanely powerful at explaining and exploring data. And here is the reason why. 10 | 11 | ### Interactivity 12 | Plotly provides a feature that no other visualization library has — interactivity. Plotly allows users to further improve the visualization experience by using customizable interactive tools. When we visualize data with Plotly, we can add interactive features like buttons, sliders, and dropdowns to display different perspectives of graphs. 13 | 14 | ### Customization and Flexibility 15 | Plotly is like a piece of paper, you can draw whatever you want on it. Compared to traditional visualization tools like ggplot2, Plotly allows full control over what is being plotted. 16 | 17 | ## Evaluation and Future Improvement 18 | By making this Plotly cheat sheet, we learned how to not only graph in which we have talked about in ggplot, but also graph used in Machine Learning for example, ROC and PCA visualization. Also, we got more familiar with writing in R markdown file. 19 | There are still things we can furtherly improve, like we can add more types of graphs in our cheat sheet for example, alluvia plot. Moreover, the introduction of parameters of functions to plot the graph can be talked about a little bit more. 20 | -------------------------------------------------------------------------------- /ggmap_cheatsheet.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Cheatsheets {-} 2 | 3 | # Geographic visualization by ggmap 4 | 5 | Yuta Adachi 6 | 7 | As my community contribution for the EDAV class, I created a cheat sheet about the geographic visualization by using ggmap. 8 | 9 | The link for the cheat sheet is [here](https://github.com/Yuta555/ggmap_cheatsheet/blob/main/ggmap_cheatsheet.pdf). 10 | 11 | ## Motivation for this project 12 | 13 | We sometimes want to describe a fact using geographic visualization. For example, if I'm looking for the shortest path from Columbia University to Central Park, I would like to see a route with a background of road map in which I can understand which street I should walk. However, it seems to be difficult to create a route map or other geographic visualizations by only the original ggplot2 packages, as far as I know. 14 | 15 | An R package, "ggmap", makes it easy to download and choose various kinds of map tiles such as the Google Maps and Stamen Maps. In addition, you might feel it is easy to plot data since you can deal with the ggplot2 framework you are familiar with. 16 | 17 | That's why I'm eager to learn how to use the ggmap package and create the cheat sheet for everyone. 18 | 19 | ## My own evaluation/Room for improvement 20 | 21 | Through creating the cheat sheet, I learned not only how to utilize the ggmap package, but also what type of plots I can choose for the geographic visualization. Specifically, in order to introduce three examples so that readers can learn how to use the package in a practical way, I researched and implemented several types of plots related to the geographic visualization such as a route map and a scatter plot on a map. 22 | 23 | Though I learned many things from the project, I can improve this cheat sheet more by adding the examples of other types of data and visualization. For example, I may create another density plot using a continuous variable, and I will also add a flow map to describe transitions between several locations if I have an opportunity to create next time. 24 | -------------------------------------------------------------------------------- /googleVis.Rmd: -------------------------------------------------------------------------------- 1 | # googleVis 2 | 3 | Wenqi Sun 4 | 5 | The googleVis package provides an interface between R and the Google's charts tools. It allows users to create web pages with interactive charts based on R data frames. Charts are displayed locally via the R HTTP help server. A modern browser with Internet connection is required. The data remains local and is not uploaded to Google. 6 | 7 | I created a cheat sheet for googleVis. It covers most charts and some options. 8 | 9 | Link: https://github.com/Nntraveler/googleVis_cheatsheet/blob/main/googleVis.pdf 10 | 11 | ## Motivation for this project 12 | 13 | Google Charts is a powerful tool for the visualization of HTML. It provides an easy and quick way to generate interactive charts. 14 | 15 | The advantages include the following: 16 | 17 | - A variety of charts 18 | 19 | - Extensive set of options for customization 20 | 21 | - Cross-browser compatibility 22 | 23 | - Dashboard and Controls for user interaction 24 | 25 | - Support dynamic data 26 | 27 | The googleVis package provides a way to use Google charts with R conveniently. Although it has detailed documentation, users may need clarification about which chart and options work for their cases since there is no cheat sheet. 28 | 29 | Therefore, by creating the cheat sheet for googleVis, everyone can get it easier to use googleVis. 30 | 31 | ## My own evaluation of the project 32 | 33 | After creating the cheatsheet, I 34 | 35 | - have a better understanding of Google charts and googleVis package 36 | 37 | - experienced happiness for contributing something to the open source projects 38 | 39 | - explored a variety of cheatsheets and found an ideal layout for my project 40 | 41 | What's more, I think there is some space for improvement: 42 | 43 | - For the options part, it could be a better way to show how to use these options vividly. Due to limited space, I just put a table here. 44 | 45 | - The compressed charts look bad for some charts like Anno, Cal, and Timeline charts. I could assign more space or adjust the data. 46 | -------------------------------------------------------------------------------- /ggstatsplot.Rmd: -------------------------------------------------------------------------------- 1 | # Cheatsheet of ggstatsplot in R 2 | 3 | Shaonan Wang, Baochan Jiang 4 | 5 | This cheatsheet is for ggstatsplot. Use the link to check the cheatsheet: 6 | https://github.com/anna-shaonanw/ggstatsplot-cheatsheet/blob/main/ggstatsplot-cheatsheet.pdf 7 | 8 | When we do data analysis, we usually need to visualize the basic situation of the data, including mean, median, mode, etc. Based on this, we will also test and analyze the distribution of the data to understand whether the data presents a normal distribution. If there are two groups of data, we also need to perform a t-test to detect the correlation of the data. And sometimes build a model between data, perform ANOVA test to learn about the model. 9 | 10 | {ggplot} is a good tool to do visualization. However, if you want to display more statistical results in the graph, this will be a very complicated process. You have to perform statistical calculations, and then use {ggplot} to draw the results one by one. In this case, we found a more convenient R package {ggstatsplot} for visualization and display of statistical results. 11 | 12 | {ggstatsplot} is an extension of {ggplot2} package for creating graphics with details from statistical tests. It can easily add statistical tests to graphs, support multiple tests, and display P value and other statistical indicators among data subsets in graphs. 13 | 14 | {ggstatsplot} = "data visualization" + "statistical modeling" 15 | 16 | In the process of preparing for this cheatsheet, we learned about basic type of plots and multiple statistical approaches and tests. {ggstatsplot} help us make data mining easy and fast. It visually shows the relationship between data sets and the results of statistical tests which helps us better understanding the data, the differences between datasets and statistical tests results of the datasets. 17 | 18 | This package contains more than we have in the cheatsheet, in the next time, we would like to explore more deeply how each parameter in the function contributed to the outcome. And we want to further learn how the group function can show the similarity and difference between multiple data sets, and their relationship. 19 | -------------------------------------------------------------------------------- /DESCRIPTION: -------------------------------------------------------------------------------- 1 | Package: placeholder 2 | Type: Book 3 | Title: Does not matter 4 | Version: 0.0.0.9000 5 | Imports: 6 | emo, 7 | tidyverse 8 | Suggests: 9 | AppliedPredictiveModeling, 10 | autoplotly, 11 | bench, 12 | bestparcoords, 13 | bookdown, 14 | BSDA, 15 | bslib, 16 | carData, 17 | caret, 18 | caTools, 19 | changepoint, 20 | chorddiag, 21 | circlize, 22 | cluster, 23 | collapsibleTree, 24 | d3heatmap, 25 | devtools, 26 | downlit, 27 | dplyr, 28 | dygraphs, 29 | extrafont, 30 | fpp2, 31 | gapminder, 32 | gbm, 33 | gcookbook, 34 | ggalluvial, 35 | ggfortify, 36 | ggmap, 37 | ggokabeito, 38 | ggplot2, 39 | ggrepel, 40 | ggridges, 41 | ggtext, 42 | grid, 43 | gridExtra, 44 | heatmaply, 45 | highcharter, 46 | httr, 47 | igraph, 48 | keras, 49 | kernlab, 50 | KFAS, 51 | lattice, 52 | leaflet, 53 | lubridate, 54 | MASS, 55 | mlbench, 56 | naivebayes, 57 | networkD3, 58 | parallel, 59 | patchwork, 60 | plotly, 61 | profvis, 62 | quantmod, 63 | randomForest, 64 | raster, 65 | RColorBrewer, 66 | readr, 67 | readxl, 68 | recipes, 69 | reshape2, 70 | reticulate, 71 | rjson, 72 | rpart, 73 | RSelenium, 74 | rsthemes, 75 | rvest, 76 | scales, 77 | sf, 78 | showtext, 79 | showtextdb, 80 | spData, 81 | spDataLarge, 82 | splines, 83 | strucchange, 84 | tensorflow, 85 | tibble, 86 | tidyquant, 87 | tidyr, 88 | timelineS, 89 | tmap, 90 | treemap, 91 | TSstudio, 92 | vcd, 93 | vistime, 94 | withr, 95 | wordcloud2, 96 | XML, 97 | xml2, 98 | xts 99 | Remotes: 100 | hadley/emo, 101 | ShubhamKaushal15/bestparcoords, 102 | Nowosad/spDataLarge, 103 | rstudio/d3heatmap, 104 | mattflor/chorddiag, 105 | daheelee/timelineS, 106 | rstudio/tensorflow, 107 | gadenbuie/rsthemes, 108 | lchiffon/wordcloud2 109 | -------------------------------------------------------------------------------- /network_visualization.Rmd: -------------------------------------------------------------------------------- 1 | # Visualizing graphs and networks 2 | 3 | Vritansh Kamal 4 | 5 | Link to presentation and resources :- 6 | 7 | ## Motivations 8 | 9 | For the community contribution, I have decided to give a presentation on visualizing networks for EDAV. I've explored some of these open-source tools, as a part of my research in industry. 10 | 11 | Graphs can reveal highly complex relationships when explored using the right kind of tools. This involves visualizing the graphs(networks) which are different than our traditional visualization methods. From visualizations, we should be able to understand the clusters that these graphs form, and the shortest path, and extract relevant information about how our data is structured. I discovered it has many more use cases in the evergrowing industry of Industrial IoT, bank transactions, and blockchain. There are server-side tools that can perform really well. But the challenge is to create visualizations on the browsers using javascript to create scalable and tangible products in a saas environment. 12 | 13 | These visualizations make use of the user's CPU/ GPU to perform the computation for visualizing data. There are libraries like CytoscapeJS, D3Js-force graph, etc. but these don't scale very well when the size of nodes and edges in a graph increases. These tools perform slowly on browsers or crash frequently when the size of graphs increases. 14 | 15 | I found this library from Vasco Asturiano which is open-source with MIT License and has support for rendering webGL-based graphs. This helps to solve most of the problems with its ability to interact with GPU. It can help visualize graphs with ease in the order of tens of thousands of nodes and edges. It also includes the customizations such as adding images on the links and nodes, adding particles emission, etc. This helps to bring virtual graphs as much as possible close to the real world. 16 | 17 | #Learning 18 | This is an efficient library, Which forms abstraction on top of d3js components and provides scalability. If I had more time I would have explored on it and built a prototype using any network dataset. The key to analyze networks is to identify relationships and display them accordingly. 19 | 20 | -------------------------------------------------------------------------------- /misleading_visualizations.Rmd: -------------------------------------------------------------------------------- 1 | # Is the data visualization misleading or not - Survey 2 | 3 | Ritvik Khandelwal (rk3213), Sanket Bhandari (sb4719) 4 | 5 | ## Overview and motivation 6 | 7 | In today’s digital world, everything around us consists of data, but not all are accurate. We use this data to gauge whether something is true or false, but it is not often that we see it in its raw form. Imagine how the many rows and columns that consist of numbers can be very confusing to interpret, hence the need for data visualizations. 8 | However, as beneficial as data visualization is for interpreting the data, it can also be used to bend the truth and misinterpret trends. Misleading visualizations have been used time and time again by reputed media companies to endorse a particular propaganda or a narrative. At times, they do convey the required information and are factually correct. However, how they are visually represented misleads the population. There are also certain cases where these reputed media companies show only a part of the data, thus conveying the wrong information with the goal of stirring a particular popular opinion among the public. 9 | 10 | ## Goal 11 | 12 | We conducted this survey to understand whether different fractions of the population (in terms of age group) are even aware that such misleading visualizations exist in media. This survey would serve as a stellar way of spreading awareness that such things happen in real-time. Whether a person is aware or not prior to this survey, we believe this survey serves as a brilliant method to inform people of the different practices that these media companies adopt to mislead the people by endorsing a particular propaganda, thus inducing unnecessary bias among the population. 13 | In our survey, we have taken 5 precise examples published by the media before. All the graphs in our survey cover a vast majority of the misleading graphs that media agencies propagate. These examples effectively summarize what is done by media companies in general to mislead people. We are confident that anyone who attempts this survey will be able to notice any sort of irregularities in the future and thus will not fall victim to such practices. 14 | 15 | The following link contains the link to the report of the suvrvey conducted by us. 16 | 17 | [https://drive.google.com/file/d/18ISbr8DNUWgRwvAxjDntb91_NONzLvr5/view?usp=sharing]{target="_blank"} 18 | 19 | Link to the Survey: 20 | 21 | [https://forms.gle/VXCpJBduiFnaE7mn7] 22 | 23 | 24 | 25 | -------------------------------------------------------------------------------- /cheatsheet_patchwork.Rmd: -------------------------------------------------------------------------------- 1 | # Cheatsheet for 'patchwork' 2 | 3 | Maria Pratyusha 4 | 5 | Link to presentation and resources :- 6 | 7 | *Note: although this code works fine locally, it is causing issues when rendered through GitHub Actions and therefore the chunks that involve `patchwork` will not be evaluated.* 8 | 9 | 10 | Import the dataset 11 | 12 | ```{r} 13 | data("iris") 14 | head(iris) 15 | ``` 16 | 17 | Install the package 18 | ```{r} 19 | library("patchwork") 20 | ``` 21 | 22 | Used in combination with ggplot2 so install and load that as well 23 | ```{r} 24 | library("ggplot2") 25 | ``` 26 | 27 | Let's plot a few graphs using 'ggplot2' 28 | 29 | Scatterplot 30 | 31 | ```{r} 32 | ggp1 <- ggplot(iris, 33 | aes(x = Sepal.Length, 34 | y = Sepal.Width, 35 | col = Species)) + 36 | geom_point() 37 | ggp1 38 | ``` 39 | 40 | Barchart 41 | 42 | ```{r} 43 | ggp2 <- ggplot(iris, 44 | aes(x = Species, 45 | y = Sepal.Width, 46 | fill = Species)) + 47 | geom_bar(stat = "identity") 48 | ggp2 49 | ``` 50 | 51 | Boxplot 52 | 53 | ```{r} 54 | ggp3 <- ggplot(iris, 55 | aes(x = Species, 56 | y = Sepal.Width, 57 | col = Species)) + 58 | geom_boxplot() 59 | ggp3 60 | ``` 61 | 62 | Composition plot using plot_arithmetic 63 | 64 | ```{r} 65 | ggp1 + ggp2 + ggp3 + plot_layout(ncol = 1) 66 | 67 | ggp1 + ggp2 - ggp3 + plot_layout(ncol = 1) 68 | 69 | ggp_sbs <- (ggp1 + ggp2) / ggp3 70 | ggp_sbs 71 | ``` 72 | 73 | area() 74 | 75 | ```{r} 76 | layout <- c( 77 | area(1, 1), 78 | area(1, 3, 3), 79 | area(3, 1, 3, 2) 80 | ) 81 | plot(layout) 82 | 83 | ggp1 + ggp2 + ggp3 + plot_layout(design = layout) 84 | ``` 85 | 86 | guide_area() 87 | 88 | ```{r, eval = FALSE} 89 | ggp1 + ggp2 + ggp3 + 90 | plot_layout(guides = 'collect', ncol = 2) 91 | ``` 92 | 93 | inset_element() 94 | 95 | ```{r} 96 | ggp1 + 97 | inset_element(ggp2, 0.01, 0.01, 0.7, 0.5) + 98 | inset_element(ggp3, 0.4, 0.6, 0.99, 0.99) 99 | ``` 100 | 101 | plot_annotation 102 | 103 | ```{r} 104 | ggp1 / (ggp2 | ggp3) + 105 | plot_annotation(tag_levels = 'A') 106 | 107 | ggp1 / ((ggp2 | ggp3) + plot_layout(tag_level = 'new')) + 108 | plot_annotation(tag_levels = c('A', '1')) 109 | ``` 110 | 111 | -------------------------------------------------------------------------------- /playful_r_packages.rmd: -------------------------------------------------------------------------------- 1 | # Playful R Packages for Kids 2 | 3 | Hannah Yang 4 | 5 | For my Community Contribution project, I wrote a blog summarizing a few interesting R packages that could motivate kids' interest. 6 | 7 | The blog can be found [here] (https://hannahyaang.notion.site/Playful-R-packages-for-kids-ed278dd2f5cf4414a9202d037ed1df68) 8 | 9 | ## Motivations 10 | 11 | In the recent decade, there's a wave among pre-school kids to learn programming skills. When it comes to the primary students education, it's more important to motivate their curiosity while learning. R language is a good entry level programming language due to its more straight-forward syntax and powerful capabilities in performing works. While during my research, I didn't find too many R tutorial dedicated to these younger generations. Thus, I decide to contribute to this field. 12 | 13 | My blog address a few needs. 14 | Firstly, providing a centralized resource for kids to explore interesting packages. 15 | Secondly, this blog brings up the attention of educating pre-school kids in programming. Coding is hard. Learning to code as a kid is even harder. We should collectively translate the tough and complex documentations to something creative and encouraging contents for kids to experiment and have fun. 16 | 17 | ## Evaluation and Improvement 18 | Finding these interesting packages suprised me at the beginning of my research as I observed many people who utilize their knowledge to do such innovative things. When we mention programming and R language, we tend to think of workplace or academia that are serious and rigorous. Thus, it's not that often to see these vibrant packages that could be used to promote learning interest. It's also surprising to see so scarce R programming resources for kids. 19 | I think the things that kids learn when they are at a younger age is really crucial, which even will impact their career decision. Thus, I believe it's definitely important to pay more attention to this field. Based on what I learned, there's no package that are designed for kids so far, a pakage that is easy to understand, easy installations, simple funcionalities, well-documented explanations in plain words, and encouring interaction messages. I think 20 | such a package is really worth a thought and we all can contribute a little with our knowledge. 21 | 22 | As for my blog, if I were to improve it with more time, I would summarize more packages that are available in the market and including more examples for each function since examples are the most helpful and straight-forward illustraion of complicated concepts. -------------------------------------------------------------------------------- /_bookdown.yml: -------------------------------------------------------------------------------- 1 | book_filename: "cc22mw" 2 | new_session: false 3 | before_chapter_script: _common.R 4 | delete_merged_file: true 5 | language: 6 | ui: 7 | chapter_name: "Chapter " 8 | 9 | 10 | # Chapter ordering 11 | rmd_files: [ 12 | # Part: Introduction 13 | 'index.Rmd', # must be first chapter 14 | 'assignment.Rmd', 15 | 'instructions.Rmd', 16 | 'sample_project.Rmd', 17 | 18 | # Part: Cheatsheets 19 | 'ggmap_cheatsheet.Rmd', 20 | 'googleVis.Rmd', 21 | 'tidytext_cheatsheet.Rmd', 22 | 'cheatsheet_of_ggally_plots.Rmd', 23 | 'cheatsheet_datascience.Rmd', 24 | 'plotly_cheatsheet.Rmd', 25 | 'data_explorer_cheatsheet.Rmd', 26 | 'ggplot2_matplotlib_cheatsheet.Rmd', 27 | 'cheatsheet_patchwork.Rmd', 28 | 29 | 30 | # Part: Tutorials 31 | 'mapcharts.Rmd', 32 | 'machine_learning_tutorial.Rmd', 33 | 'quantmod_and_tidyquant_tutorials_and_comparison.Rmd', 34 | 'learning_echarts4r.Rmd', 35 | 'inflation_web_scraping.Rmd', 36 | 'dygraph_tutorial.Rmd', 37 | 'two_fun_r_packages.Rmd', 38 | 'wordcloud_tutorial.Rmd', 39 | 'efficient_r.Rmd', 40 | 'color_deficiency.Rmd', 41 | 'ggfortify_and_autoplotly_tutorial.Rmd', 42 | 'become_better_business_analysts.Rmd', 43 | 'ggstatsplot.Rmd', 44 | 'bestparcoords.Rmd', 45 | 'hypothesis_testing_guide_in_r.Rmd', 46 | 'api_requests_and_web_scraping.Rmd', 47 | 'covid_shiny_dashboard.Rmd', 48 | 'interactive_plots_in_r.Rmd', 49 | 'introduction_to_plotly.Rmd', 50 | 'ml_interview_concept_mindmap.Rmd', 51 | 'python_seaborn.Rmd', 52 | 'tutorial_of_making_different_types_of_interactive_graphs.Rmd', 53 | 'ggplotscheatsheet.Rmd', 54 | 'chord_diagrams.Rmd', 55 | 'resparking_creativity_for_visualizations.Rmd', 56 | 'machine_learning_r.Rmd', 57 | 'edav_data_importance.Rmd', 58 | 'time_series_visualization_r_tutorial.Rmd', 59 | 'bubble_chart_application.Rmd', 60 | 'edav_history_of_charts.Rmd', 61 | 'python_geospatial_visualization.Rmd', 62 | #'r_neural_network.Rmd', 63 | 'bioconductor.Rmd', 64 | 'playful_r_packages.rmd', 65 | #'geocomputationwithr.Rmd', 66 | 'ggvis_cheatsheet.Rmd', 67 | #'plotly_python_surface.Rmd', 68 | 'rstudio_editor_theme_setup_tutorial.Rmd', 69 | 'gradient_descent.Rmd', 70 | #'ggplot_stat_and_theme.Rmd', 71 | #'file_ingestions.Rmd', 72 | 'vis_in_python_vs_r.Rmd', 73 | 'wordclouds_video_tutorial.Rmd', 74 | 75 | 76 | # Part: Workshops 77 | 'edavsurveyandanalysis.Rmd', 78 | 'network_visualization.Rmd', 79 | 'misleading_visualizations.Rmd', 80 | 81 | # Part: Appendices 82 | 'appendix_initial_setup.Rmd', # contains PART header 83 | 'appendix_pull_request_tutorial.Rmd' 84 | 85 | ] 86 | -------------------------------------------------------------------------------- /resparking_creativity_for_visualizations.Rmd: -------------------------------------------------------------------------------- 1 | # Video guide to reigniting your creavite spark for visualizations 2 | Imani Oluwafumilayo Maliti 3 | 4 | ## Video 5 | 6 | 7 | 8 |
9 | In the situation where the embedded does not work for your device, the following is a link to the same video: https://youtu.be/Q4CwEbKsBsQ 10 |
11 | 12 | Note: For those who are hard of hearing or simply prefer subtitles, the embedded video and the YouTube video have both been equipped with Closed Captioning to make your viewing experience more accessible to you. 13 | 14 | ## Motivation & Explanation 15 | 16 | My orig inal motivation for this project was to brainstorm viables activities that people could partake in to reignite their creative spark while creating data visualizations. By culture, I am a storyteller—and I have always viewed data visualizations as a prime, modern way of storytelling. By trade, I am a content creator. I specialize in editing videos and creating graphics that not only convey a story, but that inspire people to do something great with their lives. Naturally, when I visualize data, I try to incorporate both the artistic and cultural aspects of my training. I strive to make accessible visualizations that are also fun, engaging, and intentional; however, I have been in many spaces with data scientists that do not prioritize such creativity. Although I understand the importance of practicality and “professionalism”, I strongly believe that having a creative mindset is a major strength. As a result, I created a video in which I explain three activities that can aid in one’s journey to making more creative visualizations.
17 |
18 | Originally, I wanted to make a video in the form of a Tik Tok. My hope was to share it on the platform to make this information more accessible to the community who needs it and boost its impact. However, I found myself with over 20 minutes of content that I had to cut down, edit, and piece together. Ultimately, I was left with my final 6 minute video—which is too long to post on Tik Tok. If I did have to re-do this project, I would have done a shorter video; in turn, this would have enabled me to post it. To compensate for the missing information and explanations, I would pair said video with a blog post that goes into further detail about each activity. Other than that, I feel like the video came out pretty well and I am proud of the final products. 19 | 20 | 21 | ## Resources 22 | B., Du Bois William E, et al. W. E. B. Du Bois's Data Portraits: Visualizing Black America. W.E.B. Du Bois Center at the University of Massachusetts Amherst, 2018.
23 |
“THE BOOK.” Dear Data, http://www.dear-data.com/thebook. -------------------------------------------------------------------------------- /learning_echarts4r.Rmd: -------------------------------------------------------------------------------- 1 | # Learning echarts4r with shiny 2 | 3 | Ethan Evans 4 | 5 | For my Community Contribution project, I created a Shiny web-app tutorial of an interactive visualization package called echarts4r. 6 | 7 | The link for my tutorial site can be found [here](https://etevans.shinyapps.io/CommunityContribution/). 8 | 9 | ## Motivations 10 | 11 | I have explored this package at a basic level in my past R experiences but not enough to appreciate its true value. In the times I have used other interactive visualization packages like plotly and htmlwidgets, I have encountered far more coding syntax issues and plot formatting issues than when I have used echarts4r. I also just think echarts4r is the most visually appealing set of graphics I have worked with across any language or software. Thus, I felt like sharing this package with a class that has some new R users would be a great opportunity for both myself and my classmates. 12 | 13 | My tutorial addresses a few needs. First, using a Shiny web-based platform shows 14 | new R developers one of the ways you can deploy and view interactive visualizations. Second, this tutorial gives a simpler, more manageable overview of the package than other resources. While the documentation and vignettes online are sufficiently comprehensive, they assume a baseline level of R and visualization knowledge that my audience in our class might not have. My tutorial walks through how to build a plot from start to finish, draws comparisons to the familiar ggplot2, and then details a breadth of plots that we have learned about in class. Online echarts4r tutorials might be a bit overwhelming as well, so if viewers are interested after my baseline tutorial, they might go ahead and check out the vast resources online. 15 | 16 | ## Evaluation and Improvement 17 | 18 | Creating this tutorial taught me a lot more about this package and reiterated how 19 | easy and intuitive it is to learn. Also, through my deep dive of research on the 20 | package before starting the project, I realized just how powerful and unique this 21 | package is. I had thought that everything the package had to offer was detailed 22 | in the main package website; however, because it is built from Apache ECharts, all methods and functions from that library directly translate to one in echarts4r. Overall, I thought that my tutorial covers the package well and gives 23 | the viewer enough knowledge to start exploring on their own. 24 | 25 | If I were to design a tutorial for echarts4r again, I would focus on depth rather than breadth. Overall, I could have spent more time giving a better explanation of the fundamentals of a plot and how to perform certain options on the plot, rather than show a dozen different types of plots. While both have their own value, learning the fundamentals of a plot seems better for a beginner tutorial. 26 | -------------------------------------------------------------------------------- /assignment.Rmd: -------------------------------------------------------------------------------- 1 | # Community Contribution 2 | 3 | This fairly open-ended assignment provides an opportunity to receive credit for contributing to the collective learning of the class, and perhaps beyond. It should reflect a minimum of 3 hours of work. To complete the assignment you must submit a short description of your contribution. If appropriate, attach relevant files. 4 | 5 | There are many ways in which you can contribute: 6 | 7 | * organize and lead a workshop on a particular topic (the date may be after the assignment due date but you need to schedule it before) 8 | * help students find final project partners 9 | * give a well-rehearsed 5 minute lightning talk in class on a datavis topic (theory or tool) (email me to set the date -- it may be after the assignment due date but you need to schedule it before) 10 | * create a video tutorial (any length) 11 | * create a cheatsheet or other resource 12 | * write a tutorial for a tool that's not well documented 13 | * build a viz product (ex. htmlwidget or RStudio add-in) for class use 14 | * design home page (images, text and/or artwork) of this web site 15 | * [your own idea] 16 | * (Note: translations are not allowed) 17 | 18 | You may draw on and expand existing resources. When doing so, it is critical that you cite your sources. 19 | 20 | ## IMPORTANT LOGISTICS 21 | 22 | ### Groups 23 | 24 | You may work on your own or with a partner of your choosing. If you work alone, you do not need to join a group of 1, you simply submit your work on CourseWorks as with any other solo assignment. 25 | 26 | If you work with a partner, add yourselves to a group on the CC page on the People tab. Ed Discussion can be used to find partners with similar interests. 27 | 28 | ### What to submit 29 | 30 | 1. In most cases you should have something tangible to upload, such as the tutorial, cheatsheet, etc. Alternatively you may submit a link to material online (YouTube video, etc.) If there's nothing tangible then include a longer description (see 2.). 31 | 32 | 2. An explanation of the motivation for the project, the need it addresses, and your own evaluation of the project including what you learned / what you might do differently next time. (1/2 page) 33 | 34 | ### Submitting your assignment 35 | 36 | You must submit your assignment twice: once on CourseWorks (so it can be graded) and once to the class, details to follow. 37 | 38 | 1. CourseWorks submission (this assignment): submit your work as an .Rmd and rendered .pdf or .html file, just as with problem sets. If your work does not lend itself to that format, then write in the assignment text box what you did. 39 | 40 | 2. Class (GitHub) submission: detail will be provided in a separate assignment. 41 | 42 | ### Grading 43 | 44 | You will be graded on the quality of your work, originality, and effort invested. Any sources used must be cited. We are looking for value-added over existing resources. For example, an excellent cheatsheet already exists for ggplot2 so creating a new one would not be very useful unless it represents a significant improvement. 45 | -------------------------------------------------------------------------------- /covid_shiny_dashboard.Rmd: -------------------------------------------------------------------------------- 1 | # An Interactive Dashboard: COVID-19 Visualization with Shiny and Plotly 2 | 3 | Ning Kang 4 | 5 | I created the dashboard with Shiny and made the interactive plots with ggplot2 and Plotly. Users can choose from two geography levels: state or region. Users will be prompted to choose one or multiple regions among all eleven regions if the region is selected. Users can also decide the number of variables to visualize. If multiple variables are chosen, a facet grid plot will be displayed. There is a slider to change the date range, which will change the plot accordingly. 6 | 7 | Here is the link to the interactive dashboard: [COVID-19 Interactive Dashboard](https://lexkdev.shinyapps.io/cc22/){target="_blank"} 8 | 9 | ## Motivation for the Project 10 | 11 | For the community contribution, I made a dashboard to visualize New York State's Covid-19 testing from 2020-03-01 to 2022-11-13. The pandemic has significantly shaped the past three years, and many things have changed since 2020. Policies have changed compared to 2020, and I wondered by what degree the number/percentage of people being tested positive changed and different regions differed. The data source, New York State Health Data, enables users to visualize the data with Tableau. However, many of the commands are not very intuitive. Also, there are few interactive options available in R libraries regarding multiple selections and filters on data. I built an easier-to-use interactive visualization dashboard, where checking boxes and selecting from dropdowns yields informative plots. 12 | 13 | ## Evaluation of the Project 14 | 15 | Shiny is an intuitive framework for building web applications with R syntax, and knowledge of HTML, CSS, and JavaScript is not required. It is powerful yet easy to learn. I hadn't used Shiny before this project. After reading through the tutorials and sample projects and practicing, I learned about its structure, unique reactivity, and its resemblance and differences to traditional web development tools. I implemented Plotly to add interactivity to the plots, enabling tooltips and zooming in and out. 16 | 17 | Users can choose from two geography levels: state or region. Users will be prompted to choose one or multiple regions among all eleven regions if the region is selected. Users can also decide the number of variables to visualize. If multiple variables are chosen, a facet grid plot will be displayed. There is a slider to change the date range, which will change the plot accordingly. 18 | 19 | If I were to do this project a second time, I would consider adding cleograph maps. I would add more options for the sidebar panel for other graphs, such as side-to-side bar charts. I would also utilize self-created functions and improve efficiency. Since the data is updated daily, web scrapping could also be used to keep the plot updated. 20 | 21 | ## References 22 | 23 | 1. Data source: 24 | 25 | 2. 26 | 27 | 3. 28 | -------------------------------------------------------------------------------- /resources/Visualization_in_python_vs_R/tophit.csv: -------------------------------------------------------------------------------- 1 | "id","first","last","name","year","stint","team","lg","g","ab","r","h","2b","3b","hr","rbi","sb","cs","bb","so","ibb","hbp","sh","sf","gidp","avg" 2 | "walkela01","Larry","Walker","Larry Walker",2001,1,"COL","NL",142,497,107,174,35,3,38,123,14,5,82,103,6,14,0,8,9,0.3501 3 | "suzukic01","Ichiro","Suzuki","Ichiro Suzuki",2001,1,"SEA","AL",157,692,127,242,34,8,8,69,56,14,30,53,10,8,4,4,3,0.3497 4 | "giambja01","Jason","Giambi","Jason Giambi",2001,1,"OAK","AL",154,520,109,178,47,2,38,120,2,0,129,83,24,13,0,9,17,0.3423 5 | "alomaro01","Roberto","Alomar","Roberto Alomar",2001,1,"CLE","AL",157,575,113,193,34,12,20,100,30,6,80,71,5,4,9,9,9,0.3357 6 | "heltoto01","Todd","Helton","Todd Helton",2001,1,"COL","NL",159,587,132,197,54,2,49,146,7,5,98,104,15,5,1,5,14,0.3356 7 | "aloumo01","Moises","Alou","Moises Alou",2001,1,"HOU","NL",136,513,79,170,31,1,27,108,5,1,57,57,14,3,0,8,18,0.3314 8 | "berkmla01","Lance","Berkman","Lance Berkman",2001,1,"HOU","NL",156,577,110,191,55,5,34,126,7,9,92,121,5,13,0,6,8,0.331 9 | "boonebr01","Bret","Boone","Bret Boone",2001,1,"SEA","AL",158,623,118,206,37,3,37,141,5,5,40,110,5,9,5,13,11,0.3307 10 | "catalfr01","Frank","Catalanotto","Frank Catalanotto",2001,1,"TEX","AL",133,463,77,153,31,5,11,54,15,5,39,55,3,8,1,1,5,0.3305 11 | "jonesch06","Chipper","Jones","Chipper Jones",2001,1,"ATL","NL",159,572,113,189,33,5,38,102,9,10,98,82,20,2,0,5,13,0.3304 12 | "pujolal01","Albert","Pujols","Albert Pujols",2001,1,"SLN","NL",161,590,112,194,47,4,37,130,1,3,69,93,6,9,1,7,21,0.3288 13 | "bondsba01","Barry","Bonds","Barry Bonds",2001,1,"SFN","NL",153,476,129,156,32,2,73,137,13,3,177,93,35,9,0,2,5,0.3277 14 | "sosasa01","Sammy","Sosa","Sammy Sosa",2001,1,"CHN","NL",160,577,146,189,34,5,64,160,0,2,116,153,37,6,0,12,6,0.3276 15 | "pierrju01","Juan","Pierre","Juan Pierre",2001,1,"COL","NL",156,617,108,202,26,11,2,55,46,17,41,29,1,10,14,1,6,0.3274 16 | "gonzaju03","Juan","Gonzalez","Juan Gonzalez",2001,1,"CLE","AL",140,532,97,173,34,1,35,140,1,0,41,94,5,6,0,16,18,0.3252 17 | "gonzalu01","Luis","Gonzalez","Luis Gonzalez",2001,1,"ARI","NL",162,609,128,198,36,7,57,142,1,1,100,83,24,14,0,5,14,0.3251 18 | "aurilri01","Rich","Aurilia","Rich Aurilia",2001,1,"SFN","NL",156,636,114,206,37,5,37,97,1,3,47,83,2,0,3,3,14,0.3239 19 | "loducpa01","Paul","Lo Duca","Paul Lo Duca",2001,1,"LAN","NL",125,460,71,147,28,0,25,90,2,4,39,30,2,6,5,9,11,0.3196 20 | "vidrojo01","Jose","Vidro","Jose Vidro",2001,1,"MON","NL",124,486,82,155,34,1,15,59,4,1,31,49,2,10,2,2,18,0.3189 21 | "rodrial01","Alex","Rodriguez","Alex Rodriguez",2001,1,"TEX","AL",162,632,133,201,34,1,52,135,18,3,75,131,6,16,0,9,17,0.318 22 | "floydcl01","Cliff","Floyd","Cliff Floyd",2001,1,"FLO","NL",149,555,123,176,44,4,31,103,18,3,59,101,19,10,0,5,9,0.3171 23 | "stewash01","Shannon","Stewart","Shannon Stewart",2001,1,"TOR","AL",155,640,103,202,44,7,12,60,27,10,46,72,1,11,0,1,9,0.3156 24 | "cirilje01","Jeff","Cirillo","Jeff Cirillo",2001,1,"COL","NL",138,528,72,165,26,4,17,83,12,2,43,63,6,5,1,9,15,0.3125 25 | "coninje01","Jeff","Conine","Jeff Conine",2001,1,"BAL","AL",139,524,75,163,23,2,14,97,12,8,64,75,6,5,0,8,12,0.3111 26 | "jeterde01","Derek","Jeter","Derek Jeter",2001,1,"NYA","AL",150,614,110,191,35,3,21,74,27,3,56,99,3,10,5,1,13,0.3111 27 | -------------------------------------------------------------------------------- /python_geospatial_visualization.Rmd: -------------------------------------------------------------------------------- 1 | # Python: geospatial data visualization project guide 2 | 3 | Xuechun Yuan 4 | 5 | ## Introduction 6 | 7 | In this tutorial, we will go through the whole process of visualizing geospatial data with Python, starting from getting needed data. The audience would be guided through every step in the process in the context of a real geospatial analysis project. 8 | 9 | The data used in this tutorial is the customer data of the Olist store (source: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce), which is the largest online store in the Brazilian market. The goal of this sample geospatial analysis project is to visualize the customer data over the Brazilian map in order to gain insights about its customer distribution in differnt geographical regions. Shape files of Brazil states (source: http://diva-gis.org/datadown) and a dataset containing the Brazillian state codes (source: https://github.com/datasets-br/state-codes/blob/master/data/br-state-codes.csv) are used in this tutorial as well. 10 | 11 | More detailed information about the used data can be viewed in their respective source link. 12 | 13 | 14 | ### Motivation for the tutorial 15 | 16 | This tutorial is motivated by my previous experience in dealing with geojspatial data. Geospatial analysis is an interesting subject that can help bring extremely useful information in many fields and the produced graphs are often eye-catching and intriguing. Our course has just covered this topic in R; meanwhile, another frequently used tool for geospatial analysis would be Python. However, according to my past experience, the tutorials available for the relevant Python packages are not very handy; they usually focus on demonstrating methods for plotting different types of graphs, while in reality we will come across problems well before we are ready to plot the graphs. To name a few: 17 | 18 | - Visualizing geospatial data in Python would require us to use a new data structure called "GeoDataFrame", which would contain a new data type called "geometry". 19 | 20 | 21 | - Typically the needed information for plotting a geospatial graph would rest in more than one dataset, so we need to be skilled at merging all needed datasets while maintaining the validity of the data. 22 | 23 | 24 | - Some types of graph will require the data to be aggregated. 25 | 26 | 27 | - ... 28 | 29 | 30 | Such problems are usually not addressed by the available tutorials for the relevant Python packages. Therefore, I created this tutorial in an attempt to address these problems and help the audience's geospatial data visualization project in Python go smoother. 31 | 32 | In addition, available tutorials often do not provide enough systematic knowledge for geospatial data visualization in Python. This tutorial also tries to fill in that gap. 33 | 34 | 35 | ### The need to be addressed 36 | 37 | - Introducing requried packages 38 | 39 | 40 | - Introducing the audience to the new data structure required 41 | 42 | 43 | - Guide the audience through data preprocessing 44 | 45 | 46 | - Give some tips and exmaples about the use of different graphing methods 47 | 48 | 49 | ### Own evaluation of the project 50 | 51 | I have consolidated my knowledge about the whole process of graphing geospatial data with Python and brushed up on many important points that are easy to miss. Next time, I could improve by including more plot types and going deeper in different details suching as introducing available color schemes choices and modifying the legends. 52 | 53 | The tutorial can be found here: https://github.com/xy2506/python_geospatial_visualization 54 | -------------------------------------------------------------------------------- /data_explorer_cheatsheet.Rmd: -------------------------------------------------------------------------------- 1 | # DataExplorer Cheatsheet, EDA made easier 2 | 3 | Ryan McNamara and Matthieu Schulz 4 | 5 | As our community contribution, we created a cheat sheet about the DataExplorer package that automates the process of EDA. 6 | [our cheatsheet can be found here](resources/data_explorer_cheatsheet/data_explorer_cheat_sheet.pdf){target="_blank"} 7 | 8 | ## Motivation for the contribution 9 | 10 | Exploratory data analysis, or EDA, is the initial necessary step in diagnosing any dataset that one plans to work on. EDA can be used as a standalone practice to explore a dataset and report any observations/hypotheses, or it can be used as a precursor to the application of ML algorithms. In either case, it is crucial to explore one’s data before drawing any conclusions. Common examples of EDA, which are often repeated, include: visualizing variable distributions, exploring the relationships between variables, documenting missing values, etc. Many of these explorations are done using graphs/charts such as histograms, box plots, and scatter plots. While EDA is a vital part of data analysis/predictive modeling, it is often tedious to write the repetitive code required to make such visualizations. 11 | 12 | An R package, DataExplorer, automates the process of EDA and makes it much simpler and faster to produce exploratory visuals. The package is easy to learn and is a great tool for any data scientist looking to speed up their analyses. While there are a few online tutorials explaining the basics of DataExplorer, we were unable to find any resource that allows for quick referencing of the package’s different functions. Thus, we decided to construct a cheat sheet for DataExplorer as our Community Contribution. We believe our convenient cheat sheet will make it easy for anyone to familiarize themselves with the package and quickly reference the capabilities of the package. 13 | 14 | ## Evaluation of DataExplorer and learnings 15 | 16 | While creating this cheat sheet we learned a lot about the power of the DataExplorer package. We have seen how to quickly create figures that will give us an overview of our dataset using the many plotting functions available. The most useful function of the DataExplorer package is the create_report function, which will compile an HTML report of the plots available through the package. This single function gives the user a complete overview of the dataset they are working with, including information about distributions, missing values, and correlations. The DataExplorer package essentially performs EDA for the user, expediting the process of exploring one’s data. We will certainly be using the create_report function for any data-driven project we work on in the future. 17 | 18 | We also learned a lot about the data.table data structure while working on this assignment. We have been using the data.frame data structure throughout the semester, however, the DataExplorer package is designed to work with data.table. The data.table has a few advantages over the data.frame, including speed and the ability to rename columns, group by columns, and sort by columns directly within the data.table syntax. We are interested in learning more about the data.table and will use it in our code moving forward. 19 | 20 | If we had to do this project again, we would start by exploring the data.table package in more depth. The DataExplorer package works with data.frames as well, however, it is designed to work with data.tables (the data.table will be updated in place while the data.frame will not). Given the advantages the data.table has over the data.frame, it is worth learning and becoming familiar with. We also would create our cheat sheet using a different software than Google Docs/PowerPoint because it was somewhat difficult to create drawings/images. If we were to create another cheat sheet (i.e. for the data.table package), we would create it using Adobe Illustrator or a similar application. Overall, we learned a lot through the creation of our cheat sheet and hope that others find value in it as well. 21 | 22 | ## Sources: 23 | https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html 24 | https://cran.r-project.org/web/packages/DataExplorer/DataExplorer.pdf 25 | 26 | -------------------------------------------------------------------------------- /plotly_python_surface.Rmd: -------------------------------------------------------------------------------- 1 | # Tutorial on Plotly Python 3D Surface and it's Use in Gradient Descent 2 | 3 | Yuxiao Liu, Fan Wu 4 | 5 | ### Description 6 | 7 | In this tutorial, we will show the application of 3D surface plot ([Plotly 3D Surface Plots](https://plotly.com/python/3d-surface-plots/) tool) in gradient descent. 8 | 9 | In class, we covered techniques that shows continuous data in 3 dimensions, e.g. heatmap, contour. In those techniques, however, the third dimension is frequency rather than a variable per se. 3D surface plot is like a dimensional expansion to line chart - the most common and direct way to depict the relationship between continuous variables. 10 | 11 | In gradient descent, the objective is to minimize the objective function (negative log-likelihood function in the maximum likelihood setting). When the number of parameters is 2, it becomes a perfect use case to visualize via the 3D surface plot, because there are two independent variables (the parameters) and one dependent variable (the negative log-likelihood). 12 | 13 | One of the common issues with gradient decent is that: in each iteration, it always moves in the direction with the locally steepest decrease in the objective function, and the process may end up in a local minimum if there is one. A 3D surface plot can help us avoid this, and it can also provide insight in the choice of initial parameter values. 14 | 15 | Source: https://plotly.com/python/3d-surface-plots/ 16 | 17 | ### Gradient Descent Example 18 | 19 | We here provide an example of a use case: 20 | 21 | * In this example, we try to fit a probability distribution to a 1-dimensional data set, via maximum likelihood method. 22 | * The probability distribution is a mix of two standard normal distributions whose means are unknown. 23 | * Gradient descent is the intended method to minimize the negative log-likelihood function. 24 | 25 | ```{r, echo=FALSE} 26 | knitr::include_graphics("resources/plotly_python_3d_surface_images/maximum_likelihood_function.JPG") 27 | ``` 28 | 29 | Any gradient descent problem with 2 parameters can be fit into the Python template below. The only thing that need to be updated is the negative log-likelihood function. 30 | 31 | ```{r, include=FALSE} 32 | # this prevents package loading message from appearing in the rendered version of your problem set 33 | knitr::opts_chunk$set(warning = FALSE, message = FALSE) 34 | ``` 35 | 36 | ```{python, eval=FALSE} 37 | import numpy as np 38 | import plotly.graph_objects as go 39 | 40 | # Negative log-likelihood function 41 | def neg_log_likelihood_func(theta, data): 42 | N = data.shape[0] 43 | mu1, mu2 = theta 44 | if mu1 > mu2: 45 | return np.nan 46 | loglikelihood = 0 47 | for x in data: 48 | density1 = np.exp(- (x - mu1) ** 2 / 2) / (2 * np.pi) ** 0.5 49 | density2 = np.exp(- (x - mu2) ** 2 / 2) / (2 * np.pi) ** 0.5 50 | loglikelihood = loglikelihood + np.log(0.5 * density1 + 0.5 * density2) 51 | return - loglikelihood 52 | 53 | # Generate a random data set (which does not exactly follow the assumed model) 54 | data1 = np.random.normal(loc = 0, scale = 0.5, size = 20) 55 | data2 = np.random.normal(loc = 30, scale = 0.5, size = 20) 56 | data3 = np.random.normal(loc = 60, scale = 0.5, size = 40) 57 | data = np.concatenate([data1, data2, data3]) 58 | 59 | # (X, Y) represents grid for (mu1, mu2), Z represents log-likelihood values 60 | x = np.arange(-20, 80, 1) 61 | y = np.arange(-20, 80, 1) 62 | X, Y = np.meshgrid(x, y) 63 | Z = np.array([np.array([neg_log_likelihood_func((X_elem, Y_elem), data) for (X_elem, Y_elem) in zip(X_row, Y_row)]) 64 | for (X_row, Y_row) in zip(X, Y)]) 65 | 66 | # Plot using plotly 67 | # The plotly functions cannot run correctly in Rmd, thus here we comment them out and attach the plots (generated in Python environment) 68 | # fig = go.Figure(data=[go.Surface(z = Z, x = X, y = Y)]) 69 | # fig.update_layout(title='Log-Likelihood', autosize=False, 70 | # width=1000, height=1000) 71 | # fig.show() 72 | ``` 73 | 74 | ### 3D Surface Plots (negative log-likelihood) 75 | 76 | From the 3D surface plot, we clearly see that there is a local minimum (of the negative log-likelihood function) besides the global minimum. The local minimum is located near $\mu_1 = 0, \mu_2 = 50$ (where $\mu_1$ is labeled as $x$ and $\mu_2$ is labeled as $y$). Therefore, before running gradient descent algorithm, we gain the insight that we should not choose initial parameter values close to the local minimum. 77 | 78 | ```{r, echo=FALSE} 79 | knitr::include_graphics("resources/plotly_python_3d_surface_images/3d_surface_plot1.png") 80 | knitr::include_graphics("resources/plotly_python_3d_surface_images/3d_surface_plot2.png") 81 | ``` 82 | -------------------------------------------------------------------------------- /mapcharts.Rmd: -------------------------------------------------------------------------------- 1 | # (PART) Tutorials {-} 2 | 3 | ```{r, include=FALSE} 4 | knitr::opts_chunk$set(echo = TRUE) 5 | ``` 6 | 7 | # Map Plots with Highcharts 8 | 9 | Hugo Ginoux 10 | 11 | *Highcharts* is originally an extremely complete Javascript library for data visualization. Its R wrapper version is the library "highcharter". It is more difficult to understand than ggplot2 but provides an interactivity that is very satisfying. 12 | 13 | In this chapter, we will explore the possibility to create *map charts*, that is to say heatmaps where the colored cells have the shape of countries, regions or states. Here is a raw example, showing the GDP of each state in the US (with random data): 14 | 15 | ```{r load} 16 | library(highcharter) 17 | library(dplyr) 18 | library(readxl) 19 | 20 | ``` 21 | 22 | 23 | ```{r firstexample} 24 | mapdata <- get_data_from_map(download_map_data("custom/usa-and-canada")) 25 | 26 | fake_gdp <- data.frame(code=mapdata$`hc-a2`) %>% 27 | mutate(value = 1e5 * abs(rt(nrow(mapdata), df = 10))) 28 | 29 | hcmap( 30 | "custom/usa-and-canada", 31 | data = fake_gdp, 32 | value = "value", 33 | joinBy = c("hc-a2", "code"), 34 | dataLabels = list(enabled = TRUE, format = "{point.name}"), 35 | borderColor = "#FAFAFA", 36 | borderWidth = 0.1, 37 | tooltip = list( 38 | valueDecimals = 2, 39 | valuePrefix = "$", 40 | valueSuffix = "USD" 41 | ) 42 | ) %>% 43 | hc_title(text = "Fake GDP per State") %>% 44 | hc_add_theme(hc_theme_ffx()) 45 | 46 | ``` 47 | 48 | Spectacular, isn't it ? Let's dive into the different parts of the code with another example plotting the proportion of Christians in 2020 in the different European countries. 49 | 50 | ### Collect data 51 | 52 | You need to have a dataframe containing at least the name of the contry and the value to plot in the heatmap. In our example, the data comes from https://www.worldreligiondatabase.org/. I downloaded it under the form of an xlsx file, and stored it into my personal server. 53 | 54 | ```{r collect} 55 | 56 | data = read_excel("resources/mapcharts/religions.xlsx") 57 | data = data[data[['Religion 1']]=='Christians', c('Country Name','Religion 1','Pct_2020')] 58 | 59 | head(data) 60 | ``` 61 | 62 | 63 | ### Download the map data and filter 64 | 65 | Then, you need to download the map info with get_data_from_map(download_map_data(name_of_geography)). The list of available geographies with examples is here : https://code.highcharts.com/mapdata/ (there are a lot!). In our example, we need a map of Europe. 66 | 67 | ```{r download} 68 | 69 | mapdata = get_data_from_map(download_map_data("custom/europe")) 70 | 71 | mapdata 72 | ``` 73 | 74 | Then, we only keep the rows where the country is in Europe : we filter data by mapdata. 75 | 76 | ```{r filter} 77 | data = data[data[['Country Name']]%in%mapdata$name,] 78 | ``` 79 | 80 | ### Plot the map 81 | 82 | We are now ready to plot a first version of the map! We simply have to call the function hcmap with these arguments: 83 | 84 | * which map we will use : in our case, "custom/europe" (the same argument than inside download_map_data()) 85 | * data : the dataframe containing the names of the countries and the value to plot 86 | * value : the name of the column used in the heatmap : in our case, "Pct_2020" 87 | * joinBy : the names of the columns in the 2 dataframes corresponding to the names of the countries. The names must correspond : "UK" in the first one and "United Kingdom" in the other would not work 88 | 89 | ```{r map} 90 | hcmap( 91 | "custom/europe", 92 | data = data, 93 | value = "Pct_2020", 94 | joinBy = c("name", "Country Name") 95 | ) 96 | ``` 97 | 98 | The result is already interesting. But some parameters can be added to make the chart even more impressive. 99 | 100 | ### Parameters 101 | 102 | #### Add labels 103 | 104 | The parameter "datalabels" can display the names of the countries. 105 | 106 | ```{r labels} 107 | hcmap( 108 | "custom/europe", 109 | data = data, 110 | value = "Pct_2020", 111 | joinBy = c("name", "Country Name"), 112 | dataLabels = list(enabled = TRUE, format = "{point.name}") 113 | ) 114 | ``` 115 | 116 | #### Plot in % 117 | 118 | The parameter "tooltip" allows us to modify what is displayed when the mouse hovers a country. Here, I plotted the proportion in %, rounded it to only 1 decimal and added a suffix "%". 119 | 120 | ```{r %} 121 | data['Pct_2020_%'] = data['Pct_2020']*100 122 | 123 | hcmap( 124 | "custom/europe", 125 | data = data, 126 | value = "Pct_2020_%", 127 | joinBy = c("name", "Country Name"), 128 | dataLabels = list(enabled = TRUE, format = "{point.name}"), 129 | tooltip = list( 130 | valueDecimals = 1, 131 | valueSuffix = '%' 132 | ) 133 | ) 134 | ``` 135 | 136 | #### Add title, theme, colors 137 | 138 | You can add a title with the function "hc_title". It is also possible to add subtitles. The function "hc_add_theme" adds a theme. Finally, you can change the min and max colors of the heatmap by giving the hexadecimal codes in the function "hc_colorAxis". The missing values would always appear in white. 139 | 140 | ```{r title} 141 | 142 | hcmap( 143 | "custom/europe", 144 | data = data, 145 | value = "Pct_2020_%", 146 | joinBy = c("name", "Country Name"), 147 | dataLabels = list(enabled = TRUE, format = "{point.name}"), 148 | tooltip = list( 149 | valueDecimals = 1, 150 | valueSuffix = '%' 151 | ) 152 | ) %>% 153 | hc_title(text = "Proportion of Christian per country in 2020") %>% 154 | hc_add_theme(hc_theme_ffx()) %>% 155 | hc_colorAxis(minColor = "#4242f5", maxColor = "#f54242") 156 | ``` 157 | -------------------------------------------------------------------------------- /resources/file_ingestions/Iris.csv: -------------------------------------------------------------------------------- 1 | Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species 2 | 1,5.1,3.5,1.4,0.2,Iris-setosa 3 | 2,4.9,3.0,1.4,0.2,Iris-setosa 4 | 3,4.7,3.2,1.3,0.2,Iris-setosa 5 | 4,4.6,3.1,1.5,0.2,Iris-setosa 6 | 5,5.0,3.6,1.4,0.2,Iris-setosa 7 | 6,5.4,3.9,1.7,0.4,Iris-setosa 8 | 7,4.6,3.4,1.4,0.3,Iris-setosa 9 | 8,5.0,3.4,1.5,0.2,Iris-setosa 10 | 9,4.4,2.9,1.4,0.2,Iris-setosa 11 | 10,4.9,3.1,1.5,0.1,Iris-setosa 12 | 11,5.4,3.7,1.5,0.2,Iris-setosa 13 | 12,4.8,3.4,1.6,0.2,Iris-setosa 14 | 13,4.8,3.0,1.4,0.1,Iris-setosa 15 | 14,4.3,3.0,1.1,0.1,Iris-setosa 16 | 15,5.8,4.0,1.2,0.2,Iris-setosa 17 | 16,5.7,4.4,1.5,0.4,Iris-setosa 18 | 17,5.4,3.9,1.3,0.4,Iris-setosa 19 | 18,5.1,3.5,1.4,0.3,Iris-setosa 20 | 19,5.7,3.8,1.7,0.3,Iris-setosa 21 | 20,5.1,3.8,1.5,0.3,Iris-setosa 22 | 21,5.4,3.4,1.7,0.2,Iris-setosa 23 | 22,5.1,3.7,1.5,0.4,Iris-setosa 24 | 23,4.6,3.6,1.0,0.2,Iris-setosa 25 | 24,5.1,3.3,1.7,0.5,Iris-setosa 26 | 25,4.8,3.4,1.9,0.2,Iris-setosa 27 | 26,5.0,3.0,1.6,0.2,Iris-setosa 28 | 27,5.0,3.4,1.6,0.4,Iris-setosa 29 | 28,5.2,3.5,1.5,0.2,Iris-setosa 30 | 29,5.2,3.4,1.4,0.2,Iris-setosa 31 | 30,4.7,3.2,1.6,0.2,Iris-setosa 32 | 31,4.8,3.1,1.6,0.2,Iris-setosa 33 | 32,5.4,3.4,1.5,0.4,Iris-setosa 34 | 33,5.2,4.1,1.5,0.1,Iris-setosa 35 | 34,5.5,4.2,1.4,0.2,Iris-setosa 36 | 35,4.9,3.1,1.5,0.1,Iris-setosa 37 | 36,5.0,3.2,1.2,0.2,Iris-setosa 38 | 37,5.5,3.5,1.3,0.2,Iris-setosa 39 | 38,4.9,3.1,1.5,0.1,Iris-setosa 40 | 39,4.4,3.0,1.3,0.2,Iris-setosa 41 | 40,5.1,3.4,1.5,0.2,Iris-setosa 42 | 41,5.0,3.5,1.3,0.3,Iris-setosa 43 | 42,4.5,2.3,1.3,0.3,Iris-setosa 44 | 43,4.4,3.2,1.3,0.2,Iris-setosa 45 | 44,5.0,3.5,1.6,0.6,Iris-setosa 46 | 45,5.1,3.8,1.9,0.4,Iris-setosa 47 | 46,4.8,3.0,1.4,0.3,Iris-setosa 48 | 47,5.1,3.8,1.6,0.2,Iris-setosa 49 | 48,4.6,3.2,1.4,0.2,Iris-setosa 50 | 49,5.3,3.7,1.5,0.2,Iris-setosa 51 | 50,5.0,3.3,1.4,0.2,Iris-setosa 52 | 51,7.0,3.2,4.7,1.4,Iris-versicolor 53 | 52,6.4,3.2,4.5,1.5,Iris-versicolor 54 | 53,6.9,3.1,4.9,1.5,Iris-versicolor 55 | 54,5.5,2.3,4.0,1.3,Iris-versicolor 56 | 55,6.5,2.8,4.6,1.5,Iris-versicolor 57 | 56,5.7,2.8,4.5,1.3,Iris-versicolor 58 | 57,6.3,3.3,4.7,1.6,Iris-versicolor 59 | 58,4.9,2.4,3.3,1.0,Iris-versicolor 60 | 59,6.6,2.9,4.6,1.3,Iris-versicolor 61 | 60,5.2,2.7,3.9,1.4,Iris-versicolor 62 | 61,5.0,2.0,3.5,1.0,Iris-versicolor 63 | 62,5.9,3.0,4.2,1.5,Iris-versicolor 64 | 63,6.0,2.2,4.0,1.0,Iris-versicolor 65 | 64,6.1,2.9,4.7,1.4,Iris-versicolor 66 | 65,5.6,2.9,3.6,1.3,Iris-versicolor 67 | 66,6.7,3.1,4.4,1.4,Iris-versicolor 68 | 67,5.6,3.0,4.5,1.5,Iris-versicolor 69 | 68,5.8,2.7,4.1,1.0,Iris-versicolor 70 | 69,6.2,2.2,4.5,1.5,Iris-versicolor 71 | 70,5.6,2.5,3.9,1.1,Iris-versicolor 72 | 71,5.9,3.2,4.8,1.8,Iris-versicolor 73 | 72,6.1,2.8,4.0,1.3,Iris-versicolor 74 | 73,6.3,2.5,4.9,1.5,Iris-versicolor 75 | 74,6.1,2.8,4.7,1.2,Iris-versicolor 76 | 75,6.4,2.9,4.3,1.3,Iris-versicolor 77 | 76,6.6,3.0,4.4,1.4,Iris-versicolor 78 | 77,6.8,2.8,4.8,1.4,Iris-versicolor 79 | 78,6.7,3.0,5.0,1.7,Iris-versicolor 80 | 79,6.0,2.9,4.5,1.5,Iris-versicolor 81 | 80,5.7,2.6,3.5,1.0,Iris-versicolor 82 | 81,5.5,2.4,3.8,1.1,Iris-versicolor 83 | 82,5.5,2.4,3.7,1.0,Iris-versicolor 84 | 83,5.8,2.7,3.9,1.2,Iris-versicolor 85 | 84,6.0,2.7,5.1,1.6,Iris-versicolor 86 | 85,5.4,3.0,4.5,1.5,Iris-versicolor 87 | 86,6.0,3.4,4.5,1.6,Iris-versicolor 88 | 87,6.7,3.1,4.7,1.5,Iris-versicolor 89 | 88,6.3,2.3,4.4,1.3,Iris-versicolor 90 | 89,5.6,3.0,4.1,1.3,Iris-versicolor 91 | 90,5.5,2.5,4.0,1.3,Iris-versicolor 92 | 91,5.5,2.6,4.4,1.2,Iris-versicolor 93 | 92,6.1,3.0,4.6,1.4,Iris-versicolor 94 | 93,5.8,2.6,4.0,1.2,Iris-versicolor 95 | 94,5.0,2.3,3.3,1.0,Iris-versicolor 96 | 95,5.6,2.7,4.2,1.3,Iris-versicolor 97 | 96,5.7,3.0,4.2,1.2,Iris-versicolor 98 | 97,5.7,2.9,4.2,1.3,Iris-versicolor 99 | 98,6.2,2.9,4.3,1.3,Iris-versicolor 100 | 99,5.1,2.5,3.0,1.1,Iris-versicolor 101 | 100,5.7,2.8,4.1,1.3,Iris-versicolor 102 | 101,6.3,3.3,6.0,2.5,Iris-virginica 103 | 102,5.8,2.7,5.1,1.9,Iris-virginica 104 | 103,7.1,3.0,5.9,2.1,Iris-virginica 105 | 104,6.3,2.9,5.6,1.8,Iris-virginica 106 | 105,6.5,3.0,5.8,2.2,Iris-virginica 107 | 106,7.6,3.0,6.6,2.1,Iris-virginica 108 | 107,4.9,2.5,4.5,1.7,Iris-virginica 109 | 108,7.3,2.9,6.3,1.8,Iris-virginica 110 | 109,6.7,2.5,5.8,1.8,Iris-virginica 111 | 110,7.2,3.6,6.1,2.5,Iris-virginica 112 | 111,6.5,3.2,5.1,2.0,Iris-virginica 113 | 112,6.4,2.7,5.3,1.9,Iris-virginica 114 | 113,6.8,3.0,5.5,2.1,Iris-virginica 115 | 114,5.7,2.5,5.0,2.0,Iris-virginica 116 | 115,5.8,2.8,5.1,2.4,Iris-virginica 117 | 116,6.4,3.2,5.3,2.3,Iris-virginica 118 | 117,6.5,3.0,5.5,1.8,Iris-virginica 119 | 118,7.7,3.8,6.7,2.2,Iris-virginica 120 | 119,7.7,2.6,6.9,2.3,Iris-virginica 121 | 120,6.0,2.2,5.0,1.5,Iris-virginica 122 | 121,6.9,3.2,5.7,2.3,Iris-virginica 123 | 122,5.6,2.8,4.9,2.0,Iris-virginica 124 | 123,7.7,2.8,6.7,2.0,Iris-virginica 125 | 124,6.3,2.7,4.9,1.8,Iris-virginica 126 | 125,6.7,3.3,5.7,2.1,Iris-virginica 127 | 126,7.2,3.2,6.0,1.8,Iris-virginica 128 | 127,6.2,2.8,4.8,1.8,Iris-virginica 129 | 128,6.1,3.0,4.9,1.8,Iris-virginica 130 | 129,6.4,2.8,5.6,2.1,Iris-virginica 131 | 130,7.2,3.0,5.8,1.6,Iris-virginica 132 | 131,7.4,2.8,6.1,1.9,Iris-virginica 133 | 132,7.9,3.8,6.4,2.0,Iris-virginica 134 | 133,6.4,2.8,5.6,2.2,Iris-virginica 135 | 134,6.3,2.8,5.1,1.5,Iris-virginica 136 | 135,6.1,2.6,5.6,1.4,Iris-virginica 137 | 136,7.7,3.0,6.1,2.3,Iris-virginica 138 | 137,6.3,3.4,5.6,2.4,Iris-virginica 139 | 138,6.4,3.1,5.5,1.8,Iris-virginica 140 | 139,6.0,3.0,4.8,1.8,Iris-virginica 141 | 140,6.9,3.1,5.4,2.1,Iris-virginica 142 | 141,6.7,3.1,5.6,2.4,Iris-virginica 143 | 142,6.9,3.1,5.1,2.3,Iris-virginica 144 | 143,5.8,2.7,5.1,1.9,Iris-virginica 145 | 144,6.8,3.2,5.9,2.3,Iris-virginica 146 | 145,6.7,3.3,5.7,2.5,Iris-virginica 147 | 146,6.7,3.0,5.2,2.3,Iris-virginica 148 | 147,6.3,2.5,5.0,1.9,Iris-virginica 149 | 148,6.5,3.0,5.2,2.0,Iris-virginica 150 | 149,6.2,3.4,5.4,2.3,Iris-virginica 151 | 150,5.9,3.0,5.1,1.8,Iris-virginica 152 | -------------------------------------------------------------------------------- /resources/file_ingestions/Iris.xlsx: -------------------------------------------------------------------------------- 1 | Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species 2 | 1,5.1,3.5,1.4,0.2,Iris-setosa 3 | 2,4.9,3.0,1.4,0.2,Iris-setosa 4 | 3,4.7,3.2,1.3,0.2,Iris-setosa 5 | 4,4.6,3.1,1.5,0.2,Iris-setosa 6 | 5,5.0,3.6,1.4,0.2,Iris-setosa 7 | 6,5.4,3.9,1.7,0.4,Iris-setosa 8 | 7,4.6,3.4,1.4,0.3,Iris-setosa 9 | 8,5.0,3.4,1.5,0.2,Iris-setosa 10 | 9,4.4,2.9,1.4,0.2,Iris-setosa 11 | 10,4.9,3.1,1.5,0.1,Iris-setosa 12 | 11,5.4,3.7,1.5,0.2,Iris-setosa 13 | 12,4.8,3.4,1.6,0.2,Iris-setosa 14 | 13,4.8,3.0,1.4,0.1,Iris-setosa 15 | 14,4.3,3.0,1.1,0.1,Iris-setosa 16 | 15,5.8,4.0,1.2,0.2,Iris-setosa 17 | 16,5.7,4.4,1.5,0.4,Iris-setosa 18 | 17,5.4,3.9,1.3,0.4,Iris-setosa 19 | 18,5.1,3.5,1.4,0.3,Iris-setosa 20 | 19,5.7,3.8,1.7,0.3,Iris-setosa 21 | 20,5.1,3.8,1.5,0.3,Iris-setosa 22 | 21,5.4,3.4,1.7,0.2,Iris-setosa 23 | 22,5.1,3.7,1.5,0.4,Iris-setosa 24 | 23,4.6,3.6,1.0,0.2,Iris-setosa 25 | 24,5.1,3.3,1.7,0.5,Iris-setosa 26 | 25,4.8,3.4,1.9,0.2,Iris-setosa 27 | 26,5.0,3.0,1.6,0.2,Iris-setosa 28 | 27,5.0,3.4,1.6,0.4,Iris-setosa 29 | 28,5.2,3.5,1.5,0.2,Iris-setosa 30 | 29,5.2,3.4,1.4,0.2,Iris-setosa 31 | 30,4.7,3.2,1.6,0.2,Iris-setosa 32 | 31,4.8,3.1,1.6,0.2,Iris-setosa 33 | 32,5.4,3.4,1.5,0.4,Iris-setosa 34 | 33,5.2,4.1,1.5,0.1,Iris-setosa 35 | 34,5.5,4.2,1.4,0.2,Iris-setosa 36 | 35,4.9,3.1,1.5,0.1,Iris-setosa 37 | 36,5.0,3.2,1.2,0.2,Iris-setosa 38 | 37,5.5,3.5,1.3,0.2,Iris-setosa 39 | 38,4.9,3.1,1.5,0.1,Iris-setosa 40 | 39,4.4,3.0,1.3,0.2,Iris-setosa 41 | 40,5.1,3.4,1.5,0.2,Iris-setosa 42 | 41,5.0,3.5,1.3,0.3,Iris-setosa 43 | 42,4.5,2.3,1.3,0.3,Iris-setosa 44 | 43,4.4,3.2,1.3,0.2,Iris-setosa 45 | 44,5.0,3.5,1.6,0.6,Iris-setosa 46 | 45,5.1,3.8,1.9,0.4,Iris-setosa 47 | 46,4.8,3.0,1.4,0.3,Iris-setosa 48 | 47,5.1,3.8,1.6,0.2,Iris-setosa 49 | 48,4.6,3.2,1.4,0.2,Iris-setosa 50 | 49,5.3,3.7,1.5,0.2,Iris-setosa 51 | 50,5.0,3.3,1.4,0.2,Iris-setosa 52 | 51,7.0,3.2,4.7,1.4,Iris-versicolor 53 | 52,6.4,3.2,4.5,1.5,Iris-versicolor 54 | 53,6.9,3.1,4.9,1.5,Iris-versicolor 55 | 54,5.5,2.3,4.0,1.3,Iris-versicolor 56 | 55,6.5,2.8,4.6,1.5,Iris-versicolor 57 | 56,5.7,2.8,4.5,1.3,Iris-versicolor 58 | 57,6.3,3.3,4.7,1.6,Iris-versicolor 59 | 58,4.9,2.4,3.3,1.0,Iris-versicolor 60 | 59,6.6,2.9,4.6,1.3,Iris-versicolor 61 | 60,5.2,2.7,3.9,1.4,Iris-versicolor 62 | 61,5.0,2.0,3.5,1.0,Iris-versicolor 63 | 62,5.9,3.0,4.2,1.5,Iris-versicolor 64 | 63,6.0,2.2,4.0,1.0,Iris-versicolor 65 | 64,6.1,2.9,4.7,1.4,Iris-versicolor 66 | 65,5.6,2.9,3.6,1.3,Iris-versicolor 67 | 66,6.7,3.1,4.4,1.4,Iris-versicolor 68 | 67,5.6,3.0,4.5,1.5,Iris-versicolor 69 | 68,5.8,2.7,4.1,1.0,Iris-versicolor 70 | 69,6.2,2.2,4.5,1.5,Iris-versicolor 71 | 70,5.6,2.5,3.9,1.1,Iris-versicolor 72 | 71,5.9,3.2,4.8,1.8,Iris-versicolor 73 | 72,6.1,2.8,4.0,1.3,Iris-versicolor 74 | 73,6.3,2.5,4.9,1.5,Iris-versicolor 75 | 74,6.1,2.8,4.7,1.2,Iris-versicolor 76 | 75,6.4,2.9,4.3,1.3,Iris-versicolor 77 | 76,6.6,3.0,4.4,1.4,Iris-versicolor 78 | 77,6.8,2.8,4.8,1.4,Iris-versicolor 79 | 78,6.7,3.0,5.0,1.7,Iris-versicolor 80 | 79,6.0,2.9,4.5,1.5,Iris-versicolor 81 | 80,5.7,2.6,3.5,1.0,Iris-versicolor 82 | 81,5.5,2.4,3.8,1.1,Iris-versicolor 83 | 82,5.5,2.4,3.7,1.0,Iris-versicolor 84 | 83,5.8,2.7,3.9,1.2,Iris-versicolor 85 | 84,6.0,2.7,5.1,1.6,Iris-versicolor 86 | 85,5.4,3.0,4.5,1.5,Iris-versicolor 87 | 86,6.0,3.4,4.5,1.6,Iris-versicolor 88 | 87,6.7,3.1,4.7,1.5,Iris-versicolor 89 | 88,6.3,2.3,4.4,1.3,Iris-versicolor 90 | 89,5.6,3.0,4.1,1.3,Iris-versicolor 91 | 90,5.5,2.5,4.0,1.3,Iris-versicolor 92 | 91,5.5,2.6,4.4,1.2,Iris-versicolor 93 | 92,6.1,3.0,4.6,1.4,Iris-versicolor 94 | 93,5.8,2.6,4.0,1.2,Iris-versicolor 95 | 94,5.0,2.3,3.3,1.0,Iris-versicolor 96 | 95,5.6,2.7,4.2,1.3,Iris-versicolor 97 | 96,5.7,3.0,4.2,1.2,Iris-versicolor 98 | 97,5.7,2.9,4.2,1.3,Iris-versicolor 99 | 98,6.2,2.9,4.3,1.3,Iris-versicolor 100 | 99,5.1,2.5,3.0,1.1,Iris-versicolor 101 | 100,5.7,2.8,4.1,1.3,Iris-versicolor 102 | 101,6.3,3.3,6.0,2.5,Iris-virginica 103 | 102,5.8,2.7,5.1,1.9,Iris-virginica 104 | 103,7.1,3.0,5.9,2.1,Iris-virginica 105 | 104,6.3,2.9,5.6,1.8,Iris-virginica 106 | 105,6.5,3.0,5.8,2.2,Iris-virginica 107 | 106,7.6,3.0,6.6,2.1,Iris-virginica 108 | 107,4.9,2.5,4.5,1.7,Iris-virginica 109 | 108,7.3,2.9,6.3,1.8,Iris-virginica 110 | 109,6.7,2.5,5.8,1.8,Iris-virginica 111 | 110,7.2,3.6,6.1,2.5,Iris-virginica 112 | 111,6.5,3.2,5.1,2.0,Iris-virginica 113 | 112,6.4,2.7,5.3,1.9,Iris-virginica 114 | 113,6.8,3.0,5.5,2.1,Iris-virginica 115 | 114,5.7,2.5,5.0,2.0,Iris-virginica 116 | 115,5.8,2.8,5.1,2.4,Iris-virginica 117 | 116,6.4,3.2,5.3,2.3,Iris-virginica 118 | 117,6.5,3.0,5.5,1.8,Iris-virginica 119 | 118,7.7,3.8,6.7,2.2,Iris-virginica 120 | 119,7.7,2.6,6.9,2.3,Iris-virginica 121 | 120,6.0,2.2,5.0,1.5,Iris-virginica 122 | 121,6.9,3.2,5.7,2.3,Iris-virginica 123 | 122,5.6,2.8,4.9,2.0,Iris-virginica 124 | 123,7.7,2.8,6.7,2.0,Iris-virginica 125 | 124,6.3,2.7,4.9,1.8,Iris-virginica 126 | 125,6.7,3.3,5.7,2.1,Iris-virginica 127 | 126,7.2,3.2,6.0,1.8,Iris-virginica 128 | 127,6.2,2.8,4.8,1.8,Iris-virginica 129 | 128,6.1,3.0,4.9,1.8,Iris-virginica 130 | 129,6.4,2.8,5.6,2.1,Iris-virginica 131 | 130,7.2,3.0,5.8,1.6,Iris-virginica 132 | 131,7.4,2.8,6.1,1.9,Iris-virginica 133 | 132,7.9,3.8,6.4,2.0,Iris-virginica 134 | 133,6.4,2.8,5.6,2.2,Iris-virginica 135 | 134,6.3,2.8,5.1,1.5,Iris-virginica 136 | 135,6.1,2.6,5.6,1.4,Iris-virginica 137 | 136,7.7,3.0,6.1,2.3,Iris-virginica 138 | 137,6.3,3.4,5.6,2.4,Iris-virginica 139 | 138,6.4,3.1,5.5,1.8,Iris-virginica 140 | 139,6.0,3.0,4.8,1.8,Iris-virginica 141 | 140,6.9,3.1,5.4,2.1,Iris-virginica 142 | 141,6.7,3.1,5.6,2.4,Iris-virginica 143 | 142,6.9,3.1,5.1,2.3,Iris-virginica 144 | 143,5.8,2.7,5.1,1.9,Iris-virginica 145 | 144,6.8,3.2,5.9,2.3,Iris-virginica 146 | 145,6.7,3.3,5.7,2.5,Iris-virginica 147 | 146,6.7,3.0,5.2,2.3,Iris-virginica 148 | 147,6.3,2.5,5.0,1.9,Iris-virginica 149 | 148,6.5,3.0,5.2,2.0,Iris-virginica 150 | 149,6.2,3.4,5.4,2.3,Iris-virginica 151 | 150,5.9,3.0,5.1,1.8,Iris-virginica 152 | -------------------------------------------------------------------------------- /r_neural_network.Rmd: -------------------------------------------------------------------------------- 1 | # Introduction to neural network in R 2 | 3 | Zhihang Xia 4 | 5 | ```{r, include=FALSE} 6 | knitr::opts_chunk$set(echo = TRUE, eval = FALSE) 7 | ``` 8 | 9 | ```{r, eval=FALSE} 10 | library(tensorflow) 11 | library(keras) 12 | ``` 13 | 14 | ## Motivation 15 | 16 | In modern data-driven society, it is getting much easier and easier for us to obtain massive amounts of data, followed by the rapid development and wide application of machine learning. Of course, machine learning is one of the best methods when faced with huge data, but it is undeniable that with the development of technology, some machine learning models, especially the basic models, such as SVM and decision tree have been used less and less in the industry. Since the techniques of deep learning appears, neural network models have been more popular. Different from machine learning, which highly relies on algorithms, deep learning uses neural networks modeled on the human brain. Therefore, deep learning has some advantages over machine learning. For example, neural network works well with unstructured data, which means it has better performance especially when faced with messy data, which is more likely for us to see in the real world. Here, I would like to introduce the basic deep learning here using R, to get your familiar with basic neural network training from end to end. 17 | 18 | ## Basic idea of Neural Network 19 | 20 | ![Basic Structure of Neural Network](resources/r_neural_network/nn.png) 21 | The basic idea of Neural Network is to simulate the neural network of human brain, which consists of an input layer, followed by several hidden layer and a output layer. Each unit of the current layer is a linear combination of the units from previous layer. The linear result can also be transformed by some activation functions, such as relu and softmax. 22 | 23 | ## Example 24 | 25 | ### Loading and processing data 26 | 27 | We are going to use the mnist dataset in keras, which contains images of written digits of 0-9 and corresponding labels. 28 | 29 | ```{r, eval = FALSE} 30 | c(c(X_train, y_train), c(X_test, y_test)) %<-% keras::dataset_mnist() 31 | ``` 32 | 33 | ```{r, eval = FALSE} 34 | dim(X_train) 35 | dim(y_train) 36 | dim(X_test) 37 | dim(y_test) 38 | ``` 39 | By printing out the dimensions, we can see that there are 60000 training data and 10000 test data in total. Also, each image is represented by 28x28 pixel values, and each label is the corresponding digit. 40 | 41 | ```{r, eval=FALSE} 42 | test_img <- image(X_train[50,1:28,1:28]) 43 | y_train[50] 44 | ``` 45 | we are trying to check if we load the data correctly. Here, we can see that the 50th image and its related label is "3". 46 | 47 | Then we need to rescale the image, making the pixel value is from 0 to 1. 48 | 49 | ```{r, eval=FALSE} 50 | X_train <- X_train / 255 51 | X_test <- X_test / 255 52 | y_train <- to_categorical(y_train, num_classes = 10) 53 | y_test <- to_categorical(y_test, num_classes = 10) 54 | ``` 55 | 56 | 57 | ### Build and compile our model 58 | 59 | Here, we are trying to sequentially build up the neural network layers in this way: 60 | 61 | 1. a layer that flattens the image input from the shape (28,28) to the one-dimensional shape (28*28) 62 | 63 | 2. a fully-connected layer with hidden unit size 128 and followed by relu activation function 64 | 65 | 3. a output layer with unit size 10 66 | 67 | ```{r, eval=FALSE} 68 | model <- keras_model_sequential() %>% 69 | layer_flatten(input_shape = c(28,28)) %>% 70 | layer_dense(128, activation = "relu") %>% 71 | layer_dense(10) 72 | ``` 73 | 74 | Then we compile our model with sgd optimizer and cattegorical cross entropy loss function. 75 | 76 | Optimizer specifies a way to apply gradient descent update of model parameters, and loss function is the objective function to minimize over. 77 | 78 | ```{r, eval=FALSE} 79 | opt <- optimizer_sgd() 80 | loss_fn <- loss_categorical_crossentropy(from_logits = TRUE) 81 | ``` 82 | 83 | ```{r, eval=FALSE} 84 | model %>% compile( 85 | optimizer = opt, 86 | loss = loss_fn, 87 | metrics = c("accuracy") 88 | ) 89 | ``` 90 | 91 | ```{r, eval=FALSE} 92 | summary(model) 93 | ``` 94 | 95 | 96 | ### Train and test model 97 | 98 | Now We train our compiled model with 10 ephocs and 32 batch size. Also, we use 20% of the training set as our validation set. 99 | 100 | ```{r, eval=FALSE} 101 | history <- model %>% fit(X_train, y_train, ephocs=10, batch_size=32, validation_split=0.2) 102 | history 103 | ``` 104 | Here is the training/validation loss and accuracy during the training process. R will automatically plot the output. 105 | ![Basic Structure of Neural Network](resources/r_neural_network/loss.png) 106 | 107 | Finally we use the test set to evaluate our model. 108 | 109 | ```{r, eval=FALSE} 110 | model %>% evaluate(X_test, y_test) 111 | ``` 112 | 113 | ### Change optimizer 114 | 115 | Now we try to use adam optimizer instead of sgd. 116 | 117 | ```{r, eval=FALSE} 118 | model1 <- keras_model_sequential() %>% 119 | layer_flatten(input_shape = c(28,28)) %>% 120 | layer_dense(128, activation = "relu") %>% 121 | layer_dense(10) 122 | ``` 123 | 124 | ```{r, eval=FALSE} 125 | opt1 <- optimizer_adam() 126 | ``` 127 | 128 | ```{r, eval=FALSE} 129 | model1 %>% compile( 130 | optimizer = opt1, 131 | loss = loss_fn, 132 | metrics = c("accuracy") 133 | ) 134 | ``` 135 | 136 | ```{r, eval=FALSE} 137 | history1 <- model1 %>% fit(X_train, y_train, ephocs=10, batch_size=32, validation_split=0.2) 138 | history1 139 | ``` 140 | 141 | ```{r, eval=FALSE} 142 | model1 %>% evaluate(X_test, y_test) 143 | ``` 144 | 145 | We see the model with adam optimizer has a highter test accuracy than the one with sgd optimizer. 146 | 147 | ## Reference 148 | 149 | [https://www.datacamp.com/tutorial/keras-r-deep-learning](https://www.datacamp.com/tutorial/keras-r-deep-learning) 150 | 151 | [https://tensorflow.rstudio.com/tutorials/quickstart/beginner](https://tensorflow.rstudio.com/tutorials/quickstart/beginner) 152 | -------------------------------------------------------------------------------- /resources/Visualization_in_python_vs_R/vaccinations.csv: -------------------------------------------------------------------------------- 1 | "survey","freq","subject","response","start_date","end_date" 2 | "ms153_NSA",48,1,"Always",2010-09-22,2010-10-25 3 | "ms153_NSA",9,2,"Always",2010-09-22,2010-10-25 4 | "ms153_NSA",66,3,"Always",2010-09-22,2010-10-25 5 | "ms153_NSA",1,4,"Always",2010-09-22,2010-10-25 6 | "ms153_NSA",11,5,"Always",2010-09-22,2010-10-25 7 | "ms153_NSA",1,6,"Always",2010-09-22,2010-10-25 8 | "ms153_NSA",5,7,"Always",2010-09-22,2010-10-25 9 | "ms153_NSA",4,8,"Always",2010-09-22,2010-10-25 10 | "ms153_NSA",24,9,"Missing",2010-09-22,2010-10-25 11 | "ms153_NSA",4,10,"Missing",2010-09-22,2010-10-25 12 | "ms153_NSA",29,11,"Missing",2010-09-22,2010-10-25 13 | "ms153_NSA",1,12,"Missing",2010-09-22,2010-10-25 14 | "ms153_NSA",1,13,"Missing",2010-09-22,2010-10-25 15 | "ms153_NSA",9,14,"Missing",2010-09-22,2010-10-25 16 | "ms153_NSA",10,15,"Missing",2010-09-22,2010-10-25 17 | "ms153_NSA",14,16,"Missing",2010-09-22,2010-10-25 18 | "ms153_NSA",3,17,"Never",2010-09-22,2010-10-25 19 | "ms153_NSA",7,18,"Never",2010-09-22,2010-10-25 20 | "ms153_NSA",42,19,"Never",2010-09-22,2010-10-25 21 | "ms153_NSA",20,20,"Never",2010-09-22,2010-10-25 22 | "ms153_NSA",2,21,"Never",2010-09-22,2010-10-25 23 | "ms153_NSA",36,22,"Never",2010-09-22,2010-10-25 24 | "ms153_NSA",7,23,"Never",2010-09-22,2010-10-25 25 | "ms153_NSA",2,24,"Never",2010-09-22,2010-10-25 26 | "ms153_NSA",4,25,"Never",2010-09-22,2010-10-25 27 | "ms153_NSA",9,26,"Never",2010-09-22,2010-10-25 28 | "ms153_NSA",48,27,"Sometimes",2010-09-22,2010-10-25 29 | "ms153_NSA",11,28,"Sometimes",2010-09-22,2010-10-25 30 | "ms153_NSA",99,29,"Sometimes",2010-09-22,2010-10-25 31 | "ms153_NSA",3,30,"Sometimes",2010-09-22,2010-10-25 32 | "ms153_NSA",41,31,"Sometimes",2010-09-22,2010-10-25 33 | "ms153_NSA",168,32,"Sometimes",2010-09-22,2010-10-25 34 | "ms153_NSA",2,33,"Sometimes",2010-09-22,2010-10-25 35 | "ms153_NSA",23,34,"Sometimes",2010-09-22,2010-10-25 36 | "ms153_NSA",22,35,"Sometimes",2010-09-22,2010-10-25 37 | "ms153_NSA",38,36,"Sometimes",2010-09-22,2010-10-25 38 | "ms153_NSA",1,37,"Sometimes",2010-09-22,2010-10-25 39 | "ms153_NSA",15,38,"Sometimes",2010-09-22,2010-10-25 40 | "ms153_NSA",113,39,"Sometimes",2010-09-22,2010-10-25 41 | "ms432_NSA",48,1,"Always",2015-06-04,2015-10-05 42 | "ms432_NSA",9,2,"Always",2015-06-04,2015-10-05 43 | "ms432_NSA",66,3,"Missing",2015-06-04,2015-10-05 44 | "ms432_NSA",1,4,"Missing",2015-06-04,2015-10-05 45 | "ms432_NSA",11,5,"Missing",2015-06-04,2015-10-05 46 | "ms432_NSA",1,6,"Never",2015-06-04,2015-10-05 47 | "ms432_NSA",5,7,"Sometimes",2015-06-04,2015-10-05 48 | "ms432_NSA",4,8,"Sometimes",2015-06-04,2015-10-05 49 | "ms432_NSA",24,9,"Always",2015-06-04,2015-10-05 50 | "ms432_NSA",4,10,"Always",2015-06-04,2015-10-05 51 | "ms432_NSA",29,11,"Missing",2015-06-04,2015-10-05 52 | "ms432_NSA",1,12,"Missing",2015-06-04,2015-10-05 53 | "ms432_NSA",1,13,"Missing",2015-06-04,2015-10-05 54 | "ms432_NSA",9,14,"Missing",2015-06-04,2015-10-05 55 | "ms432_NSA",10,15,"Sometimes",2015-06-04,2015-10-05 56 | "ms432_NSA",14,16,"Sometimes",2015-06-04,2015-10-05 57 | "ms432_NSA",3,17,"Always",2015-06-04,2015-10-05 58 | "ms432_NSA",7,18,"Missing",2015-06-04,2015-10-05 59 | "ms432_NSA",42,19,"Missing",2015-06-04,2015-10-05 60 | "ms432_NSA",20,20,"Missing",2015-06-04,2015-10-05 61 | "ms432_NSA",2,21,"Never",2015-06-04,2015-10-05 62 | "ms432_NSA",36,22,"Never",2015-06-04,2015-10-05 63 | "ms432_NSA",7,23,"Never",2015-06-04,2015-10-05 64 | "ms432_NSA",2,24,"Sometimes",2015-06-04,2015-10-05 65 | "ms432_NSA",4,25,"Sometimes",2015-06-04,2015-10-05 66 | "ms432_NSA",9,26,"Sometimes",2015-06-04,2015-10-05 67 | "ms432_NSA",48,27,"Always",2015-06-04,2015-10-05 68 | "ms432_NSA",11,28,"Always",2015-06-04,2015-10-05 69 | "ms432_NSA",99,29,"Missing",2015-06-04,2015-10-05 70 | "ms432_NSA",3,30,"Missing",2015-06-04,2015-10-05 71 | "ms432_NSA",41,31,"Missing",2015-06-04,2015-10-05 72 | "ms432_NSA",168,32,"Missing",2015-06-04,2015-10-05 73 | "ms432_NSA",2,33,"Never",2015-06-04,2015-10-05 74 | "ms432_NSA",23,34,"Never",2015-06-04,2015-10-05 75 | "ms432_NSA",22,35,"Never",2015-06-04,2015-10-05 76 | "ms432_NSA",38,36,"Sometimes",2015-06-04,2015-10-05 77 | "ms432_NSA",1,37,"Sometimes",2015-06-04,2015-10-05 78 | "ms432_NSA",15,38,"Sometimes",2015-06-04,2015-10-05 79 | "ms432_NSA",113,39,"Sometimes",2015-06-04,2015-10-05 80 | "ms460_NSA",48,1,"Always",2016-09-27,2016-10-25 81 | "ms460_NSA",9,2,"Sometimes",2016-09-27,2016-10-25 82 | "ms460_NSA",66,3,"Always",2016-09-27,2016-10-25 83 | "ms460_NSA",1,4,"Never",2016-09-27,2016-10-25 84 | "ms460_NSA",11,5,"Sometimes",2016-09-27,2016-10-25 85 | "ms460_NSA",1,6,"Always",2016-09-27,2016-10-25 86 | "ms460_NSA",5,7,"Always",2016-09-27,2016-10-25 87 | "ms460_NSA",4,8,"Sometimes",2016-09-27,2016-10-25 88 | "ms460_NSA",24,9,"Always",2016-09-27,2016-10-25 89 | "ms460_NSA",4,10,"Sometimes",2016-09-27,2016-10-25 90 | "ms460_NSA",29,11,"Always",2016-09-27,2016-10-25 91 | "ms460_NSA",1,12,"Missing",2016-09-27,2016-10-25 92 | "ms460_NSA",1,13,"Never",2016-09-27,2016-10-25 93 | "ms460_NSA",9,14,"Sometimes",2016-09-27,2016-10-25 94 | "ms460_NSA",10,15,"Always",2016-09-27,2016-10-25 95 | "ms460_NSA",14,16,"Sometimes",2016-09-27,2016-10-25 96 | "ms460_NSA",3,17,"Always",2016-09-27,2016-10-25 97 | "ms460_NSA",7,18,"Always",2016-09-27,2016-10-25 98 | "ms460_NSA",42,19,"Never",2016-09-27,2016-10-25 99 | "ms460_NSA",20,20,"Sometimes",2016-09-27,2016-10-25 100 | "ms460_NSA",2,21,"Missing",2016-09-27,2016-10-25 101 | "ms460_NSA",36,22,"Never",2016-09-27,2016-10-25 102 | "ms460_NSA",7,23,"Sometimes",2016-09-27,2016-10-25 103 | "ms460_NSA",2,24,"Always",2016-09-27,2016-10-25 104 | "ms460_NSA",4,25,"Never",2016-09-27,2016-10-25 105 | "ms460_NSA",9,26,"Sometimes",2016-09-27,2016-10-25 106 | "ms460_NSA",48,27,"Always",2016-09-27,2016-10-25 107 | "ms460_NSA",11,28,"Sometimes",2016-09-27,2016-10-25 108 | "ms460_NSA",99,29,"Always",2016-09-27,2016-10-25 109 | "ms460_NSA",3,30,"Missing",2016-09-27,2016-10-25 110 | "ms460_NSA",41,31,"Never",2016-09-27,2016-10-25 111 | "ms460_NSA",168,32,"Sometimes",2016-09-27,2016-10-25 112 | "ms460_NSA",2,33,"Always",2016-09-27,2016-10-25 113 | "ms460_NSA",23,34,"Never",2016-09-27,2016-10-25 114 | "ms460_NSA",22,35,"Sometimes",2016-09-27,2016-10-25 115 | "ms460_NSA",38,36,"Always",2016-09-27,2016-10-25 116 | "ms460_NSA",1,37,"Missing",2016-09-27,2016-10-25 117 | "ms460_NSA",15,38,"Never",2016-09-27,2016-10-25 118 | "ms460_NSA",113,39,"Sometimes",2016-09-27,2016-10-25 119 | -------------------------------------------------------------------------------- /color_deficiency.Rmd: -------------------------------------------------------------------------------- 1 | # Color vision deficiency 2 | 3 | Margaret Reed and Yaxin Hu 4 | 5 | ```{r} 6 | library(tidyverse) 7 | library(ggokabeito) 8 | ``` 9 | 10 | Color vision deficiency is typically referred to as color blindness and can be defined as the inability to distinguish certain colors, or any colors at all. Research shows that about 8% of men and 0,5% of women suffer from color blindness deficiency to some extent. 11 | 12 | As mentioned in class, color vision deficiency is an important issue to be aware of when creating data visualizations. While red-green confusion is the most well known there are several types, all of which should be considered when creating accessible visualizations. 13 | 14 | Problems faced by color blind people when it comes to data visualization: 15 | 16 | *Not being able to recognize series or categories: 17 | The biggest problem occurs when people suffering from CVD has to distinguish between categories. Certain colors that look totally different for healthy people may look almost identical for color blind people. 18 | 19 | *Not being able to recognize the key message at a glance: 20 | Another problem might occur is that color blind people may not be able to recognize key message at a glance. For example you might color code your graph in the way that the increase is colored in green and decrease is colored in red. But color blind people wouldn't notice that. 21 | 22 | So it is important to design color vision deficiency friendly graph when doing visualization. 23 | 24 | 25 | ### Types of color vision deficiencies: 26 | 27 | #### Red-green color blindness 28 | 29 | The most common type of color blindness makes it hard to tell the difference between red and green. 30 | 31 | There are 4 types of red-green color blindness: 32 | 33 | **Deuteranomaly** is the most common type of red-green color blindness. It makes green look more red. This type is mild and doesn't usually get in the way of normal activities. 34 | 35 | **Protanomaly**- makes red look more green and less bright. This type is mild and usually doesn't get in the way of normal activities. 36 | 37 | **Protanopia** and **deuteranopia** both make you unable to tell the difference between red and green at all. 38 | 39 | #### Blue-yellow color blindness 40 | 41 | This less-common type of color blindness makes it hard to tell the difference between blue and green, and between yellow and red. 42 | 43 | There are 2 types of blue-yellow color blindness: 44 | 45 | **Tritanomaly**- makes it hard to tell the difference between blue and green, and between yellow and red. 46 | 47 | **Tritanopia**- makes you unable to tell the difference between blue and green, purple and red, and yellow and pink. It also makes colors look less bright. 48 | 49 | #### Complete color blindness 50 | 51 | If you have complete color blindness, you can't see colors at all. This is also called **monochromacy**, and it's quite uncommon. Depending on the type, you may also have trouble seeing clearly and you may be more sensitive to light. 52 | 53 | Source: 54 | 55 | ### Test your plots 56 | 57 | While not an exact science, there are some great tools to test how your visualizations may look to someone with these conditions. The one I like the most is Sim Daltonism. ([Download here](https://apps.apple.com/us/app/sim-daltonism/id693112260?mt=12)) 58 | 59 | However, this application only works on Macs so if you have a windows or linux machine, I would recommend the program that was suggested in class. 60 | 61 | Here is an example of what the standard ggplot color palette looks like with different color vision deficiencies. 62 | 63 | #### Original: 64 | 65 | ```{r} 66 | iris %>% 67 | ggplot( 68 | aes(x = Sepal.Length, y = Petal.Width, color = Species) 69 | ) + 70 | geom_point() 71 | ``` 72 | 73 | 74 | Here is how this might plot look like with the above conditions: 75 | 76 | ![](resources/color_deficiency_images/original-1.png){width="290"} ![](resources/color_deficiency_images/original-2.png){width="290"} 77 | ![](resources/color_deficiency_images/original-3.png){width="290"} ![](resources/color_deficiency_images/original-4.png){width="290"} 78 | ![](resources/color_deficiency_images/original-5.png){width="290"} ![](resources/color_deficiency_images/original-6.png){width="290"} 79 | ![](resources/color_deficiency_images/original-7.png){width="290"} ![](resources/color_deficiency_images/original-8.png){width="290"} 80 | 81 | **Ways to improve without changing color:** 82 | 83 | Increase Contrast: 84 | 85 | ```{r} 86 | iris %>% 87 | ggplot( 88 | aes(x = Sepal.Length, y = Petal.Width, color = Species) 89 | ) + 90 | geom_point() + 91 | theme_minimal() 92 | ``` 93 | 94 | Double Encoding: 95 | 96 | ```{r} 97 | iris %>% 98 | ggplot( 99 | aes(x = Sepal.Length, y = Petal.Width, color = Species, shape = Species) 100 | ) + 101 | geom_point() + 102 | theme_minimal() 103 | ``` 104 | 105 | **A Better Palette:** 106 | 107 | I really like the Okabe-Ito pallete designed by Masataka Okabe and Kei Ito 108 | 109 | (learn more read [here](https://jfly.uni-koeln.de/color/): ) 110 | 111 | Here is a [good visual guide](https://mikemol.github.io/technique/colorblind/2018/02/11/color-safe-palette.html). 112 | 113 | To easily use this palette in R, I like the `ggokabeito` package, which can be downloaded from CRAN. 114 | 115 | Example with new color palette: 116 | 117 | ```{r} 118 | iris %>% 119 | ggplot( 120 | aes(x = Sepal.Length, y = Petal.Width, color = Species) 121 | ) + 122 | geom_point() + 123 | scale_color_okabe_ito(order = c(3,7,9)) + 124 | theme_minimal() 125 | ``` 126 | 127 | 128 | 129 | ![](resources/color_deficiency_images/new-1.png){width="290"} ![](resources/color_deficiency_images/new-2.png){width="290"} 130 | ![](resources/color_deficiency_images/new-3.png){width="290"} ![](resources/color_deficiency_images/new-4.png){width="290"} 131 | ![](resources/color_deficiency_images/new-5.png){width="290"} ![](resources/color_deficiency_images/new-6.png){width="290"} 132 | ![](resources/color_deficiency_images/new-7.png){width="290"} ![](resources/color_deficiency_images/new-8.png){width="290"} 133 | 134 | 135 | ### Other considerations 136 | 137 | While it is very important to make sure that data science and visualizations are accessible to as many people as possible, these considerations for color extend further than those with color vision deficiencies. Often times the medium in which we present our visualizations can considerably impact how the color distinctions come across. Not every visualization is viewed on a large projector, some are dispersed for individual viewing, while others may even be printed in black and white. Therefore is it very important to consider other skills such as double encoding, using shapes and textures, using high contrast colors when creating visualizations. 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | -------------------------------------------------------------------------------- /file_ingestions.Rmd: -------------------------------------------------------------------------------- 1 | # Fantastic Data Files and How to Ingest Them 2 | 3 | Siyu Yang 4 | 5 | ```{r, include=FALSE} 6 | knitr::opts_chunk$set(echo = TRUE) 7 | library(tidyverse) 8 | library(readxl) 9 | library(rjson) 10 | library(XML) 11 | 12 | ``` 13 | 14 | ## Why This Tutorial? 15 | Whether in research, school projects or real life work, the first step of data analysis is always data ingestion. Before whipping out the `XGBoost` and `plotly` and diving into the fun part, the three or four letter suffix at the end of the data files we ingest is already giving us information about the data itself, where the data came from, how the data was compiled, and the type of analysis or transformations it may require. 16 | 17 | In this tutorial, I would like to introduce a few commonly used data file formats, why they look the way they look, and how to ingest them in `R` respectively. Although the tutorial may not be highly technical, I hope my experience as a former consulting data scientist and being on the receiving end of many types of data (including a zip file of hundreds of `.pdf`s) can provide a perspective of some value. 18 | 19 | ### Comma Seperated Value 20 | A Comma Separated Value file, also known as a `.csv` file, is a type of file where each line is a data record or a "row" in a tabular format, and each field or a "column" is separated by commas. The history of the `.csv` format dates back even before the popularization of personal computers, and is originally supported by [IBM Fortran](https://blog.sqlizer.io/posts/csv-history/) in 1972, which has largely to do with the fact that Fortran was coded on punched cards, and `.csv` are easier to create and less error prone. It was quickly adopted by relational database systems to transfer data between systems and formats. 21 | 22 | #### Why are we happy when we receive CSVs? 23 | - Compact file size. CSVs only store the string themselves, and relies on the ingesting system to parse the delimitors and read the file into tabular format. 24 | - Structured in tabular form, which makes it easy to query and transform. 25 | - Converts easily between file and most database systems. Almost all popular database systems can directly translate CSV files to SQL 26 | 27 | #### Ingesting csv into R 28 | Base R's `read.csv()` function provide the ability to directly ingest CSV files. 29 | ```{r} 30 | df <- read.csv('resources/file_ingestions/iris.csv') 31 | glimpse(df) 32 | ``` 33 | You can also use `read_csv()` from `tydyverse`, which provide faster ingestion when the file size is large. 34 | 35 | ```{r} 36 | df <- read_csv('resources/file_ingestions/iris.csv') 37 | glimpse(df) 38 | ``` 39 | 40 | ### Excel 41 | More often than not, especially in the business world, we receive an Excel file in either `.xlsx` (default Excel 2007 format), `.xls` (old Excel format), or `.xlsm` (Excel workbook with macro enabled). I like to think of Excel as a "visual developing tool" -- users can visually inspect data organized in tabular form, and interactively defining and creating the value and format of every cell. It contains a lot more information than CSV files, and as a result often take more space to store the same amount of data. In reality, an `xlsx` file is a [ZIP compressed archive](https://en.wikipedia.org/wiki/Microsoft_Excel) with a directory structure of XML text documents, the latter of which we will introduce in a later section. 42 | 43 | ![After adding formatting into the Iris file shown above, the Excel takes up much more space than the csv file](resources/file_ingestions/fig1.png) 44 | 45 | #### What do we do when we receive .XLSX files? 46 | I would always recommend opening the file and inspecting it first. More often than not, an Excel worksheet is built to be worked on by human, and as a result have a lot of visual or design elements that works well visually, but does not play well with R dataframes. Large spreadsheets often include multiple worksheets; real data often does not start at the first page, which is likely used as a title / index page. We also cannot assume column headers would start from the first row of the spreadsheet -- analysts can leave out header space in their spreadsheets, and the spreadsheet can have more than one column headers. 47 | 48 | I often find myself manually cleaning the Excel spreadsheet, and using Excel's export option to output a `.csv` file for R ingestion. 49 | 50 | ![Excel worksheets can be very cumbersome and easy to break; hug your business school friends](resources/file_ingestions/fig2.png) 51 | #### Ingesting Excel files into R 52 | You can call the `read_excel()` function from `readxl` library to import the file into R. 53 | 54 | ```{r} 55 | df <- read_csv('resources/file_ingestions/iris.xlsx') 56 | glimpse(df) 57 | ``` 58 | ### JSON and XML 59 | If receiving a `.csv` file makes me assume the file is exported from a relational database, and receiving an `.xlsx` file leads me to assume "some poor analyst has worked on this", receiving `.json` or `.xml` almost always suggest that I am receiving raw data from some application, most likely web-native. 60 | 61 | `XML` is introduced [in the late 1990s](https://www.toptal.com/web/json-vs-xml-part-1) as a solution "to solve the problem of universal data interchange between dissimilar systems". [In April 2001](https://twobithistory.org/2017/09/21/the-rise-and-rise-of-json.html), Douglas Crockford and Chip Morningstar sent the first JSON message from a computer at Morningstar’s Bay Area garage. Both efforts are attempts for the industry to specify, with strict semantics, custom markup languages for any application, and allows systems to talk to each other. The story of why JSON prevailed over XML is a fascinating piece of internet history -- where big enterprises tend to favor building overly complicated tools on top of XML, the simpler format of JSON is favored by more developer due to its simplicity and speed. 62 | 63 | ![More questions are asked about JSON than any other data exchange format. I do believe that the low percengage of .csv related questions speak less to its lack of popularity, more to its simplicity](sample_files/file_ingestions/fig2.png) 64 | 65 | #### Ingesting JSON and XML files into R 66 | You can call the `fromJSON()` function from `rjson` library to import the file into R. 67 | 68 | ```{r} 69 | data <- fromJSON(file = 'resources/file_ingestions/sample.json') 70 | print(data) 71 | ``` 72 | ```{r} 73 | data_xml <- xmlParse(file = "resources/file_ingestions/sample.xml") 74 | print(data_xml) 75 | 76 | ``` 77 | #### Reading JSON and XML files in R 78 | 79 | Unlike CSV, JSON and XML file are not organized in tabular format. They are organized in a tree format, and the tree structure can go beyond what can be represented in a two-dimensional format. I can climb a tree top-down to explore each node: 80 | 81 | ```{r} 82 | print(data$quiz$maths) 83 | ``` 84 | 85 | When you reach a level where the data can be reasonably represented in a tabular format, you can use `xmlToDataFrame` to transform the xml file into a dataframe. 86 | 87 | ```{r} 88 | dataframe <- xmlToDataFrame("resources/file_ingestions/sample.xml") 89 | print(dataframe) 90 | ``` 91 | 92 | 93 | 94 | -------------------------------------------------------------------------------- /appendix_pull_request_tutorial.Rmd: -------------------------------------------------------------------------------- 1 | # Tutorial for pull request mergers 2 | 3 | ## General 4 | 5 | The following is a checklist of steps to perform before merging the pull request. At any point, if you're not sure what to do, request a review from one of the PR leaders. 6 | 7 | ## Check branch 8 | 9 | PR should be submitted from a **non-main** branch. 10 | 11 | 12 |
13 | 14 | If PR was submitted from the **main** branch, provide these instructions on how to fix the problem: 15 | 16 | 1. Close this PR. 17 | 18 | 2. Follow the instructions here for forgetting to branch if you committed and pushed to GitHub: https://edav.info/github#fixing-mistakes 19 | 20 | 3. If you have trouble with 2., then delete the local folder of the project, delete your fork on GitHub, and *start over.* 21 | 22 | 4. Open a new PR. 23 | 24 |
25 | 26 | 27 | 28 | 29 |
30 | 31 | ## Examine files that were added or modified 32 | 33 |

34 | 35 | - There should be only ONE `.Rmd` file. 36 | 37 | - All of the additional resources should be in the `resources//` folder. 38 | 39 | - There should be no other files in the root directory besides the `.Rmd` file. 40 | 41 | ## Check `.Rmd` filename 42 | 43 | - The `.Rmd` filename should be words only and joined with underscores, no white space. (Update: It does not need to be the same as the branch name.) 44 | - The `.Rmd` filename can only contain **lowercase letters**. (Otherwise the filenames do not sort nicely on the repo home page.) 45 | 46 | ## Check `.Rmd` file contents 47 | 48 | - The file should **not** contain a YAML header nor a `---` line. 49 | - The second line should be blank, followed by the author name(s). 50 | - The first line should start with a **single hashtag `#`**, followed by a **single whitespace**, and then the title. 51 | - There should be no additional single hashtag headers in the chapter. (If there are, new chapters will be created.) 52 | - Other hashtag headers should **not** be followed by numbers since the hashtags will create numbered subheadings. Correct: `## Subheading`. Incorrect: `## 3. Subheading`. 53 | - If the file contains a setup chunk in `.Rmd` file, it should **not** contain a `setup` label. (The bookdown render will fail if there are duplicate chunk labels.) 54 |
i.e. use `{r, include=FALSE}` instead of `{r setup, include=FALSE}`. 55 |
[See sample `.Rmd`](https://github.com/jtr13/cc21/blob/main/sample_project.Rmd) 56 | - Links to internal files must contain `resources//` in the path, such as: `![Test Photo](resources/sample_project/election.jpg)` 57 | - The file should not contain any `install.packages()`, `write` functions, `setwd()`, or `getwd()`. 58 | - If there's anything else that looks odd but you're not sure, assign `jtr13` to review and explain the issue. 59 | 60 | 61 | ## Request changes 62 | 63 | If there are problems with any of the checks listed above, explain why the pull request cannot be merged and request changes by following these steps: 64 | 65 | 66 | 67 | 68 | 69 | 70 | Then, add a `changes requested` label to this pull request. 71 | 72 | Your job for this pull request is done for now. Once contributors fix their requests, review again and either move forward with the merge or explain what changes still need to be made. 73 | 74 |
75 | 76 | ## Steps to Merge the PR 77 | 78 | Before we click "Merge" there are a few more things to do. 79 | 80 | ### Update the branch 81 | 82 | If an "Update Branch" is visible toward the end on the Conversation tab of the pull request, click on it. This will ensure that we are working with the most up-to-date versions of `_bookdown.yml` and `DESCRIPTION`. 83 | 84 | Next we will make changes to these files on the contributor's branch. 85 | 86 | ### Add the filename of the chapter to `_bookdown.yml` 87 | 88 | - Go to "Files Changed" and copy the filename of the `.Rmd` file. 89 | 90 | 91 | - Open the branch of the submitted PR by following these steps: 92 | 93 | + To access the PR branch: 94 | 95 | 96 | 97 | + Make sure you are on the PR branch by checking that the PR branch name is shown (not `main`): 98 | 99 | 100 | 101 | - Add the name of the new file in single quotes followed by a comma under the labelled section (eg. Cheatsheets, Tutorials etc). 102 | 103 | - Save the edited version. 104 | 105 | ### (Add part names to `.Rmd` for every first article in part) 106 | 107 | Only do this if you are adding the first chapter in a PART. 108 | 109 | One person should manage this, otherwise it will be hard to keep the project organized. 110 | 111 | For every first article of each part, add the chapter name on the top of the `.Rmd` file, then propose changes. The example is like this. 112 | 113 |

114 | 115 | ### Add new libraries to `DESCRIPTION`. 116 | 117 | - Check the `.Rmd` for libraries needed. If any are missing, add them to the `DESCRIPTION` file on the contributor's branch, in the same manner that we edited the `_bookdown.yml` file. 118 | 119 | 120 | ### Merge the pull request 121 | 122 | **If you're not sure that you did things correctly, assign one of the other maintainers or @jtr13 to review before you merge the PR.** 123 | 124 | - Return to the PR on the main page of the repo `www.github.com/jtr13/...` 125 | 126 | - If necessary resolve merge conflicts by clicking on the resolve merge conflicts button: 127 | 128 | 129 | 130 | Then delete the lines with `<<<<<<< xxxx`, `=======` and `>>>>>>>> main` and edit the file as desired. Click the "Marked as resolved" button and then the green "Commit merge" button. --> 131 | 132 | - Click "Merge pull request" and then "Confirm merge". Add a thank you note perhaps with an emoji such as `:tada:`. 133 | 134 | ### Check Actions 135 | 136 | - After a few minutes, click on the Actions tabs and check whether the build has been successful: a green dot indicates a successful run, a red X indicates a failed run. 137 | 138 | - Check the log to figure out what went wrong, and if you can, fix it. If you're not sure what to do, not a problem, just open up an issue linking to the failed run so others can help (this is important so we can fix problems quickly). (Do not click `revert merge`). 139 | 140 | -------------------------------------------------------------------------------- /gradient_descent.Rmd: -------------------------------------------------------------------------------- 1 | # Gradient Descent Algorithm in R 2 | 3 | Cong Chen 4 | 5 | ```{r , include=FALSE} 6 | knitr::opts_chunk$set(echo = TRUE) 7 | ``` 8 | 9 | ## Motivation 10 | 11 | The Gradient Descent Algorithm is a popular algorithm in optimization. However, I seldom see people apply the algorithm by R, so I try to apply the Gradient Descent Algorithm to a simple optimization problem by R. And I also use three different step size rules when applying Gradient Descent Algorithm and drawing the contour map to give a more comprehensive understanding about the difference between these methods. 12 | 13 | Through the project, I understood the principle of the algorithm more clearly and was skilled in applying the algorithm. And I also realized how important choosing initial values are. Maybe the next time, I will try other different step size rules. 14 | 15 | ## Gradient Descent Algorithm 16 | 17 | Gradient Descent Algorithm is an algorithm used widely in optimization. For a simple optimization problem: 18 | 19 | $$ 20 | \min_{x\in X}f(x) 21 | $$ 22 | where $X$ is the domain of variable $x$, the algorithm's specific steps are shown below: 23 | 24 | 1. Choose the initial value of $x$ called $x^{(0)}$. 25 | 26 | 2. Choose the stopping condition value of the algorithm $\varepsilon$. 27 | 28 | 3. Select a specific step size rule to determine $\alpha^{(t)}$ for $t=0,1,...$. 29 | 30 | 4. For $t=0,1,...$, update: 31 | $$ 32 | x^{(t+1)}=x^{(t)}-\alpha^{(t)}\nabla f(x^{(t)}) 33 | $$ 34 | 5. If $\lVert x^{(t+1)}-x^{(t)} \rVert\leq \varepsilon$, stop the algorithm. 35 | 36 | ## Three common step size rules 37 | 38 | There are three common step size rules: 39 | 40 | 1. Fixed: $\alpha^{(t)}$ is a constant. 41 | 42 | 2. Exact line search 43 | 44 | 3. Backtracking line search 45 | 46 | Exact line search method tries to find the best $\alpha^{(t)}$ in each step $t$, that is: 47 | $$ 48 | \alpha^{(t)}=\arg\min_{\alpha} (x^{(t)}-\alpha\nabla f(x^{(t)})) 49 | $$ 50 | 51 | 52 | Backtracking line search method tries to reduce the step size as the iteration $t$ increases. 53 | 54 | ## A simple problem 55 | 56 | Apply these three methods to a simple optimization problem to find its optimal solution. And also, compare the difference between these three methods based on the results. 57 | 58 | A simple optimization problem: 59 | $$ 60 | \min_{x=(x_1,x_2)} f(x)\\ 61 | f(x)=\frac{10x_1^2+x_2^2}{2} 62 | $$ 63 | ```{r function} 64 | # function f(x) 65 | f<-function(x){ 66 | return((10*x[1]^2+x[2]^2)/2) 67 | } 68 | 69 | # gradient of f(x) 70 | grad<-function(x){ 71 | return(c(10*x[1],x[2])) 72 | } 73 | ``` 74 | 75 | In all three methods, the search begins at the point $x^{(0)}=(1.5,-1.5)$. 76 | 77 | ### Fixed step size 78 | 79 | ```{r fixed} 80 | fixstep<-function(x0, grad, tol = 1e-6, alpha = 0.01, max_iteration = 1000){ 81 | k = 1 82 | xf = x0 83 | xl = xf - alpha * grad(xf) 84 | x = c(x0,xl) 85 | while (sqrt(sum((xl - xf)^2)) > tol && k < max_iteration){ 86 | xf = xl 87 | xl = xf - alpha * grad(xf) 88 | k = k + 1 89 | x = c(x, xl) 90 | } 91 | return(list(iteration = k - 1, 92 | x1 = x[seq(1, 2 * k - 1, 2)], 93 | x2 = x[seq(2, 2 * k, 2)])) 94 | } 95 | ``` 96 | try three different fixed step size to observe their different processes: 97 | 98 | 1. Small: $\alpha = 0.01$. 99 | 100 | 2. Medium: $\alpha = 0.1$. 101 | 102 | 3. Large: $\alpha = 0.7$. 103 | ```{r plot contour for fixed step} 104 | # implement the method 105 | step1 = fixstep(c(1.5, -1.5), grad, alpha = 0.01) 106 | step2 = fixstep(c(1.5, -1.5), grad, alpha = 0.1) 107 | step3 = fixstep(c(1.5, -1.5), grad, alpha = 0.7, max_iteration = 50) 108 | 109 | z = matrix(0, 100, 100) 110 | x1 = seq(-1.5, 1.5, length = 100) 111 | x2 = seq(-1.5, 1.5, length = 100) 112 | 113 | # store function value for every grid point 114 | for(i in 1:100){ 115 | for(j in 1:100){ 116 | z[i,j] = f(c(x1[i],x2[j])) 117 | } 118 | } 119 | 120 | # plot contour map 121 | contour(x1, x2, z, nlevels=20) 122 | for(i in 1:step1$iteration){ 123 | segments(step1$x1[i],step1$x2[i],step1$x1[i+1],step1$x2[i+1],lty=2,col='red') 124 | } 125 | for(i in 1:step2$iteration){ 126 | segments(step2$x1[i],step2$x2[i],step2$x1[i+1],step2$x2[i+1],lty=2,col='blue') 127 | } 128 | for(i in 1:step3$iteration){ 129 | segments(step3$x1[i],step3$x2[i],step3$x1[i+1],step3$x2[i+1],lty=2,col='green') 130 | } 131 | ``` 132 | 133 | The small $\alpha$ (red line in the graph) takes more iteration to find the optimal solution, while the large $\alpha$ (green line in the graph) leads huge fluctuations in search, which makes it difficult to find the optimal solution. 134 | 135 | ### Exact line search 136 | 137 | ```{r exact line search} 138 | exactline<-function(x0, grad, tol = 1e-6, max_iteration = 1000){ 139 | A = matrix(c(10, 0, 0, 1), ncol = 2) 140 | k = 1 141 | xf = x0 142 | r = matrix(-grad(xf), ncol = 1) 143 | alpha = as.numeric((t(r)%*%r)/(t(r)%*%A%*%r)) 144 | xl = xf - alpha * grad(xf) 145 | x = c(x0,xl) 146 | while (sqrt(sum((xl - xf)^2)) > tol && k < max_iteration){ 147 | xf = xl 148 | r = matrix(-grad(xf), ncol = 1) 149 | alpha = as.numeric((t(r)%*%r)/(t(r)%*%A%*%r)) 150 | xl = xf - alpha * grad(xf) 151 | k = k + 1 152 | x = c(x, xl) 153 | } 154 | return(list(iteration = k - 1, 155 | x1 = x[seq(1, 2 * k - 1, 2)], 156 | x2 = x[seq(2, 2 * k, 2)])) 157 | } 158 | ``` 159 | 160 | 161 | ```{r plot contour for exact line search} 162 | # implement the method 163 | step1 = exactline(c(1.5, -1.5),grad) 164 | 165 | z = matrix(0, 100, 100) 166 | x1 = seq(-1.5, 1.5, length = 100) 167 | x2 = seq(-1.5, 1.5, length = 100) 168 | 169 | # store function value for every grid point 170 | for(i in 1:100){ 171 | for(j in 1:100){ 172 | z[i,j] = f(c(x1[i],x2[j])) 173 | } 174 | } 175 | 176 | # plot contour map 177 | contour(x1, x2, z, nlevels=20) 178 | for(i in 1:step1$iteration){ 179 | segments(step1$x1[i],step1$x2[i],step1$x1[i+1],step1$x2[i+1],lty=2,col='red') 180 | } 181 | ``` 182 | 183 | Although every step of exact line search is exact, exact line search takes more time compared with others. 184 | 185 | ### Backtracking Line Search 186 | 187 | ```{r backtracking line search} 188 | backtracking<-function(x0, grad, tol = 1e-6, alpha = 0.01, beta = 0.8,max_iteration = 1000){ 189 | k = 1 190 | xf = x0 191 | xl = xf - alpha * grad(xf) 192 | x = c(x0, xl) 193 | while (sqrt(sum((xl - xf)^2)) > tol && k < max_iteration){ 194 | xf = xl 195 | alpha = beta * alpha 196 | xl = xf - alpha * grad(xf) 197 | k = k + 1 198 | x = c(x, xl) 199 | } 200 | return(list(iteration = k - 1, 201 | x1 = x[seq(1, 2 * k - 1, 2)], 202 | x2 = x[seq(2, 2 * k, 2)])) 203 | } 204 | ``` 205 | Choose initial $\alpha = 0.2$, and factor $\beta = 0.8$ 206 | ```{r plot} 207 | # implement the method 208 | step1 = backtracking(c(1.5, -1.5), grad, alpha = 0.2, beta = 0.8) 209 | 210 | z = matrix(0,100,100) 211 | x1 = seq(-1.5, 1.5, length = 100) 212 | x2 = seq(-1.5, 1.5, length = 100) 213 | 214 | # store function value for every grid point 215 | for(i in 1:100){ 216 | for(j in 1:100){ 217 | z[i,j] = f(c(x1[i], x2[j])) 218 | } 219 | } 220 | 221 | # plot contour map 222 | contour(x1, x2, z, nlevels=20) 223 | for(i in 1:step1$iteration){ 224 | segments(step1$x1[i],step1$x2[i],step1$x1[i+1],step1$x2[i+1],lty=2,col='red') 225 | } 226 | ``` 227 | 228 | The large initial $\alpha$ can be used in backtracking line search, since after several steps $\alpha$ becomes really small. -------------------------------------------------------------------------------- /wordclouds_video_tutorial.Rmd: -------------------------------------------------------------------------------- 1 | # Wordclouds: a video tutorial on how to make wordclouds from raw text 2 | 3 | Jianxun Guo and Longyuan Gao 4 | 5 | ```{r} 6 | library(stringi) #install.packages(stringi) 7 | library(dplyr) 8 | #library(devtools) 9 | #install_github("lchiffon/wordcloud2") 10 | library(wordcloud2) #Must be installed from source 11 | ``` 12 | ## Introduction 13 | Wordclouds are aggregations of words appear in the text with size depending on the frequency of the words. Although word clouds are less used in rigorous EDAV, it is an eye-catching and powerful tool to present ideas and information to the mass. In this project, we provide a short yet fairly comprehensive video tutorial that covers the R library WordCloud2, and we also provide several methods to an often ignored issue, which is to convert raw text to word count dataframes that can be fed directly into wordcloud2 functions. After watching this tutorial, we believe one can easily generate various kinds of word clouds from raw texts in R. 14 | 15 | ## Link to Youtube video 16 | ![Cover](resources/wordclouds_video_tutorial/cover.png) 17 | [Link](https://youtu.be/GXB54_iGWFc) 18 | 19 | ## Slides 20 | [The slide is in this link, and is also in the resources folder](https://drive.google.com/drive/folders/1uCz7hWf-uUpV-SlDNEExPw0Hd9baPIuz?usp=share_link) 21 | 22 | ## Code Examples 23 | Here is the example we are going to use in this tutorial, which is from https://www.gsas.columbia.edu/content/academic-integrity-and-responsible-conduct-research, Academic Integrity and Responsible Conduct of Research, Columbia University. 24 | ```{r} 25 | code <- "Students should be aware that academic dishonesty (for example, plagiarism, cheating on an 26 | examination, or dishonesty in dealing with a faculty member or other University official) or 27 | the threat of violence or harassment are particularly serious offenses and will be dealt 28 | with severely under Dean's Discipline. Graduate students are expected to exhibit the high level of personal and academic integrity 29 | and honesty required of all members of an academic community as they engage in scholarly 30 | discourse and research. Scholars draw inspiration from the work done by other scholars; they argue their claims with 31 | reference to others’ work; they extract evidence from the world or from earlier scholarly 32 | works. When a student engages in these activities, it is vital to credit properly the source 33 | of his or her claims or evidence. Failing to do so violates one’s scholarly responsibility. In practical terms, students must not cheat on examinations, and deliberate plagiarism is of 34 | course prohibited. Plagiarism includes buying, stealing, borrowing, or otherwise obtaining 35 | all or part of a paper (including obtaining or posting a paper online); hiring someone to 36 | write a paper; copying from or paraphrasing another source without proper citation or 37 | falsification of citations; and building on the ideas of another without citation. " 38 | ``` 39 | ### First step: from raw text to word count dataframe 40 | Since here we assign the string directly, we do not need to do the readline from file step. However, if you need to do so, you can do it by 1) readline from file and 2)covert the read lines to one long string: 41 | ```{r} 42 | #code <- readLines(“.\\conductCode.txt”) 43 | #code <- paste(code, collapse = " ") 44 | ``` 45 | Next, we use regular expressions to extract all individual words, and convert them to lower case, in case of same words with different cases: 46 | ```{r} 47 | code <- stri_extract_all_regex(code, "(\\w+[[:punct:]]\\w+)|(\\b\\w+\\b)") 48 | code <- lapply(code, tolower) 49 | ``` 50 | Notice here we use regular expressions to extract all words with no special chars in them or with special chars within them, like "one's", "that's", etc. You can also change the regex to catch more possible words. Also notice that there are many other libraries that can do the same thing. 51 | 52 | Next, we convert the extracted words to a frequency table and then to a dataframe (so that we can feed the df directly into wordcloud functions): 53 | ```{r} 54 | code <- table(code) 55 | code <- data.frame(code) 56 | head(code) 57 | ``` 58 | 59 | Notice that there are some function words like "in", "and", "or" that make up a good portion of the frequency table. You can consider filtering some of these function words by: 60 | ```{r} 61 | filterwords <- c("and", "a", "of","or", "is","are","in","from","the","on","at","not","an","as") #list of unwanted words 62 | filterwords <- paste(c("^",paste(filterwords,collapse = "$|^"),"$"),collapse = "") #glue words together, forming regex 63 | code <- code %>% dplyr::filter(!grepl(filterwords,code)) 64 | head(code) 65 | ``` 66 | ### Step 2: from word counts to wordclouds using Wordcloud2 67 | First, we explore some functionalities in function 68 | wordcloud2. The data is the word count data frame: 69 | 70 | #### wordcloud2 function 71 | 72 | ```{r} 73 | wordcloud2(data = code, 74 | color = "random-light", 75 | fontFamily = "Times New Roman", 76 | shape="cardioid") 77 | ``` 78 | There are several parameters you can tune: 79 | 80 | * Parameters 81 | + color: color theme of the wordcloud 82 | + min/maxRotation: if words rotate, the range (in rad) the word will rotate 83 | + shape: the shape of the wordcloud 84 | + figPath: path to the figure used for masking the wordcloud 85 | + rotateRatio: the proportion of words to be rotated (ranging from 0 to 1) 86 | + ... 87 | 88 | You can find more detailed descriptions on [Wordcloud CRAN site](https://cran.r-project.org/web/packages/wordcloud2/vignettes/wordcloud.html). The parameters allow for great variabilities of wordclouds. 89 | 90 | Here is another example using rotation: 91 | ```{r} 92 | wordcloud2(data = code, 93 | minRotation = pi/4, 94 | maxRotation = pi/3, 95 | rotateRatio = 1) 96 | 97 | ``` 98 | Here is an example with figPath set. The figure will define the shape of the generated wordcloud: 99 | 100 | **Important note**: There is an inherent bug in wordcloud2 library. When figPath is set or we use lettercloud function, the generated wordcloud would not render. Possible fixes are to run this code in a R script and open the wordcloud in a browser. Alternatively you can hit the refresh button in the R viewer until the image appears. The logo is from [Wikipedia](https://commons.wikimedia.org/wiki/File:Apple_logo_black.svg) 101 | 102 | 103 | ```{r} 104 | wordcloud2(data = code, figPath = "resources/wordclouds_video_tutorial/Apple_logo_black.png.png") 105 | ``` 106 | 107 | **Rendered results are in the slides.** 108 | 109 | #### lettercloud function 110 | This function allows for wordclouds masked by a letter or a word. In this function we must set the "word" argument. 111 | Some parameters we can tune: 112 | 113 | * Parameters 114 | + word: a required argument. Creates shape for wordcloud 115 | + wordSize: default=2. Parameter of size of the word 116 | + letterFont: set font 117 | + Other parameters are the same as wordcloud2 function. 118 | 119 | Here is an example with "E" masking: 120 | 121 | **The bug aforementioned may prevent this fromm rendering. Please see slides for a rendered image.** 122 | 123 | ```{r} 124 | letterCloud(code, 125 | word ="E", 126 | wordSize = 2) 127 | ``` 128 | 129 | Here is an example with "EDAV" word masking: 130 | 131 | **The bug aforementioned may prevent this from rendering. Please see slides for a rendered image.** 132 | 133 | ```{r} 134 | letterCloud(code, 135 | word ="EDAV", 136 | wordSize = 2) 137 | 138 | ``` 139 | 140 | ## Reflections 141 | From this project, we consolidated our R programming skills. More specifically, we learned many ways to deal with raw texts in R, such as extracting English words from the texts using Regex. We also mastered the wordcloud functions and parameters in detail. 142 | Regarding what we might do differently next time, we will probably spend more times dealing with the inherent bugs in the library, like the one that makes rendered word clouds disappear. We will spend more time finding ways to overcome or circumvent this bug. 143 | -------------------------------------------------------------------------------- /sandbox.Rmd: -------------------------------------------------------------------------------- 1 | # Testing 2 | 3 | ```{r, include=FALSE} 4 | knitr::opts_chunk$set(echo = TRUE) 5 | library(tidyverse) 6 | ``` 7 | 8 | ```{r} 9 | # install.packages("ggmap") 10 | library(ggmap) 11 | # You will also need to activate an API key! Use a line like the following: 12 | # register_google("") 13 | ``` 14 | 15 | ```{r, include=FALSE} 16 | # New API key added 2022-10-18 17 | # Why isn't this working!? 18 | register_google(Sys.getenv("API_TOKEN")) 19 | ``` 20 | 21 | ## Link to Video Tutorial 22 | 23 | https://www.youtube.com/watch?v=AKEEAaCbtog 24 | 25 | ## Getting Started 26 | 27 | This demonstration will show you how to make static maps within the tidyverse framework using ggmap. Before we get started, make sure you have ggmap installed and ready to go, since you may not have it installed by default. 28 | 29 | One thing that can be somewhat pesky when working with ggmap is Google maps. As of 2018, this service requires an API key to use. To access the full features of this demonstration, you will have to register a Google API key. Details on how to do this can be found in the help file for 'register_google'. 30 | 31 | **Note: If you wish to run the code found in this file, you will need to register a Google API key!** 32 | 33 | ## Why ggmap? 34 | 35 | ggmap is useful for our demonstration, since it will use syntax that you may be familiar with already! ggmap is designed to be quite compatible with ggplot2 (in fact, Hadley Wickham, one of the authors of ggplot2, is a co-author!). In fact, much of ggmap is, under the hood, largely comprised of ggplot2 components. ggmap, however, features the added benefit of abstracting away from many of the more involved components of setting up a ggplot2 visual with maps, and will doubtless prove easier to use than base ggplot2 alone. ggmap also has many options for customizing its resulting visuals, though we will not explore many of these options in depth. 36 | 37 | It should be noted that ggmap is an excellent tool for *static* maps. If you are looking more dynamic maps that provide more potential for user interaction, you may find other libraries, such as leaflet, that better suit your needs. 38 | 39 | ## Dataset for Demonstration 40 | 41 | For our demonstration, we will utilize data on public WiFi locations in New York City. The dataset can be found here: 42 | https://data.cityofnewyork.us/City-Government/NYC-Wi-Fi-Hotspot-Locations/yjub-udmw . We will save this dataset to 'wifi_data': 43 | 44 | ```{r} 45 | wifi_data <- read.csv("https://data.cityofnewyork.us/api/views/yjub-udmw/rows.csv?accessType=DOWNLOAD") 46 | head(wifi_data) 47 | ``` 48 | 49 | ## Creating Maps (No plotting) 50 | 51 | Before we dive into how to overlay datapoints and other visuals over our graphs, we'll start by going over how to create maps in the first place. 52 | 53 | ### By location name 54 | 55 | The first, and perhaps more approachable way we can plot a map is by simply specifying the name of the location we'd like to see. This works through the geocode function. Here is an example with the city of Paris, France: 56 | 57 | ```{r} 58 | geocode("Paris") 59 | ``` 60 | 61 | If you want to verify these coordinates, you can try for yourself! Simply plug 48.85661, 2.352222 into Google Maps, and you will see that this is indeed the location of Paris, France (or, a location roughly in the center of Paris). revgeocode and mapdist are two other useful functions to do the inverse of geocode and calculate distances, respectively. 62 | 63 | To generate a map using a location name, we run two steps. First we use get_map(), specifying the location name. This gives a ggmap object. Second, we generate a ggplot2-compatible plot, using ggmap() and providing our ggmap object from step 1 as the input. To demonstrate our location lookup works with more abstract names, let us try finding the Parthenon, in Athens: 64 | 65 | ```{r} 66 | parthenon <- get_map(location = "parthenon, athens") 67 | parthenon_map <- ggmap(parthenon) 68 | parthenon_map 69 | ``` 70 | 71 | Clearly the Parthenon cannot be viewed from space. Let us adjust some variables to give us a clearer view. First, we will adjust the 'zoom' parameter. By default, get_map will select this for us, but we can specify it manually. Here, it is initially set to 10, which is about right for a city. Let us try 16 instead: 72 | 73 | ```{r} 74 | parthenon <- get_map(location = "parthenon, athens", zoom = 16) 75 | parthenon_map <- ggmap(parthenon) 76 | parthenon_map 77 | ``` 78 | 79 | That is better. The style of this map may be a little distracting, but we will show how to change this later. 80 | 81 | ### By geographic coordinates 82 | 83 | Another way to plot a map is with geographic coordinates directly. Here, we will input the coordinates for the Grand Canyon: 84 | 85 | ```{r} 86 | grand_canyon <- get_map(location = c(-112, 36.1), zoom = 9) 87 | grand_canyon_map <- ggmap(grand_canyon) 88 | grand_canyon_map 89 | ``` 90 | 91 | 92 | ### Using shape files 93 | 94 | Shape files are also an option for plotting maps with ggmap, but we will not cover them here. 95 | 96 | ## Quick Map Plot 97 | 98 | In the event that you would like a quick visual plotted, given a set of latitude and longitude points that you would like to superimpose on a map, there is an easy option in the form of qmplot. Here is an example, using our sample dataset: 99 | 100 | ```{r} 101 | qmplot(data=wifi_data, x = Longitude, y = Latitude, color = I('Blue'), alpha = I(0.05)) 102 | ``` 103 | 104 | If you are looking for a quick visual, this may suffice in some cases. 105 | 106 | ## Plotting by Overlaying ggplot2 Visuals 107 | 108 | In most cases, you will want more elaborate visuals, which may be easier to accomplish with ggplot2 and ordinary ggmaps. Here are some examples: 109 | 110 | ### Scatterplot 111 | 112 | Let us start with a basic plotting. Note that 'lat' and 'lon' are the expected names of the x and y values. 113 | 114 | ```{r} 115 | wifi_data_renamed <- mutate(wifi_data, lat = Latitude, lon = Longitude) 116 | nyc <- get_map(location = "manhattan", zoom = 12) 117 | ggmap(nyc) + 118 | geom_point(data = wifi_data_renamed, aes(color = Type), alpha=.2) 119 | ``` 120 | 121 | We see two issues. First, the legend could be inset at the top left. This can be modified within our ggmap. Second, our map colors may be confusing, given that parks and water are the same colors as several of our categories. Let us correct these issues: 122 | 123 | ```{r} 124 | nyc <- get_map(location = "manhattan", zoom = 12, source = 'stamen', maptype = "toner") 125 | ggmap(nyc, extent = 'device', legend = "topleft") + 126 | geom_point(data = wifi_data_renamed, aes(color = Type), alpha=.2) 127 | ``` 128 | 129 | We used arguments extent = 'device' and legend = "topleft" to move our legend, and source = 'stamen' and maptype = "toner" to adjust our map's appearance. 130 | 131 | ### Heatmap with Rectangular Bins 132 | 133 | In the event of overplotting, a rectangular heatmap can be used. The below graph shows a rectangular heatmap for the area of manhattan north of central park. 134 | 135 | ```{r} 136 | nyc <- get_map(location = c(-73.96, 40.8), zoom = 14, source = 'stamen', maptype = "toner") 137 | ggmap(nyc, extent = 'device', legend = "topleft") + 138 | geom_bin_2d(data = wifi_data_renamed, aes(color = Type, fill = Type), bins = 40, alpha=.7) 139 | ``` 140 | 141 | ### Contour Plots 142 | 143 | One final visual of interest is the contour plot. Here is an example of one: 144 | 145 | ```{r} 146 | nyc <- get_map(location = "manhattan", zoom = 12, color = 'bw') 147 | ggmap(nyc, extent = 'device', legend = "topleft") + 148 | geom_density_2d_filled(data = wifi_data_renamed, alpha = 0.3) 149 | ``` 150 | 151 | ### Overlaying Routes 152 | 153 | One other neat feature of ggmap is the ability to superimpose routes. Here is a map that shows the available WiFi hotspots near a walking route from Columbia to Central Park: 154 | 155 | ```{r} 156 | to_cp <- route(from = 'Columbia University', to = 'Frederick Douglass Circle', mode = 'walking') 157 | nyc <- get_map(location = c(-73.96, 40.803), zoom = 16, color = 'bw') 158 | ggmap(nyc) + 159 | geom_leg(aes(x = start_lon, y = start_lat, xend = end_lon, yend = end_lat), size = 2, color = 'blue', data=to_cp) + 160 | geom_point(data = wifi_data_renamed, color = 'red') 161 | ``` 162 | -------------------------------------------------------------------------------- /ggfortify_and_autoplotly_tutorial.Rmd: -------------------------------------------------------------------------------- 1 | # Tutorials for ggfortify and autoplotly 2 | 3 | Yujia Chen 4 | 5 | ```{r, include=FALSE} 6 | knitr::opts_chunk$set(echo = TRUE) 7 | ``` 8 | 9 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 10 | library(autoplotly) 11 | library(ggplot2) 12 | library(ggfortify) 13 | library(cluster) 14 | library(changepoint) 15 | library(KFAS) 16 | library(strucchange) 17 | library(splines) 18 | ``` 19 | 20 | ### Motivation 21 | 22 | The first time I used `ggfortify` package was to draw time series graph similar to `ggplot2` style. `ggfortify` package has a very easy-to-use and uniform programming interface that enables users to use one line of code to visualize statistical results using `ggplot2` as building blocks. The `ggfortify` package extends `ggplot2` for plotting some popular **R** package using a standardized approach, included in the function `autoplot()`. 23 | 24 | The `autoplotly` package provides functionalities to automatically generate interactive visualizations for many popular statistical results supported by `ggfortify` package with plotly and `ggplot2` style. 25 | 26 | For example, we can hover the mouse over to each point in the plot to see more details, such as principal components information and the species this particular data point belongs to. We can also use the interactive selector to drag and select an area to zoom in, and zoom out by double clicking anywhere in the plot. 27 | 28 | `ggfortify` and `autoplotly` can be widely used in statistical analysis fields and they can be seen as powerful additions to the `ggplot2` package. 29 | 30 | 31 | ### Examples 32 | 33 | #### Linear Models 34 | 35 | The `autoplotly()` function is able able to interpret `lm` fitted model objects and allows the user to 36 | select the subset of desired plots through the which parameter. The `which` parameter allows users to specify which of the subplots to display, for example: 37 | 38 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 39 | autoplotly( 40 | lm(Petal.Width ~ Petal.Length, data = iris), 41 | which = c(4, 6), colour = "dodgerblue3", 42 | smooth.colour = "black", smooth.linetype = "dashed", 43 | ad.colour = "blue", label.size = 3, label.n = 5, 44 | label.colour = "blue") 45 | ``` 46 | 47 | #### Principal Component Analysis 48 | 49 | The `autoplotly()` function works for the two essential classes of objects for principal component analysis (PCA) obtained from `stats` package: `stats::prcomp` and `stats::princomp`, for example: 50 | 51 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 52 | pca <- autoplotly(prcomp(iris[c(1, 2, 3, 4)]), data = iris, 53 | colour = 'Species', frame = TRUE) 54 | pca 55 | ``` 56 | 57 | #### Clustering 58 | 59 | The `autoplotly` package also supports `cluster::clara`, `cluster::fanny`, `cluster::pam` as well as `cluster::silhouette` classes. It automatically infers the object type and generate interactive plots of the results from those packages with a single function call. Visualization of convex for each cluster in the `iris` data set: 60 | 61 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 62 | autoplotly(fanny(iris[-5], 3), frame = TRUE) 63 | ``` 64 | 65 | By specifying frame, we are able to draw boundaries of different shapes. The different frame types can be found in `ggplot2::stat_ellipse`’s `type` keyword via `frame.type` option. Visualization of probability ellipse for `iris` data set: 66 | 67 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 68 | autoplotly(pam(iris[-5], 3), frame = TRUE, frame.type = 'norm') 69 | ``` 70 | 71 | #### Forecasting 72 | 73 | Forecasting packages such as `forecast`, `changepoint`, `strucchange`, and `KFAS`, are popular choices for statisticians and researchers. Interactive visualizations of predictions and statistical results from those packages can be generated automatically using the functions provided by `autoplotly` with the help of `ggfortify`. 74 | 75 | The `changepoint` package provides a simple approach for identifying shifts in mean and/or variance in a time series. `ggfortify` supports `cpt` object in `changepoint` package. 76 | 77 | Visualization of the change points with optimal positioning for the `AirPassengers` data set found in the `changepoint` package using the `cpt.meanvar` function: 78 | 79 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 80 | autoplotly(cpt.meanvar(AirPassengers)) 81 | ``` 82 | 83 | Visualization of the original and smoothed line in `KFAS` package: 84 | 85 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 86 | model <- SSModel( 87 | Nile ~ SSMtrend(degree=1, Q=matrix(NA)), H=matrix(NA) 88 | ) 89 | 90 | fit <- fitSSM(model=model, inits=c(log(var(Nile)),log(var(Nile))), method="BFGS") 91 | smoothed <- KFS(fit$model) 92 | autoplotly(smoothed) 93 | ``` 94 | 95 | Visualization of filtered result and smoothed line: 96 | 97 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 98 | trend <- signal(smoothed, states="trend") 99 | filtered <- KFS(fit$model, filtering="mean", smoothing='none') 100 | p <- autoplot(filtered) 101 | autoplotly(trend, ts.colour = 'blue', p = p) 102 | ``` 103 | 104 | Visualization of optimal break points where possible structural changes happen in the regression models built by `strucchange::breakpoints`: 105 | 106 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 107 | autoplotly(breakpoints(Nile ~ 1), ts.colour = "blue", ts.linetype = "dashed", 108 | cpt.colour = "dodgerblue3", cpt.linetype = "solid") 109 | ``` 110 | 111 | #### Splines 112 | 113 | The `autoplotly` can also automatically generate interactive plots for results produced by **splines**. 114 | 115 | B-spline basis points visualization for natural cubic spline with boundary knots produced from `splines::ns`: 116 | 117 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 118 | autoplotly(ns(ggplot2::diamonds$price, df = 6)) 119 | ``` 120 | 121 | ### Extensibility and Composability 122 | 123 | The plots generated using `autoplotly()` can be easily extended by applying additional `ggplot2` elements or components. For example, we can add title and axis labels to the originally generated plot using `ggplot2::ggtitle` and `ggplot2::labs`: 124 | 125 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 126 | autoplotly( 127 | prcomp(iris[c(1, 2, 3, 4)]), data = iris, colour = 'Species', frame = TRUE) + 128 | ggplot2::ggtitle("Principal Components Analysis") + 129 | ggplot2::labs(y = "Second Principal Component", x = "First Principal Component") 130 | ``` 131 | 132 | Similarly, we can add additional interactive components using `plotly`. The following example adds a custom `plotly` annotation element placed to the center of the plot with an arrow: 133 | 134 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 135 | p <- autoplotly(prcomp(iris[c(1, 2, 3, 4)]), data = iris, 136 | colour = 'Species', frame = TRUE) 137 | 138 | p %>% plotly::layout(annotations = list( 139 | text = "Example Content", 140 | font = list( 141 | family = "Courier New, monospace", 142 | size = 18, 143 | color = "black"), 144 | x = 0, 145 | y = 0, 146 | showarrow = TRUE)) 147 | ``` 148 | 149 | We can also stack multiple plots generated from `autoplotly()` together in a single view using `subplot()`, two interactive splines visualizations with different degree of freedom are stacked into one single view in the following example: 150 | 151 | ```{r echo=TRUE, message=FALSE, warning=FALSE} 152 | subplot( 153 | autoplotly(ns(ggplot2::diamonds$price, df = 6)), 154 | autoplotly(ns(ggplot2::diamonds$price, df = 3)), nrows = 2, margin = 0.03) 155 | ``` 156 | 157 | This tutorial provides a brief introduction for interactive data visualization tools. As we can see, `ggfortify` and `autoplotly` generate interactive visualizations for many popular statistical results. I learned how to build beautiful visualizations of statistical results with concise code. In the future, I will try to plot more statistical models with `ggfortify` and `autoplotly`. 158 | 159 | 160 | ### References 161 | 162 | 1. http://www.sthda.com/english/wiki/ggfortify-extension-to-ggplot2-to-handle-some-popular-packages-r-software-and-data-visualization 163 | 2. https://github.com/sinhrks/ggfortify 164 | 3. https://terrytangyuan.github.io/2018/02/12/autoplotly-intro/ 165 | 4. https://cran.r-project.org/web/packages/ggfortify/vignettes/basics.html 166 | 5. https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_ts.html 167 | 6. https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html 168 | 169 | -------------------------------------------------------------------------------- /introduction_to_plotly.Rmd: -------------------------------------------------------------------------------- 1 | # Introduction to Plotly in R 2 | 3 | Jingqi Huang, Yi Lu 4 | 5 | ## Overview 6 | 7 | As we learned exploratory data analysis throughout this semester, we have discovered various R packages and techniques that could help us perform analysis and conduct research upon large dataset. For example, histogram and boxplot for unidimensional continuous variables, scatterplot and heatmap for two continuous variables. However, in addition to static graphs, interactive ones also play an important part in modern standard of data analysis and research purposes. To help the Columbia Data Science community holistically equipped, our team decided to bring it to the table. One of the package that implements interactive plots is plotly. It is an R package for creating interactive publication-quality graphs. https://plotly.com/r/ provides very rich examples of the package in R under various tags. We want to extract the most basic logic to use plotly and combine most relevant examples on the same page. 8 | 9 | ## Install and Packages 10 | ```{r, results='hide', warning=FALSE, message=FALSE} 11 | # install.packages("ggplot2") 12 | # install.packages("plotly") 13 | library(ggplot2) 14 | library(plotly) 15 | library(dplyr) 16 | library(gapminder) 17 | ``` 18 | 19 | ## Basic Grammar 20 | Basic grammar for plotly is simple. If type or more are not specified, by default things will set to what makes most sense. 21 | ```{reval=FALSE} 22 | p <- plot_ly(dataframe, x=~column, y=~column2, type="graph type such as scatter, bar, box, heatmap, etc.", mode="mode type such as markers, lines, and etc.") 23 | p <-p %>% add_trace() 24 | p <-p %>% add_notations() 25 | p <-p %>% layout() 26 | p 27 | ``` 28 | Plotly's graph are described in two categories: traces and layout. Multiple parts on the same graph can be added with add_trace() and add changed configurations. Multiple texts can be set by add_notations() by specifying locations with xref and yref. Axis and title can be set by layout(). Plotly can convert ggplot graph to interactive mode by wrap ggplot p with ggplotly(p). 29 | 30 | With these in mind, let's go through the package together. 31 | 32 | ## Basic examples 33 | ### Histogram 34 | ```{r} 35 | df <- data.frame(type=rep(c("A", "B"), each=500), subtype=rep(c("1", "2"), each=250), value=rnorm(1000)) 36 | hist <- ggplot(df, aes(x=value, fill=subtype))+ 37 | geom_histogram(position="identity", alpha=0.5, binwidth=0.2)+ 38 | facet_grid(~type) 39 | ggplotly(hist) 40 | ``` 41 | 42 | ### 2D Histogram 43 | ```{r} 44 | p <- plot_ly(x = filter(df, type=="A")$value, y = filter(df, type=="B")$value) 45 | hist_2d <- subplot(add_markers(p, alpha=0.4), add_trace(p, type='histogram2dcontour'), add_histogram2d(p)) 46 | hist_2d 47 | ``` 48 | 49 | ### Boxplot 50 | ```{r, message=FALSE} 51 | boxplt <- plot_ly(diamonds, x = ~price/carat, y = ~clarity, color = ~clarity, type = "box") %>% 52 | layout(title="Interactive BoxPlot with Plotly") 53 | 54 | # Second method 55 | p <- ggplot(diamonds, aes(x=clarity, y=price/carat, color=clarity)) + 56 | geom_boxplot() + 57 | coord_flip() + 58 | ggtitle("BoxPlot with ggplot2") 59 | # the following line generates the same interactive graph 60 | # ggplotly(p) 61 | 62 | boxplt 63 | ``` 64 | 65 | 66 | ### 2D Scatter Plot 67 | ```{r, message=FALSE} 68 | scatter2d <- plot_ly(filter(diamonds, color=='D'), x=~carat, y=~price, color=~clarity, marker=list(size=4, opacity=0.5), 69 | # hover text 70 | text = ~paste("Price: ", price, "$
Cut: ", cut, "
Clarity: ", clarity)) %>% 71 | # set title and axis 72 | layout(title="Interactive Scatter Plot with Plotly") 73 | 74 | scatter2d 75 | ``` 76 | 77 | ### 3D Scatter Plot 78 | ```{r, message=FALSE} 79 | scatter3d <- plot_ly(diamonds[sample(nrow(diamonds), 1000), ], x=~price/carat, y=~table, z=~depth, color=~cut, 80 | marker = list(size=4, opacity=0.5))%>% 81 | layout(title="Interactive 3D Scatter Plot with Plotly") 82 | 83 | scatter3d 84 | ``` 85 | 86 | ```{r, warning=FALSE, message=FALSE} 87 | mtcars <- mutate(mtcars, type = case_when(mtcars$am == 0 ~ "Auto", mtcars$am == 1 ~ "Manual")) 88 | plot_ly(mtcars, x=~mpg, y=~wt, z=~hp, color=~type) %>% 89 | layout(title="Interactive 3D Scatter Plot with Plotly", 90 | scene = list(xaxis = list(title = "mpg"), yaxis = list(title = "weight"), zaxis = list(title = "horsepower"))) 91 | ``` 92 | 93 | 94 | ### Line Plot 95 | ```{r, message=FALSE} 96 | a <- rnorm(100, 5, 1) 97 | b <- rnorm(100, 0, 1) 98 | c <- rnorm(100, -5, 1) 99 | df <- data.frame(x=c(1:100), a, b, c) 100 | 101 | lineplt <- plot_ly(df, x = ~x) %>% 102 | add_trace(y = ~a, name="line", type="scatter", mode = "lines", line=list(color='rgb(23, 54, 211)', width=2)) %>% 103 | add_trace(y=~b, name="dot line with markers", mode = "lines+markers", line=list(dash='dot')) %>% 104 | add_trace(y=~c, name="scatter markers only", mode = "markers") %>% #same as scatter plot 105 | layout(title="Interactive Line Plot with Plotly", yaxis=list(title="value")) 106 | 107 | lineplt 108 | ``` 109 | 110 | ### Bar Plot 111 | ```{r} 112 | barplt <- plot_ly(count(diamonds, cut, clarity), x=~cut, y=~n, color=~clarity, type="bar", text=~n, marker=list(opacity=0.4, line=list(color='rgba(8,48,148, 1)', width=1.5))) %>% 113 | layout(barmode = 'group') 114 | 115 | barplt 116 | ``` 117 | 118 | ### Some intersesting examples 119 | ```{r} 120 | # Initialization 121 | question <- c('The course was effectively
organized', 122 | 'The course developed my
abilities and skills for
the subject', 123 | 'I would recommend this
course to a friend', 124 | 'Any other questions') 125 | df <- data.frame(question, sa=c(21, 24, 27, 29), a=c(30, 31, 26, 24), ne=c(21, 19, 23, 15), ds=c(16, 15, 11, 18), sds=c(12, 11, 13, 14)) 126 | answer_label <- c('Strongly
agree', 'Agree', 'Neutral', 'Disagree', 'Strongly
disagree') 127 | 128 | # Interactive plot 129 | p <- plot_ly(df, x=~sa, y=~question, type="bar", orientation="h", marker = list(color = 'rgba(38, 24, 74, 0.8)', 130 | line = list(color = 'rgb(248, 248, 249)', width = 1))) %>% 131 | add_trace(x=~a, marker = list(color = 'rgba(71, 58, 131, 0.8)')) %>% 132 | add_trace(x=~ne, marker = list(color = 'rgba(122, 120, 168, 0.8)')) %>% 133 | add_trace(x=~ds, marker = list(color = 'rgba(164, 163, 204, 0.85)')) %>% 134 | add_trace(x=~sds, marker = list(color = 'rgba(190, 192, 213, 1)')) %>% 135 | layout(barmode='stack', 136 | xaxis = list(title = "", showgrid = FALSE,showticklabels = FALSE, zeroline = FALSE), 137 | yaxis = list(title=""), 138 | paper_bgcolor = 'rgb(248, 248, 255)', 139 | plot_bgcolor = 'rgb(248, 248, 255)', 140 | margin = list(l = 120, r = 10, t = 40, b = 10), 141 | showlegend = FALSE) %>% 142 | add_annotations(xref='x', yref='y', x=~sa/2, y=~question, text=paste(df[,"sa"], '%'), font=list(size=12, color="white"), showarrow=FALSE) %>% 143 | add_annotations(x=~sa+a/2, text=paste(df[,"a"], '%'), font=list(size=12, color="white"), showarrow=FALSE) %>% 144 | add_annotations(x=~sa+a+ne/2, text=paste(df[,"ne"], '%'), font=list(size=12, color="white"), showarrow=FALSE) %>% 145 | add_annotations(x=~sa+a+ne+ds/2, text=paste(df[,"ds"], '%'), font=list(size=12, color="white"), showarrow=FALSE) %>% 146 | add_annotations(x=~sa+a+ne+ds+sds/2, text=paste(df[,"sds"], '%'), font=list(size=12, color="white"), showarrow=FALSE) %>% 147 | add_annotations(xref = 'x', yref = 'paper', 148 | x = c(21 / 2, 21 + 30 / 2, 21 + 30 + 21 / 2, 21 + 30 + 21 + 16 / 2, 149 | 21 + 30 + 21 + 16 + 12 / 2), y = 1.05, 150 | text = answer_label, 151 | font = list(size = 12, color = 'rgb(67, 67, 67)'), showarrow = FALSE) 152 | 153 | p 154 | ``` 155 | 156 | 157 | ## Animations with Plotly 158 | ```{r} 159 | df <- data.frame(x = c(1:10), y = rnorm(10), time = c(1:10)) 160 | plot_ly(df, x = ~x, y = ~y, frame = ~time, type = 'scatter', mode = 'markers', showlegend = F) 161 | ``` 162 | 163 | ```{r, warning=FALSE, message=FALSE} 164 | df <- gapminder 165 | plot_ly(df, x=~gdpPercap, y=~lifeExp, color=~continent, size=~pop, frame=~year, type="scatter", mode="markers", text=~country, hoverinfo = "text") %>% 166 | layout(xaxis = list(type = "log")) 167 | ``` 168 | 169 | 170 | ## Self-evaluation 171 | Our project generally meets the expectation derived from the motivation described at the beginning of the project. We have explained the basic language format and provided code example of each graph. We learned how to use the plotly package to build various types of graphs. We combined the rich but separate examples on plotly website into a more concentrated form. 172 | 173 | However, we are missing with detailed explanation of them but rely on the codes and graphs to explain themselves correspondingly. We are also using simple datasets built-in R, which create similar examples as the website. We will make it more detailed and well-explained, and use newer data next time. 174 | 175 | ## Reference: 176 | https://plotly.com/r/ 177 | 178 | 179 | 180 | -------------------------------------------------------------------------------- /dygraph_tutorial.Rmd: -------------------------------------------------------------------------------- 1 | # Introcuction to Dygraph for Financial Analysis 2 | 3 | Nathaniel Ho 4 | 5 | ## Project Motivation 6 | 7 | The motivation of this project is to give an introduction to dygraph, while specifically highlighting tools within the package that would prove useful for time-series financial data. Currently, the most popular package for financial analysis is Quantmod. While Quantmod is useful for retrieving time-series financial data and has a set of rudimentary graphs it can generate, it does not have an interactive component. Therefore, this project utilizes the data generated by the use of Quantmod to show different types of graphs the user can create for an individual/multiple stocks, as well as customization options for visualizing the data, including user interaction. 8 | 9 | ## Gathering Data 10 | 11 | We utilize the Quantmod package to get stock market data for AMZN and GME. We will not cover specifics of the code here as there is writeup from another group that covers the package very well. We simply get individual stock prices, as well as the combination of closing prices of the stocks we choose into one dataframe. 12 | 13 | ```{r} 14 | library(quantmod) 15 | library(dygraphs) 16 | tickers <- c("AMZN", "GME") 17 | getSymbols(tickers) 18 | closePrices <- do.call(merge, lapply(tickers, function(x) Cl(get(x)))) 19 | 20 | amzn <- getSymbols("AMZN") 21 | gme <- getSymbols("GME") 22 | 23 | ``` 24 | 25 | ## Types of Graphs 26 | 27 | 28 | ### Individual Dynamic Chart 29 | 30 | The first is an individual dynamic chart. These are useful if you want to display summary data for a singular financial instrument (such as the opening, closing, day high/low of a stock). Here, we use AMZN's closing price as an example. 31 | 32 | ```{r} 33 | dygraph(AMZN[,4], main="AMZN Closing Price") 34 | 35 | ``` 36 | 37 | ### Multiline Dynamic Chart 38 | 39 | If two series share the same x-axis (such as date), then we can compare the two series by putting them in the same dynamic chart to directly compare them. Common usages include prices or percent changes between two stocks, as well as more technical aspects of a stock such as rolling alpha/beta, Sharpe's Ratiom etc. Below, we plot both Gamestock's and Amazon's stock prices on a multiline dynamic chart. 40 | 41 | ```{r} 42 | stocks <- cbind(AMZN[,4], GME[,4]) 43 | dygraph(stocks, main = "Closing Prices of Amazon and Gamestop") %>% 44 | dySeries(c("AMZN.Close"), label = "AMZN") %>% 45 | dySeries(c("GME.Close"), label = "GME") 46 | 47 | ``` 48 | 49 | This proves to be useful in multiple ways. For example, we can also combine different types of graphs. Here, we show AMZN's stock price, as well as a step chart below that shows total trading volume for financial analysis. This creates a chart that closely mirrors Quantmod's generated graph, but this graph is interactive. 50 | 51 | ```{r} 52 | stocks <- cbind(AMZN[1000:2000,4], AMZN[1000:2000,5]/100000000) 53 | dygraph(stocks, main = "Closing Price of Amazon with Daily Trading Volume") %>% 54 | dySeries("AMZN.Close", label = "AMZN") %>% 55 | dySeries("AMZN.Volume", stepPlot = TRUE, fillGraph = TRUE, color = "grey", label="Volume (In Tens of Millions)") 56 | 57 | ``` 58 | 59 | ### Moving Average/Rolling 60 | 61 | If we wanted to create moving averages for a stock (some common ones are 50 MA, 200 MA), we can utilize another built-in tool in the dygraphs package called "rollPeriod." The tool gives the option to choose the number of data points to average, creating a smoothed version of the graph (i.e. a graph more focused on trends than day-to-day volatility). The following example is the 50-day MA of AMZN stock: 62 | 63 | ```{r} 64 | dygraph(AMZN[,4], main = "50-day Moving Average of AMZN") %>% 65 | dyRoller(rollPeriod = 50) %>% 66 | dySeries("AMZN.Close", label = "Moving Average Price") 67 | 68 | ``` 69 | 70 | ### Candlestick 71 | 72 | Finally, there is the candlestick graph. The candlestick graph is useful for showing daily moves for a stock, as well as multiple data points such as opening price, closing price, daily low, daily high, and range all in one graph, making analysis of day-to-day stock movement a lot easier to digest. 73 | 74 | ```{r} 75 | 76 | dygraph(AMZN[3000:3050,1:4], main = "AMZN Stock Price") %>% 77 | dyCandlestick() %>% 78 | dySeries("AMZN.Open", label = "Open") %>% 79 | dySeries("AMZN.High", label = "High") %>% 80 | dySeries("AMZN.Low", label = "Low") %>% 81 | dySeries("AMZN.Close", label = "Close") 82 | 83 | ``` 84 | 85 | ## In-Graph Modifications 86 | 87 | These different graphs can be further modified to give a clearer visualization of why a stock moves a certain direction. This tutorial will cover shading, events, and upper/lower bars. 88 | 89 | ### Shading 90 | 91 | Shading is useful if you want to denote a particular range where a macroeconomic event is happening. For example, many analysts use technical analysis to predict how a stock a move (common examples are bull/bear flags, channels, and the infamous "dead cat bounce"). In this example, we highlight where the initial wave of COVID lockdowns occured on AMZN's stock price chart to better visualize its movement during that period. 92 | 93 | ```{r} 94 | dygraph(AMZN[,4], main="AMZN Closing Price w/COVID Lockdown Highlighted") %>% 95 | dyShading(from = "2020-2-14", to = "2020-8-12", color = "#FFE6E6") 96 | ``` 97 | 98 | ### Events and Limits 99 | 100 | If there is a factor that shocks the market within one day, we can add an event to pinpoint the immediate effects of that positive/negative shock on that financial instrument. For example, we can highlight the stock market crash of 2009 on AMZN's stock to better visualize how this economic shock led to a steep decline in the stock price. 101 | 102 | ```{r} 103 | dygraph(AMZN[0:1000,4], main="AMZN Closing Price w/ Great Recession Crash") %>% 104 | dyEvent("2008-9-29", "Great Recession Crash", labelLoc = "bottom") 105 | ``` 106 | 107 | Alternatively, we can add limits onto the graph to better visualize the absolute max and min of a stock. AMZN's stock price peaked on July 8, 2021. We can create a limit that shows the limit of the absolute max. 108 | 109 | ```{r} 110 | dygraph(AMZN[,4], main="AMZN Closing Price w/ All-Time High Limit") %>% 111 | dyLimit(as.numeric(AMZN['2021-07-08', 4]), as.numeric(AMZN['2021-07-08', 4]), color = "black") 112 | ``` 113 | 114 | ### Upper/Lower Bars 115 | 116 | Another way we can directly modify the graph is by adding upper/lower bounds to the graph. These are particularly useful if you are using prediction ML models for time-series data and want to provide an upper/lower bound. For the sake of clarity, we utilize daily low/high for our ranges for AMZN stock to visualize upper/lower bounds rather than an ML prediction model. 117 | ```{r} 118 | dygraph(AMZN[2000:2050,2:4], main="AMZN Stock w/ Daily Low/High") %>% 119 | dySeries(c("AMZN.Low", "AMZN.Close", "AMZN.High"), label = "AMZN") 120 | ``` 121 | ## User Interactivity 122 | 123 | In addition to modifications that directly apply to the graph, there are also modules for user interactivity that you could add to your visualization: namely, selection highlighting and range selection. 124 | 125 | ### Selection Highlighting 126 | 127 | Selection Highlighting is a useful interaction for users who want a clear indicator of what time-series they are following if there are multiple ones on the plot. For example, in the graph below, if the user hovers over the stock they are tracking, the line is emboldened, which makes it easier to trace the trend of that particular stock. Additionally, we add a circle to where the user is hovering to make it easier to distinguish the data point of interest. 128 | 129 | ```{r} 130 | stocks <- cbind(AMZN[,4], GME[,4]) 131 | dygraph(stocks, main = "Closing Prices of Amazon and Gamestop w/ Selection Highlighting") %>% 132 | dySeries(c("AMZN.Close"), label = "AMZN") %>% 133 | dySeries(c("GME.Close"), label = "GME") %>% 134 | dyHighlight(highlightCircleSize = 3, 135 | highlightSeriesBackgroundAlpha = 0.4, 136 | hideOnMouseOut = FALSE) 137 | 138 | ``` 139 | ### Range Selection 140 | 141 | Additionally, you can give users control over the time range for the data that they want to view. This is useful if you want to import the entirety of a dataset, but want to give the user to analyze only a subset of the data. An example of this is seen below with the default range being from 2020 to 2022: 142 | 143 | ```{r} 144 | dygraph(AMZN[,4], main="AMZN Stock w/ Range Selection") %>% 145 | dyRangeSelector(dateWindow = c("2020-01-01", "2022-01-01")) 146 | 147 | ``` 148 | 149 | ## Project Summary and Evaluation 150 | 151 | This project's aim was to give a brief introduction to dygraph and its applications to create interactive graphs for time-series data of financial instruments. This tutorial covers the basics needed to create a compelling visual story for a financial instrument, but there are far more extensions and customization options that can be utilized that the reader can explore to better tell a data-driven story. If there were more time, I would expand on dygraph's use cases beyond financial instruments, such as in time-series trends in fields such as health and technology. -------------------------------------------------------------------------------- /bubble_chart_application.Rmd: -------------------------------------------------------------------------------- 1 | # Bubble chart --- **GPAs of Course at Columbia University** 2 | 3 | Longcong Xu 4 | 5 | ```{r,message=FALSE, warning=FALSE} 6 | library(ggplot2) 7 | library(dplyr) 8 | library(plotly) 9 | ``` 10 | 11 | ## Motivation for the project 12 | 13 | In just a few days it will be time for Columbia students to begin choosing their spring 2023 classes. The University offers us a wealth of courses, and I am sure that many students, like me, are unsure of how to choose among the many courses of interest. Yet we don't have a platform to disclose some of the information we care most about to help us make decisions. During my undergraduate years, there was a website that visually provided information about the courses, such as course name, course capacity size, average GPA of the course, etc. I borrowed the idea from [UIUC Course evaluation](https://waf.cs.illinois.edu/discovery/gpa_of_every_course_at_illinois/) and hope to build a platform to showcase the courses that belong to Columbia University. If this platform can be built and made public, it can greatly help students in course selection. 14 | 15 | On the University of Illinois at Urbana-Champaign website, the most dominant chart is the bubble chart. Therefore, I would also like to take this opportunity to supplement the EDAV course on bubble charts. The two remaining sections of the document on this page are: an introduction to bubble charts using the data mtcars built into R, and an introduction to plotly package. Although plotly package is briefly described on the EDAV website, I would like to give more examples and introduce how to scale bubble size, how to set pop up text in this package. 16 | 17 | ## Two methods to create bubble chart --- Example of mtcars 18 | 19 | The bubble chart is actually one of the scatter diagrams, except that the bubble chart adds two new dimensions to the scatter diagram. Scatter plots are mainly applicable to two-dimensional continuous variable data, while bubble plots can add categorical data to the plot. The bubble size of the bubble chart provides a new intuitive display perspective, and in addition, the bubble color shade provides another indicator of the data. Due to the limitation on the dataset, I chose the mtcars dataset for a simple interpretation of the bubble chart first. 20 | 21 | ### Use the package ggplot2 22 | 23 | One of the most commonly used packages, ggplot2, extends the scatter plot to a bubble plot. In mtcar dataset, I choose wt(car weight (1000 lbs)), disp(displacement (cu.in.)), hp (Gross horsepower) and cyl (Number of cylinders) attributes. We first started with the simplest one, using Gross horsepower depict bubble size and using Number of cylinders distinct different color of bubbles. As for the alpha, it represent the transparency of air bubbles. 24 | 25 | ```{r,message=FALSE, warning=FALSE} 26 | data("mtcars") 27 | df <- mtcars 28 | 29 | g1 <- ggplot(df, aes(x = wt, y = disp, size = hp, color = cyl)) + 30 | geom_point(alpha = 0.5) + 31 | xlab("Weight") + 32 | ylab("Displacement") 33 | g1 34 | ``` 35 | 36 | Since the type of Number of cylinders attribute is numeric originally, as the number of cylinders increases, we can see a change in color from dark to light. Although number of cylinders is numeric, there are only four values, so we present it after transforming it into factor to be able to better distinguish this indicator. 37 | 38 | ```{r,message=FALSE, warning=FALSE} 39 | 40 | df$cyl <- as.factor(df$cyl) 41 | g2 <- ggplot(df, aes(x = wt, y = disp, size = hp, color = cyl)) + 42 | geom_point(alpha = 0.5) + 43 | xlab("Weight") + 44 | ylab("Displacement") 45 | g2 46 | ``` 47 | 48 | It is clear from the graph that the displacement increases in a linear fashion as the number of cylinders increases. 49 | 50 | In addition, we can set the color of the bubbles specifically and the scale the size of the bubbles. Furthermore, if you do not know how to match the color, you can refer [color palette](https://hexcolorpedia.com/). 51 | 52 | ```{r,message=FALSE, warning=FALSE} 53 | 54 | g3 <- ggplot(df, aes(x = wt, y = disp, color = cyl, size = hp)) + 55 | geom_point(alpha = 0.5) + 56 | scale_color_manual(values = c("#AA4371", "#E7B800", "#FC4E07")) + 57 | scale_size(range = c(1, 13)) + 58 | xlab("Weight") + 59 | ylab("Displacement") + 60 | theme_set(theme_bw() +theme(legend.position = "bottom")) 61 | g3 62 | ``` 63 | 64 | ### The package plotly 65 | 66 | The chart implemented by ggplot2 has an obvious drawback, although we can clearly see the difference between different vehicles and some relationships between indicators. But we can't clearly see the brand of the car, since , and we don't know how to choose it. plotly is a package that greatly improves the functionality of interactions in charts. When the mouse hovers over the bubble, the pop up content is generated. 67 | 68 | ```{r,message=FALSE, warning=FALSE} 69 | g4 <- plot_ly(df, x = ~wt, y = ~disp, 70 | text = ~cyl, size = ~qsec, 71 | color = ~cyl, sizes = c(10, 50), 72 | marker = list(opacity = 0.7, sizemode = "diameter"))%>% add_markers()%>% 73 | layout(xaxis = list(title = "Weight"),yaxis = list(title = "Displacement")) 74 | 75 | g4 76 | ``` 77 | 78 | We can see that if we don't set it, plotly defaults to displaying the horizontal and vertical values, but still doesn't show the brand of the car we want. mtcar dataset shows the brand of the car as rowname, so I did it by adding a new column named Brand to store the brand of each car. The most important thing is to set the hovertemplate, paste function converts its arguments to character strings, and concatenates them. 79 | 80 | We pass Brand into the function if we want to display the brand of the car. Also, in R, the \ character is a line break. This way, we can customize, what we want to show in the pop up of the bubble chart. As you can see, hover is able to display not only the parameters used to draw the bubble chart, wt, disp, hp, etc., but also any other parameters in the dataset. 81 | 82 | ```{r,message=FALSE, warning=FALSE} 83 | df['Brand'] <- row.names(df) 84 | g5 <- plot_ly(df, x = ~wt, y = ~disp, 85 | size = ~hp, sizes = c(10, 50), 86 | color = ~cyl, 87 | mode = 'markers', 88 | hovertemplate = paste("Brand:",df$Brand, "
Weight:" ,df$wt, "
", "Displacement:", df$disp), 89 | marker = list(opacity = 0.7,sizemode = "diameter")) %>% add_markers()%>% 90 | layout(xaxis = list(title = "Weight"),yaxis = list(title = "Displacement")) 91 | g5 92 | ``` 93 | 94 | ## Application of bubble chart --- GPAs of Courses at Columbia University 95 | 96 | Next, we applied the bubble chart to something more useful, showing information about Columbia's courses. Due to limitations on the data and my lack of access to the data, I am getting basic course data from the website [CU Class Directory](http://www.columbia.edu/cu/bulletin/uwb/home.html), like course department, course name, enrollment for the fall 2022 semester, etc. The data regarding average gpa is not accurate enough. If anyone know how to get authoritative average gpa data for each course, please feel free to contact me at lx2305\@columbia.edu. 97 | 98 | To avoid a lot of repetitive drawing code, I used for loop iterator for each college, because of the difficulty of collecting data, I chose part of courses in 4 colleges as examples. The data needs to be fetched directly from the web page, so I have stored it in my own public GitHub file. 99 | 100 | ```{r, warning=FALSE} 101 | data <- readr::read_csv("https://raw.githubusercontent.com/RicoXu727/Leetcode-Note/main/ccData.csv", show_col_types = FALSE) 102 | schools <- list("Engineering", "Law","Teachers College","Professional Studies") 103 | 104 | l <- htmltools::tagList() 105 | for (school in schools) { 106 | d <- data[data$School == school,] 107 | l[[school]] <- plot_ly(d, x = ~CourseNumber, y = ~Department, 108 | size = ~ClassSize, sizes = c(10, 30), 109 | color = ~Average_GPA, 110 | mode = 'markers', 111 | hovertemplate = paste(d$CourseTitle, "
Class ize:",d$ClassSize, "
", "Average gpa:", d$Average_GPA), 112 | marker = list(opacity = 0.7,sizemode = "diameter")) %>% add_markers()%>%layout(title = school, xaxis = list(title = 'Course Number')) 113 | 114 | } 115 | l 116 | ``` 117 | 118 | ## Differently next time 119 | 120 | Since we learned how to implement bubble charts and know how to customizely set pop up, the most important thing for the application is to get reliable data. If you have a way to get accurate data like average gpa, or if you are familiar with crawlers and are interested in contributing to this course GPA sharing platform, you are welcome to contact me at lx2305\@columbia.edu. 121 | 122 | If our platform is implemented, students will have more active access to information about all Columbia courses, not just their own college. Perhaps it would also be more helpful to students in choosing courses across colleges and in choosing their majors. The maintainers of the platform will also update as much accurate data as possible. 123 | 124 | In addition, each course may be taught by many professors, and if possible, we can also expand a new board on the platform to show different section workload sizes and average GPAs. 125 | 126 | ### Sourses: 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | -------------------------------------------------------------------------------- /tutorial_of_making_different_types_of_interactive_graphs.Rmd: -------------------------------------------------------------------------------- 1 | 2 | # Tutorial of making different types of charts interactive 3 | 4 | Shumin Song 5 | 6 | ```{r, include=FALSE} 7 | knitr::opts_chunk$set(warning = FALSE, message = FALSE) 8 | ``` 9 | 10 | ```{r} 11 | library(ggplot2) 12 | library(plotly) 13 | library("httr") 14 | library("readxl") 15 | library(dplyr) 16 | library(collapsibleTree) 17 | # devtools::install_github("rstudio/d3heatmap") 18 | library(d3heatmap) # need to be installed from source 19 | library(gapminder) 20 | library(ggridges) 21 | library(networkD3) 22 | library(igraph) 23 | library(quantmod) 24 | library(dygraphs) 25 | ``` 26 | 27 | ## Motivation 28 | We have already learned how to create different types of charts and graphs using `ggplot2` package through the class. However, most charts and graphs we made are static, which means readers could only read them passively and focus on general information and distribution of entire data set. 29 | 30 | In order to deal with such problems, we can make interactive charts and graphs. Interactive data visualization provides tools for readers to engage, explore and adjust data information in a more efficient way. For example, readers can observe specific values of different aspects of a data point on the graph, which allows them to identify trends or relationships within a data set. In addition, when analyzing complex data, interactive controls like zooming and filtering will introduce simplified and ordered information for readers that help them generate insights to solve a problem. In this tutorial, I will show how to create interactive charts and graphs in R for several graph categories. 31 | 32 | ## Interactive Scatter Plot 33 | We can create an interactive scatter directly using `plotly` package. I will use dataset "mtcars" in `ggplot2` package in the following example. We plot data points regarding wt(weight) as x-axis and mpg(miles per gallon) as y-axis. You can find more detailed introduction of plot_ly() function from this URL: https://www.rdocumentation.org/packages/plotly/versions/4.10.0/topics/plot_ly 34 | 35 | ```{r} 36 | plot_ly(mtcars, type = "scatter", x = ~wt, y = ~mpg, mode = "markers", 37 | hovertemplate = paste( 38 | "%{xaxis.title.text}: %{x:.2f}
", 39 | "%{yaxis.title.text}: %{y:.2f}
" 40 | ) 41 | ) 42 | ``` 43 | 44 | In this graph, you can: \ 45 | 1. zoom in or zoom out of graph to focus on data points within a specific area \ 46 | 2. hover on to data points to check exact wt and mpg values of each point \ 47 | 3. double click the graph to return to default view of graph 48 | 49 | ## Interactive Bubble Plot 50 | Similarly, we can create an interactive bubble plot of dataset "mtcars" using `plotly`. We plot data points regarding wt(weight) as x-axis and mpg(miles per gallon) as y-axis. In addition, we include cyl(cylinder of engine) as color categories for data points and qsec(1/4 mile time) as data point bubble size. 51 | 52 | ```{r} 53 | plot_ly(mtcars, x = ~wt, y = ~mpg, text = ~cyl, size = ~qsec, 54 | color = ~cyl, sizes = c(10, 50), 55 | marker = list(opacity = 0.6, sizemode = "diameter"), 56 | hovertemplate = paste( 57 | "%{xaxis.title.text}: %{x:.2f}
", 58 | "%{yaxis.title.text}: %{y:.2f}
", 59 | "cyl:%{text}
" 60 | )) 61 | ``` 62 | 63 | In this bubble graph, you can do the same things as that in previous scatter plot. Moreover, you can hover on the bubbles to see specific cyl values of different data points and distinguish data points' differences in qsec values from bubble size. 64 | 65 | ## Interactive Heatmap 66 | We can use `d3heatmap` package to create an interactive heatmap graph. I will still use `mtcars` dataset in the following example. You can find a detailed d2heatmap() function usage description from https://rdrr.io/github/rstudio/d3heatmap/man/d3heatmap.html. 67 | 68 | ```{r} 69 | d3heatmap(mtcars, scale = "column", col = "BuPu", 70 | xlab = "Variable", ylab = "Car Name") 71 | 72 | ``` 73 | 74 | In the graph above, every row represents an observation of car so the x-axis represents each car's name. Every column of graph represents values of a particular variable such as "cyl", "wt", "mpg" of all cars in data set so the y-axis represents each variable in data set. I use "BuPu" as color palette which cause cells with low values be filled with blue while cells with high values be filled with purple. This can help readers observe and interpret difference of a variable value between cars. 75 | 76 | There are several interact ways for you in interactive heapmap graphs: \ 77 | 1. You can hover on to a cell to see specific information of car name, variable in that column and exact value of that variable for car. \ 78 | 2. Zoom in or zoom out of graph to focus on a specific area and double click to return to default view. 79 | 3. Click on a specific car name or variable to select a particular row or column of graph. 80 | 81 | ## Interactive Tree Diagrams 82 | To create an interactive tree diagrams, we need to use `collapsibleTree` package. You can find a detailed usage description of collapsibleTree() function on https://www.rdocumentation.org/packages/collapsibleTree/versions/0.1.7/topics/collapsibleTree. 83 | 84 | I will use dataset "geography" in the following example. You can download this dataset from https://data.world/glx/geography-table. 85 | 86 | ```{r} 87 | # Import geography dataset 88 | GET("https://query.data.world/s/mmol5szlwinfp4mfzkxa73qlrp2yli", write_disk(tf <- tempfile(fileext = ".xlsx"))) 89 | geography <- read_excel(tf) 90 | 91 | # view general info of geography dataset 92 | head(geography) 93 | 94 | geography %>% 95 | group_by(continent, type) %>% 96 | summarize(`Number of Countries` = n()) %>% 97 | collapsibleTreeSummary( 98 | hierarchy = c("continent", "type"), 99 | root = "geography", 100 | width = 600, 101 | attribute = "Number of Countries" 102 | ) 103 | 104 | ``` 105 | 106 | In this graph, you can click on a node to show or hide all its child nodes. In addition, you can hover onto a node to see total number of countries, which are decendant nodes of this node in the tree. 107 | 108 | ## Interactive Ridgeline Plot 109 | We can use `plotly` package to create interactive ridgeline plot graphs. I will use "gapminder" dataset in `gapminder` package in the following example. I will plot density ridgeline of life expectancy for different countries in different years. 110 | 111 | ```{r} 112 | ridgeline <- ggplot(data = gapminder, aes(x = lifeExp, fill = year)) + 113 | geom_density() + 114 | facet_grid(year~.) + 115 | xlab("Life Expectancy in Birth") + 116 | ylab("Year") 117 | (interactive_ridgeline <- ggplotly(ridgeline)) 118 | ``` 119 | 120 | In this graph, you can hover on ridgeline area horizontally to see the density change in different life expectancy values. You can also hover vertically to see overall density alteration according to change in year. 121 | 122 | ## Interactive Network Graph 123 | We Will use `networkD3` package to make interactive network graphs. I will create nodes and edges manually. You can find detailed usage description of simpleNetwork() function from https://www.rdocumentation.org/packages/networkD3/versions/0.4/topics/simpleNetwork 124 | 125 | ```{r} 126 | # create nodes and edges of graph 127 | df <- data.frame( 128 | from = c("A", "A", "B", "D", "C", "D", "E", "B", "C", "B"), 129 | to = c("B", "E", "F", "A", "C", "A", "B", "D", "A", "C") 130 | ) 131 | 132 | simpleNetwork(df, height="300px", width="300px", 133 | linkColour = "red", nodeColour = "blue", zoom = T) 134 | ``` 135 | 136 | In this graph, you can zoom in using scroll wheel and hover on a node to check which nodes are directly connected with the current node. You can also drag a node to see its neighbors and edges via this node more clearly. 137 | 138 | ## Interactive Time Series Graph 139 | We Will use `dygraphs` package to make interactive time series graphs. In the following example, I will show the price of Apple stock(AAPL) alteration based on time change. You can find detailed description of dygraph() function from https://www.rdocumentation.org/packages/dygraphs/versions/1.1.1.6/topics/dygraph 140 | 141 | ```{r} 142 | # get AAPL price data 143 | getSymbols("AAPL") 144 | dygraph(OHLC(AAPL)) 145 | 146 | # focus on AAPL price change from 2020 to current date 147 | graph <- dygraph(OHLC(AAPL)) 148 | dyShading(graph, from="2020-01-01", 149 | to="2022-11-11", color="#FFE6E6") 150 | 151 | ``` 152 | 153 | In these graphs, you can hover along the price line to see AAPL stock price variation within a day or closing price variation in a time period. 154 | 155 | ## My Evaluation of Tutorial 156 | Through this tutorial, I learned several crucial packages in creating interactive graphs such as `plotly`, `d3heatmap` and 'dygraphs' as well as showed the advantages of interactive graphs. Nevertheless, there are some shortcomings that can be further improved. For instance, I didn't explain parameters in details for interactive graph making functions like d3heatmap(), which might cause confusions for readers when they make graphs by themselves. In addition, I only show the basic application in creating interactive graphs. If there are more time and space, I will introduce some advanced application of interactive charts. 157 | 158 | ## Citation Sources 159 | R Graph Gallery: https://r-graph-gallery.com/interactive-charts.html 160 | 161 | Plotly R Open Source Graphing Libraries: https://plotly.com/r/ 162 | 163 | networkD3 R package: http://christophergandrud.github.io/networkD3/ 164 | 165 | dygraphs for R: https://rstudio.github.io/dygraphs/ 166 | 167 | collapsible tree diagrams in R: https://github.com/AdeelK93/collapsibleTree 168 | 169 | --------------------------------------------------------------------------------