├── DMP_Checklist_2013.pdf ├── DMPs ├── Full_life_cycle_Report.pdf ├── Media_418168_smxx.pdf ├── Media_442573_smxx.pdf ├── NGS_DataManPlan.pdf ├── RIO_article_11624.pdf └── esrc_z-proso-DMP.pdf ├── File_man_pract.md ├── LICENSE.md ├── README.md ├── Trainers ├── DMP_model_answers.pdf └── Guidance.md ├── _config.yml ├── data_formatting.pdf ├── data_formatting.pptx ├── data_management.pdf ├── data_management.pptx ├── data_share_pract.md ├── data_sharing_backup.pdf ├── data_sharing_backup.pptx ├── excel.md ├── excel.pdf ├── file_management.pdf ├── file_management.pptx ├── images ├── DCC_Data_Releases.png ├── DCC_Data_Releases_README.png ├── Data_Life_Cycle.png ├── Data_Life_Cycle_transparent.png ├── Five_selfish_reasons.png ├── Five_selfish_reasons_YouTube.png ├── Hierachical_folders.png ├── Hierachical_folders_Jing.png ├── Hierachical_folders_transparent.png ├── UoC_Research_Data_Management_Policy.png └── jesse-bowser-c0I4ahyGIkA-unsplash.jpg ├── index.md ├── new-index.md ├── patient_data.txt ├── patient_data_cleaned.tsv ├── pre_may_2023_slides ├── Source_docs │ ├── DMP_model_answers.docx │ ├── data_formatting.pptx │ ├── data_management.pptx │ ├── data_sharing_backup.pptx │ └── file_management.pptx ├── data_formatting.pdf ├── data_formatting.pptx ├── data_management.pdf ├── data_management.pptx ├── file_management.pdf ├── file_management.pptx ├── file_management_2021Mar19_share_v02.pdf ├── file_management_2021Mar19_share_v04.pdf └── file_management_2022Nov7_share.pdf ├── refine_demo.Rmd └── refine_demo.pdf /DMP_Checklist_2013.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/DMP_Checklist_2013.pdf -------------------------------------------------------------------------------- /DMPs/Full_life_cycle_Report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/DMPs/Full_life_cycle_Report.pdf -------------------------------------------------------------------------------- /DMPs/Media_418168_smxx.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/DMPs/Media_418168_smxx.pdf -------------------------------------------------------------------------------- /DMPs/Media_442573_smxx.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/DMPs/Media_442573_smxx.pdf -------------------------------------------------------------------------------- /DMPs/NGS_DataManPlan.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/DMPs/NGS_DataManPlan.pdf -------------------------------------------------------------------------------- /DMPs/RIO_article_11624.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/DMPs/RIO_article_11624.pdf -------------------------------------------------------------------------------- /DMPs/esrc_z-proso-DMP.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/DMPs/esrc_z-proso-DMP.pdf -------------------------------------------------------------------------------- /File_man_pract.md: -------------------------------------------------------------------------------- 1 | ## File Management practical 2 | 3 | __Working as a group__. 4 | 5 | Select one of the Data Management Plans and examine it. 6 | 7 | For the above, note what you deem to be examples of good File Management 8 | principles as referred to in the previous talk. 9 | 10 | You may want to use the [DMP checklist](DMP_Checklist_2013.pdf). 11 | 12 | What are your opinions of how these DMPs are documented? 13 | 14 | What management practices/which principles are to be implented? 15 | 16 | Is there anything they have missed or omitted? 17 | 18 | We will have a 5 minute session at end of practical to share findings. 19 | 20 | ### Data Management Plans for practicals 21 | 22 | [Drosophila BBSRC project](DMPs/Media_418168_smxx.pdf). 23 | [Signalling pathways MRC project](DMPs/Media_442573_smxx.pdf). 24 | [Bioinformatics software BBSRC project](DMPs/RIO_article_11624.pdf). 25 | [Pathways to violence & crime ESRC project](DMPs/esrc_z-proso-DMP.pdf). 26 | [scRNAseq analysis of neurons](DMPs/NGS_DataManPlan.pdf). 27 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | All content of this repo is released under the CC-BY 4.0 license, with attribution required to **Cambridge Data Champions Programme**. 2 | 3 | Read the [full-text of the license](https://creativecommons.org/licenses/by/4.0/legalcode), or a [succinct summary](https://creativecommons.org/licenses/by/4.0/). 4 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Managing your Research data: Best practices in Research Data Management for Biological Sciences 2 | 3 | ## When and where? 4 | 5 | - Monday 7th November 2022; 10:00 - 16:30; Online via Zoom 6 | 7 | ## Description and timetable 8 | 9 | - [Description, aims, objectives and timetable](index.md) 10 | 11 | ## Materials 12 | 13 | - [Introduction, Data Management Plans](data_management.pdf) 14 | - [Data Formatting](data_formatting.pdf) Also contains a Data Management plan practical. 15 | - [Open Refine Practical](refine_demo.pdf) 16 | - [File Management Best Practices](file_management.pdf) 17 | - [Data Sharing & Backup](data_sharing_backup.pdf) 18 | - [Quiz](https://goo.gl/forms/njTHfyBoOqiAmDA23) 19 | - [Excel practical](excel.md) 20 | - [Patient Data for practicals](https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/master/patient_data.txt) - Right-click on link and select *Save Link As...* 21 | - [Electronic whiteboard (Etherpad)](https://etherpad.wikimedia.org/p/myrd28_11_19) 22 | 23 | ## References 24 | 25 | - [Data Carpentry Open Refine for Ecology](http://www.datacarpentry.org/OpenRefine-ecology-lesson/) 26 | - [Data Carpentry Workshops](http://lgatto.github.io/2016-05-16-CAM/) 27 | - [The Data Organisation Tutorial by Karl Broman](http://kbroman.org/dataorg/) 28 | - [The Quartz guide to bad data](https://github.com/Quartz/bad-data-guide/blob/master/README.md) 29 | - [Three common bad practices in sharing tables and spreadsheets and how to avoid them](http://luisdva.github.io/pls-don't-do-this/) 30 | - [Five Selfish Reasons to work reproducibly - Florian Markowetz](http://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7) 31 | - [Keith Baggerly lecture on Duke reproducibility scandal](https://youtu.be/7gYIs7uYbMo) 32 | - [Biologists: this is why bioinformaticians hate you...](http://www.opiniomics.org/biologists-this-is-why-bioinformaticians-hate-you/) 33 | - [Issues related to data preservation | BBC Domesday Project](https://en.wikipedia.org/wiki/BBC_Domesday_Project) 34 | - [Research Data Management at Cambridge University](www.data.cam.ac.uk) 35 | - [External Support](https://www.data.cam.ac.uk/support/external) 36 | - [Managing and sharing data, UK Data Archive](http://www.data-archive.ac.uk/media/2894/managingsharing.pdf) 37 | - [Information Compliance - Data protection](https://www.information-compliance.admin.cam.ac.uk/data-protection) 38 | - [File management, Cornell University](https://data.research.cornell.edu/content/file-management) 39 | - [File management, Curtin University](http://libguides.library.curtin.edu.au/c.php?g=202401&p=1333189) 40 | - [File organization, MIT Libraries](https://libraries.mit.edu/data-management/files/2014/05/file-organization-july2014.pdf) 41 | 42 | ## License 43 | 44 | ![](https://i.creativecommons.org/l/by/4.0/88x31.png) This work is licensed under the [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/). 45 | -------------------------------------------------------------------------------- /Trainers/DMP_model_answers.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/Trainers/DMP_model_answers.pdf -------------------------------------------------------------------------------- /Trainers/Guidance.md: -------------------------------------------------------------------------------- 1 | ### Guidance to MYRD trainers 2 | 3 | _Overall theme_ 4 | * Teach good practice 5 | * Point to further, fuller guidance 6 | * Try to relate principles by your experiences 7 | * Encourage critical analysis of DMPs, reporting back/presentation skills, team working 8 | 9 | _Session hints_ 10 | 1. Introduction, Data Management Plans 11 | * Introduce class to aims of class (Introduce tutors) 12 | * Check that their computers are loaded with s/w & materials 13 | (In Clinical School - show them how to login e.g. for usernme substitute xx for monitor label e.g. D3) 14 | (Virtual classroom - check that they've followed Pauls instructions and logged into the Openrefine VMs) 15 | * Cover the Why and what of Data Management forms & importance like Expt. Design & Statistical analysis. 16 | 2. Data formatting 17 | Are Spreadsheets evil & how do you make them less so (validation)? 18 | The Gibbs rules of data formatting: 19 | Rule 1 -Never work directly on the raw data 20 | Rule 2 - Maintain consistency. 21 | Rule 3 - Don't use 0 to mean missing. 22 | Rule 4 - Fill in all the cells. 23 | Rule 5 - Make it rectangular. 24 | 3. OpenRefine practical (Live coding). 25 | The example data file has many typical data entry errors use OR as an example tool to fix them (reassure 26 | that all is not lost!). 27 | Relate what OR does in complying with our principles on data formatting. 28 | 4. File management. 29 | Get across principles of organising files in sub-dirs according to role e.g. figs/plots, data, logs etc. 30 | File naming - what is a good idea? 31 | 5. File management in DMP practical 32 | Building on what we learnt in the previous session class groups search example DMPs for instances 33 | of good practice & form opinions as to whether these are adequate. Make notes in Google Doc and brief report 34 | back. 35 | 6. Data Sharing & Backup 36 | Different types of backup - extra disk, usb stick, archive, cloud all with pros & cons 37 | Duty to share findings (often required by journals) - suitable repositories. 38 | 7. Data Sharing & Backup in DMP practical. 39 | Building on what we learnt in the previous session class groups search example DMPs for instances 40 | of good practice & form opinions as to whether these are adequate. Make notes in Google Doc and brief report 41 | back. 42 | 8. Wrap-up & close 43 | Recap on the principles taught today. Point them to further resources. Remind them about feedback survey. 44 | 45 | _Class organisation_. 46 | * In E-learning suite we spread class about the tables (6 per table) and in all but OR practical work as teams. 47 | * In Zoom Paul creates breakout rooms (Hope to swap people around in each practical). 48 | 49 | _The example DMPs_. 50 | Have been chosen to give a bit of variety to reflect their nature plus a less biology-focussed one for non-biology students. 51 | Also to show requirements in forms from different funders. 52 | Drosophila BBSRC project. 53 | Signalling pathways MRC project. 54 | Bioinformatics software BBSRC project. * different one as deals with research software. 55 | Pathways to violence & crime ESRC project. * A less biological one for non-biologists who may attend. 56 | Horizon 2020 EU Project. * An uberDMP – Extra-long as EU love their paperwork. 57 | 58 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | theme: jekyll-theme-leap-day 2 | google_analytics: UA-63148050-12 3 | -------------------------------------------------------------------------------- /data_formatting.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/data_formatting.pdf -------------------------------------------------------------------------------- /data_formatting.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/data_formatting.pptx -------------------------------------------------------------------------------- /data_management.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/data_management.pdf -------------------------------------------------------------------------------- /data_management.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/data_management.pptx -------------------------------------------------------------------------------- /data_share_pract.md: -------------------------------------------------------------------------------- 1 | ## Data Sharing practical 2 | 3 | __Working as a group__. 4 | 5 | Re-examine the example DMPs, maybe pick a different one to you read before. 6 | 7 | Note what you deem to be or derive examples of good data sharing principles 8 | as referred to in the previous talk. 9 | 10 | What are your opinions of how these are documented/which principles did you 11 | apply to your data? 12 | 13 | You may want to use the [DMP checklist](DMP_Checklist_2013.pdf). 14 | 15 | We will have a 5 minute session at end of practical to share findings. 16 | -------------------------------------------------------------------------------- /data_sharing_backup.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/data_sharing_backup.pdf -------------------------------------------------------------------------------- /data_sharing_backup.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/data_sharing_backup.pptx -------------------------------------------------------------------------------- /excel.md: -------------------------------------------------------------------------------- 1 | ## Spreadsheet Validation practical 2 | A new experimental trial is being carried out with a new drug that treats arthritis. Half of the patients will be treated with the drug and half with a placebo. 3 | 4 | _Data will be recorded as follows:_ 5 | 6 | - Patients first & last name (text) 7 | 8 | - ID number(Alphanumeric) 9 | 10 | - Placebo (Y/N) 11 | 12 | - Movement score(Physician assessed on scale 1-5) 13 | 14 | - Pain score (patient assessed on scale 0-5) 15 | 16 | - Inflammation (None/Low/Medium/High) 17 | 18 | ### Task 19 | Create a new spreadsheet with the following columns: 20 | 21 | Firstname 22 | 23 | Lastname 24 | 25 | IDno 26 | 27 | Placebo 28 | 29 | Movescore 30 | 31 | Painscore 32 | 33 | Inflammation 34 | 35 | Use the data validation feature to ensure that all fields are filled in with the correct range/type of data. 36 | 37 | - Is it possible to ensure that cells are not left blank? 38 | 39 | - Try out Date validation, just add another column for an Assessment date. 40 | 41 | - Give helpful feedback messages if data is input incorrectly. 42 | 43 | Test your sheet by entering some made-up data ; can you break your sheet? 44 | 45 | -------------------------------------------------------------------------------- /excel.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/excel.pdf -------------------------------------------------------------------------------- /file_management.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/file_management.pdf -------------------------------------------------------------------------------- /file_management.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/file_management.pptx -------------------------------------------------------------------------------- /images/DCC_Data_Releases.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/DCC_Data_Releases.png -------------------------------------------------------------------------------- /images/DCC_Data_Releases_README.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/DCC_Data_Releases_README.png -------------------------------------------------------------------------------- /images/Data_Life_Cycle.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/Data_Life_Cycle.png -------------------------------------------------------------------------------- /images/Data_Life_Cycle_transparent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/Data_Life_Cycle_transparent.png -------------------------------------------------------------------------------- /images/Five_selfish_reasons.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/Five_selfish_reasons.png -------------------------------------------------------------------------------- /images/Five_selfish_reasons_YouTube.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/Five_selfish_reasons_YouTube.png -------------------------------------------------------------------------------- /images/Hierachical_folders.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/Hierachical_folders.png -------------------------------------------------------------------------------- /images/Hierachical_folders_Jing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/Hierachical_folders_Jing.png -------------------------------------------------------------------------------- /images/Hierachical_folders_transparent.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/Hierachical_folders_transparent.png -------------------------------------------------------------------------------- /images/UoC_Research_Data_Management_Policy.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/UoC_Research_Data_Management_Policy.png -------------------------------------------------------------------------------- /images/jesse-bowser-c0I4ahyGIkA-unsplash.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/images/jesse-bowser-c0I4ahyGIkA-unsplash.jpg -------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | # Managing your Research Data: Best practices in Research Data Management for Biological Sciences 2 | 3 | ## When and where? 4 | 5 | - Tuesday 2nd May 2023 10:00-16:30 6 | - ONLINE using Craik-Marshall ZOOM environment 7 | 8 | ## Outline 9 | 10 | It has been said that 80% of data analysis is spent on the process of cleaning 11 | and preparing the data. Not only does this represent a significant time 12 | investment for the data analyst, but is often a hurdle for the non-specialist 13 | trying to get to grips with analysing their own data after attending an R or 14 | Python course. Despite the best intentions, a spreadsheet that is intuitive and 15 | easily-understandable by human eyes can lead to disaster when trying to process 16 | computationally. 17 | 18 | This workshop will go through the basic principles that we can all adopt in 19 | order to work with data more effectively and “think like a computer”. Moreover, 20 | we will discuss the best practices for data management and organisation so that 21 | our research is auditable and reproducible by ourselves, and others, in the 22 | future. Part of the journey will be via critical evaluation of example Data 23 | Management Plans (Often a condition of Grant). 24 | 25 | ## Description 26 | 27 | 33 | 34 | As a researcher, you will encounter research data in many forms, ranging from 35 | measurements, numbers and images to documents and publications. Whether you 36 | create, receive or collect data, you will certainly need to organise it at some 37 | stage of your project. This workshop will provide an overview of some basic 38 | principles on how we can work with data more effectively. We will discuss the 39 | best practices for research data management and organisation so that our 40 | research is auditable and reproducible by ourselves, and others, in the future. 41 | 42 | ## Aims: During this course you will learn about: 43 | 44 | 53 | 54 | ## Objectives: After this course you should be able to: 55 | 56 | 62 | 63 | ## Timetable 64 | 65 | _Trainers_. 66 | Abigail Edwards (CRUK Cambridge Institute). 67 | Ashley Sawle (CRUK Cambridge Institute). 68 | 69 | | | Timetable | 70 | |---|---| 71 | | 10:00 - 10:20 | [Introduction, Data Management Plans](data_management.pdf) (Ash) | 72 | | 10:20 - 11:00 | [Data formatting](data_formatting.pdf) (Ash) | 73 | | 11:00 - 11:10 | Break | 74 | | 11:10 - 12:00 | [OpenRefine practical](refine_demo.pdf) (Abbi) | 75 | | 12:00 - 12:15 | [Spreadsheet validation practical](excel.md) (Abbi) | 76 | | 12:15 - 13:00 | [File management](file_management.pdf) (Ash) | 77 | | 13:00 - 13:45 | Lunch break | 78 | | 13:45 - 14:15 | [File management in DMP practical](File_man_pract.md) (Ash) | 79 | | 14:15 - 15:00 | [Data Sharing & Backup](data_sharing_backup.pdf) (Abbi) | 80 | | 15:00 - 15:10 | Break | 81 | | 15:10 - 15:45 | [Data Sharing & Backup in DMP practical](data_share_pract.md) (Abbi) | 82 | | 15:45 - 16:00 | Wrap-up & close | 83 | 84 | **Please fill in the feedback survey at end of course** [link](https://www.surveymonkey.co.uk/r/5S2DTTS) 85 | 86 | ### Data for OpenRefine practical 87 | 88 | [patient_data.txt](https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/master/patient_data.txt) 89 | 90 | 91 | ### Data Management Plans for practicals 92 | 93 | [Drosophila BBSRC project](DMPs/Media_418168_smxx.pdf). 94 | [Signalling pathways MRC project](DMPs/Media_442573_smxx.pdf). 95 | [Bioinformatics software BBSRC project](DMPs/RIO_article_11624.pdf). 96 | [Pathways to violence & crime ESRC project](DMPs/esrc_z-proso-DMP.pdf). 97 | [scRNAseq analysis of neurons](DMPs/NGS_DataManPlan.pdf). 98 | 99 | **Useful checklist:** [A Data management plan checklist](DMP_Checklist_2013.pdf). 100 | 101 | ### Further reading 102 | 103 | - [Journal of Cheminformatics: Too many tags spoil the metadata: investigating the knowledge management of scientific research with semantic web technologies](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0345-8) 104 | - [Article on Electronic Lab Notebooks (ELNs) by Labfolder CEO](https://www.labfolder.com/electronic-lab-notebook-eln-research-guide/) 105 | - [Blog article on removing units using OpenRefine - Wayback machine verson of deleted blog article](https://web.archive.org/web/20201026004029/https://susanemcgregor.com/removing-unwanted-units-from-data-with-chomp-in-google-refine/) 106 | - [Quite extensive course on OpenRefine](https://itsmecevi.github.io/openrefine) 107 | - [Wikipedia article on FAIR principles](https://en.wikipedia.org/wiki/FAIR_data). 108 | 109 | ### Further viewing ### 110 | - [Introduction to OpenRefine video tutorial](https://www.youtube.com/watch?v=wGVtycv3SS0) 111 | - [Keith Beggeley talk on how cut and paste errors could endanger patients](https://www.youtube.com/watch?v=7gYIs7uYbMo) 112 | - [Florian's talk on why work reproducibly](https://www.youtube.com/watch?v=S8bU1CyEkRM) 113 | - [Fun look at why organisation matters - Bad Project (Lady Gaga parody)](https://www.youtube.com/watch?v=Fl4L4M8m4d0) 114 | -------------------------------------------------------------------------------- /new-index.md: -------------------------------------------------------------------------------- 1 | # Managing your Research Data: Best practices in Research Data Management for Biological Sciences 2 | 3 | ## When and where? 4 | 5 | - Thursday 28th November 2019; 12:30 - 17:00; eLearning 3 - School of Clinical Medicine, Addenbrookes site 6 | ## Timetable 7 | 8 | | Slot | Topics covered | 9 | |---|---| 10 | | 12:30 - 13:30 | [Introduction, Data Management Plans](data_management.pdf) & [Data formatting](data_formatting.pdf) (MF) | 11 | | 13:30 - 14:30 | [OpenRefine practical (Live coding)](refine_demo.pdf)[DMP-Data formatting exercise]() (MF+JS+AP) | 12 | | 14:30 - 14:45 | Break - Tea, coffee & cookies | 13 | | 14:45 - 15:35 | [File management](file_management.pdf) (JS) | 14 | | 15:35 - 16:20 | [Data Sharing & Backup](data_sharing_backup.pdf) (AP) | 15 | | 16:20 - 16:30 | Wrap-up & close | 16 | 17 | ## Outline 18 | 19 | It has been said that 80% of data analysis is spent on the process of cleaning and preparing the data for 20 | computer-based analysis. 21 | Not only does this represent a significant time investment for the data analyst, but is often a hurdle for the 22 | non-specialist trying to get to grips with analysing their own data after attending an R or Python course. 23 | Despite the best intentions, a spreadsheet that is intuitive and easily-understandable by human eyes can 24 | lead to disaster when trying to process computationally. 25 | 26 | This workshop will go through the basic principles that we can all adopt in order to work with data more 27 | effectively and “think like a computer”. Moreover, we will discuss the best practices for data management 28 | and organisation so that our research is auditable and reproducible by ourselves, and others, in the future. 29 | We have updated this course to be centred on the concept of Data Management Plans (described in more detail in 30 | course) which cover the life-cycle of your project data. DMPs depend on good data practices (which obviously 31 | can be vital even without a DMP). 32 | 33 | ## Description 34 | 35 | - Do you know what a Data Management Plan is and what it covers? 36 | - How much data would you lose if your laptop was stolen? 37 | - Have you ever emailed your colleague a file named 'final_final_versionEDITED'? 38 | - Have you ever struggled to import your spreadsheets into R? 39 | 40 | As a researcher, you will encounter research data in many forms, ranging from measurements, numbers and images 41 | to documents and publications. Whether you create, receive or collect data, you will certainly need to organise 42 | it at some stage of your project. This workshop will provide an overview of some basic principles on how we can 43 | work with data more effectively. We will discuss the best practices for research data management and organisation 44 | so that our research is auditable and reproducible by ourselves, and others, in the future. 45 | 46 | ## Aims: During this course you will learn about: 47 | 48 | - What Research Funders expect 49 | - Options for backing up your computer 50 | - Ideas for naming and organising your files 51 | - Strategies for exchanging files with collaborators 52 | - Tips and tricks to make sure that your spreadsheets are readable by programming languages such as R 53 | - Learn how to use tools like the OpenRefine software for data cleaning 54 | - Preparing high-throughput biological data for submission to a public repository 55 | 56 | ## Objectives: After this course you should be able to: 57 | 58 | - Select an appropriate backup strategy for your data 59 | - Organise your files in a more structured and consistent manner 60 | - Avoid common pitfalls in spreadsheet manipulation 61 | - Known what resources are available at The University of Cambridge for Research Data Management 62 | -------------------------------------------------------------------------------- /patient_data.txt: -------------------------------------------------------------------------------- 1 | "ID" "Name" "Race" "Sex" "Smokes" "Height" "Weight" "Birth" "State" "Pet" "Grade_Level" "Died" "Count" "Date Entered Study" 2 | "AC/AH/001" "Demetrius" "White" "Male" "FALSE" "182.87cm" "76.57kg" 1972-02-06 "Georgia" "Dog" 2 FALSE 0.01 "2015-12-01" 3 | "AC/AH/017" "Rosario" "White" "Male" "FALSE" "179.12cm" "80.43kg" 1972-06-15 "Missouri" "Dog" 2 FALSE -1.31 "" 4 | "AC/AH/020" "Julio" "Black" " Male" "FALSE" "169.15cm" "75.48kg" 1972-07-09 "Pennsylvania" "None" 2 FALSE -0.17 "" 5 | "AC/AH/022" "Lupe" "White" "Male" "FALSE" "175.66cm" "94.54kg" 1972-08-17 "Florida" "Cat" 1 FALSE -1.1 "" 6 | "AC/AH/029" "Lavern" "White" "Female" "FALSE" "164.47cm" "71.78kg" 1973-06-12 "Iowa" "NULL" 2 TRUE 1.42 "" 7 | "AC/AH/033" "Bernie" "Dog" "Female" "TRUE" "158.27cm" "69.9kg" 1973-07-01 "Maryland" "Dog" 2 FALSE 0.29 "" 8 | "AC/AH/037" "Samuel" "White" "Female" "FALSE" "161.69cm" "68.85kg" 1972-03-26 "Pennsylvania" "None" 1 FALSE 0.16 "" 9 | "AC/AH/044" "Clair" "White" "Female" "No" "165.84cm" "70.44kg" 1973-05-11 "North Carolina" "None" 1 FALSE -0.07 "" 10 | "AC/AH/045" "Shirley" "White" "Male" "FALSE" "181.32cm" "76.9kg" 1971-12-31 "Louisiana" "Dog" 1 FALSE -1.43 "" 11 | "AC/AH/048" "Merle" "Hispanic" " Male" "FALSE" "167.37cm" "79.06kg" 1973-07-19 "North Carolina" "None" 2 FALSE 0.54 "2015-12-31" 12 | "AC/AH/049" "Martin" "White" "Female" "FALSE" "160.06cm" "72.37kg" 1972-05-04 "California" "Horse" 2 TRUE -2.41 "" 13 | "AC/AH/050" "Frances" "White" "Female" "FALSE" "166.48cm" "67.34kg" 1971-11-14 "Michigan" "None" 1 FALSE 1.05 "" 14 | "AC/AH/052" "Courtney" "White" "Male" "TRUE" "175.39cm" "92.22kg" 1972-03-22 "Indiana" "Bird" 3 FALSE -0.04 "" 15 | "AC/AH/053" "FRANCIS" "White" "Female" "TRUE" "164.7cm" "75.69kg" 1971-11-22 "Virginia" "Dog" 1 FALSE -0.65 "" 16 | "AC/AH/057" "Vernon" "White" "Female" "TRUE" "163.79cm" "65.76kg" 1972-01-12 "Illinois" "Cat" 3 FALSE 0.06 "" 17 | "AC/AH/061" "Lester" "Black" "Male" "FALSE" "181.13cm" "72.33kg" 1972-11-22 "Wisconsin" "Dog" 99 TRUE -0.15 "" 18 | "AC/AH/063" "Robin" "Hispanic" "Male" "FALSE" "169.24cm" "73.3kg" 1971-11-22 "Illinois" "None" 3 FALSE 0.68 "" 19 | "AC/AH/076" "Albert" "White" "Male" "FALSE" "176.22cm" "97.67kg" 1973-04-14 "Louisiana" "Cat" 2 FALSE -1.97 "" 20 | "AC/AH/077" "Tommy" "Black" "Male" "FALSE" "174.09cm" "72.2kg" 1973-02-07 "Washington" "Cat" 3 FALSE -1.18 "" 21 | "AC/AH/086" "KYLE" "Black" "Male" "TRUE" "180.11cm" "75.72kg" 1973-05-18 "Georgia" "Cat" 3 FALSE 0.4 "" 22 | "AC/AH/089" "DONG" "White" "Male" "FALSE" "179.24cm" "75.54kg" 1972-03-17 "California" "None" 2 TRUE -0.47 "" 23 | "AC/AH/100" "MICHEL" "White" "Female" "FALSE" "161.92cm" "69.92kg" 1973-01-02 "Georgia" "Dog" 1 FALSE 0.5 "" 24 | "AC/AH/104" "JEREMY" "White" "Male" "TRUE" "169.85cm" "90.63kg" 1972-04-18 "Kentucky" "None" 1 TRUE -0.2 "" 25 | "AC/AH/112" "Pat" "Black" " Female" "FALSE" "160.57cm" "63.54kg" 1973-07-02 "California" "NA" 99 TRUE 0.69 "2016-01-31" 26 | "AC/AH/113" "Eugene" "White" " Female" "No" "168.24cm" "69.57kg" 1972-02-13 "Massachusetts" "NA" 2 FALSE 0.38 "" 27 | "AC/AH/114" "Kris" "Hispanic" "Male" "FALSE" "177.75cm" "74.84kg" 1972-11-25 "Pennsylvania" "Bird" 3 FALSE 0.15 "" 28 | "AC/AH/115" "Tracy" "Bi-Racial" "Male" "TRUE" "183.21cm" "83.36kg" 1973-10-05 "California" "Dog" 2 FALSE 0.05 "" 29 | "AC/AH/127" "Jame" "White" "Male" "FALSE" "167.75cm" "82.06kg" 1972-11-04 "Texas" "Dog" 1 TRUE 0.63 "" 30 | "AC/AH/133" "CLYDE" "Hispanic" "Male" "FALSE" "181.15cm" "83.93kg" 1973-10-19 "Washington" "Cat" 3 TRUE -0.81 "2016-03-02" 31 | "AC/AH/150" "Brett" "White" "Male" "TRUE" "181.56cm" "79.54kg" 1972-05-09 "Kentucky" "Dog" 1 TRUE 0.92 "" 32 | "AC/AH/154" "Tony" "White" "Female" "FALSE" "160.03cm" "64.3kg" 1973-09-05 "California" "DOG" 1 TRUE -0.1 "" 33 | "AC/AH/156" "GEORGE" "White" "Male" "FALSE" "165.62cm" "76.72kg" 1972-07-15 "California" "Dog" 1 TRUE 0.61 "" 34 | "AC/AH/159" "Edward" "White" "Male" "No" "181.64cm" "96.91kg" 1972-12-10 "Connecticut" "Cat" 2 FALSE -0.39 "" 35 | "AC/AH/160" "Rory" "Asian" "Female" "FALSE" "159.67cm" "71.88kg" 1973-09-28 "Florida" "Cat" 2 TRUE 0.74 "" 36 | "AC/AH/164" "Shane" "Hispanic" "Male" "TRUE" "177.03cm" "74.04kg" 1972-02-24 "Florida" "None" 2 FALSE 0.15 "" 37 | "AC/AH/171" "DEVIN" "White" "Female" "No" "163.35cm" "70.46kg" 1973-04-22 "California" "Bird" 3 TRUE 1.68 "2016-03-31" 38 | "AC/AH/176" "Jerry" "Asian" "Male" "FALSE" "175.21cm" "83.65kg" 1973-05-07 "Virginia" "Dog" 3 TRUE 0.41 "" 39 | "AC/AH/180" "Drew" "White" "Female" "FALSE" "160.8cm" "64.77kg" 1973-02-24 "Oregon" "CAT" 1 TRUE -2.17 "" 40 | "AC/AH/185" "Ronald" "White" "Male" "FALSE" "166.46cm" "76.83kg" 1972-08-23 "Colorado" "None" 99 TRUE -1.32 "" 41 | "AC/AH/186" "Christopher" "White" "Female" "FALSE" "157.95cm" "67.41kg" 1972-05-12 "New Jersey" "Dog" 3 TRUE -1.89 "" 42 | "AC/AH/192" "Dominique" "White" "Male" "FALSE" "180.61cm" "83.59kg" 1972-03-30 "Michigan" "None" 3 TRUE -0.6 "" 43 | "AC/AH/198" "Van" "White" "Female" "FALSE" "159.52cm" "67.99kg" 1972-12-08 "Missouri" "CAT" 2 FALSE -0.4 "" 44 | "AC/AH/207" "Bobbie" "White" "Female" "FALSE" "163.01cm" "65.19kg" 1973-05-23 "Florida" "Dog" 2 FALSE 1.07 "" 45 | "AC/AH/208" "Lawrence" "Hispanic" "Female" "FALSE" "165.8cm" "71.77kg" 1973-08-13 "Louisiana" "None" 1 FALSE -1.21 "" 46 | "AC/AH/210" "Keith" "Hispanic" "Female" "TRUE" "170.03cm" "66.68kg" 1972-09-03 "New York" "Dog" 99 FALSE -0.66 "" 47 | "AC/AH/211" "Son" "White" "Female" "FALSE" "157.16cm" "69.64kg" 1973-07-20 "California" "Cat" 2 TRUE -0.7 "2016-05-01" 48 | "AC/AH/213" "Charlie" "White" "Female" "Yes" "164.58cm" "72.99kg" 1972-01-31 "Louisiana" "Dog" 1 FALSE -0.32 "" 49 | "AC/AH/219" "Jay" "White" "Female" "FALSE" "163.47cm" "72.89kg" 1972-04-13 "North Caroline" "Bird" 1 TRUE 0.33 "" 50 | "AC/AH/220" "Richard" "White" "Male" "FALSE" "185.43cm" "87.23kg" 1973-07-19 "Florida" "Cat" 1 FALSE 1.29 "" 51 | "AC/AH/221" "Carlos" "White" "Female" "FALSE" "165.34cm" "70.84kg" 1972-02-07 "Michigan" "Dog" 99 TRUE 1.69 "" 52 | "AC/AH/225" "Gail" "White" "Female" "FALSE" "163.45cm" "67.67kg" 1972-10-27 "Michigan" "Cat" 2 FALSE -0.74 "" 53 | "AC/AH/233" "Marion" "White" " Female" "FALSE" "163.97cm" "66.71kg" 1971-12-29 "Ohio" "Cat" 3 TRUE -1.07 "" 54 | "AC/AH/241" "LINDSAY" "White" "Female" "FALSE" "161.38cm" "73.55kg" 1972-02-14 "Florida" "Cat" 3 FALSE -0.82 "2016-05-31" 55 | "AC/AH/244" "Sean" "White" "Female" "FALSE" "160.09cm" "65.93kg" 1973-01-31 "Maryland" "None" 99 TRUE 0.5 "" 56 | "AC/AH/248" "Andrea" "White" "Male" "FALSE" "178.64cm" "97.05kg" 1973-01-18 "Indiana" "Cat" 1 TRUE -1.21 "" 57 | "AC/AH/249" "Jesus" "Hispanic" "Female" "TRUE" "159.78cm" "68.31kg" 1972-04-29 "Alabama" "Cat" 2 TRUE 0.52 "" 58 | "AC/SG/002" "Jan" "White" "Female " "TRUE" "161.57cm" "67.92kg" 1973-07-09 "Arizona" "Dog" 3 FALSE 0.33 "" 59 | "AC/SG/003" "Walter" "White" "Female " "No" "161.83cm" "66.03kg" 1972-07-17 "Oregon" "None" 2 TRUE -0.9 "" 60 | "AC/SG/008" "Dana" "White" " Male" "Yes" "169.66cm" "77.3kg" 1973-06-01 "Nevada" "Dog" 1 TRUE -2.36 "" 61 | "AC/SG/009" "Sammy" "White" "Male" "FALSE" "166.84cm" "88.25kg" 1972-03-10 "Vermont" "Dog" 1 FALSE 1.51 "2016-07-01" 62 | "AC/SG/010" "THEO" "Asian" "Female" "FALSE" "159.32cm" "64.92kg" 1973-02-04 "New York" "Cat" 2 TRUE 0.71 "" 63 | "AC/SG/015" "Shaun" "White" "Male" "Yes" "170.51cm" "84.35kg" 1972-11-15 "New Jersey" "DOG" 3 TRUE 0.12 "" 64 | "AC/SG/016" "Jimmie" "Black" "Female" "FALSE" "161.84cm" "69.97kg" 1972-04-09 "Arizona" "Cat" 3 TRUE -0.57 "" 65 | "AC/SG/046" "Carl" "Hispanic" "Male" "FALSE" "171.41cm" "81.7kg" 1973-08-11 "Mississippi" "Bird" 2 TRUE 0.43 "" 66 | "AC/SG/055" "EVAN" "White" "Male" "FALSE" "166.75cm" "79.06kg" 1972-03-01 "Illinois" "Bird" 3 TRUE -0.55 "2016-07-31" 67 | "AC/SG/056" "Merrill" "Asian" "Female" "Yes" "166.19cm" "67.46kg" 1972-12-03 "Indiana" "NULL" 3 TRUE -1.06 "" 68 | "AC/SG/064" "JON" "White" "Male " "FALSE" "169.16cm" "90.08kg" 1972-10-10 "Illinois" "Cat" 2 TRUE -0.1 "" 69 | "AC/SG/065" "SHAYNE" "White" "Female" "FALSE" "157.01cm" "66.56kg" 1972-04-11 "California" "Dog" 3 TRUE 0.64 "" 70 | "AC/SG/067" "Thomas" "White" "Male" "FALSE" "167.51cm" "84.15kg" 1972-07-25 "Pennsylvania" "Bird" 2 TRUE -1.46 "" 71 | "AC/SG/068" "VALENTINE" "Hispanic" "Female" "FALSE" "160.47cm" "68.2kg" 1972-04-21 "Tennessee" "Cat" 3 TRUE 0.83 "" 72 | "AC/SG/072" "Cameron" "Black" "Female" "TRUE" "162.33cm" "66.47kg" 1972-02-21 "Californa" "NULL" 3 FALSE 1.02 "" 73 | "AC/SG/074" "Eddie" "Hispanic" "Male" "FALSE" "175.67cm" "88.82kg" 1973-10-10 "Georgia" "None" 99 FALSE -3.14 "" 74 | "AC/SG/084" "Brian" "Hispanic" "Male" "FALSE" "174.25cm" "80.93kg" 1972-03-12 "Virginia" "DOG" 2 TRUE 1.79 "" 75 | "AC/SG/095" "Matthew" "White" " Female" "FALSE" "158.94cm" "65.14kg" 1973-06-01 "Hawaii" "Dog" 1 FALSE 0.47 "" 76 | "AC/SG/099" "LESLIE" "Asian" " Male" "FALSE" "172.72cm" "67.62kg" 1972-02-10 "Ohio" "Cat" 1 FALSE 0.05 "" 77 | "AC/SG/101" "Jason" "White" "Female" "FALSE" "159.23cm" "69.96kg" 1973-10-04 "Michigan" "Dog" 2 TRUE -0.81 "" 78 | "AC/SG/107" "Sol" "White" "Male" "FALSE" "176.54cm" "90.76kg" 1973-02-03 "Hawaii" "None" 3 FALSE -1.48 "2016-08-31" 79 | "AC/SG/116" "Connie" "Black" "Male" "FALSE" "184.34cm" "90.41kg" 1972-06-11 "Florida" "None" 3 TRUE -1.41 "" 80 | "AC/SG/121" "Rudy" "White" "Female" "FALSE" "163.94cm" "71.47kg" 1973-03-18 "Michigan" "Cat" 3 FALSE -0.25 "" 81 | "AC/SG/122" "MICHAL" "Hispanic" "Female " "FALSE" "160.09cm" "68.94kg" 1971-12-22 "South Carolina" "DOG" 1 FALSE -0.99 "" 82 | "AC/SG/123" "Darnell" "White" " Female" "TRUE" "162.32cm" "72.72kg" 1972-09-09 "North Caroline" "Bird" 1 TRUE -0.92 "" 83 | "AC/SG/134" "Daryl" "White" "Female " "TRUE" "162.59cm" "69.76kg" 1972-06-03 "Texas" "CAT" 2 TRUE -0.28 "" 84 | "AC/SG/139" "JORDAN" "White" "Male" "FALSE" "171.94cm" "82.11kg" 1973-10-12 "Michigan" "None" 1 FALSE -1.18 "" 85 | "AC/SG/142" "Kenneth" "White" "Female" "FALSE" "158.07cm" "69.8kg" 1972-05-21 "Kansas" "Dog" 3 FALSE 0.54 "" 86 | "AC/SG/155" "Raymond" "White" "Female" "FALSE" "158.35cm" "69.72kg" 1972-06-08 "California" "Cat" 3 TRUE 1.42 "" 87 | "AC/SG/165" "Elmer" "White" "Female" "FALSE" "162.18cm" "67.81kg" 1972-03-31 "Washington" "Bird" 1 TRUE 1.68 "" 88 | "AC/SG/167" "Jimmy" "White" "Female" "FALSE" "159.38cm" "70.37kg" 1973-10-06 "Washington" "None" 2 TRUE 0.02 "2016-10-01" 89 | "AC/SG/172" "Whitney" "White" "Male" "No" "171.45cm" "84.29kg" 1972-03-02 "Florida" "Dog" 2 TRUE -0.19 "" 90 | "AC/SG/173" "Britt" "White" "Female" "TRUE" "163.17cm" "64.47kg" 1973-06-25 "Californa" "NONE" 2 FALSE 0.69 "" 91 | "AC/SG/179" "LOGAN" "White" "Male" "FALSE" "183.1cm" "82.47kg" 1972-10-30 "Ohio" "Dog" 3 TRUE 0.77 "" 92 | "AC/SG/181" "Terry" "Hispanic" "Male" "FALSE" "177.14cm" "88.7kg" 1971-11-30 "Indiana" "CAT" 3 TRUE 1.76 "" 93 | "AC/SG/182" "Jamie" "Hispanic" "Male" "TRUE" "171.08cm" "72.51kg" 1973-03-31 "Louisiana" "None" 3 TRUE -0.22 "" 94 | "AC/SG/191" "Lacy" "Hispanic" "Female" "FALSE" "159.33cm" "70.68kg" 1973-06-27 "Texas" "None" 3 TRUE -1.07 "" 95 | "AC/SG/193" "Ronnie" "White" "Male " "TRUE" "185.43cm" "73.63kg" 1973-06-11 "Iowa" "Dog" 3 FALSE -0.29 "" 96 | "AC/SG/194" "Joseph" "White" "Female" "FALSE" "162.65cm" "73.99kg" 1972-08-10 "Maryland" "CAT" 3 FALSE 0.87 "2016-10-31" 97 | "AC/SG/197" "Stacy" "White" "Female" "FALSE" "159.44cm" "66.21kg" 1972-11-14 "New York" "Cat" 1 TRUE 1.21 "" 98 | "AC/SG/204" "Anthony" "White" "Female" "FALSE" "164.11cm" "70.66kg" 1972-06-23 "California" "Dog" 3 FALSE -0.17 "" 99 | "AC/SG/216" "Alva" "White" " Female" "FALSE" "159.13cm" "66.96kg" 1972-06-25 "Alabama" "None" 1 TRUE 0.95 "" 100 | "AC/SG/217" "DEAN" "White" "Female" "FALSE" "160.58cm" "71.49kg" 1972-11-17 "Ohio" "None" 1 TRUE -0.78 "" 101 | "AC/SG/234" "LUIS" "Hispanic" "Female" "FALSE" "164.88cm" "68.07kg" 1971-11-16 "Pennsylvania" "Cat" 3 TRUE 0.35 "" 102 | -------------------------------------------------------------------------------- /patient_data_cleaned.tsv: -------------------------------------------------------------------------------- 1 | ID Name Race Sex Smokes Height Weight Birth State Pet Grade_Level Died Count Data Entered Study 2 | AC/AH/001 Demetrius White Male FALSE 182.87 76.57 1972-01-31 Georgia Dog 2 FALSE 0.01 2015-11-25 3 | AC/AH/017 Rosario White Male FALSE 179.12 80.43 1972-06-09 Missouri Dog 2 FALSE -1.31 2015-11-25 4 | AC/AH/020 Julio Black Male FALSE 169.15 75.48 1972-07-03 Pennsylvania None 2 FALSE -0.17 2015-11-25 5 | AC/AH/022 LUPE White Male FALSE 175.66 94.54 1972-08-11 Florida Cat 1 FALSE -1.1 2015-11-25 6 | AC/AH/029 Lavern White Female FALSE 164.47 71.78 1973-06-06 Iowa NULL 2 TRUE 1.42 2015-11-25 7 | AC/AH/033 Bernie Dog Female TRUE 158.27 69.9 1973-06-25 Maryland Dog 2 FALSE 0.29 2015-11-25 8 | AC/AH/037 Samuel White Female FALSE 161.69 68.85 1972-03-20 Pennsylvania None 1 FALSE 0.16 2015-11-25 9 | AC/AH/044 CLAIR White Female No 165.84 70.44 1973-05-05 North Caroline None 1 FALSE -0.07 2015-11-25 10 | AC/AH/045 Shirley White Male FALSE 181.32 76.9 1971-12-25 Louisiana Dog 1 FALSE -1.43 2015-11-25 11 | AC/AH/048 MERLE Hispanic Male FALSE 167.37 79.06 1973-07-13 North Carolina None 2 FALSE 0.54 2015-12-25 12 | AC/AH/049 Martin White Female FALSE 160.06 72.37 1972-04-28 California Horse 2 TRUE -2.41 2015-12-25 13 | AC/AH/050 Frances White Female FALSE 166.48 67.34 1971-11-08 Michigan None 1 FALSE 1.05 2015-12-25 14 | AC/AH/052 Courtney White Male TRUE 175.39 92.22 1972-03-16 Indiana Bird 3 FALSE -0.04 2015-12-25 15 | AC/AH/053 Francis White Female TRUE 164.7 75.69 1971-11-16 Virginia Dog 1 FALSE -0.65 2015-12-25 16 | AC/AH/057 Vernon White Female TRUE 163.79 65.76 1972-01-06 Illinois Cat 3 FALSE 0.06 2015-12-25 17 | AC/AH/061 Lester Black Male FALSE 181.13 72.33 1972-11-16 Wisconsin Dog 99 TRUE -0.15 2015-12-25 18 | AC/AH/063 ROBIN Hispanic Male FALSE 169.24 73.3 1971-11-16 Illinois None 3 FALSE 0.68 2015-12-25 19 | AC/AH/076 Albert White Male FALSE 176.22 97.67 1973-04-08 Louisiana Cat 2 FALSE -1.97 2015-12-25 20 | AC/AH/077 Tommy Black Male FALSE 174.09 72.2 1973-02-01 Washington Cat 3 FALSE -1.18 2015-12-25 21 | AC/AH/086 KYLE Black Male TRUE 180.11 75.72 1973-05-12 Georgia Cat 3 FALSE 0.4 2015-12-25 22 | AC/AH/089 Dong White Male FALSE 179.24 75.54 1972-03-11 Californa None 2 TRUE -0.47 2015-12-25 23 | AC/AH/100 Michel White Female FALSE 161.92 69.92 1972-12-27 Georgia Dog 1 FALSE 0.5 2015-12-25 24 | AC/AH/104 Jeremy White Male TRUE 169.85 90.63 1972-04-12 Kentucky None 1 TRUE -0.2 2015-12-25 25 | AC/AH/112 Pat Black Female FALSE 160.57 63.54 1973-06-26 California NA 99 TRUE 0.69 2016-01-25 26 | AC/AH/113 Eugene White Female No 168.24 69.57 1972-02-07 Massachusetts NA 2 FALSE 0.38 2016-01-25 27 | AC/AH/114 Kris Hispanic Male FALSE 177.75 74.84 1972-11-19 Pennsylvania Bird 3 FALSE 0.15 2016-01-25 28 | AC/AH/115 Tracy Bi-Racial Male TRUE 183.21 83.36 1973-09-29 California Dog 2 FALSE 0.05 2016-01-25 29 | AC/AH/127 Jame White Male FALSE 167.75 82.06 1972-10-29 Texas Dog 1 TRUE 0.63 2016-01-25 30 | AC/AH/133 Clyde Hispanic Male FALSE 181.15 83.93 1973-10-13 Washington Cat 3 TRUE -0.81 2016-02-25 31 | AC/AH/150 Brett White Male TRUE 181.56 79.54 1972-05-03 Kentucky Dog 1 TRUE 0.92 2016-02-25 32 | AC/AH/154 TONY White Female FALSE 160.03 64.3 1973-08-30 California DOG 1 TRUE -0.1 2016-02-25 33 | AC/AH/156 George White Male FALSE 165.62 76.72 1972-07-09 Californa Dog 1 TRUE 0.61 2016-02-25 34 | AC/AH/159 Edward White Male No 181.64 96.91 1972-12-04 Connecticut Cat 2 FALSE -0.39 2016-02-25 35 | AC/AH/160 RORY Asian Female FALSE 159.67 71.88 1973-09-22 Florida Cat 2 TRUE 0.74 2016-02-25 36 | AC/AH/164 Shane Hispanic Male TRUE 177.03 74.04 1972-02-18 Florida None 2 FALSE 0.15 2016-02-25 37 | AC/AH/171 Devin White Female No 163.35 70.46 1973-04-16 California Bird 3 TRUE 1.68 2016-03-25 38 | AC/AH/176 Jerry Asian Male FALSE 175.21 83.65 1973-05-01 Virginia Dog 3 TRUE 0.41 2016-03-25 39 | AC/AH/180 Drew White Female FALSE 160.8 64.77 1973-02-18 Oregon CAT 1 TRUE -2.17 2016-03-25 40 | AC/AH/185 Ronald White Male FALSE 166.46 76.83 1972-08-17 Colorado None 99 TRUE -1.32 2016-03-25 41 | AC/AH/186 CHRISTOPHER White Female FALSE 157.95 67.41 1972-05-06 New Jersey Dog 3 TRUE -1.89 2016-03-25 42 | AC/AH/192 Dominique White Male FALSE 180.61 83.59 1972-03-24 Michigan None 3 TRUE -0.6 2016-03-25 43 | AC/AH/198 Van White Female FALSE 159.52 67.99 1972-12-02 Missouri CAT 2 FALSE -0.4 2016-03-25 44 | AC/AH/207 Bobbie White Female FALSE 163.01 65.19 1973-05-17 Florida Dog 2 FALSE 1.07 2016-03-25 45 | AC/AH/208 Lawrence Hispanic Female FALSE 165.8 71.77 1973-08-07 Louisiana None 1 FALSE -1.21 2016-03-25 46 | AC/AH/210 Keith Hispanic Female TRUE 170.03 66.68 1972-08-28 New York Dog 99 FALSE -0.66 2016-03-25 47 | AC/AH/211 SON White Female FALSE 157.16 69.64 1973-07-14 California Cat 2 TRUE -0.7 2016-04-25 48 | AC/AH/213 Charlie White Female Yes 164.58 72.99 1972-01-25 Louisiana Dog 1 FALSE -0.32 2016-04-25 49 | AC/AH/219 JAY White Female FALSE 163.47 72.89 1972-04-07 North Caroline Bird 1 TRUE 0.33 2016-04-25 50 | AC/AH/220 Richard White Male FALSE 185.43 87.23 1973-07-13 Florida Cat 1 FALSE 1.29 2016-04-25 51 | AC/AH/221 Carlos White Female FALSE 165.34 70.84 1972-02-01 Michigan Dog 99 TRUE 1.69 2016-04-25 52 | AC/AH/225 Gail White Female FALSE 163.45 67.67 1972-10-21 Michigan Cat 2 FALSE -0.74 2016-04-25 53 | AC/AH/233 MARION White Female FALSE 163.97 66.71 1971-12-23 Ohio Cat 3 TRUE -1.07 2016-04-25 54 | AC/AH/241 Lindsay White Female FALSE 161.38 73.55 1972-02-08 Florida Cat 3 FALSE -0.82 2016-05-25 55 | AC/AH/244 Sean White Female FALSE 160.09 65.93 1973-01-25 Maryland None 99 TRUE 0.5 2016-05-25 56 | AC/AH/248 Andrea White Male FALSE 178.64 97.05 1973-01-12 Indiana Cat 1 TRUE -1.21 2016-05-25 57 | AC/AH/249 JESUS Hispanic Female TRUE 159.78 68.31 1972-04-23 Alabama Cat 2 TRUE 0.52 2016-05-25 58 | AC/SG/002 JAN White Female TRUE 161.57 67.92 1973-07-03 Arizona Dog 3 FALSE 0.33 2016-05-25 59 | AC/SG/003 Walter White Female No 161.83 66.03 1972-07-11 Oregon None 2 TRUE -0.9 2016-05-25 60 | AC/SG/008 Dana White Male Yes 169.66 77.3 1973-05-26 Nevada Dog 1 TRUE -2.36 2016-05-25 61 | AC/SG/009 SAMMY White Male FALSE 166.84 88.25 1972-03-04 Vermont Dog 1 FALSE 1.51 2016-06-25 62 | AC/SG/010 THEO Asian Female FALSE 159.32 64.92 1973-01-29 New York Cat 2 TRUE 0.71 2016-06-25 63 | AC/SG/015 Shaun White Male Yes 170.51 84.35 1972-11-09 New Jersey DOG 3 TRUE 0.12 2016-06-25 64 | AC/SG/016 Jimmie Black Female FALSE 161.84 69.97 1972-04-03 Arizona Cat 3 TRUE -0.57 2016-06-25 65 | AC/SG/046 Carl Hispanic Male FALSE 171.41 81.7 1973-08-05 Mississippi Bird 2 TRUE 0.43 2016-06-25 66 | AC/SG/055 EVAN White Male FALSE 166.75 79.06 1972-02-24 Illinois Bird 3 TRUE -0.55 2016-07-25 67 | AC/SG/056 Merrill Asian Female Yes 166.19 67.46 1972-11-27 Indiana NULL 3 TRUE -1.06 2016-07-25 68 | AC/SG/064 Jon White Male FALSE 169.16 90.08 1972-10-04 Illinois Cat 2 TRUE -0.1 2016-07-25 69 | AC/SG/065 Shayne White Female FALSE 157.01 66.56 1972-04-05 California Dog 3 TRUE 0.64 2016-07-25 70 | AC/SG/067 THOMAS White Male FALSE 167.51 84.15 1972-07-19 Pennsylvania Bird 2 TRUE -1.46 2016-07-25 71 | AC/SG/068 VALENTINE Hispanic Female FALSE 160.47 68.2 1972-04-15 Tennessee Cat 3 TRUE 0.83 2016-07-25 72 | AC/SG/072 Cameron Black Female TRUE 162.33 66.47 1972-02-15 California NULL 3 FALSE 1.02 2016-07-25 73 | AC/SG/074 Eddie Hispanic Male FALSE 175.67 88.82 1973-10-04 Georgia None 99 FALSE -3.14 2016-07-25 74 | AC/SG/084 Brian Hispanic Male FALSE 174.25 80.93 1972-03-06 Virginia DOG 2 TRUE 1.79 2016-07-25 75 | AC/SG/095 Matthew White Female FALSE 158.94 65.14 1973-05-26 Hawaii Dog 1 FALSE 0.47 2016-07-25 76 | AC/SG/099 Leslie Asian Male FALSE 172.72 67.62 1972-02-04 Ohio Cat 1 FALSE 0.05 2016-07-25 77 | AC/SG/101 Jason White Female FALSE 159.23 69.96 1973-09-28 Michigan Dog 2 TRUE -0.81 2016-07-25 78 | AC/SG/107 Sol White Male FALSE 176.54 90.76 1973-01-28 Hawaii None 3 FALSE -1.48 2016-08-25 79 | AC/SG/116 Connie Black Male FALSE 184.34 90.41 1972-06-05 Florida None 3 TRUE -1.41 2016-08-25 80 | AC/SG/121 Rudy White Female FALSE 163.94 71.47 1973-03-12 Michigan Cat 3 FALSE -0.25 2016-08-25 81 | AC/SG/122 Michal Hispanic Female FALSE 160.09 68.94 1971-12-16 South Carolina DOG 1 FALSE -0.99 2016-08-25 82 | AC/SG/123 DARNELL White Female TRUE 162.32 72.72 1972-09-03 North Carolina Bird 1 TRUE -0.92 2016-08-25 83 | AC/SG/134 Daryl White Female TRUE 162.59 69.76 1972-05-28 Texas CAT 2 TRUE -0.28 2016-08-25 84 | AC/SG/139 Jordan White Male FALSE 171.94 82.11 1973-10-06 Michigan None 1 FALSE -1.18 2016-08-25 85 | AC/SG/142 Kenneth White Female FALSE 158.07 69.8 1972-05-15 Kansas Dog 3 FALSE 0.54 2016-08-25 86 | AC/SG/155 Raymond White Female FALSE 158.35 69.72 1972-06-02 California Cat 3 TRUE 1.42 2016-08-25 87 | AC/SG/165 Elmer White Female FALSE 162.18 67.81 1972-03-25 Washington Bird 1 TRUE 1.68 2016-08-25 88 | AC/SG/167 Jimmy White Female FALSE 159.38 70.37 1973-09-30 Washington None 2 TRUE 0.02 2016-09-25 89 | AC/SG/172 Whitney White Male No 171.45 84.29 1972-02-25 Florida Dog 2 TRUE -0.19 2016-09-25 90 | AC/SG/173 BRITT White Female TRUE 163.17 64.47 1973-06-19 California NONE 2 FALSE 0.69 2016-09-25 91 | AC/SG/179 Logan White Male FALSE 183.1 82.47 1972-10-24 Ohio Dog 3 TRUE 0.77 2016-09-25 92 | AC/SG/181 Terry Hispanic Male FALSE 177.14 88.7 1971-11-24 Indiana CAT 3 TRUE 1.76 2016-09-25 93 | AC/SG/182 Jamie Hispanic Male TRUE 171.08 72.51 1973-03-25 Louisiana None 3 TRUE -0.22 2016-09-25 94 | AC/SG/191 Lacy Hispanic Female FALSE 159.33 70.68 1973-06-21 Texas None 3 TRUE -1.07 2016-09-25 95 | AC/SG/193 Ronnie White Male TRUE 185.43 73.63 1973-06-05 Iowa Dog 3 FALSE -0.29 2016-09-25 96 | AC/SG/194 Joseph White Female FALSE 162.65 73.99 1972-08-04 Maryland CAT 3 FALSE 0.87 2016-10-25 97 | AC/SG/197 Stacy White Female FALSE 159.44 66.21 1972-11-08 New York Cat 1 TRUE 1.21 2016-10-25 98 | AC/SG/204 Anthony White Female FALSE 164.11 70.66 1972-06-17 California Dog 3 FALSE -0.17 2016-10-25 99 | AC/SG/216 Alva White Female FALSE 159.13 66.96 1972-06-19 Alabama None 1 TRUE 0.95 2016-10-25 100 | AC/SG/217 Dean White Female FALSE 160.58 71.49 1972-11-11 Ohio None 1 TRUE -0.78 2016-10-25 101 | AC/SG/234 Luis Hispanic Female FALSE 164.88 68.07 1971-11-10 Pennsylvania Cat 3 TRUE 0.35 2016-10-25 -------------------------------------------------------------------------------- /pre_may_2023_slides/Source_docs/DMP_model_answers.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/Source_docs/DMP_model_answers.docx -------------------------------------------------------------------------------- /pre_may_2023_slides/Source_docs/data_formatting.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/Source_docs/data_formatting.pptx -------------------------------------------------------------------------------- /pre_may_2023_slides/Source_docs/data_management.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/Source_docs/data_management.pptx -------------------------------------------------------------------------------- /pre_may_2023_slides/Source_docs/data_sharing_backup.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/Source_docs/data_sharing_backup.pptx -------------------------------------------------------------------------------- /pre_may_2023_slides/Source_docs/file_management.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/Source_docs/file_management.pptx -------------------------------------------------------------------------------- /pre_may_2023_slides/data_formatting.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/data_formatting.pdf -------------------------------------------------------------------------------- /pre_may_2023_slides/data_formatting.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/data_formatting.pptx -------------------------------------------------------------------------------- /pre_may_2023_slides/data_management.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/data_management.pdf -------------------------------------------------------------------------------- /pre_may_2023_slides/data_management.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/data_management.pptx -------------------------------------------------------------------------------- /pre_may_2023_slides/file_management.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/file_management.pdf -------------------------------------------------------------------------------- /pre_may_2023_slides/file_management.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/file_management.pptx -------------------------------------------------------------------------------- /pre_may_2023_slides/file_management_2021Mar19_share_v02.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/file_management_2021Mar19_share_v02.pdf -------------------------------------------------------------------------------- /pre_may_2023_slides/file_management_2021Mar19_share_v04.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/file_management_2021Mar19_share_v04.pdf -------------------------------------------------------------------------------- /pre_may_2023_slides/file_management_2022Nov7_share.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/pre_may_2023_slides/file_management_2022Nov7_share.pdf -------------------------------------------------------------------------------- /refine_demo.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Demo of Open Refine" 3 | author: "Mark Dunning" 4 | output: 5 | pdf_document: default 6 | html_document: default 7 | --- 8 | 9 | ```{r setup, include=FALSE} 10 | knitr::opts_chunk$set(echo = TRUE) 11 | ``` 12 | 13 | ![](images/refine.png) 14 | 15 | # Open refine demo 16 | 17 | (adapted from [Data Carpentry materials](http://lgatto.github.io/OpenRefine-ecology/00-getting-started.html)) 18 | 19 | Open Refine (previously Google Refine) is an open-source tool that can help you to clean-up messy datasets. It presents itself as a spreadsheet-like interface, but all operations we do to the data are recorded and can be repeated or reversed. We will show how it can be used to solve some of the issues we have highlighted previously. You can use Open Refine to build-up a data-cleaning pipeline which you can apply to multiple files. We will not go that far today though. There are some nice introductory videos 20 | 21 | 22 | 23 | 24 | 25 | Open Refine runs in a web browser, although you do not have to be online to use it. 26 | 27 | ## Some example data 28 | 29 | We will use some data that have been simulated to demonstrate many of the problems we have seen already. Each row represents a different patient in a fictitious study and can be downloaded from the [course website](https://raw.githubusercontent.com/bioinformatics-core-shared-training/avoid-data-disaster/master/patient-data.txt). (Right-click and `Save Link as....`) 30 | 31 | ## Importing the data 32 | 33 | Start the program. On Windows, Double-click on the openrefine.exe file. Java services will start on your machine, and Refine will open in your Firefox browser. On the Mac, you've probably installed the package into your Applications folder. 34 | 35 | Note the file types Open Refine handles: TSV, CSV, *SV, Excel (.xls .xlsx), JSON, XML, RDF as XML, Google Data documents. Support for other formats can be added with Google Refine extensions. 36 | 37 | Once Refine is open, you’ll be asked if you want to Create, Open, or Import a Project. 38 | 39 | - Click ***Browse***, find `patient-data.txt` 40 | - Click ***next*** to open `patient-data.txt` 41 | Refine gives you a preview - a chance to show you it understood the file. If, for example, your file was really comma-separated, the preview might look strange, you would choose the correct separator in the box shown and click “update preview.” 42 | - You should see something like... 43 | 44 | ![](images/refine-import.png) 45 | 46 | If all looks well, click ***Create Project***. 47 | 48 | ## Faceting 49 | 50 | *Faceting* provides you a snapshot of the entries in a particular column and allows you to filter down to particular rows. It can also quickly highlight problems with the data. 51 | 52 | Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture on that column, and allows you to filter to some subset of rows for which their cells in that column satisfy some constraint. That’s a bit abstract, so let’s jump into some examples. Before we start, ***how many different entries would we expect to find a column that is supposed to be just `Male` or `Female`?*** 53 | 54 | 55 | - Scroll over to the `Sex` column 56 | - Click the down arrow and choose ***Facet*** -> ***Text facet*** 57 | - In the left margin, you’ll see a box containing every unique, distinct value in the `Sex` column and Refine shows you how many times that value occurs in the column (a count), and allows you to sort (order) your facets by name or count. 58 | 59 | In this case, we have found ***6*** different ways for Male or Female to be entered. 60 | 61 | 62 | Edit. Note that at any time, in any cell of the Facet box, or data cell in the Refine window, you have access to edit and can fix an error immediately. Refine will even ask you if you’d like to make that same correction to every value it finds like that one (or not). 63 | 64 | ## Trimming whitespace 65 | 66 | *Whitespace* is when we have a blank space at the beginning, or end, of a text entry. They can be difficult to spot by-eye and for the computer ` Male` and `Male` are completely distinct entries. This can have undesired consequences in a data analysis. 67 | 68 | ```{r echo=FALSE} 69 | library(stringr) 70 | patients <- read.delim("patient-data.txt") 71 | ht <- as.numeric(str_replace_all(patients$Height,"cm", "")) 72 | boxplot(ht ~ patients$Sex,las=2,ylab="Height") 73 | ``` 74 | 75 | Fortunately, Open Refine has a straightforward solution to this problem 76 | 77 | - Select the `Sex` column 78 | - Select ***Edit cells*** 79 | - ***Common transforms*** -> ***Trim trailing and leading whitespace*** 80 | 81 | Now try the text facet operation from above. What do you notice? 82 | 83 | ## Clustering 84 | 85 | *Clustering* in Open Refine is used to identify and consolidate similar entries into a consistent term. Let's try this on the `Pet` column. The first thing we will notice is that `Cat` and `CAT` are distinct entries but probably shouldn't be. We can fix that by clustering 86 | 87 | - Select the `Pet` colum 88 | - ***Facet*** -> ***Text Facet*** as before 89 | - Now, click the ***cluster*** button in the top-right of the panel that shows you all the different values in this column. 90 | - It will now suggest various groups of entries that it thinks represent the same thing. You can click the checkbox to merge into a single group and have the ability to choose the new name. 91 | - Clicking ***Merge Selected and Close*** will perform the operation 92 | 93 | ## Bulk-editing 94 | 95 | Staying with the `Pet` column, there is also an inconsistent way of representing missing data; with `NA`, `None` or `NULL` used. Languages such as R would prefer `NA` to be used, although in practice we can use any as long as we are consistent. 96 | 97 | - Click on `None` in the Facet panel. Only rows where the value of `Pet` is `None` will be shown. 98 | - Click on the ***edit*** box `None` value in any particular row. This will give you the chance to edit the value. 99 | - Change the value to `NA`. Clicking ***Apply to all identical cells*** will change all occurences of `None` to `NA`. 100 | - You could also try changing `NULL`... 101 | 102 | ## Splitting into several columns 103 | 104 | Sometimes multiple pieces of information can be encoded in a single cell. In our particular case, the ID assigned to each patient contains a hospital identifier (either `AH` or `SG`) and a numeric ID. For some analyses we might want to quickly perform operations that take the hospital as a factor 105 | 106 | - Select the `ID` column 107 | - ***Edit column*** -> ***Split into several columns*** 108 | - It will ask you what text character splits the IDs into different parts. In this case we specify `/` 109 | - Each new column is assigned a new name automatically. You can change the names by ***Edit column*** -> ***Rename this column*** 110 | 111 | ## Filling missing values 112 | 113 | The final column `Date entered study` was used to indicate the date at which each patient was enrolled onto the study in question. Patients were enrolled in batches. However, the person filling out the form thought it was helpful to include this information only once for each batch of new patients. 114 | 115 | - Select the `Date Entered Study` column 116 | - Select ***Edit cells*** -> ***Fill down*** 117 | - Empty cells will now be filled with the appropriate date. 118 | 119 | ## Upper- and lower-case transformations 120 | 121 | For consistency, we might want the text entries in a particular column to be all `lower` or `UPPPER` case. 122 | 123 | - Select the `Name` column 124 | - ***Edit column*** -> ***Common transformations*** 125 | - You can choose `To uppercase`, `To lowercase` if required. 126 | - However, we will use the operation `To titlecase`. It makes the first letter of the text *Upper case*, but the rest of the text *Lower case*. 127 | 128 | ## More-advanced text operations 129 | 130 | Open Refine has it's own language ("General Refine Expression language (GREL)") for performing custom text operations in a column. 131 | 132 | The `Height` and `Weight` columns are problematic because they contain the units information (`kg` and `cm` respectively). Languages such as R will interpret the values in such a column as text, and not numeric data. Simple plotting and numeric analysis will not be possible without extra manipulation. 133 | 134 | - Select the `Height` column 135 | - Select ***Edit cells*** -> ***Transform...*** 136 | - In the Expression box, enter `replace(value, "cm","")` 137 | 138 | 139 | ## Things to try 140 | 141 | - Can you split the `Birth` column into Year, Month and Day? 142 | - Can you make the `Smokes` column suitable for analysis? 143 | - Tidy up the `Weight` column for analysis 144 | - The `Race` column contains one value that is very suspicious...Can you find it and change it to something suitable? 145 | - Look at the `State` column and try faceting / clustering?. Are there any entries that should be joined into one? You may need to experiment with different clustering methods. 146 | 147 | 148 | ## Exporting your data / project 149 | 150 | You can export the modified table into a new file:- 151 | 152 | - ***Export*** -> ***Tab-separated value*** or ***Export*** -> ***Comma-separated value*** would seem to be sensible choices. 153 | 154 | 155 | # Impact on analysis 156 | 157 | Lets suppose we want to look at the difference in weight between males and females in the study. 158 | 159 | ```{r} 160 | patients <- read.delim("patient-data-cleaned.tsv") 161 | boxplot(patients$Weight ~patients$Sex) 162 | ``` 163 | 164 | ```{r} 165 | library(stringr) 166 | patients <- read.delim("patient-data.txt") 167 | patients$Weight <- as.numeric(str_replace_all(patients$Weight, "kg","")) 168 | patients$Sex <- str_trim(patients$Sex) 169 | boxplot(patients$Weight ~patients$Sex) 170 | 171 | ``` 172 | 173 | -------------------------------------------------------------------------------- /refine_demo.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/bioinformatics-core-shared-training/Managing-your-research-data/04c8a04fa1781fff4921c1a3fe6d0a71d0e23263/refine_demo.pdf --------------------------------------------------------------------------------