├── .gitignore ├── CONTRIBUTING.md ├── LICENSE.md ├── README.md ├── assets ├── README.md ├── docs │ ├── blog.md │ ├── media.md │ └── slides.md └── images │ ├── highlighted_facts.png │ └── software │ ├── ami │ ├── ami2-species-genus-results.png │ └── ami2-species-genus.png │ ├── getpapers │ ├── getpapers-complex1.png │ ├── getpapers-folder.png │ ├── getpapers-inspectctree1.png │ ├── getpapers-inspectctree2.png │ ├── getpapers-inspecturlstxt.png │ ├── getpapers-metadataonly.png │ ├── getpapers-noexecute.png │ ├── getpapers-simplequery.png │ └── getpapers-simpleresult.png │ ├── journal-scrapers │ ├── 001.png │ ├── 002.png │ ├── 003.png │ ├── 004.png │ ├── 005.png │ ├── 006.png │ ├── 007.png │ ├── 008.png │ ├── 009.png │ ├── 010.png │ ├── 011.png │ └── 012.png │ ├── norma │ ├── cprojectwithshtml.png │ ├── html2shtml.graphml │ ├── html2shtml.png │ ├── nlm2htmlwithwarnings.png │ ├── normahtml2shtml0.dot │ ├── normahtml2shtml0.png │ ├── normasmall0.dot │ ├── normasmall0.png │ ├── normaxml2shtml0.dot │ ├── normaxml2shtml0.png │ ├── pdf2txt.graphml │ ├── pdf2txt.png │ ├── xml2shtml.graphml │ └── xml2shtml.png │ ├── quickscrape │ └── quickscrape-url.png │ ├── shell │ ├── cat.png │ ├── cd.png │ ├── cp.png │ ├── head.png │ ├── ls.png │ ├── mkdir.png │ ├── mv.png │ ├── rm.png │ ├── tail.png │ ├── tree.png │ └── wc.png │ └── vms │ ├── desktop.png │ ├── memorysettings.png │ ├── starting-vm.png │ ├── terminal.png │ ├── vm-crunchbang.png │ ├── vm-error.png │ ├── vm-icon.png │ ├── vm-installer.png │ ├── vm-popup.png │ ├── vm-poweredoff.png │ ├── vm-running.png │ ├── vm-startscreen.png │ └── vm-virtualbox-download.png ├── software-tutorials ├── README.md ├── ami │ └── README.md ├── canary │ └── README.md ├── cat │ └── README.md ├── cproject │ └── README.md ├── getpapers │ ├── README.md │ ├── getpapers-arxiv-queries.md │ ├── getpapers-eupmc-queries.md │ └── getpapers-ieee-queries.md ├── installation │ └── README.md ├── journal-scrapers │ └── README.md ├── norma │ ├── README.md │ └── notes.txt ├── quickscrape │ └── README.md ├── sHTML │ └── README.md ├── shell │ └── README.md └── vms │ └── README.md ├── training-guidelines ├── README.md ├── evaluation-assessment.md ├── how-to-setup-a-training.md ├── session-formats.md └── workflow.md └── training-modules ├── A-About-ContentMine ├── README.md ├── about-contentmine.odp ├── about-contentmine.pdf ├── bubbles │ ├── README.md │ ├── bubbles.png │ ├── bubbles0.png │ ├── bubbles1.png │ ├── bubbles3.png │ ├── bubbles4.png │ └── bubbles5.png ├── chemistry │ ├── README.md │ ├── chemicaltagger0.png │ ├── chemicaltagger1.png │ ├── chemicaltagger2.png │ ├── chemicaltagger3.png │ └── chemicaltagger4.png ├── fotd │ ├── README.md │ ├── fotd.png │ └── fotd1.png └── iucn │ ├── README.md │ ├── iucn1.png │ ├── iucn2.png │ ├── iucn3.png │ ├── iucn4.png │ ├── iucn5.png │ └── iucn6.png ├── A-CM-Facts ├── README.md ├── contentmine-facts.odp ├── contentmine-facts.pdf ├── iucn_walkthrough.md └── manual_markup.pdf ├── A-Canary └── README.md ├── A-Legal-Responsible ├── README.md ├── jisctdm.pdf ├── jisctdm.pptx ├── legal-responsible.odp ├── legal-responsible.pdf ├── legal_true_false.pdf └── true_false_answers.pdf ├── A-Participate-Contribute └── README.md ├── B-Architecture ├── README.md ├── contentmine-architecture.odp └── contentmine-architecture.pdf ├── B-Building-corpus └── README.md ├── B-Fact-Extraction └── README.md ├── B-Normalization └── README.md ├── B-Scraping ├── README.md ├── scraping.odp └── scraping.pdf ├── B-VM-Commandline └── README.md ├── B-Working-with-Facts └── README.md ├── C-Own-Usecase └── README.md ├── C-Regex └── README.md ├── C-Writing-scrapers └── README.md └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | notes.md 2 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # How to contribute 2 | 3 | :+1: We're happy you're thinking about contributing to ContentMine! The material offered here is subject to continuous change and improvement, so please check again regularly. We're also very happy to include your feedback, suggestions and contributions. 4 | 5 | There are many ways to contribute: 6 | - by reporting an issue regarding the training material 7 | - by suggesting new content 8 | - by submitting pull requests and closing issues 9 | - by writing, blogging or tweeting about the project 10 | - by translating the materials 11 | 12 | ### How to set up your development environment 13 | 14 | The ContentMine software is based on two languages, Node.js for getpapers and quickscrape, and Java for norma/ami-plugins. An installation and environment guide is [here](http://contentmine.github.io/). 15 | 16 | ### How to report a bug 17 | 18 | If you encounter a bug, please let us know. You can raise a new issue for each part of the software in the respective repositories: 19 | 20 | * [getpapers](https://github.com/ContentMine/getpapers/issues) 21 | * [quickscrape](https://github.com/ContentMine/quickscrape/issues) 22 | * [norma](https://github.com/ContentMine/norma/issues) 23 | * [ami-plugin](https://github.com/ContentMine/ami-plugin/issues) 24 | 25 | Please include as many information in your report as possible, to help maintainers reproduce the problem. 26 | 27 | * A clear and descriptive title 28 | * Describe the exact steps which reproduce the problem, e.g. the commands you entered and the filetypes you were operating on. 29 | * Describe the software behaviour following those steps, and where the problem occurred. 30 | * Explain where it was different from what you expected to happen. 31 | * Attach additional information to the report, such as error messages, or corrupted files. 32 | * Add a `bug` label to the issue. 33 | 34 | Before submitting a bug, please check the list of existing bugs whether there is a similar issue open. You can then help by adding your information to an existing report. 35 | 36 | ### How to request an enhancement 37 | 38 | There is always room for improvement and we'd like to hear your perspective on it. Before creating a pull request, please raise an issue to discuss the proposed changes first. We can then make sure to make best use of your efforts. 39 | 40 | If you want to discuss an idea in general, you can also start a new topic on [discuss](http://discuss.contentmine.org/). 41 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/). 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ![ContentMine logo](https://github.com/ContentMine/assets/blob/master/png/Content_mine(small).png) 2 | 3 | 4 | # ContentMine training material 5 | 6 | This repository contains material helping you to learn Content Mining yourself and for training others to do so. It includes tutorials for working with the ContentMine toolchain on your own PC or to use webservices offered by ContentMine. 7 | 8 | 9 | # Table of Content 10 | 11 | 1. [Purpose of this repository](#purpose-of-this-repository) 12 | 1. [For whom is this repository](#for-whom-is-this-repository) 13 | 1. [Description of content](#description-of-content) 14 | 1. [Contribute](#contribute) 15 | 1. [Licence](#licence) 16 | 17 | ## Purpose of this repository 18 | 19 | * This repository helps facilitators and organizers work out a strong and working schedule, with a fitting use case aligned with narratives around it. 20 | * The goal of the workshop design must always be to identify and fulfill the requirements and expectations of the audience/participants. 21 | 22 | 23 | ## For whom is this repository? 24 | 25 | * Learners: A large part of this repository is devoted to software tutorials, which can be used in the workshops but also for self-guided learning. 26 | * Organizers and Facilitators: Ensures a working framework of a workshop and develop the content. 27 | 28 | 29 | ## Description of content 30 | 31 | There are three main sections in this repository: 32 | 33 | * [software tutorials](software-tutorials): The basic usage of ContentMine tools can be learned step-by-step with the help of the tutorials. It includes installation and describes functionalities, what results to expect, and how to link the different elements of the content mining pipeline. The tutorials can be used in workshops as well as for self-guided learning. 34 | 35 | * [training guidelines](training-guidelines): This folder contains organizer and facilitator resources for the organisation of a ContentMine workshop. It describes the workflow of facilitators and organizers on which steps to perform when, and which dependencies exist. A checklist helps facilitators and organizers prepare their sessions. Additionally it contains a description of the teaching methods used in a workshop as well as links to helpful external teaching resources. 36 | 37 | * [training modules](training-modules): This folder contains learning modules which can be combined as needed by workshop facilitators. 38 | Each module contains a description of learning goals, links to additional resources like slides or software tutorials, and a description of course of action. 39 | 40 | ## Contribute 41 | 42 | There are many ways to contribute, please have a closer look [here](CONTRIBUTING.md). 43 | 44 | You can find us online at: 45 | - [contentmine.org](http://contentmine.org) 46 | - [@thecontentmine](http://twitter.com/thecontentmine) 47 | - training ett contentmine dot org 48 | 49 | If you have questions or suggestions on specific parts of the training material, you can also contact us directly via mail (mail ett stefankasberger dot at, web ett christopherkittel dot eu). 50 | 51 | ## Licence 52 | 53 | This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/). 54 | -------------------------------------------------------------------------------- /assets/README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | == additional assets == 4 | === Zotero === 5 | about 50 links in: 6 | https://www.zotero.org/groups/contentmine/items 7 | 8 | 9 | 10 | 11 | -------------------------------------------------------------------------------- /assets/docs/blog.md: -------------------------------------------------------------------------------- 1 | 2 | - [The Hague Declaration on Knowledge Discovery in the Digital Age](http://contentmine.org/2015/05/the-hague-declaration-on-knowledge-discovery-in-the-digital-age/): About the release of the Hague Declaration 3 | - [Wellcome Trust Workshop – 2 Days of Hacking, Networking & Collaboration](http://contentmine.org/2015/04/wellcome-trust-workshop-after-the-event/): Sum up of workshop 4 | - [Issues in electronic theses and open research data](http://contentmine.org/2015/04/issues-in-electronic-theses-and-open-research-data/): Sum Up of workshop 5 | - [Wellcome Trust Workshop – Before the Event](http://contentmine.org/2015/04/wellcome-trust-workshop-before-the-event/): invitation with anticipated outcome 6 | - [Guidelines for collaboration requests](http://contentmine.org/2015/03/guidelines-for-collaboration-requests/): are of interest regarding FAQs 7 | - [Clinical Trials Workshop, UK Cochrane Centre](http://contentmine.org/2015/03/clinical-trials-workshop-uk-cochrane-centre/): Sum up of workshop. Intersting regarding FAQs, Workshop Agenda 8 | - [Clinical Trials Content Mining in Oxford is happening](http://contentmine.org/2015/03/clinical-trials-content-mining-in-oxford-is-happening/): Sum up with agenda 9 | - [Explaining the difference between getpapers and quickscrape](http://contentmine.org/2015/07/explaining-the-difference-between-getpapers-and-quickscrape/): Comparision getpapers and quickscrap 10 | - [Content Mining for Science at OAI9](http://contentmine.org/2015/06/content-mining-for-science-at-oai9/): sum up of content from poster presentation. whole pipeline gets mentioned 11 | - [The importance of Figures and Captions](http://contentmine.org/2015/06/the-importance-of-figures-and-captions/): discussing classification of figures and their role in publications 12 | - [LIBER Response to STM Statement on Text and Data Mining](http://contentmine.org/2015/06/liber-response-to-stm-statement-on-text-and-data-mining/): legal material 13 | - [#MozSprint – June 4th – 5th, 2015](http://contentmine.org/2015/06/mozsprint-june-4th-5th-2015/): Sum up of the sprint 14 | - [“We think ContentMine has the potential to transform the way we work!”](http://contentmine.org/2015/06/we-think-it-contentmine-has-the-potential-to-transform-the-way-we-work/): states the problems in neuroscience (Edinburgh) 15 | - []() 16 | - []() 17 | - []() 18 | - []() 19 | - []() 20 | - []() 21 | - []() 22 | 23 | 24 | 25 | 26 | -------------------------------------------------------------------------------- /assets/docs/media.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | - [TheContentMine II](https://vimeo.com/110908526) 5 | - [ContentMining Hackday in Cambridge facilitated by ContentMine](https://www.youtube.com/watch?v=YWG57l-v_g4) 6 | - [Wellcome Trust, 13 - 14th April 2015](https://www.youtube.com/watch?v=rnKUkrIOF4k) 7 | - [Group discussion f/t CharlesOppenhein/Richard Danbury on Text and Data Mining](https://www.youtube.com/watch?v=MDb_8kWeJak) 8 | - [Group discussion f/t CharlesOppenhein/Yvonne Nobis on Text and Data Mining](https://www.youtube.com/watch?v=xQ2FLJwkEIk) 9 | - [Right to read, right to mine | Jenny Molloy](https://www.youtube.com/watch?v=QdEfGx5VAso) 10 | - [Content Mining of the bioscience literature](https://www.youtube.com/watch?v=G9LePsd9R9A) 11 | - []() 12 | 13 | 14 | 15 | 16 | where is the poster? 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | -------------------------------------------------------------------------------- /assets/docs/slides.md: -------------------------------------------------------------------------------- 1 | 2 | - [Architecture of ContentMine Components contentmine.org](http://www.slideshare.net/petermurrayrust/architecture-of) 3 | - [Automatic Extraction of Science and Medicine from the scholarly literature](http://www.slideshare.net/petermurrayrust/automatic-extraction-of-science-and-medicine-from-the-scholarly-literature) 4 | - [ContentMine Architecture](http://www.slideshare.net/petermurrayrust/contentmine-architecture) 5 | - [ContentMining for Synthetic Biology](http://www.slideshare.net/petermurrayrust/contentmining-for-synthetic-biology) 6 | - [Content Mining at Wellcome Trust](http://www.slideshare.net/petermurrayrust/content-mining-at-wellcome-trust) 7 | - [ContentMine: Liberating scholarship from Open publications and theses](http://www.slideshare.net/petermurrayrust/contentmine-liberating-scholarship-from-open-publications-and-theses) 8 | - [ContentMining and Clinical Trials](http://www.slideshare.net/petermurrayrust/contentmining-and-clinical-trials) 9 | - [Copyright Reform and Open Data](http://www.slideshare.net/petermurrayrust/copyright-reform-and-open-data) 10 | - [Content Mining for Machines and Humans](http://www.slideshare.net/petermurrayrust/content-mining-for-machines-and-humans) 11 | - [Petermrjisc20141201](http://www.slideshare.net/petermurrayrust/petermrjisc20141201) 12 | - [Petermrbl20141127](http://www.slideshare.net/petermurrayrust/petermrbl20141127) 13 | - [Embrace the Open Revolution](http://www.slideshare.net/petermurrayrust/embrace-the-open-revolution) 14 | - [Contentmineatopencon2](http://www.slideshare.net/petermurrayrust/contentmineatopencon2) 15 | - [Disruptive Communities and Technology](http://www.slideshare.net/petermurrayrust/opencon1) 16 | - [ContentMine: Open Data and Social Machines](http://www.slideshare.net/petermurrayrust/contentmine-open-data-and-social-machines) 17 | - [Mining Scientific Images](http://www.slideshare.net/petermurrayrust/mining-scientific-images) 18 | - [ContentMine and WikiData](http://www.slideshare.net/petermurrayrust/contentmine-and-wikidata) 19 | - [Making Theses USEFUL](http://www.slideshare.net/petermurrayrust/making-theses-useful) 20 | - [Csvconf](http://www.slideshare.net/petermurrayrust/csvconf) 21 | - [The Content Mine (presented at UKSG)](http://www.slideshare.net/petermurrayrust/the-content-mine-presented-at-uksg) 22 | - [The Value and Benefits of Open Access to Research Literature](http://www.slideshare.net/rossmounce/the-value-and-benefits-of-open-access-to-research-literature) 23 | - [Content Mining](http://www.slideshare.net/rossmounce/content-mining) 24 | - [ContentMine at EuropePMC AGM](http://www.slideshare.net/JennyMolloy/j-molloy-europe-pmc-slides) 25 | - [SciDataCon 2014 TDM Workshop Intro Slides](http://www.slideshare.net/JennyMolloy/scidaintro-slides) 26 | - [Legal Framework for TDM](http://www.slideshare.net/JennyMolloy/contentmine) 27 | - [ContentMine Presentation for WHO Health Data Seminar](http://www.slideshare.net/JennyMolloy/contentmine-presentation-for-who) 28 | - [ContentMine (EMBL-EBI Industry Programme)](http://www.slideshare.net/JennyMolloy/contentmine-49511121) 29 | - []() 30 | - []() 31 | 32 | 33 | 34 | -------------------------------------------------------------------------------- /assets/images/highlighted_facts.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/highlighted_facts.png -------------------------------------------------------------------------------- /assets/images/software/ami/ami2-species-genus-results.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/ami/ami2-species-genus-results.png -------------------------------------------------------------------------------- /assets/images/software/ami/ami2-species-genus.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/ami/ami2-species-genus.png -------------------------------------------------------------------------------- /assets/images/software/getpapers/getpapers-complex1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-complex1.png -------------------------------------------------------------------------------- /assets/images/software/getpapers/getpapers-folder.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-folder.png -------------------------------------------------------------------------------- /assets/images/software/getpapers/getpapers-inspectctree1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-inspectctree1.png -------------------------------------------------------------------------------- /assets/images/software/getpapers/getpapers-inspectctree2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-inspectctree2.png -------------------------------------------------------------------------------- /assets/images/software/getpapers/getpapers-inspecturlstxt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-inspecturlstxt.png -------------------------------------------------------------------------------- /assets/images/software/getpapers/getpapers-metadataonly.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-metadataonly.png -------------------------------------------------------------------------------- /assets/images/software/getpapers/getpapers-noexecute.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-noexecute.png -------------------------------------------------------------------------------- /assets/images/software/getpapers/getpapers-simplequery.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-simplequery.png -------------------------------------------------------------------------------- /assets/images/software/getpapers/getpapers-simpleresult.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-simpleresult.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/001.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/002.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/002.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/003.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/003.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/004.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/004.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/005.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/005.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/006.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/006.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/007.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/007.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/008.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/008.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/009.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/009.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/010.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/010.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/011.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/011.png -------------------------------------------------------------------------------- /assets/images/software/journal-scrapers/012.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/012.png -------------------------------------------------------------------------------- /assets/images/software/norma/cprojectwithshtml.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/cprojectwithshtml.png -------------------------------------------------------------------------------- /assets/images/software/norma/html2shtml.graphml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | quickscrape 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | $ tree dinosaurs-html 48 | dinosaurs-htmls/ 49 | ├── http_europepmc.org_articles_PMC2214819 50 | │   ├── fulltext.html 51 | │   └── results.json 52 | ├── http_europepmc.org_articles_PMC2635535 53 | │   ├── fulltext.html 54 | │   └── results.json 55 | ├── http_europepmc.org_articles_PMC2997427 56 | │   ├── fulltext.html 57 | │   └── results.json 58 | ... 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | $ tree dinosaurs-htmls/ 76 | dinosaurs-htmls/ 77 | ├── http_europepmc.org_articles_PMC2214819 78 | │   ├── fulltext.html 79 | │   ├── fulltext.xhtml 80 | │   └── results.json 81 | ├── http_europepmc.org_articles_PMC2635535 82 | │   ├── fulltext.html 83 | │   ├── fulltext.xhtml 84 | │   └── results.json 85 | ├── http_europepmc.org_articles_PMC2997427 86 | │   ├── fulltext.html 87 | │   ├── fulltext.xhtml 88 | │   └── results.json 89 | ... 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | $ tree dinosaurs-htmls/ 107 | dinosaurs-htmls/ 108 | ├── http_europepmc.org_articles_PMC2214819 109 | │   ├── fulltext.html 110 | │   ├── fulltext.xhtml 111 | │   ├── results.json 112 | │   └── scholarly.html 113 | ├── http_europepmc.org_articles_PMC2635535 114 | │   ├── fulltext.html 115 | │   ├── fulltext.xhtml 116 | │   ├── results.json 117 | │   └── scholarly.html 118 | ├── http_europepmc.org_articles_PMC2997427 119 | │   ├── fulltext.html 120 | │   ├── fulltext.xhtml 121 | │   ├── results.json 122 | │   └── scholarly.html 123 | ... 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | $ quickscrape \ 140 | -r urllist.txt \ 141 | -o dinosaurs-htmls \ 142 | -d journal-scrapers/scrapers 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | $ norma \ 162 | -q dinosaurs-html \ 163 | -i fulltext.html \ 164 | -o fulltext.xhtml \ 165 | --html jsoup 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | 175 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | $ norma \ 185 | -q dinosaurs-htmls/ \ 186 | -i fulltext.xhtml \ 187 | -o scholarly.html \ 188 | --transform nature2html 189 | 190 | 191 | 192 | 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | 201 | 202 | 203 | 204 | 205 | -------------------------------------------------------------------------------- /assets/images/software/norma/html2shtml.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/html2shtml.png -------------------------------------------------------------------------------- /assets/images/software/norma/nlm2htmlwithwarnings.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/nlm2htmlwithwarnings.png -------------------------------------------------------------------------------- /assets/images/software/norma/normahtml2shtml0.dot: -------------------------------------------------------------------------------- 1 | digraph norma { 2 | graph [nodesep=0.4 ranksep=0.7] 3 | 4 | "apiquery" [label="apiquery", style="filled", color="yellow"]; 5 | "urllist" [label="urllist", style="filled", color="yellow"]; 6 | 7 | "getp" [label="getpapers", style="filled", color="cyan", shape="box"] 8 | "qs" [label="quickscrape", style="filled", color="cyan", shape="box"] 9 | 10 | "apiquery" -> "getp"; 11 | "urllist" -> "qs"; 12 | 13 | "f.xml" [label="fulltext.xml", penwidth="2"]; 14 | "f.html" [label="fulltext.html", penwidth="2"]; 15 | "f.pdf" [label="fulltext.pdf", style="filled", color="pink"]; 16 | 17 | /*"f.pdf.html" [label="fulltext.pdf.html"]; 18 | "f.pdf.txt" [label="fulltext.pdf.txt", style="filled", color="pink"]; 19 | */ 20 | "f.xhtml" [label="fulltext.xhtml"]; 21 | 22 | "png" [label="png", style="filled" color="pink", penwidth="2"]; 23 | /* 24 | "png.hocr.html" [label="HTML-OCR"] 25 | "png.hocr.svg" [label="OCR'ed image (SVG)", style="filled", color="pink"] 26 | */ 27 | //"svg" [label="svg", style="filled", color="pink", penwidth="2"]; 28 | "sdata" [label="suppdata", style="filled", color="pink", penwidth="2"]; 29 | "s.html" [label="scholarly.html", style="filled", color="red", penwidth="2"]; 30 | 31 | "getp" -> {"f.xml" "f.pdf" "sdata" } [style="bold", color="grey"]; 32 | "qs" -> {"f.xml" "f.pdf" "png" "sdata"} [style="bold", color="grey"]; 33 | "qs" -> "f.html" [style="bold", color="red"]; 34 | 35 | "f.xml" -> "n.stylesheets" [style="bold", color="grey"]; 36 | 37 | "f.html" -> "n.tidy" [style="bold", color="red"]; 38 | "n.tidy" -> "f.xhtml" [style="bold", color="red"]; 39 | 40 | "f.xhtml" -> "n.htmlstyle" [style="bold", color="red"]; 41 | 42 | /* 43 | "f.pdf" -> "n.pdf2txt" ; 44 | "n.pdf2txt" -> "f.pdf.txt"; 45 | */ 46 | /* 47 | "f.pdf" -> "n.pdf2svg"; 48 | "n.pdf2svg" -> "png" 49 | "n.pdf2svg" -> "svg"; 50 | */ 51 | 52 | /* 53 | "f.pdf" -> "n.pdf2html" ; 54 | "n.pdf2html" -> {"f.pdf.html"}; 55 | */ 56 | /* 57 | "png" -> "n.ocr" ; 58 | "n.ocr" -> "png.hocr.html" 59 | */ 60 | /* 61 | "png.hocr.html" -> "n.ocr2"; 62 | "n.ocr2" -> "png.hocr.svg" 63 | */ 64 | "tagger" [label="tagger", style="filled", color="cyan", shape="box"] 65 | "n.stylesheets" -> "tagger" [style="bold", color="grey"]; 66 | "n.htmlstyle" -> "tagger" [style="bold", color="red"]; 67 | //{"f.pdf.html" } -> "tagger"; 68 | "tagger" -> "s.html" [style="bold", color="red"]; 69 | 70 | /* 71 | "sdata" -> "n.ctree"; 72 | "n.ctree" -> {"doc" "csv"}; 73 | */ 74 | subgraph cluster_norma { 75 | label="norma" color="cyan" penwidth="3"; 76 | "n.stylesheets" [label="stylesheets", style="filled", color="cyan", shape="box"] 77 | "n.tidy" [label="tidy", style="filled", color="cyan", shape="box"] 78 | "n.htmlstyle" [label="htmlstyle", style="filled", color="cyan", shape="box"] 79 | // "n.pdf2txt" [label="pdf2txt", style="filled", color="cyan", shape="box"] 80 | // "n.pdf2html" [label="pdf2html", style="filled", color="cyan", shape="box"] 81 | // "n.pdf2svg" [label="pdf2svg", style="filled", color="cyan", shape="box"] 82 | // "n.ctree" [label="ctree", style="filled", color="cyan", shape="box"] 83 | // "n.ocr" [label="tesseract", style="filled", color="cyan", shape="box"] 84 | // "n.ocr2" [label="ocr", style="filled", color="cyan", shape="box"] 85 | } 86 | 87 | 88 | } -------------------------------------------------------------------------------- /assets/images/software/norma/normahtml2shtml0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/normahtml2shtml0.png -------------------------------------------------------------------------------- /assets/images/software/norma/normasmall0.dot: -------------------------------------------------------------------------------- 1 | digraph norma { 2 | graph [nodesep=0.4 ranksep=0.7] 3 | 4 | "apiquery" [label="apiquery", style="filled", color="yellow"]; 5 | "urllist" [label="urllist", style="filled", color="yellow"]; 6 | 7 | "getp" [label="getpapers", style="filled", color="cyan", shape="box"] 8 | "qs" [label="quickscrape", style="filled", color="cyan", shape="box"] 9 | 10 | "apiquery" -> "getp"; 11 | "urllist" -> "qs"; 12 | 13 | "f.xml" [label="fulltext.xml", penwidth="2"]; 14 | "f.html" [label="fulltext.html", penwidth="2"]; 15 | "f.pdf" [label="fulltext.pdf", style="filled", color="pink"]; 16 | 17 | /*"f.pdf.html" [label="fulltext.pdf.html"]; 18 | "f.pdf.txt" [label="fulltext.pdf.txt", style="filled", color="pink"]; 19 | */ 20 | "f.xhtml" [label="fulltext.xhtml"]; 21 | 22 | "png" [label="png", style="filled" color="pink", penwidth="2"]; 23 | /* 24 | "png.hocr.html" [label="HTML-OCR"] 25 | "png.hocr.svg" [label="OCR'ed image (SVG)", style="filled", color="pink"] 26 | */ 27 | //"svg" [label="svg", style="filled", color="pink", penwidth="2"]; 28 | "sdata" [label="suppdata", style="filled", color="pink", penwidth="2"]; 29 | "s.html" [label="scholarly.html", style="filled", color="red", penwidth="2"]; 30 | 31 | "getp" -> {"f.xml" "f.pdf" "sdata" } [style="bold"]; 32 | "qs" -> {"f.xml" "f.pdf" "f.html" "png" "sdata"} [style="bold"]; 33 | 34 | "f.xml" -> "n.stylesheets" [style="bold"]; 35 | 36 | "f.html" -> "n.tidy" [style="bold"]; 37 | "n.tidy" -> "f.xhtml" [style="bold"]; 38 | 39 | "f.xhtml" -> "n.htmlstyle" [style="bold"]; 40 | 41 | /* 42 | "f.pdf" -> "n.pdf2txt" ; 43 | "n.pdf2txt" -> "f.pdf.txt"; 44 | */ 45 | /* 46 | "f.pdf" -> "n.pdf2svg"; 47 | "n.pdf2svg" -> "png" 48 | "n.pdf2svg" -> "svg"; 49 | */ 50 | 51 | /* 52 | "f.pdf" -> "n.pdf2html" ; 53 | "n.pdf2html" -> {"f.pdf.html"}; 54 | */ 55 | /* 56 | "png" -> "n.ocr" ; 57 | "n.ocr" -> "png.hocr.html" 58 | */ 59 | /* 60 | "png.hocr.html" -> "n.ocr2"; 61 | "n.ocr2" -> "png.hocr.svg" 62 | */ 63 | "tagger" [label="tagger", style="filled", color="cyan", shape="box"] 64 | {"n.stylesheets" "n.htmlstyle"} -> "tagger" [style="bold"]; 65 | //{"f.pdf.html" } -> "tagger"; 66 | "tagger" -> "s.html" [style="bold"]; 67 | 68 | /* 69 | "sdata" -> "n.ctree"; 70 | "n.ctree" -> {"doc" "csv"}; 71 | */ 72 | subgraph cluster_norma { 73 | label="norma" color="cyan" penwidth="3"; 74 | "n.stylesheets" [label="stylesheets", style="filled", color="cyan", shape="box"] 75 | "n.tidy" [label="tidy", style="filled", color="cyan", shape="box"] 76 | "n.htmlstyle" [label="htmlstyle", style="filled", color="cyan", shape="box"] 77 | // "n.pdf2txt" [label="pdf2txt", style="filled", color="cyan", shape="box"] 78 | // "n.pdf2html" [label="pdf2html", style="filled", color="cyan", shape="box"] 79 | // "n.pdf2svg" [label="pdf2svg", style="filled", color="cyan", shape="box"] 80 | // "n.ctree" [label="ctree", style="filled", color="cyan", shape="box"] 81 | // "n.ocr" [label="tesseract", style="filled", color="cyan", shape="box"] 82 | // "n.ocr2" [label="ocr", style="filled", color="cyan", shape="box"] 83 | } 84 | 85 | 86 | } -------------------------------------------------------------------------------- /assets/images/software/norma/normasmall0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/normasmall0.png -------------------------------------------------------------------------------- /assets/images/software/norma/normaxml2shtml0.dot: -------------------------------------------------------------------------------- 1 | digraph norma { 2 | graph [nodesep=0.4 ranksep=0.7] 3 | 4 | "apiquery" [label="apiquery", style="filled", color="yellow"]; 5 | "urllist" [label="urllist", style="filled", color="yellow"]; 6 | 7 | "getp" [label="getpapers", style="filled", color="cyan", shape="box"] 8 | "qs" [label="quickscrape", style="filled", color="cyan", shape="box"] 9 | 10 | "apiquery" -> "getp"; 11 | "urllist" -> "qs"; 12 | 13 | "f.xml" [label="fulltext.xml", penwidth="2"]; 14 | "f.html" [label="fulltext.html", penwidth="2"]; 15 | "f.pdf" [label="fulltext.pdf", style="filled", color="pink"]; 16 | 17 | /*"f.pdf.html" [label="fulltext.pdf.html"]; 18 | "f.pdf.txt" [label="fulltext.pdf.txt", style="filled", color="pink"]; 19 | */ 20 | "f.xhtml" [label="fulltext.xhtml"]; 21 | 22 | "png" [label="png", style="filled" color="pink", penwidth="2"]; 23 | /* 24 | "png.hocr.html" [label="HTML-OCR"] 25 | "png.hocr.svg" [label="OCR'ed image (SVG)", style="filled", color="pink"] 26 | */ 27 | //"svg" [label="svg", style="filled", color="pink", penwidth="2"]; 28 | "sdata" [label="suppdata", style="filled", color="pink", penwidth="2"]; 29 | "s.html" [label="scholarly.html", style="filled", color="red", penwidth="2"]; 30 | 31 | "getp" -> {"f.pdf" "sdata" } [style="bold", color="grey"]; 32 | "getp" -> {"f.xml"} [style="bold", color="red"]; 33 | "qs" -> {"f.xml" "f.pdf" "f.html" "png" "sdata"} [style="bold", color="grey"]; 34 | 35 | "f.xml" -> "n.stylesheets" [style="bold", color="red"]; 36 | 37 | "f.html" -> "n.tidy" [style="bold", color="grey"]; 38 | "n.tidy" -> "f.xhtml" [style="bold", color="grey"]; 39 | 40 | "f.xhtml" -> "n.htmlstyle" [style="bold", color="grey"]; 41 | 42 | /* 43 | "f.pdf" -> "n.pdf2txt" ; 44 | "n.pdf2txt" -> "f.pdf.txt"; 45 | */ 46 | /* 47 | "f.pdf" -> "n.pdf2svg"; 48 | "n.pdf2svg" -> "png" 49 | "n.pdf2svg" -> "svg"; 50 | */ 51 | 52 | /* 53 | "f.pdf" -> "n.pdf2html" ; 54 | "n.pdf2html" -> {"f.pdf.html"}; 55 | */ 56 | /* 57 | "png" -> "n.ocr" ; 58 | "n.ocr" -> "png.hocr.html" 59 | */ 60 | /* 61 | "png.hocr.html" -> "n.ocr2"; 62 | "n.ocr2" -> "png.hocr.svg" 63 | */ 64 | "tagger" [label="tagger", style="filled", color="cyan", shape="box"] 65 | "n.htmlstyle" -> "tagger" [style="bold", color="grey"]; 66 | "n.stylesheets" -> "tagger" [style="bold", color="red"]; 67 | //{"f.pdf.html" } -> "tagger"; 68 | "tagger" -> "s.html" [style="bold", color="red"]; 69 | 70 | /* 71 | "sdata" -> "n.ctree"; 72 | "n.ctree" -> {"doc" "csv"}; 73 | */ 74 | subgraph cluster_norma { 75 | label="norma" color="cyan" penwidth="3"; 76 | "n.stylesheets" [label="stylesheets", style="filled", color="cyan", shape="box"] 77 | "n.tidy" [label="tidy", style="filled", color="cyan", shape="box"] 78 | "n.htmlstyle" [label="htmlstyle", style="filled", color="cyan", shape="box"] 79 | // "n.pdf2txt" [label="pdf2txt", style="filled", color="cyan", shape="box"] 80 | // "n.pdf2html" [label="pdf2html", style="filled", color="cyan", shape="box"] 81 | // "n.pdf2svg" [label="pdf2svg", style="filled", color="cyan", shape="box"] 82 | // "n.ctree" [label="ctree", style="filled", color="cyan", shape="box"] 83 | // "n.ocr" [label="tesseract", style="filled", color="cyan", shape="box"] 84 | // "n.ocr2" [label="ocr", style="filled", color="cyan", shape="box"] 85 | } 86 | 87 | 88 | } -------------------------------------------------------------------------------- /assets/images/software/norma/normaxml2shtml0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/normaxml2shtml0.png -------------------------------------------------------------------------------- /assets/images/software/norma/pdf2txt.graphml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | $ cd dinosaur-pdfs 25 | $ tree 26 | . 27 | ├── Natarajan - Bone Cancer.pdf 28 | ├── Taylor - Aspects of sauropod dinosaurs.pdf 29 | ├── Taylor - Sauropods Giraffes.pdf 30 | └── Wedel - Posteranial Pneumacity in Dinosaurs.pdf 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | $ tree dinosaurs-pdfs/ 48 | dinosaurs-pdfs/ 49 | ├── Natarajan - Bone Cancer 50 | │   └── fulltext.pdf 51 | ├── Taylor - Aspects of sauropod dinosaurs 52 | │   └── fulltext.pdf 53 | ├── Taylor - Sauropods Giraffes 54 | │   └── fulltext.pdf 55 | └── Wedel - Posteranial Pneumacity in Dinosaurs 56 | └── fulltext.pdf 57 | 58 | 4 directories, 4 files 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | $ tree dinosaurs-pdfs/ 76 | dinosaurs-pdfs/ 77 | ├── Natarajan - Bone Cancer 78 | │   ├── fulltext.pdf 79 | │   └── fulltext.pdf.txt 80 | ├── Taylor - Aspects of sauropod dinosaurs 81 | │   ├── fulltext.pdf 82 | │   └── fulltext.pdf.txt 83 | ├── Taylor - Sauropods Giraffes 84 | │   ├── fulltext.pdf 85 | │   └── fulltext.pdf.txt 86 | └── Wedel - Posteranial Pneumacity in Dinosaurs 87 | ├── fulltext.pdf 88 | └── fulltext.pdf.txt 89 | 90 | 4 directories, 8 files 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | $ for fname in *.pdf; do 108 | filename=$(basename "$fname"); 109 | filename="${filename%.*}"; 110 | mkdir "$filename"; 111 | mv "$fname" "$filename"/fulltext.pdf; 112 | done; 113 | cd .. 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | $ norma \ 133 | -q dinosaurs-pdfs \ 134 | -i fulltext.pdf \ 135 | -o fulltext.pdf.txt \ 136 | --transform pdf2txt 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | 145 | 146 | 147 | 148 | 149 | 150 | 151 | 152 | 153 | -------------------------------------------------------------------------------- /assets/images/software/norma/pdf2txt.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/pdf2txt.png -------------------------------------------------------------------------------- /assets/images/software/norma/xml2shtml.graphml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | $ tree dinosaurs-xmls 24 | dinosaurs-xmls/ 25 | ├── eupmc_results.json 26 | ├── fulltext_html_urls.txt 27 | ├── PMC3893193 28 | │   └── fulltext.xml 29 | ├── PMC3893247 30 | │   └── fulltext.xml 31 | ... 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | getpapers 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | $ tree dinosaurs-xmls 72 | dinosaurs-xmls/ 73 | ├── eupmc_results.json 74 | ├── fulltext_html_urls.txt 75 | ├── PMC3893193 76 | │   ├── fulltext.xml 77 | │   └── scholarly.html 78 | ├── PMC3893247 79 | │   ├── fulltext.xml 80 | │   └── scholarly.html 81 | ... 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | $ getpapers \ 99 | -q dinosaurs-xmls \ 100 | -o dinosaurs -x 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | $ norma \ 119 | -q dinosaurs-xmls \ 120 | -i fulltext.xml \ 121 | -o scholarly.html \ 122 | --transform nlm2html 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | -------------------------------------------------------------------------------- /assets/images/software/norma/xml2shtml.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/xml2shtml.png -------------------------------------------------------------------------------- /assets/images/software/quickscrape/quickscrape-url.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/quickscrape/quickscrape-url.png -------------------------------------------------------------------------------- /assets/images/software/shell/cat.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/cat.png -------------------------------------------------------------------------------- /assets/images/software/shell/cd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/cd.png -------------------------------------------------------------------------------- /assets/images/software/shell/cp.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/cp.png -------------------------------------------------------------------------------- /assets/images/software/shell/head.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/head.png -------------------------------------------------------------------------------- /assets/images/software/shell/ls.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/ls.png -------------------------------------------------------------------------------- /assets/images/software/shell/mkdir.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/mkdir.png -------------------------------------------------------------------------------- /assets/images/software/shell/mv.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/mv.png -------------------------------------------------------------------------------- /assets/images/software/shell/rm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/rm.png -------------------------------------------------------------------------------- /assets/images/software/shell/tail.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/tail.png -------------------------------------------------------------------------------- /assets/images/software/shell/tree.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/tree.png -------------------------------------------------------------------------------- /assets/images/software/shell/wc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/wc.png -------------------------------------------------------------------------------- /assets/images/software/vms/desktop.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/desktop.png -------------------------------------------------------------------------------- /assets/images/software/vms/memorysettings.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/memorysettings.png -------------------------------------------------------------------------------- /assets/images/software/vms/starting-vm.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/starting-vm.png -------------------------------------------------------------------------------- /assets/images/software/vms/terminal.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/terminal.png -------------------------------------------------------------------------------- /assets/images/software/vms/vm-crunchbang.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-crunchbang.png -------------------------------------------------------------------------------- /assets/images/software/vms/vm-error.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-error.png -------------------------------------------------------------------------------- /assets/images/software/vms/vm-icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-icon.png -------------------------------------------------------------------------------- /assets/images/software/vms/vm-installer.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-installer.png -------------------------------------------------------------------------------- /assets/images/software/vms/vm-popup.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-popup.png -------------------------------------------------------------------------------- /assets/images/software/vms/vm-poweredoff.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-poweredoff.png -------------------------------------------------------------------------------- /assets/images/software/vms/vm-running.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-running.png -------------------------------------------------------------------------------- /assets/images/software/vms/vm-startscreen.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-startscreen.png -------------------------------------------------------------------------------- /assets/images/software/vms/vm-virtualbox-download.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-virtualbox-download.png -------------------------------------------------------------------------------- /software-tutorials/README.md: -------------------------------------------------------------------------------- 1 | ![ContentMine logo](https://github.com/ContentMine/assets/blob/master/png/Content_mine(small).png) 2 | 3 | The usage of ContentMine tools can be learned step-by-step with the help of the tutorials. They describe functionalities, what results to expect, and how to link the different elements of the content mining pipeline. They are based on the ContentMine virtual machine, which has all necessary software pre-installed. 4 | 5 | The tutorials can be used in workshops as well as for self-guided learning. 6 | 7 | # Table of contents 8 | 9 | 1. [Purpose and installation of the ContentMine-VirtualMachine](vms/README.md)
A VirtualBox-Image contains all necessary software as well as sample datasets for getting started with content mining. This tutorial explains how to install the ContentMine-VM and use it as a sandbox environment. 10 | 11 | 1. [Introduction to the command line interface](shell/README.md)
12 | This tutorial introduces the basic UNIX-commands and shows how to navigate folders and handle files. 13 | 14 | 1. [Getting started with getpapers](getpapers/README.md)
15 | This tutorial demonstrates how to create an initial corpus for fact extraction. 16 | 17 | 1. [Getting started with quickscrape](quickscrape/README.md)
18 | This tutorial introduces quickscrape, and how to use it to extract semi-structured information from web pages. 19 | 20 | 1. [Create your own scraper definition](journal-scrapers/README.md)
21 | This tutorial shows how to contribute to, and extend the ContentMine scraper collection. If you need a specific definition for your use with quickscrape, here you can learn how to create it. 22 | 23 | 1. [Normalizing scholarly literature](norma/README.md)
24 | This tutorial shows how to normalize scientific literature into a unified format which can be processed by machines. 25 | 26 | 1. [ContentMine data structure: CProject](cproject/README.md)
27 | This tutorial gives an overview of the data structure used, and how it can be integrated in your analysis. 28 | 29 | 1. [Extracting facts with AMI-plugins](ami/README.md)
30 | This tutorial demonstrates how to extract, aggregate, and filter facts from scholarly.html. 31 | -------------------------------------------------------------------------------- /software-tutorials/ami/README.md: -------------------------------------------------------------------------------- 1 | # ami 2 | ============================== 3 | 4 | ## Table of Content 5 | 6 | 1. [Description](#description) 7 | 1. [Preparations](#preparations) 8 | 1. [Input data](#input-data) 9 | 1. [Tutorial](#Tutorial) 10 | 1. [ami2-species](#ami2-species) 11 | 1. [ami2-gene](#ami2-gene) 12 | 1. [ami2-sequence](#ami2-sequence) 13 | 1. [ami2-regex](#ami2-regex) 14 | 1. [ami2-word](#ami2-word) 15 | 1. [Summarization of results](#summarization-of-results) 16 | 1. [Summary](#summary) 17 | 1. [Next Steps](#next-steps) 18 | 1. [Further Materials](#further-materials) 19 | 20 | ## Description 21 | 22 | **What does ami?** 23 | 24 | Ami is a collection of plugins that extract pieces of information (a 'fact') from structured documents. It uses dictionaries or regexes to look up various fact sets, such as species names, gene or protein sequences. 25 | At this stage individual facts which are spread over the corpus of papers are extracted and collected into machine readable XML files, which can then be used for visualization or statistics. 26 | We are integrating new dictionaries to cover different types of encoded knowledge, please drop us a message if you have an idea or resource for a new fact extractor! 27 | 28 | **Why do we need ami?** 29 | 30 | ami helps researchers with repetitive cognitive tasks like pattern matching, which e.g. has to be done when trying to keep up-to-date with new species occurrences or tracking one specific gene. ami also helps coping with large amounts of literature that has to be filtered, e.g. for meta-studies, by highlighting the papers which are really necessary to read. 31 | 32 | **How can I use ami?** 33 | 34 | Ami is to be used after the normalization of papers happened. You take the output of [norma](../norma/README.md) and apply a plugin. 35 | 36 | **What you will learn here** 37 | 38 | This tutorial shows you: 39 | - how to use the species plugin 40 | - how to use the genes plugin 41 | - how to use the sequences plugin 42 | - how to use regular expressions 43 | - what to do with the extracted facts 44 | 45 | **How to use the tutorial** 46 | 47 | We have some conventions which will be used through-out the tutorial. 48 | - Variables as placeholders are always caps, like NAME, YOURDIRECTORY etc. 49 | 50 | 51 | **Glossary** 52 | 53 | 54 | ## Preparations 55 | ### Pre-Requisites 56 | 57 | ### Used Software 58 | - [Future TDM Virtual Machine](LINK) 59 | - ami2-species 60 | - ami2-gene 61 | - ami2-sequence 62 | - ami2-regex 63 | 64 | ### Installation 65 | 66 | On the ContentMine-VM ami is already provided. If you want to install it locally or update it on the VM, please refer to the [installation instructions](http://contentmine.github.io/) for your platform. 67 | 68 | You can find the technical documentation for `ami` in its [repository](https://github.com/ContentMine/ami-plugin). 69 | 70 | ## Input data 71 | 72 | The input for ami is always a [CProject](../cproject) containing documents in [scholarly HTML](../sHTML). ami then applies one of the available plugins, extracts the relevant content from the sHTML, and stores the results in the respective paper folder. ami-plugins require `scholarly.html`-files as input. Please follow the instructions for [norma](../norma/README.md). Your project directory should look like this: 73 | 74 | ```bash 75 | tree ursus 76 | ursus/ 77 | ├── eupmc_results.json 78 | ├── fulltext_html_urls.txt 79 | ├── PMC3893193 80 | │   ├── fulltext.xml 81 | │   └── scholarly.html 82 | ├── PMC3893247 83 | │   ├── fulltext.xml 84 | │   └── scholarly.html 85 | ├── PMC3898307 86 | │   ├── fulltext.xml 87 | │   └── scholarly.html 88 | ... 89 | ``` 90 | 91 | ## Tutorial 92 | 93 | This tutorial is based on release [0.2.24](https://github.com/ContentMine/ami/releases). 94 | 95 | ### ami2-species 96 | 97 | You can then search for all occurences of a species name with 98 | ```bash 99 | ami2-species --project CPROJECTFOLDER -i INPUTFILE --sp.species --sp.type SPECIESTYPE 100 | ``` 101 | 102 | You have to choose between three different types of species terms for ```SPECIESTYPE``` 103 | - ```genus```, which will extract terms like *Brachiosaurus* [more details](https://en.wikipedia.org/wiki/Genus) 104 | - ```binomial```, which will extract terms like *B. altithorax* [more details](https://en.wikipedia.org/wiki/Binomial_nomenclature) 105 | - or ```genussp```, whick will extract terms like *Bacillus sp* or *Ursus spp.* 106 | 107 | ```bash 108 | ami2-species --project ursus/ -i scholarly.html --sp.species --sp.type genus 109 | ``` 110 | 111 | ![ami2-species-genus](../../assets/images/software/ami/ami2-species-genus.png) 112 | 113 | Inspecting the folders with `tree ursus` now should look like the following. If no matches could be found, an **empty.xml** will be created, to indicate that the plugin has been run on this particular paper, but with no results. If matches have been found, a **results.xml** will be created. 114 | 115 | ![ami2-species-genus-results](../../assets/images/software/ami/ami2-species-genus-results.png) 116 | 117 | A for-loop performs all extractions in sequence: 118 | ``` 119 | for type in genus binomial genussp; do 120 | ami2-species --project ursus/ -i scholarly.html --sp.species --sp.type $type; 121 | done 122 | ``` 123 | 124 | The results will all be saved as XML in the corresponding CTree inside results/species with per-type folder named after the SPECIESTYPE. 125 | 126 | ``` 127 | tree ursus 128 | ursus 129 | ├── eupmc_results.json 130 | ├── fulltext_html_urls.txt 131 | ├── PMC3893193 132 | │   ├── fulltext.xml 133 | │   ├── results 134 | │   │   └── species 135 | │   │   ├── **binomial** 136 | │   │   │   └── results.xml 137 | │   │   ├── **genus** 138 | │   │   │   └── results.xml 139 | │   │   └── **genussp** 140 | │   │   └── results.xml 141 | │   └── scholarly.html 142 | ├── PMC3893247 143 | │   ├── fulltext.xml 144 | │   ├── results 145 | │   │   └── species 146 | │   │   ├── **binomial** 147 | │   │   │   └── results.xml 148 | │   │   ├── **genus** 149 | │   │   │   └── results.xml 150 | │   │   └── **genussp** 151 | │   │   └── results.xml 152 | │   └── scholarly.html 153 | ... 154 | ``` 155 | 156 | ```bash 157 | cat ursus/PMC4349051/results/species/genussp/results.xml 158 | ``` 159 | 160 | ```xml 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | ``` 169 | 170 | The results.xml consists of different amounts of lines, where every line represents one extracted fact inside the ``-tag and consists of: 171 | - `exact`: the exact match - the fact. e. g. `exact="Parasaurolophus sp" for the fact "Parasaurolophus sp" 172 | - `pre`: 99 characters before the match 173 | - `post`: 99 characters after the match 174 | 175 | Down to earth, this is what the fact extraction looks like in the end: a list of terms extracted from the literature with XX characters before and afterwards as context around the fact. 176 | 177 | 178 | ### ami2-gene 179 | 180 | The search for genes works in the same way, just with another command: 181 | ```bash 182 | ami2-gene --project CPROJECTFOLDER -i INPUTFILE --g.gene --g.type GENETYPE 183 | ``` 184 | 185 | At the moment there is only one GENETYPE available (`human`). Results are again stored within the CTree. 186 | 187 | ```bash 188 | ami2-gene --project ursus/ -i scholarly.html --g.gene --g.type human 189 | ``` 190 | 191 | This then creates a folder gene/human inside results next to the results of the species module. 192 | 193 | ``` 194 | tree ursus 195 | ursus 196 | ├── eupmc_results.json 197 | ├── fulltext_html_urls.txt 198 | ├── PMC3893193 199 | │   ├── fulltext.xml 200 | │   ├── results 201 | │   │   └── gene 202 | │   │   └── human 203 | │   │   └── results.xml 204 | │   └── scholarly.html 205 | ├── PMC3893247 206 | │   ├── fulltext.xml 207 | │   ├── results 208 | │   │   └── gene 209 | │   │   └── human 210 | │   │   └── results.xml 211 | │   └── scholarly.html 212 | ... 213 | ``` 214 | 215 | The `results.xml` has a similar structure: a results-tag with pre, post, and exact attributes. 216 | 217 | ``` 218 | cat ursus/PMC4454486/results/gene/human/results.xml 219 | ``` 220 | 221 | ```xml 222 | 223 | 224 | 225 | 226 | 227 | ``` 228 | 229 | ### ami2-sequence 230 | 231 | The search for sequences follows the same structure: 232 | ```bash 233 | ami2-sequence --project CPROJECTFOLDER -i INPUTFILE --sq.sequence --sq.type SEQUENCETYPE 234 | ``` 235 | 236 | - SEQUENCETYPE is one of `dna rna prot prot3 carb3`. 237 | 238 | You can run one type with this query: 239 | ```bash 240 | ami2-sequence --project ursus/ -i scholarly.html --sq.sequence --sq.type rna 241 | ``` 242 | 243 | You can also run all types in sequence with this loop: 244 | ```bash 245 | for type in dna rna prot prot3 carb3; do 246 | ami2-sequence --project ursus -i scholarly.html --sq.sequence --sq.type $type; 247 | done 248 | ``` 249 | 250 | This creates an own folder called `sequence/sequencetype` inside results. The results are in general of the same structure as before, with an additional attribute `xpath` that shows the location of the match within the html-structure of the `scholarly.html`. 251 | 252 | ```bash 253 | cat ursus/PMC4447998/results/sequence/rna/results.xml 254 | ``` 255 | 256 | ```xml 257 | 258 | 259 | 260 | 261 | 262 | 263 | ``` 264 | 265 | ### ami2-regex 266 | 267 | Regex is the shortcut for ([regular expression](https://en.wikipedia.org/wiki/Regular_expression)) and is used to match search patterns inside text. This means to search for such basic strings like "ursus" inside a sentence, but also allows way more complex patterns to look for, `[Uu]rsus` matches for example upper and lower case letters, so you look at the same time for Ursus and ursus. 268 | 269 | Digits can be added by e.g. `[0-9]`, which matches any digit, or by fixed sequences `(00111001)`. Non-alphanumeric characters have to be escaped by `\`, so if you want to search for the number 3.14 explicitly, the regex looks like `(3\.14)`. 270 | 271 | We will now see a variety of regex-s and how they are used in a XML file. 272 | 273 | **How to use regex with ami** 274 | 275 | 276 | ```bash 277 | ami2-regex --project CPROJECTFOLDER -i INPUTFILE --context PRE POST --r.regex REGEXFILE.xml 278 | ```. 279 | 280 | - `PRE`: tells ami how many characters before a match should be captured 281 | - `POST`: tells ami how many characters after a match should be captured 282 | - `REGEXFILE`: target location of your regex XML file. It contains all regex--projectueries. 283 | 284 | **How to create a custom regex XML file** 285 | 286 | The REGEXFILE.xml needs to be wrapped by `` and closing tags ``, which are the opening and closing tags. The `TITLE` sets the name of the folder, where the output gets stored. 287 | 288 | ```xml 289 | 290 | 291 | ``` 292 | 293 | In the regex XML each regex-query is written to a new line, and consists of the opening and closing tags ``. Within the opening tag there must be two attributes declared, 294 | - `weight`: the relative importance given to each match (influences indexing engines). The default value `1.0` 295 | - `fields`: corresponds to the regex-query, and specifies the name of the query 296 | 297 | ```xml 298 | 299 | 300 | 301 | 302 | ``` 303 | 304 | What is missing now is the regex-query itself. It is placed between the regex-tags `query`and is framed by round brackets `()`. 305 | 306 | In line two one field ("food") is defined. We want to get both upper and lower cases, and `[Ff]` matches either `F` or `f`: `([Ff]ood)`. The following characters `ood` are fixed for this query, they have to be matched. 307 | 308 | For the second query, we want to find all mentions of "predator regime" or "predator regimes". For this we need `\s`, a special character standing for ` ` - the whitespace, blank character. The questions mark `[s]?` makes the "s" optional: `([Pp]redator\sregime[s]?)` 309 | 310 | ```xml 311 | 312 | ([Ff]ood) 313 | ([Pp]redator\sregime[s]?) 314 | 315 | ``` 316 | 317 | We now construct a ```regex.xml``` like that (use any texteditor for that) and place this XML in our project folder as `ursusfood.xml`. We run ami with it, and because we want to get some context around our matches, add the ```PRE``` and ```POST``` option, which capture characters before and after the match. 318 | 319 | ```bash 320 | ami2-regex --project ursus/ -i scholarly.html --r.regex ursus/ursusfood.xml --context 50 50 321 | ``` 322 | 323 | The output contains 50 characters `pre` and 50 characters `post` the `value0`, as well as the `xpath` of the match in the scholarly.html. 324 | 325 | ### ami2-word 326 | 327 | Word frequency can be used to categorize documents. The simplest approach is to count the words in documents, or within chunks of documents. 328 | 329 | ```bash 330 | ami2-word --project ursus/ -i scholarly.html --w.words wordFrequencies 331 | ``` 332 | creates 333 | ``` 334 | eupmc 335 | ├── eupmc_results.json 336 | ├── fulltext_html_urls.txt 337 | ├── PMC2275095 338 | │   ├── fulltext.xml 339 | │   ├── results 340 | │   │   └── word 341 | │   │   └── frequencies 342 | │   │   ├── results.html 343 | │   │   └── results.xml 344 | │   └── scholarly.html 345 | ├── PMC2586803 346 | │   ├── fulltext.xml 347 | │   ├── results 348 | │   │   └── word 349 | │   │   └── frequencies 350 | │   │   ├── results.html 351 | │   │   └── results.xml 352 | ``` 353 | 354 | The first lists word frequencies as: 355 | 356 | ``` 357 | 358 | 359 | 360 | 361 | 362 | 363 | 364 | 365 | 366 | 367 | ``` 368 | 369 | and the second creates a "Word Cloud"-like HTML display with the most frequent words in order and with fonts proportional to the count. This mainly reflects the frequency in the English language, so we can remove the commonest words by using *stopwords*. We have a range of stopword files in different languages. It is also possible to create your own files and add them. 370 | 371 | ```bash 372 | ami2-word --project ursus --w.words wordFrequencies --w.stopwords STOPWORDS.txt 373 | ``` 374 | 375 | The format is a simple list of words: 376 | ``` 377 | a 378 | about 379 | above 380 | across 381 | after 382 | afterwards 383 | again 384 | against 385 | ``` 386 | 387 | and the file can be referenced either through a URL format or relative/absolute filename. 388 | 389 | ```bash 390 | ami2-word --project ursus/ -i scholarly.html --w.words wordFrequencies --w.stopwords stopwords.txt 391 | ``` 392 | 393 | gives `results.xml` as 394 | ``` 395 | 396 | 397 | 398 | 399 | 400 | 401 | 402 | 403 | 404 | 405 | 406 | 407 | ``` 408 | 409 | ### Summarization of results 410 | 411 | ami-plugin possesses the ability to automatically create an aggregation of results. It is possible to aggregate the results over all plugins with: 412 | 413 | ``` 414 | ami2-sequence --project ursus --filter file\(\*\*/results.xml\) -o sequencesfiles.xml 415 | ``` 416 | 417 | It is possible to only aggregate results for a specific plugin and option, e.g. for `dna`: 418 | ``` 419 | ami2-sequence --project ursus --filter file\(\*\*/dna/results.xml\)xpath\(//result\) -o dnasnippets.xml 420 | ``` 421 | 422 | 423 | ## Summary 424 | 425 | * A project folder containing ctrees is always the input. 426 | * Plugins are own software parts with own commands. 427 | * Rsults/Facts are stored within the ctree in a plugin-specific folder. 428 | * Results also store the context of 99 characters before and after the fact. 429 | 430 | **Next steps** 431 | 432 | * Back to the [tutorial overview](..) 433 | 434 | ## Further material 435 | 436 | **ContentMine** 437 | - [contentmine.org](http://contentmine.org) 438 | - office ett contentmine dot org 439 | - [@TheContentMine](http://twitter.com/thecontentmine) 440 | 441 | **Slides** 442 | 443 | 444 | **Assets** 445 | 446 | **Videos** 447 | 448 | 449 | **Learning Materials** 450 | 451 | **www** 452 | 453 | 454 | **Papers** 455 | -------------------------------------------------------------------------------- /software-tutorials/canary/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/software-tutorials/canary/README.md -------------------------------------------------------------------------------- /software-tutorials/cat/README.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/software-tutorials/cat/README.md -------------------------------------------------------------------------------- /software-tutorials/cproject/README.md: -------------------------------------------------------------------------------- 1 | ![ContentMine logo](https://github.com/ContentMine/assets/blob/master/png/Content_mine(small).png) 2 | 3 | # What is a CProject? 4 | 5 | This page documents the current implementation of a CProject as of May 16, 2016. 6 | 7 | The CProject is a naming convention for interacting with input und output files of ContentMine tools. A CProject is a filesystem tree containing a set of folders and files with reserved names. Any part of the the tool kit (quickscrape, getpapers, norma, ami, or custom downstream processing tools) will read and write to a CProject. 8 | 9 | The structure of a CProject is being [discussed here](https://github.com/ContentMine/cmine/issues/10). Please join the debate with your suggestions and observations. 10 | 11 | # How does a CProject look like? 12 | 13 | #### Quickscrape 14 | 15 | An example output folder after scraping two random papers one open access one not. 16 | 17 | ``` 18 | output 19 | ├── http_ijs.microbiologyresearch.org_content_journal_ijsem_10.1099_ijsem.0.001085 20 | │ └── results.json 21 | └── https_elifesciences.org_content_5_e10647v3 22 | ├── fulltext.pdf 23 | ├── fulltext.xml 24 | └── results.json 25 | ``` 26 | 27 | #### getpapers 28 | 29 | An example output folder showing two papers from EuPMC. Notice that in this case the results file eupmc_results.json is a big blob for the whole project rather than one in each folder like in quickscrape. I'm not sure if this is a good thing because it means that the folder per-paper structure is missing a chunk on information. It has to be kept in the right CProject so that this data isn't lost. This could all be split up placed in a results.json like quickscrape. It also means that the data is stored in a file which changes name based upon which api it was obtained from. 30 | 31 | ``` 32 | output 33 | ├── eupmc_fulltext_html_urls.txt 34 | ├── eupmc_results.json 35 | ├── PMC4683095 36 | │ └── fulltext.xml 37 | ├── PMC4690148 38 | └── fulltext.xml 39 | ``` 40 | 41 | #### ami 42 | 43 | If you run a plugin (e.g. `ami2-gene --g.gene --g.type human --project malaria`) it results in a tree like 44 | 45 | ``` 46 | └── PMC4831192 47 | ├── fulltext.xml 48 | ├── results 49 | │ └── gene 50 | │ └── human 51 | │ └── empty.xml OR results.xml 52 | └── scholarly.html 53 | ``` 54 | 55 | After running all plugins from `ami` via the `cmine` command, the full extent of a CProject looks like like: 56 | 57 | ``` 58 | ├── PMC4831192 59 | │ ├── fulltext.xml 60 | │ ├── gene.human.count.xml 61 | │ ├── gene.human.snippets.xml 62 | │ ├── results 63 | │ │ ├── gene 64 | │ │ │ └── human 65 | │ │ │ └── empty.xml 66 | │ │ ├── sequence 67 | │ │ │ └── dnaprimer 68 | │ │ │ └── empty.xml 69 | │ │ ├── species 70 | │ │ │ ├── binomial 71 | │ │ │ │ └── results.xml 72 | │ │ │ └── genus 73 | │ │ │ └── empty.xml 74 | │ │ └── word 75 | │ │ └── frequencies 76 | │ │ ├── results.html 77 | │ │ └── results.xml 78 | │ ├── scholarly.html 79 | │ ├── sequence.dnaprimer.count.xml 80 | │ ├── sequence.dnaprimer.snippets.xml 81 | │ ├── species.binomial.count.xml 82 | │ ├── species.binomial.snippets.xml 83 | │ ├── species.genus.count.xml 84 | │ ├── species.genus.snippets.xml 85 | │ ├── word.frequencies.count.xml 86 | │ └── word.frequencies.snippets.xml 87 | ├── sequence.dnaprimer.count.xml 88 | ├── sequence.dnaprimer.documents.xml 89 | ├── sequence.dnaprimer.snippets.xml 90 | ├── species.binomial.count.xml 91 | ├── species.binomial.documents.xml 92 | ├── species.binomial.snippets.xml 93 | ├── species.genus.count.xml 94 | ├── species.genus.documents.xml 95 | ├── species.genus.snippets.xml 96 | ├── word.frequencies.count.xml 97 | ├── word.frequencies.documents.xml 98 | └── word.frequencies.snippets.xml 99 | ``` 100 | -------------------------------------------------------------------------------- /software-tutorials/getpapers/README.md: -------------------------------------------------------------------------------- 1 | # getpapers 2 | 3 | This tutorial covers the installation of getpapers, demonstrates how to construct simple and more complex queries, and shows what output can be expected from getpapers. 4 | 5 | ## Table of contents 6 | 7 | 1. [Description](#description) 8 | 1. [Preparations](#preparations) 9 | 1. [Installation](#installation) 10 | 1. [Input data](#input-data) 11 | 1. [Tutorial](#tutorial) 12 | 1. [Construct a simple query and compare results](#construct-a-simple-query-and-compare-results) 13 | 1. [Getting pdfs and other files](#getting-pdfs-and-other-files) 14 | 1. [Complex queries for EPMC](#complex-queries-for-europepmc) 15 | 1. [Summary and next steps](#summary-and-next-steps) 16 | 17 | ## Description 18 | 19 | **What is getpapers?** 20 | 21 | **getpapers** is one of the entry points of the ContentMine pipeline. It is designed for use in content mining, but you may also find it useful for quickly acquiring large numbers of papers for reading, or metadata for bibliometrics. getpapers accesses publisher APIs, queries them for search terms and returns metadata, PDFs or XMLs, and if available supplementary information. In contrast, the other entry point [quickscrape](../quickscrape/README.md) takes URLs as input and scrapes the whole page. 22 | 23 | **How does getpapers help me?** 24 | 25 | getpapers automates the search and download process and helps in building an initial corpus of documents for content mining. 26 | 27 | **Glossary** 28 | - JSON ([Wikipedia](https://en.wikipedia.org/wiki/JSON)) 29 | - XML ([Wikipedia](https://en.wikipedia.org/wiki/XML)) 30 | - API ([Wikipedia](https://en.wikipedia.org/wiki/API)) 31 | 32 | ## Preparations 33 | 34 | ### Prerequisites 35 | 36 | ### Installation 37 | 38 | On the ContentMine-VM norma is already provided. If you want to install it locally or update it on the VM, please refer to the [installation instructions](http://contentmine.github.io/) for your platform. 39 | 40 | If you are a developer you can find the technical documentation for `getpapers` in its [repository](https://github.com/ContentMine/getpapers). 41 | 42 | ### Input data 43 | 44 | In this tutorial we will mainly use open access literature from [Europe PMC](http://europepmc.org/). We can search within their database of 3.5 million fulltext papers from life-sciences. About one million of these are Open Access. Please refer to [Europe PMC-Data](http://europepmc.org/FtpSite) for details. 45 | 46 | ## Tutorial 47 | 48 | After starting the VM, we open a terminal and start in the workshop folder. 49 | 50 | ![gpfolder](../../assets/images/software/getpapers/getpapers-folder.png) 51 | 52 | ### Construct a simple query and compare results 53 | 54 | We now will check how many results we can expect for a search for `ursus maritimus` on Europe PMC. Without a specification of the API, getpapers will chose Europe PMC as default data source. The query structure is the same that would be entered in the search field on the Europe PMC website. With the `-n` flag getpapers will report how many results match the query, but it will not actually download anything. `-o OUTPUTFOLDER` tells getpapers where to store results. 55 | 56 | ``` 57 | getpapers -q 'ursus maritimus' -n -o ursus 58 | getpapers -q ursus maritimus -n -o ursus 59 | ``` 60 | ![gpnoexecute](../../assets/images/software/getpapers/getpapers-noexecute.png) 61 | 62 | Please note the different result numbers. With quotation marks `-q 'ursus maritimus'` searches for the exact match, while `-q ursus maritimus` is not as constrained and returns also matches for only `ursus` or `maritimus`. 63 | 64 | In the next step we will download the metadata of our results. To keep waiting times low we use the more narrow search. If you chose a too large result set and don't want to wait, you can abort the query with `Ctrl + C`. 65 | ``` 66 | getpapers -q 'ursus maritimus' -o ursus 67 | ``` 68 | ![gpmetadataonly](../../assets/images/software/getpapers/getpapers-metadataonly.png) 69 | 70 | This query creates two files, `eupmc_fulltext_html_urls.txt` and `eupmc_results.json` and stores them in the `ursus` folder. Depending on Europe PMC, not all search results necessarily have a downloadable HTML-version of the paper. In rare cases it can happen that no HTML-fulltexts at all are available for the query, in which case the file `fulltext_html_urls.txt` will not be created. We now have a look at the contents of the `eupmc_fulltext_html_urls.txt` file, which contains a list of HTML-sources for results. 71 | ``` 72 | ls ursus 73 | wc -l ursus/eupmc_fulltext_html_urls.txt 74 | head ursus/eupmc_fulltext_html_urls.txt 75 | ``` 76 | ![inspecturlstxt](../../assets/images/software/getpapers/getpapers-inspecturlstxt.png) 77 | 78 | The `eupmc_results.json` file contains metadata about each search results in the [JSON](https://en.wikipedia.org/wiki/JSON)-format. Metadata are e.g. a DOI, the title, authors, and additional bibliographic data. 79 | 80 | ### Getting pdfs and other files 81 | 82 | Until now, our queries only resulted in metadata and a list of urls. PDF files can be retrieved by adding a `-p` flag to the query. Please note, that a **generic query** will result in a **large number of results** and long processing times. Unless intended, you can cancel a search with `Ctrl+C` in the command line. 83 | 84 | ```bash 85 | getpapers -q 'ursus maritimus' -o ursus -p 86 | ``` 87 | 88 | For every PDF found, getpapers creates a subfolder named after the Europe PMC paper ID which holds the fulltext.pdf and any future additional files. To have a look at folder file structure, use the ```tree``` command. 89 | 90 | ``` 91 | tree ursus 92 | ``` 93 | 94 | Not all queries returned PDFs, we now run another query with `-x` flag for XML-results. In contrast to PDF, XML is a format that machines can understand well, and XML enables better mining results further down the pipeline. 95 | 96 | ```bash 97 | getpapers -q 'ursus maritimus' -o ursus -x 98 | ``` 99 | 100 | The existing resultfolders get updated, no results are lost or overwritten. 101 | 102 | ``` 103 | tree ursus 104 | ``` 105 | 106 | As a last step, we will download supplementary information with the `-s` flag. Europe PMC provides supplementary information in compressed ZIP-files. 107 | 108 | ```bash 109 | getpapers -q 'ursus maritimus' -o ursus -s 110 | ``` 111 | 112 | ``` 113 | tree ursus 114 | ``` 115 | ![inspectctree1](../../assets/images/software/getpapers/getpapers-inspectctree1.png) 116 | 117 | ![inspectctree2](../../assets/images/software/getpapers/getpapers-inspectctree2.png) 118 | 119 | This is the [CProject](../cproject/README.md)-structure, which is the main data structure of the ContentMine pipeline, and any further operations are going to be centered around the CProject. 120 | 121 | ### Complex queries for EuropePMC 122 | 123 | Queries are directed to the [Europe PMC API](http://europepmc.org/RestfulWebService). In their simplest form, they can be free text, like the one we executed before (`getpapers -q 'ursus maritimus' -o ursus -x`). 124 | 125 | Using the EuropePMC webservice's query language we can construct much more detailed queries. A selection of the most commonly useful search fields is explained [here](getpapers-eupmc-queries.md), and a complete documentation of possible queries is in Appendix I of the [EuropePMC reference PDF](http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf). 126 | 127 | For example we can restrict our search to only papers that mention 'ursus maritimus' in the abstract. 128 | 129 | ```bash 130 | getpapers -q ABSTRACT:ursus maritimus -o ursus -n 131 | getpapers -q ABSTRACT:'ursus maritimus' -o ursus -n 132 | ``` 133 | ![complex1](../../assets/images/software/getpapers/getpapers-complex1.png) 134 | 135 | Please compare again the different result numbers without and with `''`. 136 | 137 | We can use the logical operations `AND` and `OR`, and can group operations using brackets. Please note that in the shell we have to encapsulate the query with `'` when we use brackets and use double quotations for inner groupings. 138 | 139 | ```bash 140 | getpapers -q '(LICENSE:"cc by" OR LICENSE:"cc-by") AND ABSTRACT:"ursus maritimus"' -o ursus -n 141 | ``` 142 | 143 | Search for papers which contain the phrase "ursus maritimus" in the introduction section and the phrase "survey" in the methods section. 144 | ```bash 145 | getpapers -q 'INTRO:"ursus maritimus" AND METHODS:survey' -o ursus -n 146 | ``` 147 | 148 | Search for papers where the authors contain "Smith" and which were published in either "Biology" or "Cell". You can look up journals on the [Europe PMC journal list](http://europepmc.org/journalList?journals) by clicking on the magnifying glass in the column "Search this Journal". 149 | ```bash 150 | getpapers -q 'AUTH:Smith AND (JOURNAL:biology OR JOURNAL:cell)' -o ursus -n 151 | ``` 152 | 153 | Downloads XML and PDF's for papers that contain "ursus maritimus" in the abstract and were published between 2010 and 2013. 154 | ```bash 155 | getpapers -q 'ABSTRACT:"ursus maritimus" AND PUB_YEAR:[2010 TO 2013]' -o ursus -p -x 156 | ``` 157 | 158 | Search for papers that contain "ursus maritimus" in the title and were first published between July 2009 and June 2013. 159 | ```bash 160 | getpapers -q 'TITLE:"ursus maritimus" AND FIRST_PDATE:[2009-07-01 TO 2013-06-30]' -o ursus -n 161 | ``` 162 | 163 | Search for papers about "ursus maritimus" where the European Research Council ("ERC") is mentioned in the acknowledgements section. 164 | ```bash 165 | getpapers -q '"ursus maritimus" AND (ACK_FUND:ERC OR ACK_FUND:"European Research Council")' -o ursus -n 166 | ``` 167 | 168 | You can find some common query options [here](getpapers-eupmc-queries.md). 169 | 170 | ## Summary and next steps 171 | 172 | A getpapers query must consist of `-q QUERY -o OUTPUTDIRECTORY`. 173 | * QUERY: the term(s) you want to look for. 174 | * OUTPUTDIRECTORY: Folder in which you want the output files and directory. The folder will be created if it does not already exist. 175 | * Use `-x` for machine-readable fulltext results (preferred, because XML-files provide better mining results in later stages of the tool chain). 176 | * Use `-p` if you want to retrieve human-readable fulltexts in PDF-format. 177 | * Use `-s` to download supplementary information. 178 | * Use `-n` to only check how many results can be expected without downloading anything. 179 | * By default, only Open Access papers will be returned. 180 | * Each API has a different native query language, please refer to the documentation ([EUPMC](getpapers-eupmc-queries.md), [ArXiv](getpapers-arxiv-queries.md), [IEEE](getpapers-ieee-queries.md)) 181 | 182 | **Next steps** 183 | * Back to the [tutorial overview](..) 184 | * Continue to [quickscrape](../quickscrape/README.md) for an introduction to scraping. 185 | * Continue to [norma](../norma/README.md) for the next step of the ContentMine pipeline. 186 | * Continue to [cproject](../cproject/README.md) for an introduction to the main datastructure. 187 | -------------------------------------------------------------------------------- /software-tutorials/getpapers/getpapers-arxiv-queries.md: -------------------------------------------------------------------------------- 1 | [Official documentation](http://arxiv.org/help/api/user-manual) 2 | 3 | ### Restrict by fields 4 | 5 | | fields | description | 6 | |-----------|---------------| 7 | | `ti` | Title | 8 | | `au` | Author | 9 | | `abs` | Abstract | 10 | | `co` | Comment | 11 | | `jr` | Journal Reference | 12 | | `cat` | Subject Category | 13 | | `rn` | Report Number | 14 | | `id` | Id (use id_list instead) | 15 | | `all` | All of the above | 16 | 17 | ### Boolean operators 18 | 19 | ``` 20 | AND # logical "and" 21 | OR # logical "or" 22 | ANDNOT # logical "exclude" 23 | ``` 24 | 25 | #### Grouping operators 26 | 27 | | operator | description | 28 | |-----------|---------------| 29 | | `()` | Used to group Boolean expressions for Boolean operator precedence. | 30 | | `""` double quotes | Used to group multiple words into phrases to search a particular field. | -------------------------------------------------------------------------------- /software-tutorials/getpapers/getpapers-eupmc-queries.md: -------------------------------------------------------------------------------- 1 | [Official Documentation, Appendix 1](http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf) 2 | 3 | ### Restrict search by bibliographic metadata 4 | 5 | | Field | Description | Example | 6 | |-----------|---------------|---------------| 7 | | `PMCID:` | Search for a publication by its PubMed Central ID, where applicable (i.e. available as full text) | `PMCID:PMC1287967` | 8 | | `TITLE:` | Search for a term or terms in publication titles | `TITLE:aspirin, TITLE:”protein knowledgebase”` | 9 | | `ABSTRACT:` | Search for a term or terms in publication abstracts | `ABSTRACT:malaria`, `ABSTRACT:”chicken pox”` | 10 | | `AUTH:` | Search for a surname and (optionally) initial(s) in publication author lists | `AUTH:einstein`, `AUTH:”Smith AB”` | 11 | | `JOURNAL:` | Journal title – searchable either in full or abbreviated form | `JOURNAL:”biology letters”`, `JOURNAL:”biol lett”` | 12 | | `LICENSE:` | Search for content according to the assigned Creative Commons license (where provided). | `LICENSE:"cc by" OR LICENSE:"cc-by"`, `LICENSE:cc` | 13 | | `PUB_YEAR:` | Search by year of publication in YYYY format; note syntax for range searching. | `PUB_YEAR:2000`, `PUB_YEAR:[2000 TO 2001]` | 14 | | `E_PDATE:` | Electronic publication date, when an article was first published online. | `E_PDATE:2013-12-15`, `E_PDATE:20070930`, `E_PDATE:[2000-12-18 TO 2014-12-30]`, `E_PDATE:[20040101 TO 20140101]` | 15 | | `FIRST_PDATE:` | The date of first publication, whichever is first, electronic or print publication. Where a date is not fully available e.g. year only, an algorithm is applied to determine the value. | `FIRST_PDATE:1995-02-01`, `FIRST_PDATE:20000101`, `FIRST_PDATE:[2000-10-14 TO 2010-11-15]`, `FIRST_PDATE:[20040101 TO 20140101]` | 16 | | `P_PDATE:` | Print publication date of journal issue, when an article appeared in print format. | `P_PDATE:1982-10-01`, `P_PDATE:20140101`, `P_PDATE:[2000-12-18 TO 2014-12-30]`, `P_PDATE:[20031114 TO 20141115]` | 17 | 18 | 19 | ### Restrict by article metadata 20 | 21 | | Field | Description | Example | 22 | |---------------|--------------------------------------------------|-------------------------------------------------------| 23 | | `DISEASE:` | Search for mined diseases | `DISEASE:dysthymias` | 24 | | `GENE_PROTEIN:` | Search for records that have GENE_PROTEINS mined | `GENE_PROTEIN:gng11` | 25 | | `GOTERM:` | Search for records that have GOTERM mined | `GOTERM:apoptosis` | 26 | | `CHEM:` | Limit your search by MeSH substance | `CHEM:propantheline`, `CHEM:”protein kinases”` | 27 | | `ORGANISM:` | Search for mined organisms | `ORGANISM:terebratulide` | 28 | | `PUB_TYPE:` | Limit your search by publication type | `PUB_TYPE:review`, `PUB_TYPE:”retraction of publication”` | 29 | 30 | ### Section-level search 31 | 32 | | Field | Description | Example | 33 | |------------|----------------------------------------------------------------------|--------------------------------| 34 | | `INTRO:` | Find articles with a phrase in the Introduction & Background section | `INTRO:“protein interactions”` | 35 | | `METHODS:` | Find articles with a phrase in the Materials & Methods section | `METHODS:“yeast two-hybrid”` | 36 | | `RESULTS:` | Find articles with a phrase in the Results section | `RESULTS:"in vivo"` | 37 | | `DISCUSS:` | Find articles with a phrase in the Discussion seciton | `DISCUSS:cardivascular` | 38 | | `ACK_FUND:` | Find articles with a phrase in the Acknowledgements & Funding section | `ACK_FUND:ERC` | 39 | -------------------------------------------------------------------------------- /software-tutorials/getpapers/getpapers-ieee-queries.md: -------------------------------------------------------------------------------- 1 | [Official documentation](http://ieeexplore.ieee.org/gateway/) 2 | 3 | ### Search in Fields 4 | 5 | | field | description | 6 | |-----------|---------------| 7 | | `au` | Terms to search for in Author | 8 | | `ti` | Terms to search for in Document Title | 9 | | `ab` | Terms to search for in Abstract | 10 | | `doi` | Terms to search for in DOI | 11 | | `cs` | Terms to search for in Affiliations | 12 | | `jn` | Terms to search for in Publication Title | 13 | | `isbn` | Terms to search for in isbn | 14 | | `issn` | Terms to search for in issn | 15 | | `py` | Terms to search for in Publication Year | 16 | | `partnum` | Terms to search by Part Number | 17 | | `thsrsterms` | Terms to search for in Thesaurus Terms | 18 | | `cntrlterms` | Terms to search for in Controlled Index Terms | 19 | | vidxterms` | Terms to search for in Index Terms | 20 | | `md` | Terms to search for in all configured metadata fields and abstract. Accepts complex queries involving field names and boolean operators. | 21 | | `querytext` | Terms to search for in all configured metadata fields, abstract and document text. Accepts complex queries involving field names and boolean operators. | 22 | 23 | 24 | ### Filtering parameters 25 | 26 | | parameter | description | 27 | |-----------|---------------| 28 | | `oa` | Open Access only (1 - true; 0 - false) | 29 | | `pn` | Publication number | 30 | | `pys` | Start value of Publication Year to restrict results by. | 31 | | `pye` | End value of Publication Year to restrict results by. | 32 | | `pu` | Publisher. One of: IEEE/AIP/IET/AVS/IBM | 33 | | `ctype` | Content Type. One of: Conferences/Journals/Books/Early Access/Standards/Educational Courses | 34 | 35 | ### Paging parameters 36 | 37 | | parameter | description | 38 | |-----------|---------------| 39 | | `hc` | Number of records to fetch. Default: 25; Maximum: 1000 | -------------------------------------------------------------------------------- /software-tutorials/installation/README.md: -------------------------------------------------------------------------------- 1 | # Installation 2 | 3 | The installation instructions have been moved, please look [here](http://contentmine.github.io/) for installing on you platform. 4 | -------------------------------------------------------------------------------- /software-tutorials/norma/README.md: -------------------------------------------------------------------------------- 1 | # norma 2 | ============================== 3 | 4 | ## Table of Content 5 | 6 | 1. [Description](#description) 7 | 1. [Preparations](#preparations) 8 | 1. [Data Used](#data-used) 9 | 1. [Tutorial](#tutorial) 10 | 1. [XML to sHTML](#xml-to-shtml) 11 | 1. [PDF collections to TXT](#pdf-to-txt) 12 | 1. [Troubleshooting](#troubleshooting) 13 | 1. [Summary](#summary) 14 | 1. [Next Tutorial](#next-tutorial) 15 | 1. [Further Materials](#further-materials) 16 | 17 | ## Description 18 | 19 | **What norma does** 20 | 21 | Norma transforms different file formats such as PDF, XML or HTML provided by publishing websites and API's into our internal standard for scientific literature called [scholarly HTML](../sHTML/). Norma furthermore possesses the ability to reverse engineer graphs, and e.g. extract data points from a timeline. 22 | 23 | **Why do we need norma?** 24 | 25 | Norma parses and merges different formats and file standards of scientific literature, and offers a unified output which can be used for further processing. 26 | 27 | **How can I use norma?** 28 | 29 | Norma offers paths from three different input streams to sHTML: 30 | * from XML-files collected by an API-query of [getpapers](../getpapers/README.md) 31 | * from HTML-files collected by a URL-scrape of [quickscrape](../quickscrape/README.md) 32 | * from an existing collection of PDFs 33 | 34 | **What you will learn here** 35 | 36 | This tutorial shows you how to 37 | * normalize XML-files or HTML-files into sHTML 38 | * normalize a collection of PDFs into simple text files 39 | 40 | **How to use the tutorial** 41 | 42 | We have some conventions at work, which will be used through-out the tutorial. 43 | - Variables as placeholders are always caps, like NAME, YOURDIRECTORY etc. 44 | 45 | **Glossary** 46 | - API 47 | - PDF 48 | - XML 49 | - HTML 50 | - Normalizing 51 | - Parsing 52 | - Scraping 53 | 54 | 55 | ## Preparations 56 | ### Prerequisites 57 | 58 | ### Used Software 59 | - [Future TDM Virtual Machine](LINK) 60 | - norma 0.1.4 61 | 62 | ### Installation 63 | 64 | On the ContentMine-VM norma is already provided. If you want to install it locally or update it on the VM, please refer to the [installation instructions](http://contentmine.github.io/) for your platform. 65 | 66 | You can find the technical documentation for `norma` in its [repository](https://github.com/ContentMine/norma). 67 | 68 | ## Data used 69 | 70 | ![normasmall0](../../assets/images/software/norma/normasmall0.png) 71 | 72 | Norma can take 4 different file formats for the publications (XML, PDF, HTML, XHTML) and additional files for supplementary materials and PNG's. getpapers or quickscrape can be used to create a corpus of papers, but you also can use your own PDFs or HTML. The task is to unify the different input streams into a single data format. 73 | 74 | ## Tutorial 75 | 76 | This tutorial is based on release [0.2.26](https://github.com/ContentMine/norma/releases). 77 | 78 | ### XML to sHTML 79 | 80 | ![normaxml2shtml](../../assets/images/software/norma/normaxml2shtml0.png) 81 | 82 | We continue after retrieving a number of papers with `getpapers -q 'ursus maritimus' -o ursus -x`. 83 | 84 | getpapers returns a [CProject](../cproject)-folder. It contains subfolders holding one paper and associated files, fulltexts in PDF or XML format, and if requested, supplementary files. Just a quick reminder of how it may look like: 85 | 86 | ![inspectctree1](../../assets/images/software/getpapers/getpapers-inspectctree2.png) 87 | 88 | ![inspectctree2](../../assets/images/software/getpapers/getpapers-inspectctree1.png) 89 | 90 | We now transform the fulltext.xml-files into sHTML. This can be done in bulk by passing the project folder with `-q`. The input/output-parameters `-i` and `-o` are the files to read in and write to. The parameter `--transform nlm2html` corresponds to a specific transformation from one format to another. 91 | 92 | ```bash 93 | norma --project ursus -i fulltext.xml -o scholarly.html --transform nlm2html 94 | ``` 95 | 96 | ![nlm2htmlwithwarnings](../../assets/images/software/norma/nlm2htmlwithwarnings.png) 97 | 98 | norma will print a lot process information in the form of format warnings. 99 | 100 | If you inspect the folder with `tree ursus`, norma will have added a `scholarly.html`-file to those papers, where a format conversion was possible. 101 | 102 | ![cprojectwithshtml](../../assets/images/software/norma/cprojectwithshtml.png) 103 | 104 | ### PDF to text 105 | 106 | PDF is a notoriously bad format for automatic processing. While understandable for the human reader, PDF is a real obstacle for computers and so for content mining. This lies in the nature of the document, which - from a machine's perspective - is essentially a 2-dimensional plane with symbols on it. The only information that a machine readily knows about any symbol is it's x- and y-location on the plane. Meaning, relations with other symbols, or logical concepts are not present in a PDF and have to be constructed by input from the outside. 107 | 108 | This process is still in an experimental stage, and most suited if you have an application for plain text analysis and want to extract the *raw content* of a PDF. 109 | 110 | Before we can apply norma on a collection of PDFs, we have to move them into a CProject structure. First place all target PDFs into a folder, and [cd](../shell) into it. You can then use the following code to create folders based on the names of PDFs, move each PDF into the corresponding folder, and rename it to fulltext.pdf. 111 | 112 | ```bash 113 | for fname in *.pdf; do 114 | filename=$(basename "$fname"); 115 | filename="${filename%.*}"; 116 | mkdir "$filename"; 117 | mv "$fname" "$filename"/fulltext.pdf; 118 | done 119 | ``` 120 | 121 | This is now our CProject folder. We then move one level back in the hierarchy with `cd ..`, in order to run norma from outside the folder. 122 | 123 | The folder structure should then look like this: 124 | 125 | ``` 126 | tree my-pdfs 127 | my-pdfs 128 | ├── foldername1 129 | │   └── fulltext.pdf 130 | ├── foldername2 131 | │   └── fulltext.pdf 132 | ├── foldername3 133 | │   └── fulltext.pdf 134 | └── foldername4 135 | └── fulltext.pdf 136 | 137 | 4 directories, 4 files 138 | ``` 139 | 140 | To finally convert the pdf into text files, run the following norma-commands: 141 | 142 | ```bash 143 | norma --project ursus/ -i fulltext.pdf -o fulltext.pdf.txt --transform pdf2txt 144 | ``` 145 | 146 | ## Summary 147 | 148 | * use publications downloaded via getpapers and convert the XML to uniform sHTML with `norma --project PROJECTFOLDER -i fulltext.xml -o scholarly.html --transform nlm2html` 149 | * convert PDF's to text files with `norma --project PROJECTFOLDER -i fulltext.pdf -o fulltext.pdf.txt --transform pdf2txt` 150 | 151 | ## next tutorial 152 | * Back to the [tutorial overview](..) 153 | * Continue to [sHTML](../sHTML) if you want to learn more about scholarly HTML. 154 | * Continue to [ami](../ami) for the next step of the ContentMine pipeline. 155 | -------------------------------------------------------------------------------- /software-tutorials/norma/notes.txt: -------------------------------------------------------------------------------- 1 | * CProject erwähnen 2 | * überall erwähnen entlang der pipeline 3 | * abfangen: redirect to getpapers - führe schritte blabla aus 4 | * abfangen: redirect to quickscrape 5 | * check norma-logs for success/errors 6 | * test norma-logging levels 7 | * norma-paths löschen 8 | * erklärungen die peripher sind, löschen 9 | * sHTML tutorial 10 | * PDF klarer abgrenzen: viel schlechtere quali, keine sHTML 11 | * PDF2text experimentellen status anführen 12 | * PDF2text results erklären -------------------------------------------------------------------------------- /software-tutorials/quickscrape/README.md: -------------------------------------------------------------------------------- 1 | # quickscrape 2 | 3 | ## Table of content 4 | 5 | 1. [Description](#description) 6 | 1. [Installation](#installation) 7 | 2. [Scraper definitions](#scraper-definitions) 8 | 3. [Scraping](#scraping) 9 | 4. [Other sources](#other-sources) 10 | 5. [Summary and next steps](#summary-and-next-steps) 11 | 12 | ## Description 13 | **What is quickscrape?** 14 | 15 | `quickscrape` is together with [getpapers](../getpapers/README.md) one of the entry points of the ContentMine pipeline. It is designed to enable large-scale content mining, and retrieve PDFs, images and fulltext-htmls of scientific literature. 16 | 17 | **For what do we need quickscrape?** 18 | 19 | **How can I use quickscrape?** 20 | 21 | 22 | ## Installation 23 | 24 | On the ContentMine-VM norma is already provided. If you want to install it locally or update it on the VM, please refer to the [installation instructions](http://contentmine.github.io/) for your platform. 25 | 26 | You can find the technical documentation for `quickscrape` in its [repository](https://github.com/ContentMine/quickscrape). 27 | 28 | ## Scraper definitions 29 | 30 | **Quickscrape is incomplete without the scraper definitions**. They are developed individually for each journal to accustom for different page layouts or html-tags. Scraper definitions are maintained in a [separate repository](https://github.com/ContentMine/journal-scrapers.git), and it is possible to [create your own definitions](../journal-scrapers/README.md) for a journal. 31 | 32 | At the moment there exist definitions for following publishers/journals: 33 | * BMC 34 | * PLoS 35 | * PeerJ 36 | * PNAS 37 | * elife 38 | 39 | You can download the newest scraper definitions with this command: 40 | ```bash 41 | git clone https://github.com/ContentMine/journal-scrapers.git 42 | ``` 43 | The scraper definitions will then be found in `your_path/journal-scraper/scrapers/`. Remember your path, cause it will be needed later on. 44 | 45 | ## Scraping 46 | 47 | There are two possible inputs for quickscrape, a single url, or list of urls. The single url can be passed directly as a parameter from the command line, the list of urls should be collected either manually by you, or may be taken from a basic getpapers query. Quickscrape will then visit every URL in this list and grab everything it can. This includes sections according to tags, images or tables. This process depends heavily on the format that is provided by the publisher. 48 | 49 | ### Single URLs 50 | 51 | A minimum query consists of a URL (or URL-list) and the path to a specific scraper (or a folder containing scraper definitions). You pass the url with `-u or --url`, the scraper definition with `-s or --scraper` and the output-directory with `-o or --output`. 52 | 53 | ```bash 54 | quickscrape -u url -s journal-scrapers/scrapers/scraper.json -o test_folder 55 | ``` 56 | 57 | The scraper you use should correspond the to URL you provide, if none is available, choose `generic_open.json`. 58 | 59 | ```bash 60 | quickscrape \ 61 | --url https://peerj.com/articles/384 \ 62 | --scraper journal-scrapers/scrapers/peerj.json \ 63 | --output peerj-384 64 | ``` 65 | 66 | ![quickscrape-url](../assets/images/software/quickscrape/quickscrape-url.png) 67 | 68 | Quickscrape now creates a subfolder for each searchresult, describing the article source, a fulltext.html with the scraping results, and a results.json containing metadata of the article, e.g. authors, title, abstract and bibliographic data. It may include other files such as fulltext PDFs, fulltext XMLs, or scraped images. This is also one of the starting points for a [ctree](../ctree/README.md), the main datastructure of the ContentMine pipeline. 69 | 70 | ```bash 71 | tree peerj-384/ 72 | peerj-384/ 73 | └── https_peerj.com_articles_384 74 | ├── fig-1-full.png 75 | ├── fulltext.html 76 | ├── fulltext.pdf 77 | ├── fulltext.xml 78 | └── results.json 79 | 80 | 1 directory, 5 files 81 | ``` 82 | 83 | ### getpapers URL-lists 84 | 85 | In the next example we take the output we get from a [basic getpapers query](../getpapers/README.md#construct-a-simple-query_and-compare-results), e.g. `getpapers -q 'dinosaurs' --api eupmc -o test_eupmc`. This returns two files in a search results folder. An *apiname*_results.json, which contains metadata about the search results, and a fulltext_html_urls.txt, which contains a list of URLs of fulltext papers. A *valid list of URLs* is a textfile with exactly one valid URL per line. A *valid URL* is a URL that leads to a fulltext page, e.g. [https://peerj.com/articles/384](https://peerj.com/articles/384). 86 | 87 | ```bash 88 | tree test_eupmc 89 | test_eupmc 90 | ├── eupmc_results.json 91 | └── fulltext_html_urls.txt 92 | 93 | cat test_eupmc/fulltext_html_urls.txt 94 | http://europepmc.org/articles/PMC4040045 95 | http://europepmc.org/articles/PMC4055607 96 | http://europepmc.org/articles/PMC4022087 97 | http://europepmc.org/articles/PMC3394943 98 | http://europepmc.org/articles/PMC4448809 99 | http://europepmc.org/articles/PMC3115283 100 | ``` 101 | 102 | We now take the fulltext_html_urls.txt as input for quickscrape. quickscrape will choose a scraper automatically, if one is available. If not, a scraper with very generic definitions will be used, and the result will not be as precise. 103 | 104 | ```bash 105 | quickscrape -r test_eupmc/fulltext_html_urls.txt -d journal-scrapers/scrapers/ -o test_eupmc 106 | tree test_eupmc 107 | test_eupmc 108 | ├── eupmc_results.json 109 | ├── fulltext_html_urls.txt 110 | ├── http_europepmc.org_articles_PMC1234567 111 | │   ├── fulltext.html 112 | │   └── results.json 113 | ├── http_europepmc.org_articles_PMC1234568 114 | │ ├── fig-1-full.png 115 | │ ├── fig-2-full.png 116 | │ ├── fulltext.html 117 | │ ├── fulltext.pdf 118 | │ ├── fulltext.xml 119 | │ └── results.json 120 | ├── ... 121 | ... 122 | └── http_europepmc.org_articles_PMC4448809 123 | ├── fulltext.html 124 | ├── fulltext.pdf 125 | └── results.json 126 | 127 | 19 directories, 43 files 128 | 129 | ``` 130 | 131 | ## Other sources 132 | From other searches or citation data you may have a list of DOIs ([Digital Object Identifier](https://en.wikipedia.org/wiki/Digital_object_identifier)), such as `https://dx.doi.org/10.7717/peerj.384`. This is not a valid URL input for quickscrape. You must first resolve the DOI, in this case [https://dx.doi.org/10.7717/peerj.384](https://dx.doi.org/10.7717/peerj.384) leads to [https://peerj.com/articles/384/](https://peerj.com/articles/384/). Other examples are [http://dx.doi.org/10.4103%2F1817-1745.131497](http://dx.doi.org/10.4103%2F1817-1745.131497) which leads to the [article on pediatricneurosciences.com](http://www.pediatricneurosciences.com/article.asp?issn=1817-1745;year=2014;volume=9;issue=1;spage=79;epage=81;aulast=Vitaliti), or [http://dx.doi.org/10.1074%2Fmcp.M111.014167](http://dx.doi.org/10.1074%2Fmcp.M111.014167) which leads to the [article on MCPOnline.org](http://www.mcponline.org/content/11/7/M111.014167). It is important to distinguish between the DOI and the landing page. **Content can only be scraped from the landing page.** 133 | 134 | ## Summary and next steps 135 | 136 | * A minimum query consists of a URL (or URL-list) and the path to a specific scraper (or a folder containing scraper definitions). 137 | * Please be a respectful and responsible miner and apply a reasonable rate limit `-r` (recommended between 3 and 6 scrapes per minute). 138 | * The result will be a collection of [ctrees](../ctree/README.md) containing fulltexts in various formats (PDF, XML), a results.json with metadata, and possibly images. 139 | 140 | **Next steps** 141 | * Continue to [journal-scrapers](../journal-scrapers/README.md) if you want to define your own scraper. 142 | * Continue to [norma](../norma/README.md) for the next step of the ContentMine pipeline. 143 | * Continue to [ctree](../ctree/README.md) for an introduction of the main datastructure. 144 | -------------------------------------------------------------------------------- /software-tutorials/sHTML/README.md: -------------------------------------------------------------------------------- 1 | see [the SHTML spec](https://github.com/ScholarlyHTML/spec/) 2 | 3 | Breaking news - 2015-12 - There is now a very exciting [new W3C development in ScholaryHTML](https://github.com/w3c/scholarly-html) which we are 4 | deeply involved in helping create 5 | -------------------------------------------------------------------------------- /software-tutorials/shell/README.md: -------------------------------------------------------------------------------- 1 | ![ContentMine logo](https://github.com/ContentMine/assets/blob/master/png/Content_mine(small).png) 2 | 3 | ### Basic shell commands 4 | 5 | In general: autocompletion with 'tab' may save you a lot of typing. If you want to interrupt and cancel the execution of a command, press ```CTRL``` + ```c```. 6 | 7 | **ls**: **l**i**s**ts files and directories, taking the current working directory as starting point. You can also look into the content of subdirectories by extending the path ```ls dir/nested_dir/nested_dir2``` ([wikipedia](https://en.wikipedia.org/wiki/Ls). We inspect which files and folders are present in our current working directory and in the `shell-example` folder: 8 | ``` 9 | ls 10 | ls shell-example 11 | ``` 12 | ![ls](../../assets/images/software/shell/ls.png) 13 | 14 | 15 | **tree**: The tree-command provides a hierarchical overview of a folders content, including subfolders. Please note that if you work on your local machine and not within our virtual machine, you may have to [install tree first](https://askubuntu.com/questions/431251/how-to-print-the-directory-tree-in-terminal). 16 | ``` 17 | tree shell-example 18 | ``` 19 | ![tree](../../assets/images/software/shell/tree.png) 20 | 21 | 22 | **cd**: **c**hange **d**irectory, moves the working directory location to the target location. You can move back up in the directory hierarchy with `cd ..`. If you want to navigate to an absolute path, you have to start with a "/", `cd /home/workshop/workshop/`. 23 | ``` 24 | cd shell-example 25 | ls 26 | cd .. 27 | ``` 28 | ![cd](../../assets/images/software/shell/cd.png) 29 | 30 | 31 | **mkdir**: **m**a**k**e **dir**ectory: creates a new directory ([wikipedia](https://en.wikipedia.org/wiki/Cd_(command))). 32 | ``` 33 | mkdir new-folder 34 | ls 35 | ``` 36 | ![mkdir](../../assets/images/software/shell/mkdir.png) 37 | 38 | 39 | **mv**: **m**o**v**es files and directories from the first location to the second. You can move them further down into already existing directories, but also up with ```mv dir ..```, and into the current directory with ```mv lower_dir ./new_lower_dir``` ([wikipedia](https://en.wikipedia.org/wiki/Mkdir)). mv is also used to rename files or folders, e.g. ```mv old_filename.txt new_filename.txt```. 40 | ``` 41 | mv new-folder shell-example 42 | ls shell-example 43 | mv shell-example/new-folder ./ 44 | ls 45 | ``` 46 | ![mv](../../assets/images/software/shell/mv.png) 47 | 48 | 49 | **cp**: **c**o**p**ies files from the first location to the second ([wikipedia](https://en.wikipedia.org/wiki/Cp_(Unix))). If you want to copy a folder, you have to use ```cp -r source_dir target_dir``` where ```-r``` stands for recursive. 50 | ``` 51 | cp -r shell-example shell-example-2 52 | ls 53 | tree shell-example 54 | tree shell-example-2 55 | ``` 56 | ![cp](../../assets/images/software/shell/cp.png) 57 | 58 | 59 | **rm**: **r**e**m**oves the specified file ([wikipedia](https://en.wikipedia.org/wiki/Rm_(Unix))). If you want to remove a directory, use ```rm -r dir``` but be sure you want this. 60 | ``` 61 | rm -r shell-example-2 62 | ls 63 | ``` 64 | ![rm](../../assets/images/software/shell/rm.png) 65 | 66 | 67 | **cat**: **cat**enates the content of a file sequentially and prints line by line to the terminal ([wikipedia](https://en.wikipedia.org/wiki/Cat_%28Unix%29)) 68 | ``` 69 | cat shell-example/lines.txt 70 | ``` 71 | ![cat](../../assets/images/software/shell/cat.png) 72 | 73 | 74 | **head**: prints the first *n* lines to the terminal ([wikipedia](https://en.wikipedia.org/wiki/Head_(Unix))). Default is 10. 75 | ``` 76 | head -5 shell-example/lines.txt 77 | ``` 78 | ![head](../../assets/images/software/shell/head.png) 79 | 80 | 81 | **tail**: prints the last *n* lines to the terminal ([wikipedia](https://en.wikipedia.org/wiki/Tail_(Unix))). Default is 10. 82 | ``` 83 | tail -5 shell-example/lines.txt 84 | ``` 85 | ![tail](../../assets/images/software/shell/tail.png) 86 | 87 | 88 | **wc**: Counts different things in a file, e.g. words, lines, or bytes ([wikipedia](https://en.wikipedia.org/wiki/Wc_%28Unix%29)). **wc -l filename** counts lines. 89 | ``` 90 | wc -l shell-example/lines.txt 91 | wc -w shell-example/lines.txt 92 | ``` 93 | ![wc](../../assets/images/software/shell/wc.png) 94 | 95 | 96 | * [Return to tutorial overview](..) 97 | * [Proceed to getpapers-tutorial](../getpapers) -------------------------------------------------------------------------------- /software-tutorials/vms/README.md: -------------------------------------------------------------------------------- 1 | # ContentMine Virtual Machines 2 | 3 | ## Table of Contents 4 | 5 | 1. [Description](#description) 6 | 2. [Installation](#installation) 7 | 3. [Troubleshooting](#troubleshooting) 8 | 4. [Components](#components) 9 | 10 | ## DESCRIPTION 11 | 12 | **What is a virtual machine?** 13 | A virtual machine is a simulated operating system 'within' your operating system (think Inception for operating systems). It consists of two parts: 14 | * [VirtualBox](https://www.virtualbox.org/): the software which runs the virtual machine 15 | * [ContentMine virtual machine image](): an image in which the whole operating system with its configuration and our ContentMine software packages are located 16 | 17 | **Why does ContentMine use a virtual machine?** 18 | Virtual Machines make it easy to use pre-configured software environments on different operating systems. In our case, it allows us to run the ContentMine software easily on all kinds of operating systems (Linux, Windows and Mac). This is used mostly for hands-on workshops and allows all attendees to run the software without having to modify their own systems. This allows us to quickly and smoothly start content mining, with a mininum of fuss. 19 | 20 | **How can I use the virtual machine?** 21 | For this, you have to install [VirtualBox](https://www.virtualbox.org/) and start from it the [ContentMine virtual machine image](). You will find more details in the install section. 22 | 23 | ## INSTALLATION 24 | 25 | ### Install VirtualBox 26 | VirtualBox runs on Windows, Linux, Macintosh, and Solaris and licensed under GNU General Public License (GPL) version 2. 27 | 28 | **Requirements** 29 | * Reasonably powerful x86 hardware. Any recent Intel or AMD processor should do. 30 | All the other things depend on the requirements of the virtual machine image, as you can find them below. 31 | 32 | 1. Download the VirtualBox platform installer for your operating system from the [VirtualBox website](https://www.virtualbox.org/wiki/Downloads). 33 | 2. Run the installer and follow the on-screen instructions. 34 | 35 | ### Installing the ContentMine Virtual Machine image 36 | 37 | **Requirements** 38 | * 64-bit architecture (unless otherwise stated) 39 | * 3 GB RAM 40 | * Adequate hard drive space for the VM (at least 5 GB) 41 | 42 | 1. Download the required ContentMine virtual machine image 43 | * 'Current VM' as at [May 2016](https://drive.google.com/open?id=0B7pJKedx9b97QVJWUzNnMGdPb00) 44 | * getpapers v0.4.5 45 | * quickscrape v0.4.7 46 | * norma v0.2.26 47 | * ami v0.2.24 48 | * 'Contentmine-FTDM' VM for Cambridge workshop/s [2015-12-10/11](http://contentmine.org/wp-content/uploads/static/contentmine-VM.ova) (3534356480 bytes on MAC-OSX) 49 | * 'Biology' VM for University of Bath workshop [28/07/15](https://onedrive.live.com/redir?resid=1652077CF1AA4E9F!1280&authkey=!AGyzu9zuzzKeJok&ithint=file%2cova) 50 | * 'Neuro' VM for Edinburgh Neuroscience hack [26/05/15] - [direct link](https://www.dropbox.com/s/yes9af47fn8vnz7/ContentMine-VM.ova?dl=0) 51 | * 'Cochrane' VM for Oxford Cochrane centre workshop [2015/03/15] - [direct link](https://drive.google.com/file/d/0B6ChGXuXmOEDemRtb1JBakREYWc/view?usp=sharing) 52 | * 'Playground' VM for [EBI workshop](https://github.com/ContentMine/EBI_workshop_20150330) [2015/03/30] - [direct link](https://drive.google.com/uc?export=download&confirm=dp8f&id=0B6ChGXuXmOEDNWx2d0EwbDkyY00) - [installation instructions](https://github.com/ContentMine/EBI_workshop_20150330/blob/master/docs/pre-workshop_installation.pdf) 53 | 54 | The image should be fairly large (>1GB, now ca 3.3GB). Depending on your connection that can take between 10 and 60 minutes (or much longer if you have very slow connection). 55 | 2. Double-click on the downloaded file (´´´.ova´´´ file-extension) to open VirtualBox and offer to import the virtual machine. Please follow the on-screen instructions to complete the import. 56 | 3. Configure the import of the image 57 | 58 | 59 | This is a series of screenshots to show what you should be seeing when you first install the Virtual Box and Virtual Machine. These are for a MAC-OSX and there will be minor differences for other OS. 60 | 61 | * Virtual Box Download (e.g. from https://www.virtualbox.org/wiki/Downloads);
62 | VirtualBox download 63 |
Pick your operating system 64 |
65 | 66 | * Virtual Box Installation:
67 | (MAC-OSX) click on downloaded file (creates
68 | MAC-OSX installer 69 |
70 | Then click on the package/box icon (1) and it should install in `Applications | VirtualBox.app` 71 | 72 | * Possible error (ignore). You might see:
73 | Ignore error 74 |
75 | If so, click the "Do not show this message again" and continue. 76 |
77 | 78 | ### Starting the ContentMine Virtual Machine image 79 | 80 | After installing VirtualBox and importing the virtual machine image you can select the machine from the VirtualBox interface. 81 | 82 | ![Start the vm](../../assets/images/software/vms/starting-vm.png) 83 | 84 | Please go to "Settings" first and make sure you allocate at least 2000MB RAM to the Base Memory. Then click OK. 85 | 86 | ![Memory Settings](../../assets/images/software/vms/memorysettings.png) 87 | 88 | To start VirtualBox, select your image and click the "Start" button. 89 | 90 | After a few seconds you land on the desktop. 91 | 92 | ![CM vm desktop](../../assets/images/software/vms/desktop.png) 93 | 94 | You can shut down the vm by right click and then "Exit", and "Power off". 95 | 96 | **If you attend a workshop: Please try to get the virtual machine running _before_ the workshop. There will be little time on the day to help with VirtualBox issues!** 97 | 98 | 99 | * Starting the VM. (MAC-OSX) Click on the `Applications | VirtualBox.app` and you may see
100 | Powered off 101 |
102 | 103 | * When the VM is ready you will see
104 | Running 105 |
106 | (and on MAC-OSX the icons:
107 | icons) 108 | 109 | There can be more than one VM - we release different ones for different tutorials, and you can switch between them on the LH side. 110 | 111 | * when the VM is running you should see a screen such as: 112 | Startscreen 113 |
114 | 115 | * Right-click on the main window and get a popup: 116 | popup 117 |
118 | 119 | 120 | * Select `terminal` and you will get: 121 | crunchbang 122 |
123 | 124 | * try ```ls -lt``` at the command line. If it comes out with a German &ess; type: 125 | ```setxkbmap gb``` to convert to GB or ```setxkbmap us``` for US. 126 | 127 | ### Usage 128 | 129 | Basic entry to different applications starts with a right click on the desktop. Following options are of interest to us: 130 | * Terminal: command line interface. This is the basic way how to operate the ContentMine software. It opens a text-based interface, from where we can navigate folders, look into files, and interact with the ContentMine software. 131 | * File Manager: visually navigate through the folders 132 | * Web Browser: go onto the web. 133 | 134 | **Terminal** 135 | The execution of the updates can take a while (depending on your internet connection). After everything is checked, you should see a window like this: 136 | 137 | ![Terminal](../../assets/images/software/vms/terminal.png) 138 | 139 | You can maximize it to fullscreen by double clicking on the title bar. 140 | 141 | **Command Line / Shell** 142 | 143 | The command line is going to be the main interface with ContentMine. Some basic commands for using and navigating the command line are documented [here](../shell/README.md), please have a look if you are new to using the command line. 144 | 145 | **Copy & Paste** 146 | 147 | The settings of the virtual machine allows to share the clipboard with the host machine. That means you can copy any text from the host machine (e. g. browser, text file, terminal) and paste it into the your used virtual machine application, and vice versa. 148 | 149 | If you want to paste something into the command line, this is possible with right-click+"Paste", or ```Ctrl+Shift+V```. If you want to copy something out of the command line, e.g. an error message, highlight the message with the cursor, and then either right-click+"Copy" or use ```Ctrl+Shift+C```. 150 | 151 | **File import/export between the host system and the vm** 152 | 153 | If you want to transfer files between the host system and the vm, you have to set up a shared folder. This has to be done in the VirtualBox before starting the vm. Go to "Settings-Shared Folders" 154 | 155 | ## TROUBLESHOOTING 156 | 157 | The most common error is an incomplete download of the large VM image file, please verify that the download has been succesfully and _fully_ completed. 158 | 159 | If you have any problems getting VirtualBox, or downloading and starting the virtual image, don't worry. Please contact the workshop organizer for support, or ask us on our [website](http://contentmine.org/contact). 160 | 161 | Note that some VMs may have a German locale and you may need to issue ```setxkbmap gb ``` or ```setxkbmap us``` in the terminal to change the keyboard settings. 162 | 163 | ## COMPONENTS 164 | 165 | The VM comes with following packages and environments installed: 166 | 167 | Environments: 168 | - node.js v0.10.24 169 | - npm 1.3.21 170 | - zsh 4.3.17 171 | 172 | ContentMine tools: 173 | - getpapers - version 174 | - quickscrape - version 175 | - norma - version 176 | - AMI-plugins - version 177 | 178 | Data analysis packages: 179 | - [anaconda](http://continuum.io/downloads#py34) 180 | - [jupyter](http://jupyter.readthedocs.org/en/latest/install.html) 181 | - [R 3.2] 182 | -------------------------------------------------------------------------------- /training-guidelines/README.md: -------------------------------------------------------------------------------- 1 | # Facilitator Materials 2 | 3 | This directory contain resources to **organize**, **communicate**, **evaluate** and/or **facilitate a ContentMine training**. The following files are available to use: 4 | 5 | * [General guidelines](general.md) 6 | What should the setting and atmosphere be like? What should your mindset as a facilitator be? The general guidelines offer ideas about giving a workshop. A 'lessons learned' section describes solutions to avoidable mistakes in planning or running a workshop. 7 | * [Workflow](workflow.md) 8 | The workflow gives facilitators and organizers some orientation on which steps to perform when, and which dependencies exist. It also includes a checklist which helps facilitators and organizers prepare their sessions. 9 | * [Evaluation & Assessment](evaluation-assessment.md) 10 | What to prepare for? What can be improved? This document describes how to identify the needs of your audience before the workshop and how to evaluate the workshop afterwards. Improving the materials and guidelines through github-issues or other feedback helps facilitators creating better workshops, and participants have a better experience. 11 | * [Teaching resources](teaching.md) 12 | This contains a description of the methods used in the workshop as well as links to helpful external teaching resources. 13 | -------------------------------------------------------------------------------- /training-guidelines/evaluation-assessment.md: -------------------------------------------------------------------------------- 1 | # Assessment and Evaluation 2 | 3 | Finding out the needs and the prior knowledge of the participants (assessment) and how the workshop met them (evaluation) are two important steps for improving the quality of the trainings, the trainng materials and the experience for participants (and facilitators). Allocate time and resources before, during and after a workshop to identify the needs of your audience and check how well they were met afterwards. 4 | 5 | ## Pre-Training Assessment 6 | 7 | Pre-training assessments should fill the gaps of missing knowledge of the facilitators to being able to prepare the training as good as possible regarding the participants. This leads to better trainings and less problems during them, but still it should always be decided training by training, if the effort is worth it. This depends mostly on organizational parameters, like how much time in advance is left, and how good you already know your target audience. 8 | 9 | Steps to do: 10 | 1. Create a registration form with email field 11 | 2. Think about, what are the uncertain variables regarding your participants? What do you want to know from them? => Ask yourself who they are. 12 | 2. Create survey 13 | 3. Send out link to the survey at least one week in advance 14 | 4. Evaluate the survey before the training 15 | 5. Implement outcomes of the evaluation to the training concept and content 16 | 17 | 18 | The survey should answer questions like: 19 | * Who is participating? 20 | * Profession 21 | * Skill levels: TDM, Programming, Research, (Newbie, advanced or expert) 22 | * Technical systems: OS, Software they use, etc. 23 | * What is the prior knowledge regarding the content of the training? 24 | * What do the participants expect (reagarding the communicated knowledge about the session in advance)? 25 | 26 | It is also important to think about questions you want to re-validate after the workshop, so you can check progress in the skills of the particiapnts and confidence in applying them. 27 | 28 | ## Post-Training Evaluation 29 | 30 | The survey after the training should evaluate the progress achived by the training, the satisfaction ot the participants with the training and further improvements. 31 | 32 | Steps to do: 33 | 1. Set Up evaluation form 34 | 2. Send out email with link 35 | 3. Send out a reminder two weeks later to the ones still missing 36 | 4. Evaluate results 37 | 5. Implement outcomes into trainings materials, the training concepts and the strategy 38 | 39 | For effective training and learning evaluation, the principal questions should be: 40 | 41 | * To what extent were the identified training needs objectives achieved by the programme? 42 | * To what extent were the learners' objectives achieved? 43 | * What specifically did the learners learn or was useful to be reminded of? 44 | * How was the prior expertise and how is the expertise in certain fields after the training (programming, legal issues, ContentMine project)? 45 | * What commitment have the learners made about the learning they are going to implement on their return to work? 46 | * What do they want to learn next (general in terms of Content-Mining)? 47 | * What can be improved? 48 | 49 | -------------------------------------------------------------------------------- /training-guidelines/how-to-setup-a-training.md: -------------------------------------------------------------------------------- 1 | # How to set up a Training 2 | 3 | 1. [Goals and Strategy](#goals-and-strategy) 4 | 5 | 2. [Work out the Training](#work-out-the-training) 6 | 7 | 3. [Preparations](#preparations) 8 | 9 | 4. [On Spot](#on-spot) 10 | 11 | 5. [Follow Up](#follow-up) 12 | 13 | 6. [Lessons Learned](#lessons-learned) 14 | 15 | This document should help you to work out the training materials, work out a proper training concept and execute it on spot. Additionally you should use the [checklist.md](LINK), which supports you through this process. 16 | 17 | This document works together with [checklist.md](LINNK). In this document you get practical support how to work out a training concept, and what to do in the main steps toward it. The checklist.md](LINK) is an easy to use list to check, if you have done the most important steps, which are described in this document in more detail. 18 | 19 | ## 1. Goals and Strategy 20 | 21 | A training can be quite easy if you know exactly what you want to do and what your goal is. But that's not a generic thing, so you have to think about for every training, for every host, for every group of participants from the beginning. And also don't underestimate the possibilities to improve your training skills so far. 22 | 23 | The first goal is to identify the goal(s) and work out a proper strategy to reach them with the training. This here is the smallest section in the guideline, but when you pay too less time for this, the outcome will most likely not satisfy you and a lot of time can be spend into the wrong direction. 24 | 25 | So take yourself some time and think in depth about the following points: 26 | * Who is your target audience? Think in detail: age, expertise, prior-knowledge, expectations, diversity, etc. 27 | * What should be the concrete outcome for the participants, for the hosts and for ContentMine? 28 | * What is needed to achieve the outcomes? 29 | * What is the problem you want to solve with the things you teach? 30 | * What does the host expect? 31 | 32 | **Think especially about a strong narrative and a strong use-case you want to offer the participants.** 33 | 34 | Prepare short, concise answer for these questions: 35 | * What is a fact? 36 | * What is content mining? 37 | * What is content mine? 38 | * What is content mining good for? 39 | * What is ContentMine good for? 40 | * What does it have to do with the TDM exception? 41 | 42 | And to sum it up: **Make everyone happy: Hosts, participants and you, the facilitator!** 43 | 44 | ## 2. Work out the Training 45 | 46 | ### Workflow 47 | * plan with time to change the room setup 48 | * think about participants coming later in: how to include them most easily, how often will this happen? 49 | * use the [¢hecklist.md](LINK) and add your specific tasks with time to it. 50 | * check previous trainings: github repository, documents, follow ups, lessons learned 51 | * work in groups as often as possible. peer learning is a great way to have fun and learn something 52 | * check if people are excluded from some acitivies? e.g. non-devs, people without laptop, etc. 53 | * prepare offline alternatives, WIFI always can fail. 54 | * Have a good balance between hands on sessions, social interactive sessions and theoretical talks. 55 | * Check if there is anywhere pressure in the timeline when planning and after finishing the timeline. Keep spaces free to relax for everyone. People should never think “I need a break!”. 56 | * Identify the central transformational events from one block to the next and think about what is necessary for it. e.g. from presentation to hack session, or from world cafe group to the big circle 57 | * think about transition from last session and into next session 58 | 59 | ### Culture 60 | * create an inclusive and diverse space: no racism, no sexism, etc. 61 | * have fun 62 | * be positive 63 | * help people 64 | * don't blame them for not knowing something! 65 | * be patient 66 | 67 | ### Content 68 | * be clear about problems, limitations and advantages, but always offer a solution 69 | * Is there some pre-knowledge assumed? 70 | * Is the content to difficult? 71 | * do you use too many technical terms? 72 | * don't overload participants with too much information 73 | 74 | ### Session Formats 75 | 76 | Here we offer you some common types of sessions which can be used to put them together to a full training schedule 77 | * always sum up after blocks of content, what you did and why 78 | 79 | #### Introduction 80 | * tell your name and your background: why are you interested in Content-Mining? 81 | * tell about ContentMine: website, the idea, Shuttleworth Fellowship, Peter Murray Rust, Team, 82 | * use established narratives [narratives.md](LINK) 83 | * give an overview of what will happen in the training 84 | * tell what problem we want to solve and why CM is important 85 | * short introduction round: why are they here, name, activities/profession 86 | 87 | #### Presentation 88 | This is a list of basic information how to present well. If you look for How to create slides, please go down to "Create Training Materials" > "Slides" 89 | * ask questions to the audience: have they heard of it? etc. 90 | * tell a strong story 91 | * ask yourself: what is the purpose of the talk? 92 | * breath out before you start and take a 2-3 seconds break 93 | * stand still with your feet, imagine your feet are growing into the floor like the roots of a tree to have a stable stand 94 | * relax and try to find something positive/funny about the presentation when you are nervous 95 | 96 | #### Hack Session 97 | A Hack Session is when the participants together with the faciliators try to hack on their own laptops in an open framing. This can be small ones from 5-15 minutes, but also longer ones over a hour. The guide offers some basic support, but a lot depend on what software, which use-case and which data you use. The sessions can be done individually, but most likely they are better done in groups of 2-3 persons. 98 | 99 | Hack Sessions can be combined easily with On-Screen sessions. After the introduction on the screen, people can hack on a sandbox or a small worked-out task by themselves in the open. 100 | 101 | Hack Session are best for around 10 up to 20 participants with 2 faciliators. Then for every additional 10 participants another facilitator is necessary/helpful. 102 | 103 | **Purpose** 104 | 105 | 106 | * let people do something by themselves 107 | * let everyone create individual results if possible 108 | * let them learn in groups 109 | * guide them to something helpfull for their own interests 110 | * learning by doing 111 | * show them the power and magic of content mining in real 112 | 113 | **Preparations** 114 | 115 | 116 | * work out tasks for the participants 117 | * small tasks of 10 minutes are often better than one hour slots of one big project. maybe chunk down your big slot to 4 smaller chunks 118 | * find information xyz 119 | * we have x, we want z, please do y 120 | * tell them the timeslot and the goal they should achive 121 | * Prepare an offline alternative 122 | 123 | **Timeline** 124 | * explain the session: what is the goal, what do they have to do for it, why is this of interest to them, when will it end 125 | * show a demo (optional): sometimes it can be helpful to offer some concrete ideas 126 | * ask if every group has a laptop, the data and if the software is running 127 | * ask advanced participants to help after they managed 128 | * ask if everything is clear 129 | * start hacking session 130 | * walk around: talk and help people 131 | * inform 5min before session ends 132 | * end hacking session 133 | * let the people present/tell their results and interpretation. this is where the real magic happens and the own thoughts get connected with others. 134 | * what have you found? 135 | * what were the problems? 136 | * did you solve them? yes, how? no, what barrier? 137 | * discuss/compare outcomes 138 | * sum up at the end 139 | 140 | **Possible Problems** 141 | * WIFI is not working 142 | * too many participants: build new groups. group people together up to 4 but not more. 143 | 144 | #### On Screen Tutorials 145 | 146 | This session shows hands-on stuff on the screen, and optionally let's everyone in the training follow step-by-step the things shown (e.g. explaining iPython notebook). 147 | 148 | If you want to know more about coding stuff for workshops, look under Preparations -> Software. 149 | 150 | **Purpose** 151 | 152 | 153 | * show the audience some things / use-cases which can be done with Content-Mine. 154 | * Introduce the audience into the Content-Mine software 155 | * offer some pre-coded functionality to make more sophisticated analysis and applications easier to do for the audience. 156 | 157 | **Preparations** 158 | 159 | 160 | * have in mind, that the beamer may has lower resolution 161 | * Execute the tutorials in a sequential way 162 | * plan breaks when blocks are done 163 | * look for supporters: because one person is all the time presenting at the front, the resources are very low to do support for the participants and solve problems on their machine 164 | 165 | **Timeline** 166 | 167 | * explain the things you will do 168 | * ask people to help each other 169 | * check if the system is running 170 | * start the presentation 171 | * sum up at the end 172 | 173 | **Didactics** 174 | 175 | * take it easy and make intentional breaks to let users follow up. 176 | * take yourself time to type and execute commands. be aware, that after executing the command, the command itself may disappear from the screen, so it is not visible anymore for the participants. When you want to use short-cuts like tab-completion, explain and show them. 177 | 178 | 179 | **Possible Problems** 180 | 181 | 182 | * people get stuck and fall away: prepare breaks, have someone to support and help 183 | * the data gets changed 184 | 185 | #### Discussions 186 | 187 | A discussion is something familiar, so not too much must be said about it. This mostly focuses on offering some hints to moderate discussions and get them started. 188 | 189 | **Purpose** 190 | 191 | 192 | * let everyone speak 193 | * offer a inclusive environment 194 | * let people connect with each other 195 | 196 | **Preparations** 197 | 198 | * prepare a backup plan, if no one wants to start. 199 | * prepare 2-3 questions, especially one to kick-start the discussion 200 | * try to identify in advance people who are likely to start 201 | 202 | **Timeline** 203 | 204 | * explain the goal of the discussion 205 | * tell duration 206 | * start discussion: make general question. if no one starts, ask a person directly 207 | * moderate the comments. everyone should say something 208 | + stop discussion: when time is over, or ask participants if they want to overdue 5-15min (but not more!) 209 | 210 | **Possible Problems** 211 | * no one says something: wait, stay relaxed and offer a friendly and welcoming space. after 1min of waiting time (but before it gets too weird), ask again a person directly 212 | 213 | #### World Cafe 214 | 215 | **Concept** 216 | 217 | A Worl Cafe is a discussive format to bringt people and ideas together in small, intimate groups of 3-5 people. The group discusses several topics proposed by the facilitators in a defined amount of time. Normally you do it 3 times with around 10-20 minutes for each round. The length mostly depends on the time you have, the number of rounds too, but also on what you want to achieve with your questions. 218 | 219 | **Purpose** 220 | * Connect with the people in a meaningful way 221 | * Let people interact on a personal and direct level 222 | * Create a social athmosphere 223 | * Get straight into the thoughts and ideas of the people, no talking around issues! 224 | 225 | **Application** 226 | * After people did some content mining, more towards the end of a workshop 227 | * useful to wrap up things and get a meta-perspective 228 | * After theoretical, one-to-many block 229 | 230 | **Preparation** 231 | * Virtual bell: to stop the rounds 232 | * post-its 233 | * Pens and Sheets with questions on them 234 | * Tixu/Pins to hang up the sheets after the World Cafe 235 | * Think of 3-4 questions that open up a discussion, e.g. 236 | * What are opportunities created through content mining? 237 | * What are barrier for content mining? 238 | * How can content mining help you? 239 | 240 | **Timeline** 241 | 1. Explain the next session: 242 | * “How knows what a world cafe is?“ 243 | * Art of Hosting method 244 | * tell the number of rounds and length 245 | * tell about meta harvest if done at the end 246 | * tell the questions to discuss 247 | * document on sheets offered by facilitators 248 | * groups should get straight into the topics! 249 | * seperate into groups of 3-5 people 250 | * ask for help to re-structure the room if necessary 251 | * start the room setup: hand out pens and sheets 252 | * Sum up: get in your round, discuss and document things 253 | 2. Get in groups. 254 | 3. Discuss and document first question 255 | * Ask if everyone has a sheet and 2-3 pens 256 | 4. Warn one minute before the end 257 | 5. Ring virtual bell after round ended 258 | 6. Group change: one person stays (decide yourself on the table who does what, if meta-harvest is used, the person who stays should report at the end in the big circle), everyone else should look for at total new group 259 | 7. Point 3 to 6 will be repeated until last question was discussed 260 | 8. Collect sheets and pin them on the wall 261 | 9. Circle (optional) 262 | * The Circle sits all people in a circle to start a meta-harvest of the world cafe rounds. 263 | * In this the main topics of the groups get presented and discussed. 264 | * From every group at least one person should speak about what happened and what where the main points. 265 | * A new sheet should be prepared for this. 266 | * This can take from 10 to 30 minutes and even more, if people have something to tell. 267 | * if not enough time to allow everyone to speak, just ask one per table to present issues 268 | 269 | **Possible Problems** 270 | * not enough participants: just reduce the number of groups too the smallest number necessary to get 3-5 people together. 271 | * too many participants: scale up group size until 6 persons. look for additional tables. have more sheets and pens prepared than you expect. 272 | 273 | ### Examples 274 | **This will come later** 275 | 276 | * offer some examples of slots in a workshop in general and for CM in specific. Which formats are available? 277 | * how do facilitators organize their activites during the training? notes, laptop docs?? 278 | * narrative: offer information how to create a strong and compelling narrative in your training. this is a very crucial point 279 | * link to older training repos 280 | 281 | ## 4. Preparation 282 | 283 | ### Room 284 | * Pin sheet with bit.ly shorturl to github repo on the main wall 285 | * Pin sheet with most important steps for newbies on the wall 286 | * Check if WIFI is available 287 | * Beamer: check resolution 288 | * check for plugs 289 | * check for additional laptoops for users with none, or where the software does not work 290 | 291 | ### Software / VM 292 | * if you work with data, check if it is okay to share it with the participants (privacy, copyright) 293 | * think about different system requirements: OS, CPU power, inet bandwitdh, storage, RAM, pre-installed packages conflicts, 294 | * offer a sandbox (optional): for their first steps, before getting deeper into the software (e.g. creating a folder, opening a file, printing your first fact, etc.). 295 | * document code well 296 | * offer data examples (optional) 297 | * offer code examples (optional) 298 | * comment code so people understand processes 299 | * tell when computation may take a while 300 | * explain errors and how to handle them 301 | * give them information how to solve problems themselves 302 | * prepare a data set, which can easily be reset after participants did some changes on it 303 | * test your system: on different operating systems, on different hardware 304 | * when you test out your concept 305 | * when your software is finished 306 | 307 | ### Online 308 | * GitHub repository 309 | * wordpress event page [.md](LINK) 310 | * Tweet about it 311 | 312 | ### Dissemniate 313 | Share the content with the community, the hosts and associated people. 314 | 315 | ### Follow Up's 316 | This is a very important point: What do you want the participants to follow up and take home with from the training? You should offer some (not more than 3!) concrete ways how to follow up after the training with the project or Content-Mining. It is also very important to get in touch with the most motivated people and look if they are interested in doing more around the topic and the project. 317 | 318 | Here some examples: 319 | * Make Pull Request with own hacks/code 320 | * Communication: Email, Twitter, Mailinglist 321 | * Discourse 322 | * Invite to another training or related event 323 | * Write direct email with specific topic 324 | 325 | **Evaluation** 326 | * work out [evaluation](LINNK) 327 | 328 | ### Create Training Materials 329 | * think about color-blindness 330 | 331 | #### Slides 332 | Some basic hints for creating and sharing slides 333 | 334 | * try to not use lists, except it is obviously a list 335 | * use big enough font size 336 | * use colors which create high contrast to make it easier visible 337 | * have in mind, that the beamer may has lower resolution 338 | * use strong visual approach 339 | * Slideshare is the prefered way to share your slides 340 | 341 | #### Screencasts 342 | Depending on your operating system, you can/should use: 343 | * Ubuntu/Linux: RecordMyDesktop 344 | * MacOS: 345 | * Windows: 346 | 347 | #### Hand Outs 348 | * Add a bit.ly to the overall materials 349 | 350 | 351 | #### GitHub repository 352 | * offer participants ways to get to the necessary level in advance: install sw, learn stuff, read things 353 | 354 | ## 5. On Spot 355 | 356 | * Check WIFI connection 357 | * Disconnect interactive applications like Skype, Slack, Tweetdeck etc. so incoming messages don't disturbe the sessions. 358 | * Open necessary folders, slides, applications, notes and rowser tabs 359 | * mention follow ups, especially evaluation 360 | * mention online materials, especially the github repository 361 | * Don't pressure yourself and the participants: trainings must be easy going 362 | * Ask if people are hold back by others things out of their control to implement things learned 363 | * check for plugs 364 | 365 | ### Dealing with problems 366 | 367 | * Offer advice for offline alternatives: Art of Hosting, always prepare something! 368 | * Offer hints for testing the software, the data and the slides for readiness 369 | * Social interaction: offer some basic methods and concrete examples for social interactions. Give hints for common problems (people not interested, people start to fight, people blame you for failing/weak training, etc.) 370 | * Which information need to be available all the time, especially for new arriving people? 371 | * Think through if way more or less people arrive: will the sessions still work? what could help, what do you have to change? 372 | * How to treat racism, sexism, etc. Think about this when inviting people in advance and put focus on this 373 | * Work out alternative strategies for 3 and for 50 persons 374 | 375 | ## 6. Follow Up 376 | 377 | Also look at Follow Up in Section "How to work out a Training" above. 378 | 379 | **Participants** 380 | * Send out Email with follow ups and evaluation form 381 | 382 | **Hosts** 383 | * get some feedback from them 384 | 385 | **Online** 386 | * write blog post 387 | * social media 388 | 389 | **Facilitators** 390 | * Document lessons learned in [doc.md](LINK). 391 | 392 | ## 7. Lessons Learned 393 | 394 | This list describes common problems that may occur and how they can be solved. 395 | 396 | | PROBLEMS | LESSONS LEARNED | 397 | |----------|-----------------| 398 | | The central message (vertical integration, ease of access and scalability, sectioning of papers, use of supplementary information) does not come across. | Put focus on unique points of interaction (sHTML, ctree, results) and not the technical details. | 399 | | The use case demonstrated is not really appropriate for audience. | Start with powerful demo first, then go into technical details | 400 | | Too much content in too less time, not enough time for teaching/delivering key message | Start low but have more in-depth stuff available. | 401 | | Missing narrative, difficult transitions between sessions | Change preparation time allocation from 80% tech/ 20% storyline to 50/50. | 402 | | Early technological problems (keyboard locales, VM on windows) lose large parts of the group, and it is difficult to recover from there. | Have backup plans and material: reserve more time for error handling and if something does not work. Define alternative tasks while solving problems. | 403 | | Getting started with the VM could be problematic especially for newcomers. | People should not have to worry about CLI other than copying the relevant commands. Everything should have been within one folder which would be the starting folder of the command line. | 404 | | Missing central documentation to direct people to. | Prepare a github repo and an etherpad. | 405 | | Long URLs are difficult to type. | When uris are used, use bit.ly | 406 | | People start wandering off on their own e.g. when some are still stuck at installing, and others already completed the task. | Define clearer “group stages” (Now we’re installing for 5 min; then we’ll begin with notebook; play around with facts for X min; then talk about results) | 407 | | When a session loses focus, it is difficult to catch people's attention again and pull them back together. | Try to provide "a social solution for a technical problem" - clear statements what to do if something fails; have people group up in teams of two with different OS and experience level, so that they move more together. | 408 | -------------------------------------------------------------------------------- /training-guidelines/session-formats.md: -------------------------------------------------------------------------------- 1 | # Teaching resources 2 | 3 | This contains a description of the methods used in the workshop as well as links to helpful external teaching resources. 4 | 5 | 1. [Presentation](#presentation) 6 | 7 | 2. [World Cafe](#world-cafe) 8 | 9 | 3. [Hands on](#hands-on) 10 | 11 | 4. [Hacking session](#hacking-session) 12 | 13 | 14 | ### Presentation 15 | 16 | ##### What is it about? 17 | 18 | ##### When can I use it? 19 | 20 | ##### How to setup / frame it 21 | 22 | ##### resources needed 23 | 24 | ##### min / max time allocation 25 | 26 | ##### min / max nr. of participants 27 | 28 | ##### possible problems 29 | 30 | * bad beamer resolution: 31 | * no fineprint / high-res images 32 | * bad beamer colors: 33 | * clearer contrasting slide colors 34 | 35 | 36 | ##### Links 37 | 38 | 39 | 40 | ### World Cafe 41 | 42 | ##### What is it about? 43 | 44 | * Connect with the people in a meaningful way 45 | * Let people interact on a very personal and direct level 46 | * Create a socially nice athmosphere towards the end to have good memory about the session 47 | * Get straight into it, no bullshit around! 48 | * Discuss 3 questions in small groups (~5), sum up in large circle 49 | 50 | ##### When can I use it? 51 | 52 | * After people did some content mining, more towards the end of a workshop 53 | * useful to wrap up things and get a meta-perspective 54 | 55 | ##### How to setup / frame it 56 | 57 | 58 | ** Before workshop** 59 | 60 | 1. Think of 3-4 questions that open up a discussion, e.g. 61 | * What are opportunities created through content mining? 62 | * What are barrier for content mining? 63 | * How can content mining help you? 64 | 1. write these questions down on flipchart 65 | 1. rearrange room / tables to allow for groups of 4-5 people to sit together 66 | 67 | ** At workshop** 68 | 69 | 1. Explain the next session (2min) 70 | * “How knows what a world cafe is? “ 71 | * Art of Hosting Method 72 | * 3 rounds of 5min with one question for each. 73 | * Groups of 4-5 people 74 | * Discuss at a table. start straight into it. 5mins are not too much, so dont waste time. 75 | * Write down your thoughts. For this you have pens and a sheet. s 76 | * After bell rings, one person stays and everyone else leaves to another table. decide yourself on the table who does what. 77 | * At the end after the 3 rounds we will discuss the outcome in a big round/circle were we are going to harvest on a meta level. 78 | * Sum Up: Get in your round, discuss the questions and document main points on your sheet. 79 | * “Questions?” 80 | 1. Set up space(1min) 81 | * “So, let’s start it. Please get together in groups of 4-5 people.” 82 | * give sheets and pens to groups 83 | * Everyone at a table, a sheet and 1-2 pens? 84 | 1. three rounds (5min + 1min moving): 85 | * Check the tables: "everything ok?" 86 | * signal after 4min: "1 minute left" 87 | * Finish with virtual bell 88 | * Switch tables 89 | 1. Switch to big circle (1min) 90 | * “Let’s rearrange the tables and chairs to one big circle 91 | * when not enough place for all, 2nd, 3rd row 92 | * Circle with Harvesting (10min) 93 | * explain it: everyone can go into the middle and speak about something he/she experience. meta level. we collect this on an own sheet. circle is perfect form to bring things together. 94 | * if not enough time for all: ask one person from each table to summarize for meta-harvest after the three rounds. 95 | 96 | ##### resources needed 97 | 98 | * a bell - real or virtual to signal start/end of rounds 99 | * Flipchart pens and sheets 100 | * sticky tape 101 | * postits 102 | 103 | ##### min / max time allocation 104 | 105 | * 45min 106 | 107 | ##### min / max nr. of participants 108 | 109 | ##### possible problems 110 | 111 | * too many participants (>20): 112 | * create additional tables for more rotation 113 | * not enough participants (<6): 114 | * do the rounds together 115 | * no one starts talking in the big circle: give quick own impression 116 | 117 | ##### Links 118 | 119 | 120 | 121 | ### Hands-on 122 | 123 | ##### What is it about? 124 | 125 | * let people do something by themselves and create individual results 126 | * let them learn in groups 127 | * learning by doing 128 | * show them the magic of content mining 129 | 130 | 131 | ##### When can I use it? 132 | 133 | * as often as possible 134 | 135 | ##### How to setup / frame it 136 | 137 | 1. let people form teams of two (1min) 138 | * condition: have at least one working laptop per two 139 | 140 | 1. make sure every team has the software and data (5-10min) 141 | 142 | 1. explain whats going to happen and what they will be able to do after the session (use case) 143 | 144 | 1. introduce them to the concept to learn by demoing an example 145 | 146 | 1. have them reproduce the example 147 | 148 | 1. repeat: what is the use case, what is the problem, how are we going to solve it? 149 | 150 | 1. go through code / explain raw data 151 | 152 | 1. provide some small task/quiz: 153 | * find information xyz 154 | * we have x, we want z, please do y 155 | 156 | 1. enable sandbox environment: 157 | * let people explore on their own 158 | * go around, check if help needed 159 | 160 | 1. summarize: 161 | * what have you found? 162 | * what were the problems? 163 | * did you solve them? yes, how? no, what barrier? - should be identified during sandboxing 164 | * discuss/compare outcomes 165 | 166 | 167 | ##### resources needed 168 | 169 | * laptops 170 | * code examples 171 | * data examples 172 | 173 | 174 | ##### min / max time allocation 175 | 176 | 45min - 90min 177 | 178 | ##### min / max nr. of participants 179 | 180 | 2 - 15 181 | 182 | ##### possible problems 183 | 184 | * high participant / facilitator ratio: 185 | * ask advanced participants to help after they managed 186 | 187 | ##### Links 188 | 189 | 190 | 191 | ### Hacking session 192 | 193 | ##### What is it about? 194 | 195 | ##### When can I use it? 196 | 197 | ##### How to setup / frame it 198 | 199 | ##### resources needed 200 | 201 | ##### min / max time allocation 202 | 203 | ##### min / max nr. of participants 204 | 205 | ##### possible problems 206 | 207 | ##### Links 208 | 209 | 210 | 211 | ## additional facilitator resources 212 | 213 | 214 | -------------------------------------------------------------------------------- /training-guidelines/workflow.md: -------------------------------------------------------------------------------- 1 | # General workflow 2 | 3 | This should give facilitators and organizers some orientation on which steps to perform when. The **days ahead** estimations roughly indicate when this process should start, since there are dependencies later on. 4 | 5 | The major points can be raised as personal issues for the facilitators/organizers, with the subpoints as issue notes. This helps keeping track of progress. Also, each major point should be discussed directly between facilitators/organizers at least once. 6 | 7 | 8 | 1. (**4-8 weeks ahead**): Logistics 9 | - [ ] Liase with local org on basic info, decide on: 10 | - [ ] Date, Venue 11 | - [ ] Team 12 | - [ ] Hashtag 13 | - [ ] Duration and timeframe (start - end, lunch breaks) 14 | - [ ] check local tech situation (internet speed, beamer,...) 15 | - [ ] Define target audience 16 | - [ ] Find place for social event after workshop (bar, restaurant) 17 | - [ ] start with PR (together with host?): 18 | - [ ] How? 19 | - [ ] When? 20 | - [ ] Where? 21 | - [ ] Partners: Media, Organizers, Domain specific, institutions, 22 | - [ ] Send to @GrahamSteel to create event page on contentmine.org 23 | 1. (**4-6 weeks ahead**): Framing 24 | - [ ] Think/talk about what people want to learn, check if any of our major use cases applies 25 | - [ ] Think about requirements: skills, material, preparation 26 | - [ ] Think about participant preparation: what people need to do in advance 27 | 1. (**4-6 weeks ahead, optional**): Guest speaker 28 | - [ ] ask for attendance 29 | - [ ] aks for compliance with date 30 | - [ ] send out email: 31 | - [ ] event link 32 | - [ ] hard facts: date, location, meeting time, participants, 33 | - [ ] ask for support in advertisement: personally and institutionally 34 | - [ ] questions? 35 | - [ ] agenda 36 | - [ ] short description of event 37 | 1. (**3-5 weeks ahead**): 38 | - [ ] Identify knowledge gaps and uncertainties and create pre-training assessment. 39 | - [ ] **optional, 4 weeks ahead:** Send out pre-training assessment 40 | 1. (**28d ahead**) Set up repo via template [LINK]() 41 | - [ ] Short description 42 | - [ ] Time-Schedule 43 | - [ ] Trainers with links to personal pages or Twitter 44 | - [ ] Preparational material: links to learn in advance 45 | - [ ] Travelling information and directions 46 | 1. (**18d ahead**, optional) Remind participants of pre-training assessment 47 | 1. (**14d ahead**) Draft session schedule 48 | - [ ] Choose modules according to target audience after closing pre-training assessment. We have three module categories 49 | - A: Introduction sessions 50 | - B: application of CMtools to do content mining 51 | - C: more hacky sessions, working together on user problems, downstream analysis 52 | - The session schedule should include short breaks after each session, and possibly time and space for a social event in the evening. 53 | - think about: framing, target audience, purpose for project and participants 54 | - discuss what should not / can not be done and why 55 | - potential technical issues -> prepare solution (additional documentation) 56 | - potential content questions / problems -> prepare answers 57 | - offer some social event in the evening 58 | - follow-ups 59 | 1. (**14d ahead**): Update repo 60 | - [ ] schedule 61 | - [ ] pre-requisites 62 | - [ ] downloads / installations / datasets / tech requirements 63 | - [ ] Ask for feedback and corrections about facts mentioned in event invitation with all involved in the workshop preparation. 64 | 1. (**10-14d ahead**) Prepare materials: 65 | - [ ] prepare pad 66 | - [ ] workflow 67 | - [ ] commands 68 | - [ ] links 69 | - [ ] bit.ly for github-repo 70 | - [ ] check technical readiness: 71 | - [ ] canary-workspace 72 | - [ ] jupyter-notebook 73 | - [ ] if using VM prepare sticks 74 | - [ ] readable by MAC/Windows/Linux 75 | - [ ] delocalize VM (language, host system) 76 | - [ ] check content readiness: 77 | - [ ] training github pages repo 78 | - [ ] slides up2date 79 | - [ ] adapted to local audience 80 | - [ ] dataset compiled 81 | - [ ] personal notes 82 | - [ ] handouts / posters / flipchart workshop sheets 83 | - [ ] if more than one facilitator: assign roles 84 | - [ ] who does facilitating? 85 | - [ ] who does tech support? 86 | 1. (**5d ahead**) Prepare follow ups 87 | - Set up post-training evaluation form 88 | - Set up email for participants: post-training survey, follow ups 89 | 1. (**3d ahead**) Connect with participants: 90 | - [ ] welcome email with installation instructions and material 91 | - [ ] event link 92 | - [ ] hard facts again: date, location 93 | - [ ] offer single point of contact 94 | 1. Do workshop: 95 | - [ ] Coordinate with local support: 96 | - [ ] technological support 97 | - [ ] some refreshments (coffee, snacks) 98 | - [ ] check beamer resolution, adjust screen for visibility 99 | - [ ] Ask for permission -> Take photos 100 | - [ ] Tweet and encourage others to 101 | - prepare locality (beamer, seating arrangement, material for harvesting) 102 | - [ ] Let participants group up (2-3-4), target: 4-6 groups 103 | - [ ] Ensure most (if not all) participants have required software installed 104 | - [ ] Introduce the pad 105 | - [ ] ask about skill level, test crucial skills (shell, python, statistics, etc) with easy improving steps and find out more and more about who your audience is 106 | - [ ] go through scheduled sessions 107 | 1. (**1d later**) Send out email to participants with: 108 | - [ ] feedback forms - read the [guidance](https://github.com/ContentMine/workshop-resources/blob/master/training-guidelines/evaluation-assessment.md#evaluation) on evaluation, copy and edit [template feedback form](https://docs.google.com/forms/d/1nCaM6_sA-clrWDoNzdua5Luxg8bV7dcBMj82pERIIpQ/edit). 109 | - [ ] github-repo 110 | - [ ] follow-ups 111 | 1. (**1d later** ) Document the workshop 112 | - [ ] document what you have done compared to the initial plan. this can then be used to interprete the evaluations properly 113 | - [ ] Review: document the lessons learned in an own doc 114 | 1. (**3d later, optional**) create additional media 115 | - [ ] Blog post with images 116 | - [ ] add photos to Flickr, videos to youtube 117 | 1. (**3d later**) Get feedback from host 118 | 1. (**asap**) update materials: 119 | - [ ] repo with content generated during workshop 120 | - [ ] Add latest VM to [VM repo](https://github.com/ContentMine/workshops) 121 | -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/README.md: -------------------------------------------------------------------------------- 1 | ## About ContentMine 2 | 3 | ### Contents 4 | * what is content mining in general, what is CM specifically? 5 | * Describe the different types of content that might be mined 6 | * Describe the standard procedure for mining (Crawl -> Scrape -> Extract) 7 | * Describe some of the legal and technical barriers to content mining 8 | * Describe 2-3 examples of content mining. 9 | * Describe what we can offer 10 | 11 | ### Learning Goals 12 | 13 | * what is the vision and purpose of content mining 14 | * get an overview about the products of ContentMine 15 | * how to relate content mining to their own work 16 | 17 | ### Activities and Methods 18 | 19 | * presentation 20 | * demo of CM product 21 | * other demos 22 | * Q&A 23 | 24 | ### Duration 25 | 26 | * 45min 27 | * 20min introduction presentation 28 | * 25min discussion of participants interests/visions 29 | 30 | ### Prerequisites 31 | 32 | * none 33 | 34 | ### Resources 35 | 36 | * slides ([.odp](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-About-ContentMine/about-contentmine.odp) and [PDF](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-About-ContentMine/about-contentmine.pdf)) 37 | * online demos for amusement 38 | - [IUCN redlist](iucn/README.md) 39 | - [Fact of the Day](fotd/README.md) 40 | - [Chemistry](chemistry/README.md) 41 | - [bubbles](bubbles/README.md) 42 | -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/about-contentmine.odp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/about-contentmine.odp -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/about-contentmine.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/about-contentmine.pdf -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/bubbles/README.md: -------------------------------------------------------------------------------- 1 | # Bubbles# Bubbles 2 | 3 | A modern approach to showing document structure. 4 | 5 | ## online demo 6 | 7 | see http://bubbles.contentmine.org . 8 | 9 | * 10 | * 11 | * 12 | * 13 | * 14 | 15 | -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/bubbles/bubbles.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/bubbles/bubbles0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles0.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/bubbles/bubbles1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles1.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/bubbles/bubbles3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles3.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/bubbles/bubbles4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles4.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/bubbles/bubbles5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles5.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/chemistry/README.md: -------------------------------------------------------------------------------- 1 | # ChemicalTagger 2 | 3 | A tool that completely parses chemical text and interprets syntetic procedures. 4 | ## online demo 5 | 6 | see http://chemicaltagger.ch.cam.ac.uk . 7 | 8 | * before annotation: 9 | 10 |
11 | * immediate effect of annotation after clicking "Process Text" 12 | 13 |
14 | * showing just one 15 |
16 | * 17 | 18 | -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/chemistry/chemicaltagger0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger0.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/chemistry/chemicaltagger1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger1.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/chemistry/chemicaltagger2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger2.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/chemistry/chemicaltagger3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger3.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/chemistry/chemicaltagger4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger4.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/fotd/README.md: -------------------------------------------------------------------------------- 1 | # Fact of the Day 2 | 3 | Every day the documents are searched for Facts - species, or genes. There are some false positives, but on the whole it's quite fun. Where possible we link to an image from Wikipedia 4 | 5 | ## online demo 6 | 7 | see http://fotd.contentmine.org eacg Fact points to the previous and the next. There are some places with a few blank days but generally you can find some nice pictures. 8 | 9 | * 10 | * 11 | 12 | 13 | -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/fotd/fotd.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/fotd/fotd.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/fotd/fotd1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/fotd/fotd1.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/iucn/README.md: -------------------------------------------------------------------------------- 1 | # IUCN 2 | 3 | See http://iucn.org for general background 4 | 5 | ## online demo 6 | 7 | Facts are harvested daily and filtered against IUCN species, results are displayed at http://iucn.contentmine.org . 8 | 9 | * The landing page:
10 |
11 | 12 | * we search for `ursus`, and note the bear species
13 | 14 | * When entered we get a list of species with `ursus`
15 | 16 |
17 | we'll select `Ursus maritimus`:
18 | 19 | * All the hits for "ursus". (left) in textual context; on the right the title of the article they were located in
20 | 21 | * and the actual article
22 | 23 | 24 | 25 | -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/iucn/iucn1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn1.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/iucn/iucn2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn2.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/iucn/iucn3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn3.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/iucn/iucn4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn4.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/iucn/iucn5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn5.png -------------------------------------------------------------------------------- /training-modules/A-About-ContentMine/iucn/iucn6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn6.png -------------------------------------------------------------------------------- /training-modules/A-CM-Facts/README.md: -------------------------------------------------------------------------------- 1 | ## What is a ContentMine Fact 2 | 3 | ### Contents 4 | 5 | * introduce concept of fact from a technical perspective 6 | * walk through the complete pipeline of CM, from inputs to outputs 7 | * which steps are necessary in a content mining workflow 8 | 9 | ### Learning Goals 10 | 11 | * Understand the impact of crawlers on websites 12 | * Describe steps one can take to limit impact of crawling e.g. limit requests 13 | 14 | ### Activities and Methods 15 | 16 | * paper markup exercise 17 | * Hand-out and follow the [worksheet](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-CM-Facts/manual_markup.pdf) 18 | * Provide 1-2 papers and ensure there is at least one copy of one paper per participant (paper links to be added here and to resource section) 19 | * step-by-step walkthrough: showing an example input, and what happens after CM is applied to it 20 | 21 | ### Duration 22 | 23 | * 45min 24 | * 15min introduction 25 | * 30min manual markup exercise 26 | 27 | ### Prerequisites 28 | 29 | * none 30 | 31 | ### Resources 32 | 33 | * readymade executable examples 34 | * slides ([.odp](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-CM-Facts/contentmine-facts.odp) and [PDF](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-CM-Facts/contentmine-facts.pdf)) 35 | * Manual markup [workshee](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-CM-Facts/manual_markup.pdf) (need at least one copy per two participants) 36 | * Manual markup papers (links to be added here) 37 | 38 | ### Watch for 39 | 40 | * country-specificity 41 | -------------------------------------------------------------------------------- /training-modules/A-CM-Facts/contentmine-facts.odp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-CM-Facts/contentmine-facts.odp -------------------------------------------------------------------------------- /training-modules/A-CM-Facts/contentmine-facts.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-CM-Facts/contentmine-facts.pdf -------------------------------------------------------------------------------- /training-modules/A-CM-Facts/iucn_walkthrough.md: -------------------------------------------------------------------------------- 1 | = Walkthrough of interactive use of IUCN = 2 | 3 | This file desceribes the current IUCN demonstration on contentmine.org 4 | 5 | ** WARNING** this may be unstable against changes to the actual implementation/service 6 | -------------------------------------------------------------------------------- /training-modules/A-CM-Facts/manual_markup.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-CM-Facts/manual_markup.pdf -------------------------------------------------------------------------------- /training-modules/A-Canary/README.md: -------------------------------------------------------------------------------- 1 | ## Canary 2 | 3 | ### Contents 4 | 5 | ### Learning Goals 6 | 7 | * what is the purpose of canary? 8 | * how does it work? 9 | * how to create own workspace 10 | * how to input URLs 11 | * how to execute queries 12 | * how to extract facts 13 | * how to access facts 14 | 15 | ### Activities and Methods 16 | 17 | * present a selection of possible applications, including network analysis, indexing and search, re-using extracted facts 18 | * guided software development 19 | * pair programming 20 | 21 | ### Duration 22 | 23 | * 45min 24 | * 5min intro 25 | * 5min create own workspace, explain options 26 | * 10min execute query: finding papers, removing failures, run plugins (do it with a narrow query) 27 | * 5min error handling 28 | * 10min exploring facts: context, source, options 29 | * 10min own exploration: show documentation, then hands on, while answering questions 30 | 31 | ### Prerequisites 32 | 33 | * CM tools and workflow 34 | * basic programming skills helpful 35 | * version control helpful 36 | 37 | ### Resources 38 | 39 | * canary.contentmine.org 40 | 41 | ### Watch for 42 | 43 | -------------------------------------------------------------------------------- /training-modules/A-Legal-Responsible/README.md: -------------------------------------------------------------------------------- 1 | ## Legal Aspects and Responsible Mining 2 | 3 | ### Contents 4 | 5 | * Under what circumstances is it legal to content mine, and how? 6 | * copyright 7 | * publisher licences 8 | * country-specific legislation 9 | * legal areas that impact content mining 10 | * specify licensing /status of resulting facts 11 | * introduce concept of fact from a legal perspective 12 | 13 | ### Learning Goals 14 | 15 | * learn about the impact of copyright on the possibilities of CM 16 | * learn the differences between licences 17 | * learn about current legal developments regarding text and data mining, including updates on H2020, EU copyright revisions 18 | * learn about barriers erected by publishers, including restrictive licences, API ToS 19 | * learn about current consultations and plans in the EU around TDM 20 | * learn about copyright restrictions (or exceptions) in your jurisdication. 21 | * Understand the impact of crawlers on websites 22 | * Describe steps one can take to limit impact of crawling e.g. limit requests 23 | 24 | ### Activities and Methods 25 | 26 | * warm-up: legal quiz 27 | * presentation 28 | * Q&A 29 | 30 | ### Duration 31 | 32 | * 50min 33 | * 10min presentation about legal aspects 34 | * 15min quiz + discussion of answers 35 | * 10min wrap up and questions 36 | * 10min presentation about responsible mining 37 | * 5min discussion / questions 38 | 39 | ### Prerequisites 40 | 41 | * none 42 | 43 | ### Resources 44 | 45 | * Up-to-date [legal quiz handouts](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-Legal-Responsible/legal_true_false.pdf), one per participant. 46 | * Legal quiz [answer sheet](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-Legal-Responsible/true_false_answers.pdf) for trainer 47 | * Slides ([.odp](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-Legal-Responsible/legal-responsible.odp) and [PDF](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-Legal-Responsible/legal-responsible.pdf)) 48 | * [Article on Responsible Content Mining](http://contentmine.org/wp-content/uploads/2015/06/responsible-content-mining-1.pdf) 49 | * [Charles Oppenheim slides at JISC event] (jisctdm.pptx) 50 | 51 | 52 | ### Watch for 53 | 54 | * country-specificity 55 | -------------------------------------------------------------------------------- /training-modules/A-Legal-Responsible/jisctdm.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/jisctdm.pdf -------------------------------------------------------------------------------- /training-modules/A-Legal-Responsible/jisctdm.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/jisctdm.pptx -------------------------------------------------------------------------------- /training-modules/A-Legal-Responsible/legal-responsible.odp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/legal-responsible.odp -------------------------------------------------------------------------------- /training-modules/A-Legal-Responsible/legal-responsible.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/legal-responsible.pdf -------------------------------------------------------------------------------- /training-modules/A-Legal-Responsible/legal_true_false.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/legal_true_false.pdf -------------------------------------------------------------------------------- /training-modules/A-Legal-Responsible/true_false_answers.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/true_false_answers.pdf -------------------------------------------------------------------------------- /training-modules/A-Participate-Contribute/README.md: -------------------------------------------------------------------------------- 1 | ## Participate and Contribute 2 | 3 | ### Contents 4 | 5 | * CMunities 6 | * discourse 7 | * github 8 | * Cambridge-Meetups 9 | 10 | ### Learning Goals 11 | 12 | * how to continue to participate 13 | * how to contribute to code 14 | * learn about version control and how to contribute to CM (scrapers, code, plugins) but also through issues and bug reports 15 | 16 | ### Activities and Methods 17 | 18 | * raise issue together 19 | * exemplary coding and contribution to github-repo, e.g. workshop or setup extra test repo 20 | * hands-on work on own code/repo 21 | 22 | ### Duration 23 | 24 | * 15min 25 | * 10min presentation 26 | * 5min questions 27 | 28 | ### Prerequisites 29 | 30 | * github-account 31 | * basic command line knowledge 32 | 33 | ### Resources 34 | 35 | * slides [LINK] 36 | 37 | ### Watch for 38 | 39 | -------------------------------------------------------------------------------- /training-modules/B-Architecture/README.md: -------------------------------------------------------------------------------- 1 | ## Architecture 2 | 3 | ### Contents 4 | 5 | * Overview of tool chain 6 | * What happens where, where can I jump in? 7 | 8 | ### Learning Goals 9 | 10 | * get a first intuition about the building blocks of ContentMine 11 | * learn about the different steps of content mining 12 | * learn about the source of papers and facts 13 | * learn about intermediaries like scholarly.html and CProject/CTree 14 | * learn how this leads to the outcome of content mining: a fact in context 15 | 16 | ### Activities and Methods 17 | 18 | * opening, reading and interpreting the intermediaries of ContentMine 19 | 20 | ### Duration 21 | 22 | * 45min 23 | * 7min presentation 24 | * 7min looking at publisher APIs and scope of data sources 25 | * 7min looking at CProject/CTree 26 | * 7min looking at a fulltext.xml / scholarly.html 27 | * 7min looking at facts in context 28 | * 10min reserve for questions 29 | 30 | ### Prerequisites 31 | 32 | * none 33 | 34 | ### Resources 35 | 36 | * slides [LINK] 37 | * a precompiled dataset with ami-plugins run 38 | * links and documentation to 2-3 publisher APIs (EUPMC, ...) 39 | * an example CProject/CTree 40 | * an example fact + surrounding context 41 | 42 | ### Watch for 43 | 44 | -------------------------------------------------------------------------------- /training-modules/B-Architecture/contentmine-architecture.odp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/B-Architecture/contentmine-architecture.odp -------------------------------------------------------------------------------- /training-modules/B-Architecture/contentmine-architecture.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/B-Architecture/contentmine-architecture.pdf -------------------------------------------------------------------------------- /training-modules/B-Building-corpus/README.md: -------------------------------------------------------------------------------- 1 | ## Building a corpus with getpapers 2 | 3 | ### Contents 4 | 5 | ### Learning Goals 6 | 7 | * learn about the challenges posed by different publisher APIs 8 | * learn how to use getpapers 9 | * learn how to create your own collection of papers 10 | 11 | ### Activities and Methods 12 | 13 | * write search queries 14 | * compare website source with scraper definition and subsequent scraping result 15 | * handling errors 16 | * look at found results 17 | 18 | ### Duration 19 | 20 | * 50min 21 | * 10min presentation 22 | * 5min looking at API 23 | * 5min demo of getpapers-query 24 | * 20min hands-on of querying 25 | * 10min reserve for error handling 26 | 27 | ### Prerequisites 28 | 29 | * command line 30 | 31 | ### Resources 32 | 33 | * [Tutorial on getpapers](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/getpapers) 34 | * [Screencast on getpapers](https://youtu.be/H2ESPjihnDA) 35 | 36 | ### Watch for 37 | 38 | * too many big queries crashing the connection 39 | -------------------------------------------------------------------------------- /training-modules/B-Fact-Extraction/README.md: -------------------------------------------------------------------------------- 1 | ## Extracting Facts with ami-plugins 2 | 3 | ### Contents 4 | 5 | ### Learning Goals 6 | 7 | * get an overview of possible plugins and what outcome to expect 8 | * run ami-plugins on your data 9 | * learn how to interpret results.xml 10 | 11 | ### Activities and Methods 12 | 13 | * demonstration of ami-plugins 14 | * discussion of use-cases 15 | 16 | ### Prerequisites 17 | 18 | * command line 19 | 20 | ### Duration 21 | 22 | * 60min 23 | * 10min presentation 24 | * 5min discussion of interesting use case 25 | * 5min demo of running the chosen ami-plugin 26 | * 15min hands on for everyone 27 | * 10min reserve for error handling 28 | * 10min looking at a results.xml in-depth 29 | * 5min reserve for questions 30 | 31 | ### Resources 32 | 33 | * [Tutorial on ami](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/ami) 34 | * prepared, normalized dataset 35 | 36 | ### Watch for 37 | 38 | * community-specific use case 39 | -------------------------------------------------------------------------------- /training-modules/B-Normalization/README.md: -------------------------------------------------------------------------------- 1 | ## Normalisation 2 | 3 | ### Contents 4 | 5 | ### Learning Goals 6 | 7 | * what are the difficulties of unstructured information 8 | * why it is desirable to have a normalized version and what is scholarly html 9 | * understand the process of normalization within CM 10 | 11 | ### Activities and Methods 12 | 13 | * demonstration of norma 14 | * compare one input document with an output document in detail 15 | 16 | ### Duration 17 | 18 | * 30min 19 | * 10min presentation 20 | * 5min hands on with norma 21 | * 5min buffer for problems 22 | * 10min discussion / questions 23 | 24 | ### Prerequisites 25 | 26 | * command line 27 | 28 | ### Resources 29 | 30 | * [Tutorial on norma](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/norma) 31 | 32 | ### Watch for 33 | 34 | -------------------------------------------------------------------------------- /training-modules/B-Scraping/README.md: -------------------------------------------------------------------------------- 1 | ## Scraping 2 | 3 | ### Contents 4 | 5 | ### Learning Goals 6 | 7 | * how does the general process of scraping work, including structure of a website 8 | * how to identify specific data 9 | * how to reorganize it into a reusable format 10 | * learn about the challenges posed by different publishing formats 11 | * learn how to use quickscrape with own urls 12 | 13 | ### Activities and Methods 14 | 15 | * compare website source with scraper definition and subsequent scraping result 16 | * write scraper 17 | 18 | ### Duration 19 | 20 | * 45min 21 | * 10min presentation 22 | * 5min demo of quickscrape 23 | * 20min hands on example, comparing source-html with scraper-definition and output 24 | * 10min reserve for questions 25 | 26 | ### Prerequisites 27 | 28 | * command line 29 | 30 | ### Resources 31 | 32 | * readymade executable examples 33 | * two or three tested links with example output 34 | * slides [LINK] 35 | 36 | ### Watch for 37 | 38 | -------------------------------------------------------------------------------- /training-modules/B-Scraping/scraping.odp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/B-Scraping/scraping.odp -------------------------------------------------------------------------------- /training-modules/B-Scraping/scraping.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/B-Scraping/scraping.pdf -------------------------------------------------------------------------------- /training-modules/B-VM-Commandline/README.md: -------------------------------------------------------------------------------- 1 | ## How to work with ContentMine tools 2 | 3 | ### Contents 4 | 5 | How to set up the working environment of the Virtual Machine 6 | 7 | ### Learning Goals 8 | 9 | * how to install 10 | * how to get updates 11 | * basic steps of commandline 12 | 13 | ### Activities and Methods 14 | 15 | * installing software together 16 | * getting to know command line 17 | 18 | ### Duration 19 | 20 | * 45min 21 | * 10min installation 22 | * 5min reserve for error handling 23 | * 15min hands on example of command line 24 | * 10min showing how to keep software updated 25 | * 5min reserve for questions 26 | 27 | ### Prerequisites 28 | 29 | * command line 30 | 31 | ### Resources 32 | 33 | * readymade executable examples 34 | * links to stable versions of software 35 | * [Tutorial on VirtualMachines](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/vms) 36 | * [Tutorial on shell](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/shell) 37 | * slides [LINK] 38 | 39 | ### Watch for 40 | 41 | * installation problems on different OS 42 | -------------------------------------------------------------------------------- /training-modules/B-Working-with-Facts/README.md: -------------------------------------------------------------------------------- 1 | ## Working with ContentMine Facts 2 | 3 | ### Contents 4 | 5 | ### Learning Goals 6 | 7 | * show that real science can be done (e.g. supertree) 8 | * introduce ipython-notebooks 9 | * how to read the results.xmls 10 | 11 | ### Activities and Methods 12 | 13 | * very hacky 14 | 15 | ### Prerequisites 16 | 17 | * command line 18 | * at least one other programming language 19 | * domain knowledge 20 | 21 | ### Duration 22 | 23 | * 60min 24 | * 5min introduce notebooks 25 | * 10min getting notebooks to run 26 | * 5min reserve for error handling 27 | * 30min going through notebook workflow 28 | * 10min discussion / questions 29 | 30 | ### Resources 31 | 32 | * prepared ipython notebook 33 | * [pyCProject](https://github.com/ContentMine/pyCProject/blob/master/source/readctree.py) 34 | 35 | ### Watch for 36 | 37 | * community-specific use case 38 | -------------------------------------------------------------------------------- /training-modules/C-Own-Usecase/README.md: -------------------------------------------------------------------------------- 1 | ## How to employ ContentMine tools to my own problem 2 | 3 | ### Contents 4 | 5 | * very hacky 6 | * interactive 7 | * explorative 8 | 9 | ### Learning Goals 10 | 11 | * how to use the CM tool chain to solve one self-defined problem 12 | * develop a custom application/tool/script/data pipeline to solve a problem 13 | 14 | ### Activities and Methods 15 | 16 | * present a selection of possible applications, including network analysis, indexing and search, re-using extracted facts 17 | * guided software development 18 | * pair programming 19 | 20 | ### Duration 21 | 22 | * 60min - 180min 23 | 24 | ### Prerequisites 25 | 26 | * CM tools and workflow 27 | * basic programming skills helpful 28 | * version control helpful 29 | 30 | ### Resources 31 | 32 | ### Watch for 33 | 34 | -------------------------------------------------------------------------------- /training-modules/C-Regex/README.md: -------------------------------------------------------------------------------- 1 | ## Regex in ami 2 | 3 | ### Contents 4 | 5 | * what is a regex, why is it useful for me? 6 | 7 | ### Learning Goals 8 | 9 | * learn how to create their own regex queries 10 | * be able to interpret why something is not as expected 11 | * how to input a regex into ami and run it 12 | * work with results 13 | 14 | ### Activities and Methods 15 | 16 | * working on examples 17 | * guided hacking 18 | * pair programming 19 | 20 | ### Duration 21 | 22 | * 60min - 90min 23 | 24 | ### Prerequisites 25 | 26 | * CM tools and workflow 27 | * basic programming skills helpful 28 | * version control helpful 29 | 30 | ### Duration 31 | 32 | ### Resources 33 | 34 | * http://regexr.com/ 35 | * an example dataset 36 | 37 | ### Watch for 38 | 39 | -------------------------------------------------------------------------------- /training-modules/C-Writing-scrapers/README.md: -------------------------------------------------------------------------------- 1 | ## Writing scraper definitions 2 | 3 | ### Contents 4 | 5 | * What is a scraper definition? 6 | * What do I need it for? 7 | * How can I build one? 8 | 9 | ### Learning Goals 10 | 11 | * learn how to create their own scraper definitions 12 | * be able to interpret why something is not as expected 13 | * how to add a new definition to quickscrape and run it 14 | * get an intuition for results 15 | 16 | ### Activities and Methods 17 | 18 | * working on examples 19 | * guided hacking 20 | * pair programming 21 | 22 | ### Duration 23 | 24 | * 60min - 120min 25 | 26 | ### Prerequisites 27 | 28 | * CM tools and workflow 29 | * basic programming skills helpful 30 | * version control helpful 31 | 32 | ### Duration 33 | 34 | ### Resources 35 | 36 | * an example dataset 37 | 38 | ### Watch for 39 | 40 | -------------------------------------------------------------------------------- /training-modules/README.md: -------------------------------------------------------------------------------- 1 | ![ContentMine logo](https://github.com/ContentMine/assets/blob/master/png/Content_mine(small).png) 2 | 3 | # Content of this folder 4 | 5 | This folder contains learning modules which can be combined in any desired way by workshop facilitators. 6 | Each module contains a description of learning goals, links to additional resources like slides or software tutorials, and a detailed description of steps. 7 | 8 | Modules beginning with "A" contain a general introduction to content mining and the ContentMine project. They include an overview about legal and practical issues, as well as an introduction to core concepts of content mining. 9 | 10 | Modules beginning with "B" contain practical examples and demonstrations of ContentMine tools, and are designed as more interactive sessions. 11 | 12 | Modules beginning with "C" build on what has been learned in "B". They are designed as hacky, guided applications of ContentMine tools, and enable participants to explore and work on their own use cases with the help of our facilitators. 13 | 14 | 15 | | Module | Description | Estimated length | 16 | |--------|------------|-------------------| 17 | | A | **Introduction sessions** | 195min | 18 | | A1 | [What is content mining in general, what does ContentMine do specifically?](A-About-ContentMine) | 45min | 19 | | A2 | [Under what circumstances is it legal to content mine, and how?](A-Legal-Responsible) | 45min | 20 | | A3 | [What is a fact, where does it come from, what can I do with it?](A-CM-Facts) | 45min | 21 | | A4 | [Participate and contribute to the ContentMine](A-Participate-Contribute) | 15min | 22 | | A5 | [Getting an easy start with Canary](A-Canary) | 45min | 23 | |----|------------------------------------------------|------| 24 | | B | **Practical sessions** | 335min | 25 | | B1 | [Overview of tool chain: What happens how, where can I jump in?](B-Architecture) | 45min | 26 | | B2 | [How to work with ContentMine tools](B-VM-Commandline) | 45min | 27 | | B3 | [Building a corpus with getpapers](B-Building-corpus) | 50min | 28 | | B4 | [Scraping with quickscrape](B-Scraping) | 45min | 29 | | B5 | [Normalizing the literature](B-Normalization) | 30min | 30 | | B6 | [Extracting facts with AMI](B-Fact-Extraction) | 60min | 31 | | B7 | [Putting facts in context with jupyter](B-Working-with-Facts) | 60min | 32 | |----|------------------------------------------------|------| 33 | | C | **Hacking sessions** | 180min - 390min | 34 | | C1 | [ami-regex](C-Regex) | 60min - 90min | 35 | | C2 | [Writing scraper definitions](C-Writing-scrapers) | 60min - 120min | 36 | | C3 | [Explore own use case](C-Own-Usecase) | 60min - 180min | --------------------------------------------------------------------------------