├── .gitignore
├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── assets
├── README.md
├── docs
│ ├── blog.md
│ ├── media.md
│ └── slides.md
└── images
│ ├── highlighted_facts.png
│ └── software
│ ├── ami
│ ├── ami2-species-genus-results.png
│ └── ami2-species-genus.png
│ ├── getpapers
│ ├── getpapers-complex1.png
│ ├── getpapers-folder.png
│ ├── getpapers-inspectctree1.png
│ ├── getpapers-inspectctree2.png
│ ├── getpapers-inspecturlstxt.png
│ ├── getpapers-metadataonly.png
│ ├── getpapers-noexecute.png
│ ├── getpapers-simplequery.png
│ └── getpapers-simpleresult.png
│ ├── journal-scrapers
│ ├── 001.png
│ ├── 002.png
│ ├── 003.png
│ ├── 004.png
│ ├── 005.png
│ ├── 006.png
│ ├── 007.png
│ ├── 008.png
│ ├── 009.png
│ ├── 010.png
│ ├── 011.png
│ └── 012.png
│ ├── norma
│ ├── cprojectwithshtml.png
│ ├── html2shtml.graphml
│ ├── html2shtml.png
│ ├── nlm2htmlwithwarnings.png
│ ├── normahtml2shtml0.dot
│ ├── normahtml2shtml0.png
│ ├── normasmall0.dot
│ ├── normasmall0.png
│ ├── normaxml2shtml0.dot
│ ├── normaxml2shtml0.png
│ ├── pdf2txt.graphml
│ ├── pdf2txt.png
│ ├── xml2shtml.graphml
│ └── xml2shtml.png
│ ├── quickscrape
│ └── quickscrape-url.png
│ ├── shell
│ ├── cat.png
│ ├── cd.png
│ ├── cp.png
│ ├── head.png
│ ├── ls.png
│ ├── mkdir.png
│ ├── mv.png
│ ├── rm.png
│ ├── tail.png
│ ├── tree.png
│ └── wc.png
│ └── vms
│ ├── desktop.png
│ ├── memorysettings.png
│ ├── starting-vm.png
│ ├── terminal.png
│ ├── vm-crunchbang.png
│ ├── vm-error.png
│ ├── vm-icon.png
│ ├── vm-installer.png
│ ├── vm-popup.png
│ ├── vm-poweredoff.png
│ ├── vm-running.png
│ ├── vm-startscreen.png
│ └── vm-virtualbox-download.png
├── software-tutorials
├── README.md
├── ami
│ └── README.md
├── canary
│ └── README.md
├── cat
│ └── README.md
├── cproject
│ └── README.md
├── getpapers
│ ├── README.md
│ ├── getpapers-arxiv-queries.md
│ ├── getpapers-eupmc-queries.md
│ └── getpapers-ieee-queries.md
├── installation
│ └── README.md
├── journal-scrapers
│ └── README.md
├── norma
│ ├── README.md
│ └── notes.txt
├── quickscrape
│ └── README.md
├── sHTML
│ └── README.md
├── shell
│ └── README.md
└── vms
│ └── README.md
├── training-guidelines
├── README.md
├── evaluation-assessment.md
├── how-to-setup-a-training.md
├── session-formats.md
└── workflow.md
└── training-modules
├── A-About-ContentMine
├── README.md
├── about-contentmine.odp
├── about-contentmine.pdf
├── bubbles
│ ├── README.md
│ ├── bubbles.png
│ ├── bubbles0.png
│ ├── bubbles1.png
│ ├── bubbles3.png
│ ├── bubbles4.png
│ └── bubbles5.png
├── chemistry
│ ├── README.md
│ ├── chemicaltagger0.png
│ ├── chemicaltagger1.png
│ ├── chemicaltagger2.png
│ ├── chemicaltagger3.png
│ └── chemicaltagger4.png
├── fotd
│ ├── README.md
│ ├── fotd.png
│ └── fotd1.png
└── iucn
│ ├── README.md
│ ├── iucn1.png
│ ├── iucn2.png
│ ├── iucn3.png
│ ├── iucn4.png
│ ├── iucn5.png
│ └── iucn6.png
├── A-CM-Facts
├── README.md
├── contentmine-facts.odp
├── contentmine-facts.pdf
├── iucn_walkthrough.md
└── manual_markup.pdf
├── A-Canary
└── README.md
├── A-Legal-Responsible
├── README.md
├── jisctdm.pdf
├── jisctdm.pptx
├── legal-responsible.odp
├── legal-responsible.pdf
├── legal_true_false.pdf
└── true_false_answers.pdf
├── A-Participate-Contribute
└── README.md
├── B-Architecture
├── README.md
├── contentmine-architecture.odp
└── contentmine-architecture.pdf
├── B-Building-corpus
└── README.md
├── B-Fact-Extraction
└── README.md
├── B-Normalization
└── README.md
├── B-Scraping
├── README.md
├── scraping.odp
└── scraping.pdf
├── B-VM-Commandline
└── README.md
├── B-Working-with-Facts
└── README.md
├── C-Own-Usecase
└── README.md
├── C-Regex
└── README.md
├── C-Writing-scrapers
└── README.md
└── README.md
/.gitignore:
--------------------------------------------------------------------------------
1 | notes.md
2 |
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # How to contribute
2 |
3 | :+1: We're happy you're thinking about contributing to ContentMine! The material offered here is subject to continuous change and improvement, so please check again regularly. We're also very happy to include your feedback, suggestions and contributions.
4 |
5 | There are many ways to contribute:
6 | - by reporting an issue regarding the training material
7 | - by suggesting new content
8 | - by submitting pull requests and closing issues
9 | - by writing, blogging or tweeting about the project
10 | - by translating the materials
11 |
12 | ### How to set up your development environment
13 |
14 | The ContentMine software is based on two languages, Node.js for getpapers and quickscrape, and Java for norma/ami-plugins. An installation and environment guide is [here](http://contentmine.github.io/).
15 |
16 | ### How to report a bug
17 |
18 | If you encounter a bug, please let us know. You can raise a new issue for each part of the software in the respective repositories:
19 |
20 | * [getpapers](https://github.com/ContentMine/getpapers/issues)
21 | * [quickscrape](https://github.com/ContentMine/quickscrape/issues)
22 | * [norma](https://github.com/ContentMine/norma/issues)
23 | * [ami-plugin](https://github.com/ContentMine/ami-plugin/issues)
24 |
25 | Please include as many information in your report as possible, to help maintainers reproduce the problem.
26 |
27 | * A clear and descriptive title
28 | * Describe the exact steps which reproduce the problem, e.g. the commands you entered and the filetypes you were operating on.
29 | * Describe the software behaviour following those steps, and where the problem occurred.
30 | * Explain where it was different from what you expected to happen.
31 | * Attach additional information to the report, such as error messages, or corrupted files.
32 | * Add a `bug` label to the issue.
33 |
34 | Before submitting a bug, please check the list of existing bugs whether there is a similar issue open. You can then help by adding your information to an existing report.
35 |
36 | ### How to request an enhancement
37 |
38 | There is always room for improvement and we'd like to hear your perspective on it. Before creating a pull request, please raise an issue to discuss the proposed changes first. We can then make sure to make best use of your efforts.
39 |
40 | If you want to discuss an idea in general, you can also start a new topic on [discuss](http://discuss.contentmine.org/).
41 |
--------------------------------------------------------------------------------
/LICENSE.md:
--------------------------------------------------------------------------------
1 | This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
2 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | .png)
2 |
3 |
4 | # ContentMine training material
5 |
6 | This repository contains material helping you to learn Content Mining yourself and for training others to do so. It includes tutorials for working with the ContentMine toolchain on your own PC or to use webservices offered by ContentMine.
7 |
8 |
9 | # Table of Content
10 |
11 | 1. [Purpose of this repository](#purpose-of-this-repository)
12 | 1. [For whom is this repository](#for-whom-is-this-repository)
13 | 1. [Description of content](#description-of-content)
14 | 1. [Contribute](#contribute)
15 | 1. [Licence](#licence)
16 |
17 | ## Purpose of this repository
18 |
19 | * This repository helps facilitators and organizers work out a strong and working schedule, with a fitting use case aligned with narratives around it.
20 | * The goal of the workshop design must always be to identify and fulfill the requirements and expectations of the audience/participants.
21 |
22 |
23 | ## For whom is this repository?
24 |
25 | * Learners: A large part of this repository is devoted to software tutorials, which can be used in the workshops but also for self-guided learning.
26 | * Organizers and Facilitators: Ensures a working framework of a workshop and develop the content.
27 |
28 |
29 | ## Description of content
30 |
31 | There are three main sections in this repository:
32 |
33 | * [software tutorials](software-tutorials): The basic usage of ContentMine tools can be learned step-by-step with the help of the tutorials. It includes installation and describes functionalities, what results to expect, and how to link the different elements of the content mining pipeline. The tutorials can be used in workshops as well as for self-guided learning.
34 |
35 | * [training guidelines](training-guidelines): This folder contains organizer and facilitator resources for the organisation of a ContentMine workshop. It describes the workflow of facilitators and organizers on which steps to perform when, and which dependencies exist. A checklist helps facilitators and organizers prepare their sessions. Additionally it contains a description of the teaching methods used in a workshop as well as links to helpful external teaching resources.
36 |
37 | * [training modules](training-modules): This folder contains learning modules which can be combined as needed by workshop facilitators.
38 | Each module contains a description of learning goals, links to additional resources like slides or software tutorials, and a description of course of action.
39 |
40 | ## Contribute
41 |
42 | There are many ways to contribute, please have a closer look [here](CONTRIBUTING.md).
43 |
44 | You can find us online at:
45 | - [contentmine.org](http://contentmine.org)
46 | - [@thecontentmine](http://twitter.com/thecontentmine)
47 | - training ett contentmine dot org
48 |
49 | If you have questions or suggestions on specific parts of the training material, you can also contact us directly via mail (mail ett stefankasberger dot at, web ett christopherkittel dot eu).
50 |
51 | ## Licence
52 |
53 | This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
54 |
--------------------------------------------------------------------------------
/assets/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 | == additional assets ==
4 | === Zotero ===
5 | about 50 links in:
6 | https://www.zotero.org/groups/contentmine/items
7 |
8 |
9 |
10 |
11 |
--------------------------------------------------------------------------------
/assets/docs/blog.md:
--------------------------------------------------------------------------------
1 |
2 | - [The Hague Declaration on Knowledge Discovery in the Digital Age](http://contentmine.org/2015/05/the-hague-declaration-on-knowledge-discovery-in-the-digital-age/): About the release of the Hague Declaration
3 | - [Wellcome Trust Workshop – 2 Days of Hacking, Networking & Collaboration](http://contentmine.org/2015/04/wellcome-trust-workshop-after-the-event/): Sum up of workshop
4 | - [Issues in electronic theses and open research data](http://contentmine.org/2015/04/issues-in-electronic-theses-and-open-research-data/): Sum Up of workshop
5 | - [Wellcome Trust Workshop – Before the Event](http://contentmine.org/2015/04/wellcome-trust-workshop-before-the-event/): invitation with anticipated outcome
6 | - [Guidelines for collaboration requests](http://contentmine.org/2015/03/guidelines-for-collaboration-requests/): are of interest regarding FAQs
7 | - [Clinical Trials Workshop, UK Cochrane Centre](http://contentmine.org/2015/03/clinical-trials-workshop-uk-cochrane-centre/): Sum up of workshop. Intersting regarding FAQs, Workshop Agenda
8 | - [Clinical Trials Content Mining in Oxford is happening](http://contentmine.org/2015/03/clinical-trials-content-mining-in-oxford-is-happening/): Sum up with agenda
9 | - [Explaining the difference between getpapers and quickscrape](http://contentmine.org/2015/07/explaining-the-difference-between-getpapers-and-quickscrape/): Comparision getpapers and quickscrap
10 | - [Content Mining for Science at OAI9](http://contentmine.org/2015/06/content-mining-for-science-at-oai9/): sum up of content from poster presentation. whole pipeline gets mentioned
11 | - [The importance of Figures and Captions](http://contentmine.org/2015/06/the-importance-of-figures-and-captions/): discussing classification of figures and their role in publications
12 | - [LIBER Response to STM Statement on Text and Data Mining](http://contentmine.org/2015/06/liber-response-to-stm-statement-on-text-and-data-mining/): legal material
13 | - [#MozSprint – June 4th – 5th, 2015](http://contentmine.org/2015/06/mozsprint-june-4th-5th-2015/): Sum up of the sprint
14 | - [“We think ContentMine has the potential to transform the way we work!”](http://contentmine.org/2015/06/we-think-it-contentmine-has-the-potential-to-transform-the-way-we-work/): states the problems in neuroscience (Edinburgh)
15 | - []()
16 | - []()
17 | - []()
18 | - []()
19 | - []()
20 | - []()
21 | - []()
22 |
23 |
24 |
25 |
26 |
--------------------------------------------------------------------------------
/assets/docs/media.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | - [TheContentMine II](https://vimeo.com/110908526)
5 | - [ContentMining Hackday in Cambridge facilitated by ContentMine](https://www.youtube.com/watch?v=YWG57l-v_g4)
6 | - [Wellcome Trust, 13 - 14th April 2015](https://www.youtube.com/watch?v=rnKUkrIOF4k)
7 | - [Group discussion f/t CharlesOppenhein/Richard Danbury on Text and Data Mining](https://www.youtube.com/watch?v=MDb_8kWeJak)
8 | - [Group discussion f/t CharlesOppenhein/Yvonne Nobis on Text and Data Mining](https://www.youtube.com/watch?v=xQ2FLJwkEIk)
9 | - [Right to read, right to mine | Jenny Molloy](https://www.youtube.com/watch?v=QdEfGx5VAso)
10 | - [Content Mining of the bioscience literature](https://www.youtube.com/watch?v=G9LePsd9R9A)
11 | - []()
12 |
13 |
14 |
15 |
16 | where is the poster?
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
--------------------------------------------------------------------------------
/assets/docs/slides.md:
--------------------------------------------------------------------------------
1 |
2 | - [Architecture of ContentMine Components contentmine.org](http://www.slideshare.net/petermurrayrust/architecture-of)
3 | - [Automatic Extraction of Science and Medicine from the scholarly literature](http://www.slideshare.net/petermurrayrust/automatic-extraction-of-science-and-medicine-from-the-scholarly-literature)
4 | - [ContentMine Architecture](http://www.slideshare.net/petermurrayrust/contentmine-architecture)
5 | - [ContentMining for Synthetic Biology](http://www.slideshare.net/petermurrayrust/contentmining-for-synthetic-biology)
6 | - [Content Mining at Wellcome Trust](http://www.slideshare.net/petermurrayrust/content-mining-at-wellcome-trust)
7 | - [ContentMine: Liberating scholarship from Open publications and theses](http://www.slideshare.net/petermurrayrust/contentmine-liberating-scholarship-from-open-publications-and-theses)
8 | - [ContentMining and Clinical Trials](http://www.slideshare.net/petermurrayrust/contentmining-and-clinical-trials)
9 | - [Copyright Reform and Open Data](http://www.slideshare.net/petermurrayrust/copyright-reform-and-open-data)
10 | - [Content Mining for Machines and Humans](http://www.slideshare.net/petermurrayrust/content-mining-for-machines-and-humans)
11 | - [Petermrjisc20141201](http://www.slideshare.net/petermurrayrust/petermrjisc20141201)
12 | - [Petermrbl20141127](http://www.slideshare.net/petermurrayrust/petermrbl20141127)
13 | - [Embrace the Open Revolution](http://www.slideshare.net/petermurrayrust/embrace-the-open-revolution)
14 | - [Contentmineatopencon2](http://www.slideshare.net/petermurrayrust/contentmineatopencon2)
15 | - [Disruptive Communities and Technology](http://www.slideshare.net/petermurrayrust/opencon1)
16 | - [ContentMine: Open Data and Social Machines](http://www.slideshare.net/petermurrayrust/contentmine-open-data-and-social-machines)
17 | - [Mining Scientific Images](http://www.slideshare.net/petermurrayrust/mining-scientific-images)
18 | - [ContentMine and WikiData](http://www.slideshare.net/petermurrayrust/contentmine-and-wikidata)
19 | - [Making Theses USEFUL](http://www.slideshare.net/petermurrayrust/making-theses-useful)
20 | - [Csvconf](http://www.slideshare.net/petermurrayrust/csvconf)
21 | - [The Content Mine (presented at UKSG)](http://www.slideshare.net/petermurrayrust/the-content-mine-presented-at-uksg)
22 | - [The Value and Benefits of Open Access to Research Literature](http://www.slideshare.net/rossmounce/the-value-and-benefits-of-open-access-to-research-literature)
23 | - [Content Mining](http://www.slideshare.net/rossmounce/content-mining)
24 | - [ContentMine at EuropePMC AGM](http://www.slideshare.net/JennyMolloy/j-molloy-europe-pmc-slides)
25 | - [SciDataCon 2014 TDM Workshop Intro Slides](http://www.slideshare.net/JennyMolloy/scidaintro-slides)
26 | - [Legal Framework for TDM](http://www.slideshare.net/JennyMolloy/contentmine)
27 | - [ContentMine Presentation for WHO Health Data Seminar](http://www.slideshare.net/JennyMolloy/contentmine-presentation-for-who)
28 | - [ContentMine (EMBL-EBI Industry Programme)](http://www.slideshare.net/JennyMolloy/contentmine-49511121)
29 | - []()
30 | - []()
31 |
32 |
33 |
34 |
--------------------------------------------------------------------------------
/assets/images/highlighted_facts.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/highlighted_facts.png
--------------------------------------------------------------------------------
/assets/images/software/ami/ami2-species-genus-results.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/ami/ami2-species-genus-results.png
--------------------------------------------------------------------------------
/assets/images/software/ami/ami2-species-genus.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/ami/ami2-species-genus.png
--------------------------------------------------------------------------------
/assets/images/software/getpapers/getpapers-complex1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-complex1.png
--------------------------------------------------------------------------------
/assets/images/software/getpapers/getpapers-folder.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-folder.png
--------------------------------------------------------------------------------
/assets/images/software/getpapers/getpapers-inspectctree1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-inspectctree1.png
--------------------------------------------------------------------------------
/assets/images/software/getpapers/getpapers-inspectctree2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-inspectctree2.png
--------------------------------------------------------------------------------
/assets/images/software/getpapers/getpapers-inspecturlstxt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-inspecturlstxt.png
--------------------------------------------------------------------------------
/assets/images/software/getpapers/getpapers-metadataonly.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-metadataonly.png
--------------------------------------------------------------------------------
/assets/images/software/getpapers/getpapers-noexecute.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-noexecute.png
--------------------------------------------------------------------------------
/assets/images/software/getpapers/getpapers-simplequery.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-simplequery.png
--------------------------------------------------------------------------------
/assets/images/software/getpapers/getpapers-simpleresult.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/getpapers/getpapers-simpleresult.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/001.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/001.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/002.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/002.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/003.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/003.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/004.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/004.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/005.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/005.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/006.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/006.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/007.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/007.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/008.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/008.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/009.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/009.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/010.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/010.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/011.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/011.png
--------------------------------------------------------------------------------
/assets/images/software/journal-scrapers/012.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/journal-scrapers/012.png
--------------------------------------------------------------------------------
/assets/images/software/norma/cprojectwithshtml.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/cprojectwithshtml.png
--------------------------------------------------------------------------------
/assets/images/software/norma/html2shtml.graphml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 | quickscrape
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 | $ tree dinosaurs-html
48 | dinosaurs-htmls/
49 | ├── http_europepmc.org_articles_PMC2214819
50 | │ ├── fulltext.html
51 | │ └── results.json
52 | ├── http_europepmc.org_articles_PMC2635535
53 | │ ├── fulltext.html
54 | │ └── results.json
55 | ├── http_europepmc.org_articles_PMC2997427
56 | │ ├── fulltext.html
57 | │ └── results.json
58 | ...
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
72 |
73 |
74 |
75 | $ tree dinosaurs-htmls/
76 | dinosaurs-htmls/
77 | ├── http_europepmc.org_articles_PMC2214819
78 | │ ├── fulltext.html
79 | │ ├── fulltext.xhtml
80 | │ └── results.json
81 | ├── http_europepmc.org_articles_PMC2635535
82 | │ ├── fulltext.html
83 | │ ├── fulltext.xhtml
84 | │ └── results.json
85 | ├── http_europepmc.org_articles_PMC2997427
86 | │ ├── fulltext.html
87 | │ ├── fulltext.xhtml
88 | │ └── results.json
89 | ...
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 |
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 | $ tree dinosaurs-htmls/
107 | dinosaurs-htmls/
108 | ├── http_europepmc.org_articles_PMC2214819
109 | │ ├── fulltext.html
110 | │ ├── fulltext.xhtml
111 | │ ├── results.json
112 | │ └── scholarly.html
113 | ├── http_europepmc.org_articles_PMC2635535
114 | │ ├── fulltext.html
115 | │ ├── fulltext.xhtml
116 | │ ├── results.json
117 | │ └── scholarly.html
118 | ├── http_europepmc.org_articles_PMC2997427
119 | │ ├── fulltext.html
120 | │ ├── fulltext.xhtml
121 | │ ├── results.json
122 | │ └── scholarly.html
123 | ...
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 | $ quickscrape \
140 | -r urllist.txt \
141 | -o dinosaurs-htmls \
142 | -d journal-scrapers/scrapers
143 |
144 |
145 |
146 |
147 |
148 |
149 |
150 |
151 |
152 |
153 |
154 |
155 |
156 |
157 |
158 |
159 |
160 |
161 | $ norma \
162 | -q dinosaurs-html \
163 | -i fulltext.html \
164 | -o fulltext.xhtml \
165 | --html jsoup
166 |
167 |
168 |
169 |
170 |
171 |
172 |
173 |
174 |
175 |
176 |
177 |
178 |
179 |
180 |
181 |
182 |
183 |
184 | $ norma \
185 | -q dinosaurs-htmls/ \
186 | -i fulltext.xhtml \
187 | -o scholarly.html \
188 | --transform nature2html
189 |
190 |
191 |
192 |
193 |
194 |
195 |
196 |
197 |
198 |
199 |
200 |
201 |
202 |
203 |
204 |
205 |
--------------------------------------------------------------------------------
/assets/images/software/norma/html2shtml.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/html2shtml.png
--------------------------------------------------------------------------------
/assets/images/software/norma/nlm2htmlwithwarnings.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/nlm2htmlwithwarnings.png
--------------------------------------------------------------------------------
/assets/images/software/norma/normahtml2shtml0.dot:
--------------------------------------------------------------------------------
1 | digraph norma {
2 | graph [nodesep=0.4 ranksep=0.7]
3 |
4 | "apiquery" [label="apiquery", style="filled", color="yellow"];
5 | "urllist" [label="urllist", style="filled", color="yellow"];
6 |
7 | "getp" [label="getpapers", style="filled", color="cyan", shape="box"]
8 | "qs" [label="quickscrape", style="filled", color="cyan", shape="box"]
9 |
10 | "apiquery" -> "getp";
11 | "urllist" -> "qs";
12 |
13 | "f.xml" [label="fulltext.xml", penwidth="2"];
14 | "f.html" [label="fulltext.html", penwidth="2"];
15 | "f.pdf" [label="fulltext.pdf", style="filled", color="pink"];
16 |
17 | /*"f.pdf.html" [label="fulltext.pdf.html"];
18 | "f.pdf.txt" [label="fulltext.pdf.txt", style="filled", color="pink"];
19 | */
20 | "f.xhtml" [label="fulltext.xhtml"];
21 |
22 | "png" [label="png", style="filled" color="pink", penwidth="2"];
23 | /*
24 | "png.hocr.html" [label="HTML-OCR"]
25 | "png.hocr.svg" [label="OCR'ed image (SVG)", style="filled", color="pink"]
26 | */
27 | //"svg" [label="svg", style="filled", color="pink", penwidth="2"];
28 | "sdata" [label="suppdata", style="filled", color="pink", penwidth="2"];
29 | "s.html" [label="scholarly.html", style="filled", color="red", penwidth="2"];
30 |
31 | "getp" -> {"f.xml" "f.pdf" "sdata" } [style="bold", color="grey"];
32 | "qs" -> {"f.xml" "f.pdf" "png" "sdata"} [style="bold", color="grey"];
33 | "qs" -> "f.html" [style="bold", color="red"];
34 |
35 | "f.xml" -> "n.stylesheets" [style="bold", color="grey"];
36 |
37 | "f.html" -> "n.tidy" [style="bold", color="red"];
38 | "n.tidy" -> "f.xhtml" [style="bold", color="red"];
39 |
40 | "f.xhtml" -> "n.htmlstyle" [style="bold", color="red"];
41 |
42 | /*
43 | "f.pdf" -> "n.pdf2txt" ;
44 | "n.pdf2txt" -> "f.pdf.txt";
45 | */
46 | /*
47 | "f.pdf" -> "n.pdf2svg";
48 | "n.pdf2svg" -> "png"
49 | "n.pdf2svg" -> "svg";
50 | */
51 |
52 | /*
53 | "f.pdf" -> "n.pdf2html" ;
54 | "n.pdf2html" -> {"f.pdf.html"};
55 | */
56 | /*
57 | "png" -> "n.ocr" ;
58 | "n.ocr" -> "png.hocr.html"
59 | */
60 | /*
61 | "png.hocr.html" -> "n.ocr2";
62 | "n.ocr2" -> "png.hocr.svg"
63 | */
64 | "tagger" [label="tagger", style="filled", color="cyan", shape="box"]
65 | "n.stylesheets" -> "tagger" [style="bold", color="grey"];
66 | "n.htmlstyle" -> "tagger" [style="bold", color="red"];
67 | //{"f.pdf.html" } -> "tagger";
68 | "tagger" -> "s.html" [style="bold", color="red"];
69 |
70 | /*
71 | "sdata" -> "n.ctree";
72 | "n.ctree" -> {"doc" "csv"};
73 | */
74 | subgraph cluster_norma {
75 | label="norma" color="cyan" penwidth="3";
76 | "n.stylesheets" [label="stylesheets", style="filled", color="cyan", shape="box"]
77 | "n.tidy" [label="tidy", style="filled", color="cyan", shape="box"]
78 | "n.htmlstyle" [label="htmlstyle", style="filled", color="cyan", shape="box"]
79 | // "n.pdf2txt" [label="pdf2txt", style="filled", color="cyan", shape="box"]
80 | // "n.pdf2html" [label="pdf2html", style="filled", color="cyan", shape="box"]
81 | // "n.pdf2svg" [label="pdf2svg", style="filled", color="cyan", shape="box"]
82 | // "n.ctree" [label="ctree", style="filled", color="cyan", shape="box"]
83 | // "n.ocr" [label="tesseract", style="filled", color="cyan", shape="box"]
84 | // "n.ocr2" [label="ocr", style="filled", color="cyan", shape="box"]
85 | }
86 |
87 |
88 | }
--------------------------------------------------------------------------------
/assets/images/software/norma/normahtml2shtml0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/normahtml2shtml0.png
--------------------------------------------------------------------------------
/assets/images/software/norma/normasmall0.dot:
--------------------------------------------------------------------------------
1 | digraph norma {
2 | graph [nodesep=0.4 ranksep=0.7]
3 |
4 | "apiquery" [label="apiquery", style="filled", color="yellow"];
5 | "urllist" [label="urllist", style="filled", color="yellow"];
6 |
7 | "getp" [label="getpapers", style="filled", color="cyan", shape="box"]
8 | "qs" [label="quickscrape", style="filled", color="cyan", shape="box"]
9 |
10 | "apiquery" -> "getp";
11 | "urllist" -> "qs";
12 |
13 | "f.xml" [label="fulltext.xml", penwidth="2"];
14 | "f.html" [label="fulltext.html", penwidth="2"];
15 | "f.pdf" [label="fulltext.pdf", style="filled", color="pink"];
16 |
17 | /*"f.pdf.html" [label="fulltext.pdf.html"];
18 | "f.pdf.txt" [label="fulltext.pdf.txt", style="filled", color="pink"];
19 | */
20 | "f.xhtml" [label="fulltext.xhtml"];
21 |
22 | "png" [label="png", style="filled" color="pink", penwidth="2"];
23 | /*
24 | "png.hocr.html" [label="HTML-OCR"]
25 | "png.hocr.svg" [label="OCR'ed image (SVG)", style="filled", color="pink"]
26 | */
27 | //"svg" [label="svg", style="filled", color="pink", penwidth="2"];
28 | "sdata" [label="suppdata", style="filled", color="pink", penwidth="2"];
29 | "s.html" [label="scholarly.html", style="filled", color="red", penwidth="2"];
30 |
31 | "getp" -> {"f.xml" "f.pdf" "sdata" } [style="bold"];
32 | "qs" -> {"f.xml" "f.pdf" "f.html" "png" "sdata"} [style="bold"];
33 |
34 | "f.xml" -> "n.stylesheets" [style="bold"];
35 |
36 | "f.html" -> "n.tidy" [style="bold"];
37 | "n.tidy" -> "f.xhtml" [style="bold"];
38 |
39 | "f.xhtml" -> "n.htmlstyle" [style="bold"];
40 |
41 | /*
42 | "f.pdf" -> "n.pdf2txt" ;
43 | "n.pdf2txt" -> "f.pdf.txt";
44 | */
45 | /*
46 | "f.pdf" -> "n.pdf2svg";
47 | "n.pdf2svg" -> "png"
48 | "n.pdf2svg" -> "svg";
49 | */
50 |
51 | /*
52 | "f.pdf" -> "n.pdf2html" ;
53 | "n.pdf2html" -> {"f.pdf.html"};
54 | */
55 | /*
56 | "png" -> "n.ocr" ;
57 | "n.ocr" -> "png.hocr.html"
58 | */
59 | /*
60 | "png.hocr.html" -> "n.ocr2";
61 | "n.ocr2" -> "png.hocr.svg"
62 | */
63 | "tagger" [label="tagger", style="filled", color="cyan", shape="box"]
64 | {"n.stylesheets" "n.htmlstyle"} -> "tagger" [style="bold"];
65 | //{"f.pdf.html" } -> "tagger";
66 | "tagger" -> "s.html" [style="bold"];
67 |
68 | /*
69 | "sdata" -> "n.ctree";
70 | "n.ctree" -> {"doc" "csv"};
71 | */
72 | subgraph cluster_norma {
73 | label="norma" color="cyan" penwidth="3";
74 | "n.stylesheets" [label="stylesheets", style="filled", color="cyan", shape="box"]
75 | "n.tidy" [label="tidy", style="filled", color="cyan", shape="box"]
76 | "n.htmlstyle" [label="htmlstyle", style="filled", color="cyan", shape="box"]
77 | // "n.pdf2txt" [label="pdf2txt", style="filled", color="cyan", shape="box"]
78 | // "n.pdf2html" [label="pdf2html", style="filled", color="cyan", shape="box"]
79 | // "n.pdf2svg" [label="pdf2svg", style="filled", color="cyan", shape="box"]
80 | // "n.ctree" [label="ctree", style="filled", color="cyan", shape="box"]
81 | // "n.ocr" [label="tesseract", style="filled", color="cyan", shape="box"]
82 | // "n.ocr2" [label="ocr", style="filled", color="cyan", shape="box"]
83 | }
84 |
85 |
86 | }
--------------------------------------------------------------------------------
/assets/images/software/norma/normasmall0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/normasmall0.png
--------------------------------------------------------------------------------
/assets/images/software/norma/normaxml2shtml0.dot:
--------------------------------------------------------------------------------
1 | digraph norma {
2 | graph [nodesep=0.4 ranksep=0.7]
3 |
4 | "apiquery" [label="apiquery", style="filled", color="yellow"];
5 | "urllist" [label="urllist", style="filled", color="yellow"];
6 |
7 | "getp" [label="getpapers", style="filled", color="cyan", shape="box"]
8 | "qs" [label="quickscrape", style="filled", color="cyan", shape="box"]
9 |
10 | "apiquery" -> "getp";
11 | "urllist" -> "qs";
12 |
13 | "f.xml" [label="fulltext.xml", penwidth="2"];
14 | "f.html" [label="fulltext.html", penwidth="2"];
15 | "f.pdf" [label="fulltext.pdf", style="filled", color="pink"];
16 |
17 | /*"f.pdf.html" [label="fulltext.pdf.html"];
18 | "f.pdf.txt" [label="fulltext.pdf.txt", style="filled", color="pink"];
19 | */
20 | "f.xhtml" [label="fulltext.xhtml"];
21 |
22 | "png" [label="png", style="filled" color="pink", penwidth="2"];
23 | /*
24 | "png.hocr.html" [label="HTML-OCR"]
25 | "png.hocr.svg" [label="OCR'ed image (SVG)", style="filled", color="pink"]
26 | */
27 | //"svg" [label="svg", style="filled", color="pink", penwidth="2"];
28 | "sdata" [label="suppdata", style="filled", color="pink", penwidth="2"];
29 | "s.html" [label="scholarly.html", style="filled", color="red", penwidth="2"];
30 |
31 | "getp" -> {"f.pdf" "sdata" } [style="bold", color="grey"];
32 | "getp" -> {"f.xml"} [style="bold", color="red"];
33 | "qs" -> {"f.xml" "f.pdf" "f.html" "png" "sdata"} [style="bold", color="grey"];
34 |
35 | "f.xml" -> "n.stylesheets" [style="bold", color="red"];
36 |
37 | "f.html" -> "n.tidy" [style="bold", color="grey"];
38 | "n.tidy" -> "f.xhtml" [style="bold", color="grey"];
39 |
40 | "f.xhtml" -> "n.htmlstyle" [style="bold", color="grey"];
41 |
42 | /*
43 | "f.pdf" -> "n.pdf2txt" ;
44 | "n.pdf2txt" -> "f.pdf.txt";
45 | */
46 | /*
47 | "f.pdf" -> "n.pdf2svg";
48 | "n.pdf2svg" -> "png"
49 | "n.pdf2svg" -> "svg";
50 | */
51 |
52 | /*
53 | "f.pdf" -> "n.pdf2html" ;
54 | "n.pdf2html" -> {"f.pdf.html"};
55 | */
56 | /*
57 | "png" -> "n.ocr" ;
58 | "n.ocr" -> "png.hocr.html"
59 | */
60 | /*
61 | "png.hocr.html" -> "n.ocr2";
62 | "n.ocr2" -> "png.hocr.svg"
63 | */
64 | "tagger" [label="tagger", style="filled", color="cyan", shape="box"]
65 | "n.htmlstyle" -> "tagger" [style="bold", color="grey"];
66 | "n.stylesheets" -> "tagger" [style="bold", color="red"];
67 | //{"f.pdf.html" } -> "tagger";
68 | "tagger" -> "s.html" [style="bold", color="red"];
69 |
70 | /*
71 | "sdata" -> "n.ctree";
72 | "n.ctree" -> {"doc" "csv"};
73 | */
74 | subgraph cluster_norma {
75 | label="norma" color="cyan" penwidth="3";
76 | "n.stylesheets" [label="stylesheets", style="filled", color="cyan", shape="box"]
77 | "n.tidy" [label="tidy", style="filled", color="cyan", shape="box"]
78 | "n.htmlstyle" [label="htmlstyle", style="filled", color="cyan", shape="box"]
79 | // "n.pdf2txt" [label="pdf2txt", style="filled", color="cyan", shape="box"]
80 | // "n.pdf2html" [label="pdf2html", style="filled", color="cyan", shape="box"]
81 | // "n.pdf2svg" [label="pdf2svg", style="filled", color="cyan", shape="box"]
82 | // "n.ctree" [label="ctree", style="filled", color="cyan", shape="box"]
83 | // "n.ocr" [label="tesseract", style="filled", color="cyan", shape="box"]
84 | // "n.ocr2" [label="ocr", style="filled", color="cyan", shape="box"]
85 | }
86 |
87 |
88 | }
--------------------------------------------------------------------------------
/assets/images/software/norma/normaxml2shtml0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/normaxml2shtml0.png
--------------------------------------------------------------------------------
/assets/images/software/norma/pdf2txt.graphml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 |
24 | $ cd dinosaur-pdfs
25 | $ tree
26 | .
27 | ├── Natarajan - Bone Cancer.pdf
28 | ├── Taylor - Aspects of sauropod dinosaurs.pdf
29 | ├── Taylor - Sauropods Giraffes.pdf
30 | └── Wedel - Posteranial Pneumacity in Dinosaurs.pdf
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 | $ tree dinosaurs-pdfs/
48 | dinosaurs-pdfs/
49 | ├── Natarajan - Bone Cancer
50 | │ └── fulltext.pdf
51 | ├── Taylor - Aspects of sauropod dinosaurs
52 | │ └── fulltext.pdf
53 | ├── Taylor - Sauropods Giraffes
54 | │ └── fulltext.pdf
55 | └── Wedel - Posteranial Pneumacity in Dinosaurs
56 | └── fulltext.pdf
57 |
58 | 4 directories, 4 files
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
72 |
73 |
74 |
75 | $ tree dinosaurs-pdfs/
76 | dinosaurs-pdfs/
77 | ├── Natarajan - Bone Cancer
78 | │ ├── fulltext.pdf
79 | │ └── fulltext.pdf.txt
80 | ├── Taylor - Aspects of sauropod dinosaurs
81 | │ ├── fulltext.pdf
82 | │ └── fulltext.pdf.txt
83 | ├── Taylor - Sauropods Giraffes
84 | │ ├── fulltext.pdf
85 | │ └── fulltext.pdf.txt
86 | └── Wedel - Posteranial Pneumacity in Dinosaurs
87 | ├── fulltext.pdf
88 | └── fulltext.pdf.txt
89 |
90 | 4 directories, 8 files
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 |
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 |
107 | $ for fname in *.pdf; do
108 | filename=$(basename "$fname");
109 | filename="${filename%.*}";
110 | mkdir "$filename";
111 | mv "$fname" "$filename"/fulltext.pdf;
112 | done;
113 | cd ..
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
122 |
123 |
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 | $ norma \
133 | -q dinosaurs-pdfs \
134 | -i fulltext.pdf \
135 | -o fulltext.pdf.txt \
136 | --transform pdf2txt
137 |
138 |
139 |
140 |
141 |
142 |
143 |
144 |
145 |
146 |
147 |
148 |
149 |
150 |
151 |
152 |
153 |
--------------------------------------------------------------------------------
/assets/images/software/norma/pdf2txt.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/pdf2txt.png
--------------------------------------------------------------------------------
/assets/images/software/norma/xml2shtml.graphml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
23 | $ tree dinosaurs-xmls
24 | dinosaurs-xmls/
25 | ├── eupmc_results.json
26 | ├── fulltext_html_urls.txt
27 | ├── PMC3893193
28 | │ └── fulltext.xml
29 | ├── PMC3893247
30 | │ └── fulltext.xml
31 | ...
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
47 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 | getpapers
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 | $ tree dinosaurs-xmls
72 | dinosaurs-xmls/
73 | ├── eupmc_results.json
74 | ├── fulltext_html_urls.txt
75 | ├── PMC3893193
76 | │ ├── fulltext.xml
77 | │ └── scholarly.html
78 | ├── PMC3893247
79 | │ ├── fulltext.xml
80 | │ └── scholarly.html
81 | ...
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 | $ getpapers \
99 | -q dinosaurs-xmls \
100 | -o dinosaurs -x
101 |
102 |
103 |
104 |
105 |
106 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
114 |
115 |
116 |
117 |
118 | $ norma \
119 | -q dinosaurs-xmls \
120 | -i fulltext.xml \
121 | -o scholarly.html \
122 | --transform nlm2html
123 |
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 |
--------------------------------------------------------------------------------
/assets/images/software/norma/xml2shtml.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/norma/xml2shtml.png
--------------------------------------------------------------------------------
/assets/images/software/quickscrape/quickscrape-url.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/quickscrape/quickscrape-url.png
--------------------------------------------------------------------------------
/assets/images/software/shell/cat.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/cat.png
--------------------------------------------------------------------------------
/assets/images/software/shell/cd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/cd.png
--------------------------------------------------------------------------------
/assets/images/software/shell/cp.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/cp.png
--------------------------------------------------------------------------------
/assets/images/software/shell/head.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/head.png
--------------------------------------------------------------------------------
/assets/images/software/shell/ls.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/ls.png
--------------------------------------------------------------------------------
/assets/images/software/shell/mkdir.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/mkdir.png
--------------------------------------------------------------------------------
/assets/images/software/shell/mv.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/mv.png
--------------------------------------------------------------------------------
/assets/images/software/shell/rm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/rm.png
--------------------------------------------------------------------------------
/assets/images/software/shell/tail.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/tail.png
--------------------------------------------------------------------------------
/assets/images/software/shell/tree.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/tree.png
--------------------------------------------------------------------------------
/assets/images/software/shell/wc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/shell/wc.png
--------------------------------------------------------------------------------
/assets/images/software/vms/desktop.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/desktop.png
--------------------------------------------------------------------------------
/assets/images/software/vms/memorysettings.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/memorysettings.png
--------------------------------------------------------------------------------
/assets/images/software/vms/starting-vm.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/starting-vm.png
--------------------------------------------------------------------------------
/assets/images/software/vms/terminal.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/terminal.png
--------------------------------------------------------------------------------
/assets/images/software/vms/vm-crunchbang.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-crunchbang.png
--------------------------------------------------------------------------------
/assets/images/software/vms/vm-error.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-error.png
--------------------------------------------------------------------------------
/assets/images/software/vms/vm-icon.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-icon.png
--------------------------------------------------------------------------------
/assets/images/software/vms/vm-installer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-installer.png
--------------------------------------------------------------------------------
/assets/images/software/vms/vm-popup.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-popup.png
--------------------------------------------------------------------------------
/assets/images/software/vms/vm-poweredoff.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-poweredoff.png
--------------------------------------------------------------------------------
/assets/images/software/vms/vm-running.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-running.png
--------------------------------------------------------------------------------
/assets/images/software/vms/vm-startscreen.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-startscreen.png
--------------------------------------------------------------------------------
/assets/images/software/vms/vm-virtualbox-download.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/assets/images/software/vms/vm-virtualbox-download.png
--------------------------------------------------------------------------------
/software-tutorials/README.md:
--------------------------------------------------------------------------------
1 | .png)
2 |
3 | The usage of ContentMine tools can be learned step-by-step with the help of the tutorials. They describe functionalities, what results to expect, and how to link the different elements of the content mining pipeline. They are based on the ContentMine virtual machine, which has all necessary software pre-installed.
4 |
5 | The tutorials can be used in workshops as well as for self-guided learning.
6 |
7 | # Table of contents
8 |
9 | 1. [Purpose and installation of the ContentMine-VirtualMachine](vms/README.md)
A VirtualBox-Image contains all necessary software as well as sample datasets for getting started with content mining. This tutorial explains how to install the ContentMine-VM and use it as a sandbox environment.
10 |
11 | 1. [Introduction to the command line interface](shell/README.md)
12 | This tutorial introduces the basic UNIX-commands and shows how to navigate folders and handle files.
13 |
14 | 1. [Getting started with getpapers](getpapers/README.md)
15 | This tutorial demonstrates how to create an initial corpus for fact extraction.
16 |
17 | 1. [Getting started with quickscrape](quickscrape/README.md)
18 | This tutorial introduces quickscrape, and how to use it to extract semi-structured information from web pages.
19 |
20 | 1. [Create your own scraper definition](journal-scrapers/README.md)
21 | This tutorial shows how to contribute to, and extend the ContentMine scraper collection. If you need a specific definition for your use with quickscrape, here you can learn how to create it.
22 |
23 | 1. [Normalizing scholarly literature](norma/README.md)
24 | This tutorial shows how to normalize scientific literature into a unified format which can be processed by machines.
25 |
26 | 1. [ContentMine data structure: CProject](cproject/README.md)
27 | This tutorial gives an overview of the data structure used, and how it can be integrated in your analysis.
28 |
29 | 1. [Extracting facts with AMI-plugins](ami/README.md)
30 | This tutorial demonstrates how to extract, aggregate, and filter facts from scholarly.html.
31 |
--------------------------------------------------------------------------------
/software-tutorials/ami/README.md:
--------------------------------------------------------------------------------
1 | # ami
2 | ==============================
3 |
4 | ## Table of Content
5 |
6 | 1. [Description](#description)
7 | 1. [Preparations](#preparations)
8 | 1. [Input data](#input-data)
9 | 1. [Tutorial](#Tutorial)
10 | 1. [ami2-species](#ami2-species)
11 | 1. [ami2-gene](#ami2-gene)
12 | 1. [ami2-sequence](#ami2-sequence)
13 | 1. [ami2-regex](#ami2-regex)
14 | 1. [ami2-word](#ami2-word)
15 | 1. [Summarization of results](#summarization-of-results)
16 | 1. [Summary](#summary)
17 | 1. [Next Steps](#next-steps)
18 | 1. [Further Materials](#further-materials)
19 |
20 | ## Description
21 |
22 | **What does ami?**
23 |
24 | Ami is a collection of plugins that extract pieces of information (a 'fact') from structured documents. It uses dictionaries or regexes to look up various fact sets, such as species names, gene or protein sequences.
25 | At this stage individual facts which are spread over the corpus of papers are extracted and collected into machine readable XML files, which can then be used for visualization or statistics.
26 | We are integrating new dictionaries to cover different types of encoded knowledge, please drop us a message if you have an idea or resource for a new fact extractor!
27 |
28 | **Why do we need ami?**
29 |
30 | ami helps researchers with repetitive cognitive tasks like pattern matching, which e.g. has to be done when trying to keep up-to-date with new species occurrences or tracking one specific gene. ami also helps coping with large amounts of literature that has to be filtered, e.g. for meta-studies, by highlighting the papers which are really necessary to read.
31 |
32 | **How can I use ami?**
33 |
34 | Ami is to be used after the normalization of papers happened. You take the output of [norma](../norma/README.md) and apply a plugin.
35 |
36 | **What you will learn here**
37 |
38 | This tutorial shows you:
39 | - how to use the species plugin
40 | - how to use the genes plugin
41 | - how to use the sequences plugin
42 | - how to use regular expressions
43 | - what to do with the extracted facts
44 |
45 | **How to use the tutorial**
46 |
47 | We have some conventions which will be used through-out the tutorial.
48 | - Variables as placeholders are always caps, like NAME, YOURDIRECTORY etc.
49 |
50 |
51 | **Glossary**
52 |
53 |
54 | ## Preparations
55 | ### Pre-Requisites
56 |
57 | ### Used Software
58 | - [Future TDM Virtual Machine](LINK)
59 | - ami2-species
60 | - ami2-gene
61 | - ami2-sequence
62 | - ami2-regex
63 |
64 | ### Installation
65 |
66 | On the ContentMine-VM ami is already provided. If you want to install it locally or update it on the VM, please refer to the [installation instructions](http://contentmine.github.io/) for your platform.
67 |
68 | You can find the technical documentation for `ami` in its [repository](https://github.com/ContentMine/ami-plugin).
69 |
70 | ## Input data
71 |
72 | The input for ami is always a [CProject](../cproject) containing documents in [scholarly HTML](../sHTML). ami then applies one of the available plugins, extracts the relevant content from the sHTML, and stores the results in the respective paper folder. ami-plugins require `scholarly.html`-files as input. Please follow the instructions for [norma](../norma/README.md). Your project directory should look like this:
73 |
74 | ```bash
75 | tree ursus
76 | ursus/
77 | ├── eupmc_results.json
78 | ├── fulltext_html_urls.txt
79 | ├── PMC3893193
80 | │ ├── fulltext.xml
81 | │ └── scholarly.html
82 | ├── PMC3893247
83 | │ ├── fulltext.xml
84 | │ └── scholarly.html
85 | ├── PMC3898307
86 | │ ├── fulltext.xml
87 | │ └── scholarly.html
88 | ...
89 | ```
90 |
91 | ## Tutorial
92 |
93 | This tutorial is based on release [0.2.24](https://github.com/ContentMine/ami/releases).
94 |
95 | ### ami2-species
96 |
97 | You can then search for all occurences of a species name with
98 | ```bash
99 | ami2-species --project CPROJECTFOLDER -i INPUTFILE --sp.species --sp.type SPECIESTYPE
100 | ```
101 |
102 | You have to choose between three different types of species terms for ```SPECIESTYPE```
103 | - ```genus```, which will extract terms like *Brachiosaurus* [more details](https://en.wikipedia.org/wiki/Genus)
104 | - ```binomial```, which will extract terms like *B. altithorax* [more details](https://en.wikipedia.org/wiki/Binomial_nomenclature)
105 | - or ```genussp```, whick will extract terms like *Bacillus sp* or *Ursus spp.*
106 |
107 | ```bash
108 | ami2-species --project ursus/ -i scholarly.html --sp.species --sp.type genus
109 | ```
110 |
111 | 
112 |
113 | Inspecting the folders with `tree ursus` now should look like the following. If no matches could be found, an **empty.xml** will be created, to indicate that the plugin has been run on this particular paper, but with no results. If matches have been found, a **results.xml** will be created.
114 |
115 | 
116 |
117 | A for-loop performs all extractions in sequence:
118 | ```
119 | for type in genus binomial genussp; do
120 | ami2-species --project ursus/ -i scholarly.html --sp.species --sp.type $type;
121 | done
122 | ```
123 |
124 | The results will all be saved as XML in the corresponding CTree inside results/species with per-type folder named after the SPECIESTYPE.
125 |
126 | ```
127 | tree ursus
128 | ursus
129 | ├── eupmc_results.json
130 | ├── fulltext_html_urls.txt
131 | ├── PMC3893193
132 | │ ├── fulltext.xml
133 | │ ├── results
134 | │ │ └── species
135 | │ │ ├── **binomial**
136 | │ │ │ └── results.xml
137 | │ │ ├── **genus**
138 | │ │ │ └── results.xml
139 | │ │ └── **genussp**
140 | │ │ └── results.xml
141 | │ └── scholarly.html
142 | ├── PMC3893247
143 | │ ├── fulltext.xml
144 | │ ├── results
145 | │ │ └── species
146 | │ │ ├── **binomial**
147 | │ │ │ └── results.xml
148 | │ │ ├── **genus**
149 | │ │ │ └── results.xml
150 | │ │ └── **genussp**
151 | │ │ └── results.xml
152 | │ └── scholarly.html
153 | ...
154 | ```
155 |
156 | ```bash
157 | cat ursus/PMC4349051/results/species/genussp/results.xml
158 | ```
159 |
160 | ```xml
161 |
162 |
163 |
164 |
165 |
166 |
167 |
168 | ```
169 |
170 | The results.xml consists of different amounts of lines, where every line represents one extracted fact inside the ``-tag and consists of:
171 | - `exact`: the exact match - the fact. e. g. `exact="Parasaurolophus sp" for the fact "Parasaurolophus sp"
172 | - `pre`: 99 characters before the match
173 | - `post`: 99 characters after the match
174 |
175 | Down to earth, this is what the fact extraction looks like in the end: a list of terms extracted from the literature with XX characters before and afterwards as context around the fact.
176 |
177 |
178 | ### ami2-gene
179 |
180 | The search for genes works in the same way, just with another command:
181 | ```bash
182 | ami2-gene --project CPROJECTFOLDER -i INPUTFILE --g.gene --g.type GENETYPE
183 | ```
184 |
185 | At the moment there is only one GENETYPE available (`human`). Results are again stored within the CTree.
186 |
187 | ```bash
188 | ami2-gene --project ursus/ -i scholarly.html --g.gene --g.type human
189 | ```
190 |
191 | This then creates a folder gene/human inside results next to the results of the species module.
192 |
193 | ```
194 | tree ursus
195 | ursus
196 | ├── eupmc_results.json
197 | ├── fulltext_html_urls.txt
198 | ├── PMC3893193
199 | │ ├── fulltext.xml
200 | │ ├── results
201 | │ │ └── gene
202 | │ │ └── human
203 | │ │ └── results.xml
204 | │ └── scholarly.html
205 | ├── PMC3893247
206 | │ ├── fulltext.xml
207 | │ ├── results
208 | │ │ └── gene
209 | │ │ └── human
210 | │ │ └── results.xml
211 | │ └── scholarly.html
212 | ...
213 | ```
214 |
215 | The `results.xml` has a similar structure: a results-tag with pre, post, and exact attributes.
216 |
217 | ```
218 | cat ursus/PMC4454486/results/gene/human/results.xml
219 | ```
220 |
221 | ```xml
222 |
223 |
224 |
225 |
226 |
227 | ```
228 |
229 | ### ami2-sequence
230 |
231 | The search for sequences follows the same structure:
232 | ```bash
233 | ami2-sequence --project CPROJECTFOLDER -i INPUTFILE --sq.sequence --sq.type SEQUENCETYPE
234 | ```
235 |
236 | - SEQUENCETYPE is one of `dna rna prot prot3 carb3`.
237 |
238 | You can run one type with this query:
239 | ```bash
240 | ami2-sequence --project ursus/ -i scholarly.html --sq.sequence --sq.type rna
241 | ```
242 |
243 | You can also run all types in sequence with this loop:
244 | ```bash
245 | for type in dna rna prot prot3 carb3; do
246 | ami2-sequence --project ursus -i scholarly.html --sq.sequence --sq.type $type;
247 | done
248 | ```
249 |
250 | This creates an own folder called `sequence/sequencetype` inside results. The results are in general of the same structure as before, with an additional attribute `xpath` that shows the location of the match within the html-structure of the `scholarly.html`.
251 |
252 | ```bash
253 | cat ursus/PMC4447998/results/sequence/rna/results.xml
254 | ```
255 |
256 | ```xml
257 |
258 |
259 |
260 |
261 |
262 |
263 | ```
264 |
265 | ### ami2-regex
266 |
267 | Regex is the shortcut for ([regular expression](https://en.wikipedia.org/wiki/Regular_expression)) and is used to match search patterns inside text. This means to search for such basic strings like "ursus" inside a sentence, but also allows way more complex patterns to look for, `[Uu]rsus` matches for example upper and lower case letters, so you look at the same time for Ursus and ursus.
268 |
269 | Digits can be added by e.g. `[0-9]`, which matches any digit, or by fixed sequences `(00111001)`. Non-alphanumeric characters have to be escaped by `\`, so if you want to search for the number 3.14 explicitly, the regex looks like `(3\.14)`.
270 |
271 | We will now see a variety of regex-s and how they are used in a XML file.
272 |
273 | **How to use regex with ami**
274 |
275 |
276 | ```bash
277 | ami2-regex --project CPROJECTFOLDER -i INPUTFILE --context PRE POST --r.regex REGEXFILE.xml
278 | ```.
279 |
280 | - `PRE`: tells ami how many characters before a match should be captured
281 | - `POST`: tells ami how many characters after a match should be captured
282 | - `REGEXFILE`: target location of your regex XML file. It contains all regex--projectueries.
283 |
284 | **How to create a custom regex XML file**
285 |
286 | The REGEXFILE.xml needs to be wrapped by `` and closing tags ``, which are the opening and closing tags. The `TITLE` sets the name of the folder, where the output gets stored.
287 |
288 | ```xml
289 |
290 |
291 | ```
292 |
293 | In the regex XML each regex-query is written to a new line, and consists of the opening and closing tags ``. Within the opening tag there must be two attributes declared,
294 | - `weight`: the relative importance given to each match (influences indexing engines). The default value `1.0`
295 | - `fields`: corresponds to the regex-query, and specifies the name of the query
296 |
297 | ```xml
298 |
299 |
300 |
301 |
302 | ```
303 |
304 | What is missing now is the regex-query itself. It is placed between the regex-tags `query`and is framed by round brackets `()`.
305 |
306 | In line two one field ("food") is defined. We want to get both upper and lower cases, and `[Ff]` matches either `F` or `f`: `([Ff]ood)`. The following characters `ood` are fixed for this query, they have to be matched.
307 |
308 | For the second query, we want to find all mentions of "predator regime" or "predator regimes". For this we need `\s`, a special character standing for ` ` - the whitespace, blank character. The questions mark `[s]?` makes the "s" optional: `([Pp]redator\sregime[s]?)`
309 |
310 | ```xml
311 |
312 | ([Ff]ood)
313 | ([Pp]redator\sregime[s]?)
314 |
315 | ```
316 |
317 | We now construct a ```regex.xml``` like that (use any texteditor for that) and place this XML in our project folder as `ursusfood.xml`. We run ami with it, and because we want to get some context around our matches, add the ```PRE``` and ```POST``` option, which capture characters before and after the match.
318 |
319 | ```bash
320 | ami2-regex --project ursus/ -i scholarly.html --r.regex ursus/ursusfood.xml --context 50 50
321 | ```
322 |
323 | The output contains 50 characters `pre` and 50 characters `post` the `value0`, as well as the `xpath` of the match in the scholarly.html.
324 |
325 | ### ami2-word
326 |
327 | Word frequency can be used to categorize documents. The simplest approach is to count the words in documents, or within chunks of documents.
328 |
329 | ```bash
330 | ami2-word --project ursus/ -i scholarly.html --w.words wordFrequencies
331 | ```
332 | creates
333 | ```
334 | eupmc
335 | ├── eupmc_results.json
336 | ├── fulltext_html_urls.txt
337 | ├── PMC2275095
338 | │ ├── fulltext.xml
339 | │ ├── results
340 | │ │ └── word
341 | │ │ └── frequencies
342 | │ │ ├── results.html
343 | │ │ └── results.xml
344 | │ └── scholarly.html
345 | ├── PMC2586803
346 | │ ├── fulltext.xml
347 | │ ├── results
348 | │ │ └── word
349 | │ │ └── frequencies
350 | │ │ ├── results.html
351 | │ │ └── results.xml
352 | ```
353 |
354 | The first lists word frequencies as:
355 |
356 | ```
357 |
358 |
359 |
360 |
361 |
362 |
363 |
364 |
365 |
366 |
367 | ```
368 |
369 | and the second creates a "Word Cloud"-like HTML display with the most frequent words in order and with fonts proportional to the count. This mainly reflects the frequency in the English language, so we can remove the commonest words by using *stopwords*. We have a range of stopword files in different languages. It is also possible to create your own files and add them.
370 |
371 | ```bash
372 | ami2-word --project ursus --w.words wordFrequencies --w.stopwords STOPWORDS.txt
373 | ```
374 |
375 | The format is a simple list of words:
376 | ```
377 | a
378 | about
379 | above
380 | across
381 | after
382 | afterwards
383 | again
384 | against
385 | ```
386 |
387 | and the file can be referenced either through a URL format or relative/absolute filename.
388 |
389 | ```bash
390 | ami2-word --project ursus/ -i scholarly.html --w.words wordFrequencies --w.stopwords stopwords.txt
391 | ```
392 |
393 | gives `results.xml` as
394 | ```
395 |
396 |
397 |
398 |
399 |
400 |
401 |
402 |
403 |
404 |
405 |
406 |
407 | ```
408 |
409 | ### Summarization of results
410 |
411 | ami-plugin possesses the ability to automatically create an aggregation of results. It is possible to aggregate the results over all plugins with:
412 |
413 | ```
414 | ami2-sequence --project ursus --filter file\(\*\*/results.xml\) -o sequencesfiles.xml
415 | ```
416 |
417 | It is possible to only aggregate results for a specific plugin and option, e.g. for `dna`:
418 | ```
419 | ami2-sequence --project ursus --filter file\(\*\*/dna/results.xml\)xpath\(//result\) -o dnasnippets.xml
420 | ```
421 |
422 |
423 | ## Summary
424 |
425 | * A project folder containing ctrees is always the input.
426 | * Plugins are own software parts with own commands.
427 | * Rsults/Facts are stored within the ctree in a plugin-specific folder.
428 | * Results also store the context of 99 characters before and after the fact.
429 |
430 | **Next steps**
431 |
432 | * Back to the [tutorial overview](..)
433 |
434 | ## Further material
435 |
436 | **ContentMine**
437 | - [contentmine.org](http://contentmine.org)
438 | - office ett contentmine dot org
439 | - [@TheContentMine](http://twitter.com/thecontentmine)
440 |
441 | **Slides**
442 |
443 |
444 | **Assets**
445 |
446 | **Videos**
447 |
448 |
449 | **Learning Materials**
450 |
451 | **www**
452 |
453 |
454 | **Papers**
455 |
--------------------------------------------------------------------------------
/software-tutorials/canary/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/software-tutorials/canary/README.md
--------------------------------------------------------------------------------
/software-tutorials/cat/README.md:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/software-tutorials/cat/README.md
--------------------------------------------------------------------------------
/software-tutorials/cproject/README.md:
--------------------------------------------------------------------------------
1 | .png)
2 |
3 | # What is a CProject?
4 |
5 | This page documents the current implementation of a CProject as of May 16, 2016.
6 |
7 | The CProject is a naming convention for interacting with input und output files of ContentMine tools. A CProject is a filesystem tree containing a set of folders and files with reserved names. Any part of the the tool kit (quickscrape, getpapers, norma, ami, or custom downstream processing tools) will read and write to a CProject.
8 |
9 | The structure of a CProject is being [discussed here](https://github.com/ContentMine/cmine/issues/10). Please join the debate with your suggestions and observations.
10 |
11 | # How does a CProject look like?
12 |
13 | #### Quickscrape
14 |
15 | An example output folder after scraping two random papers one open access one not.
16 |
17 | ```
18 | output
19 | ├── http_ijs.microbiologyresearch.org_content_journal_ijsem_10.1099_ijsem.0.001085
20 | │ └── results.json
21 | └── https_elifesciences.org_content_5_e10647v3
22 | ├── fulltext.pdf
23 | ├── fulltext.xml
24 | └── results.json
25 | ```
26 |
27 | #### getpapers
28 |
29 | An example output folder showing two papers from EuPMC. Notice that in this case the results file eupmc_results.json is a big blob for the whole project rather than one in each folder like in quickscrape. I'm not sure if this is a good thing because it means that the folder per-paper structure is missing a chunk on information. It has to be kept in the right CProject so that this data isn't lost. This could all be split up placed in a results.json like quickscrape. It also means that the data is stored in a file which changes name based upon which api it was obtained from.
30 |
31 | ```
32 | output
33 | ├── eupmc_fulltext_html_urls.txt
34 | ├── eupmc_results.json
35 | ├── PMC4683095
36 | │ └── fulltext.xml
37 | ├── PMC4690148
38 | └── fulltext.xml
39 | ```
40 |
41 | #### ami
42 |
43 | If you run a plugin (e.g. `ami2-gene --g.gene --g.type human --project malaria`) it results in a tree like
44 |
45 | ```
46 | └── PMC4831192
47 | ├── fulltext.xml
48 | ├── results
49 | │ └── gene
50 | │ └── human
51 | │ └── empty.xml OR results.xml
52 | └── scholarly.html
53 | ```
54 |
55 | After running all plugins from `ami` via the `cmine` command, the full extent of a CProject looks like like:
56 |
57 | ```
58 | ├── PMC4831192
59 | │ ├── fulltext.xml
60 | │ ├── gene.human.count.xml
61 | │ ├── gene.human.snippets.xml
62 | │ ├── results
63 | │ │ ├── gene
64 | │ │ │ └── human
65 | │ │ │ └── empty.xml
66 | │ │ ├── sequence
67 | │ │ │ └── dnaprimer
68 | │ │ │ └── empty.xml
69 | │ │ ├── species
70 | │ │ │ ├── binomial
71 | │ │ │ │ └── results.xml
72 | │ │ │ └── genus
73 | │ │ │ └── empty.xml
74 | │ │ └── word
75 | │ │ └── frequencies
76 | │ │ ├── results.html
77 | │ │ └── results.xml
78 | │ ├── scholarly.html
79 | │ ├── sequence.dnaprimer.count.xml
80 | │ ├── sequence.dnaprimer.snippets.xml
81 | │ ├── species.binomial.count.xml
82 | │ ├── species.binomial.snippets.xml
83 | │ ├── species.genus.count.xml
84 | │ ├── species.genus.snippets.xml
85 | │ ├── word.frequencies.count.xml
86 | │ └── word.frequencies.snippets.xml
87 | ├── sequence.dnaprimer.count.xml
88 | ├── sequence.dnaprimer.documents.xml
89 | ├── sequence.dnaprimer.snippets.xml
90 | ├── species.binomial.count.xml
91 | ├── species.binomial.documents.xml
92 | ├── species.binomial.snippets.xml
93 | ├── species.genus.count.xml
94 | ├── species.genus.documents.xml
95 | ├── species.genus.snippets.xml
96 | ├── word.frequencies.count.xml
97 | ├── word.frequencies.documents.xml
98 | └── word.frequencies.snippets.xml
99 | ```
100 |
--------------------------------------------------------------------------------
/software-tutorials/getpapers/README.md:
--------------------------------------------------------------------------------
1 | # getpapers
2 |
3 | This tutorial covers the installation of getpapers, demonstrates how to construct simple and more complex queries, and shows what output can be expected from getpapers.
4 |
5 | ## Table of contents
6 |
7 | 1. [Description](#description)
8 | 1. [Preparations](#preparations)
9 | 1. [Installation](#installation)
10 | 1. [Input data](#input-data)
11 | 1. [Tutorial](#tutorial)
12 | 1. [Construct a simple query and compare results](#construct-a-simple-query-and-compare-results)
13 | 1. [Getting pdfs and other files](#getting-pdfs-and-other-files)
14 | 1. [Complex queries for EPMC](#complex-queries-for-europepmc)
15 | 1. [Summary and next steps](#summary-and-next-steps)
16 |
17 | ## Description
18 |
19 | **What is getpapers?**
20 |
21 | **getpapers** is one of the entry points of the ContentMine pipeline. It is designed for use in content mining, but you may also find it useful for quickly acquiring large numbers of papers for reading, or metadata for bibliometrics. getpapers accesses publisher APIs, queries them for search terms and returns metadata, PDFs or XMLs, and if available supplementary information. In contrast, the other entry point [quickscrape](../quickscrape/README.md) takes URLs as input and scrapes the whole page.
22 |
23 | **How does getpapers help me?**
24 |
25 | getpapers automates the search and download process and helps in building an initial corpus of documents for content mining.
26 |
27 | **Glossary**
28 | - JSON ([Wikipedia](https://en.wikipedia.org/wiki/JSON))
29 | - XML ([Wikipedia](https://en.wikipedia.org/wiki/XML))
30 | - API ([Wikipedia](https://en.wikipedia.org/wiki/API))
31 |
32 | ## Preparations
33 |
34 | ### Prerequisites
35 |
36 | ### Installation
37 |
38 | On the ContentMine-VM norma is already provided. If you want to install it locally or update it on the VM, please refer to the [installation instructions](http://contentmine.github.io/) for your platform.
39 |
40 | If you are a developer you can find the technical documentation for `getpapers` in its [repository](https://github.com/ContentMine/getpapers).
41 |
42 | ### Input data
43 |
44 | In this tutorial we will mainly use open access literature from [Europe PMC](http://europepmc.org/). We can search within their database of 3.5 million fulltext papers from life-sciences. About one million of these are Open Access. Please refer to [Europe PMC-Data](http://europepmc.org/FtpSite) for details.
45 |
46 | ## Tutorial
47 |
48 | After starting the VM, we open a terminal and start in the workshop folder.
49 |
50 | 
51 |
52 | ### Construct a simple query and compare results
53 |
54 | We now will check how many results we can expect for a search for `ursus maritimus` on Europe PMC. Without a specification of the API, getpapers will chose Europe PMC as default data source. The query structure is the same that would be entered in the search field on the Europe PMC website. With the `-n` flag getpapers will report how many results match the query, but it will not actually download anything. `-o OUTPUTFOLDER` tells getpapers where to store results.
55 |
56 | ```
57 | getpapers -q 'ursus maritimus' -n -o ursus
58 | getpapers -q ursus maritimus -n -o ursus
59 | ```
60 | 
61 |
62 | Please note the different result numbers. With quotation marks `-q 'ursus maritimus'` searches for the exact match, while `-q ursus maritimus` is not as constrained and returns also matches for only `ursus` or `maritimus`.
63 |
64 | In the next step we will download the metadata of our results. To keep waiting times low we use the more narrow search. If you chose a too large result set and don't want to wait, you can abort the query with `Ctrl + C`.
65 | ```
66 | getpapers -q 'ursus maritimus' -o ursus
67 | ```
68 | 
69 |
70 | This query creates two files, `eupmc_fulltext_html_urls.txt` and `eupmc_results.json` and stores them in the `ursus` folder. Depending on Europe PMC, not all search results necessarily have a downloadable HTML-version of the paper. In rare cases it can happen that no HTML-fulltexts at all are available for the query, in which case the file `fulltext_html_urls.txt` will not be created. We now have a look at the contents of the `eupmc_fulltext_html_urls.txt` file, which contains a list of HTML-sources for results.
71 | ```
72 | ls ursus
73 | wc -l ursus/eupmc_fulltext_html_urls.txt
74 | head ursus/eupmc_fulltext_html_urls.txt
75 | ```
76 | 
77 |
78 | The `eupmc_results.json` file contains metadata about each search results in the [JSON](https://en.wikipedia.org/wiki/JSON)-format. Metadata are e.g. a DOI, the title, authors, and additional bibliographic data.
79 |
80 | ### Getting pdfs and other files
81 |
82 | Until now, our queries only resulted in metadata and a list of urls. PDF files can be retrieved by adding a `-p` flag to the query. Please note, that a **generic query** will result in a **large number of results** and long processing times. Unless intended, you can cancel a search with `Ctrl+C` in the command line.
83 |
84 | ```bash
85 | getpapers -q 'ursus maritimus' -o ursus -p
86 | ```
87 |
88 | For every PDF found, getpapers creates a subfolder named after the Europe PMC paper ID which holds the fulltext.pdf and any future additional files. To have a look at folder file structure, use the ```tree``` command.
89 |
90 | ```
91 | tree ursus
92 | ```
93 |
94 | Not all queries returned PDFs, we now run another query with `-x` flag for XML-results. In contrast to PDF, XML is a format that machines can understand well, and XML enables better mining results further down the pipeline.
95 |
96 | ```bash
97 | getpapers -q 'ursus maritimus' -o ursus -x
98 | ```
99 |
100 | The existing resultfolders get updated, no results are lost or overwritten.
101 |
102 | ```
103 | tree ursus
104 | ```
105 |
106 | As a last step, we will download supplementary information with the `-s` flag. Europe PMC provides supplementary information in compressed ZIP-files.
107 |
108 | ```bash
109 | getpapers -q 'ursus maritimus' -o ursus -s
110 | ```
111 |
112 | ```
113 | tree ursus
114 | ```
115 | 
116 |
117 | 
118 |
119 | This is the [CProject](../cproject/README.md)-structure, which is the main data structure of the ContentMine pipeline, and any further operations are going to be centered around the CProject.
120 |
121 | ### Complex queries for EuropePMC
122 |
123 | Queries are directed to the [Europe PMC API](http://europepmc.org/RestfulWebService). In their simplest form, they can be free text, like the one we executed before (`getpapers -q 'ursus maritimus' -o ursus -x`).
124 |
125 | Using the EuropePMC webservice's query language we can construct much more detailed queries. A selection of the most commonly useful search fields is explained [here](getpapers-eupmc-queries.md), and a complete documentation of possible queries is in Appendix I of the [EuropePMC reference PDF](http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf).
126 |
127 | For example we can restrict our search to only papers that mention 'ursus maritimus' in the abstract.
128 |
129 | ```bash
130 | getpapers -q ABSTRACT:ursus maritimus -o ursus -n
131 | getpapers -q ABSTRACT:'ursus maritimus' -o ursus -n
132 | ```
133 | 
134 |
135 | Please compare again the different result numbers without and with `''`.
136 |
137 | We can use the logical operations `AND` and `OR`, and can group operations using brackets. Please note that in the shell we have to encapsulate the query with `'` when we use brackets and use double quotations for inner groupings.
138 |
139 | ```bash
140 | getpapers -q '(LICENSE:"cc by" OR LICENSE:"cc-by") AND ABSTRACT:"ursus maritimus"' -o ursus -n
141 | ```
142 |
143 | Search for papers which contain the phrase "ursus maritimus" in the introduction section and the phrase "survey" in the methods section.
144 | ```bash
145 | getpapers -q 'INTRO:"ursus maritimus" AND METHODS:survey' -o ursus -n
146 | ```
147 |
148 | Search for papers where the authors contain "Smith" and which were published in either "Biology" or "Cell". You can look up journals on the [Europe PMC journal list](http://europepmc.org/journalList?journals) by clicking on the magnifying glass in the column "Search this Journal".
149 | ```bash
150 | getpapers -q 'AUTH:Smith AND (JOURNAL:biology OR JOURNAL:cell)' -o ursus -n
151 | ```
152 |
153 | Downloads XML and PDF's for papers that contain "ursus maritimus" in the abstract and were published between 2010 and 2013.
154 | ```bash
155 | getpapers -q 'ABSTRACT:"ursus maritimus" AND PUB_YEAR:[2010 TO 2013]' -o ursus -p -x
156 | ```
157 |
158 | Search for papers that contain "ursus maritimus" in the title and were first published between July 2009 and June 2013.
159 | ```bash
160 | getpapers -q 'TITLE:"ursus maritimus" AND FIRST_PDATE:[2009-07-01 TO 2013-06-30]' -o ursus -n
161 | ```
162 |
163 | Search for papers about "ursus maritimus" where the European Research Council ("ERC") is mentioned in the acknowledgements section.
164 | ```bash
165 | getpapers -q '"ursus maritimus" AND (ACK_FUND:ERC OR ACK_FUND:"European Research Council")' -o ursus -n
166 | ```
167 |
168 | You can find some common query options [here](getpapers-eupmc-queries.md).
169 |
170 | ## Summary and next steps
171 |
172 | A getpapers query must consist of `-q QUERY -o OUTPUTDIRECTORY`.
173 | * QUERY: the term(s) you want to look for.
174 | * OUTPUTDIRECTORY: Folder in which you want the output files and directory. The folder will be created if it does not already exist.
175 | * Use `-x` for machine-readable fulltext results (preferred, because XML-files provide better mining results in later stages of the tool chain).
176 | * Use `-p` if you want to retrieve human-readable fulltexts in PDF-format.
177 | * Use `-s` to download supplementary information.
178 | * Use `-n` to only check how many results can be expected without downloading anything.
179 | * By default, only Open Access papers will be returned.
180 | * Each API has a different native query language, please refer to the documentation ([EUPMC](getpapers-eupmc-queries.md), [ArXiv](getpapers-arxiv-queries.md), [IEEE](getpapers-ieee-queries.md))
181 |
182 | **Next steps**
183 | * Back to the [tutorial overview](..)
184 | * Continue to [quickscrape](../quickscrape/README.md) for an introduction to scraping.
185 | * Continue to [norma](../norma/README.md) for the next step of the ContentMine pipeline.
186 | * Continue to [cproject](../cproject/README.md) for an introduction to the main datastructure.
187 |
--------------------------------------------------------------------------------
/software-tutorials/getpapers/getpapers-arxiv-queries.md:
--------------------------------------------------------------------------------
1 | [Official documentation](http://arxiv.org/help/api/user-manual)
2 |
3 | ### Restrict by fields
4 |
5 | | fields | description |
6 | |-----------|---------------|
7 | | `ti` | Title |
8 | | `au` | Author |
9 | | `abs` | Abstract |
10 | | `co` | Comment |
11 | | `jr` | Journal Reference |
12 | | `cat` | Subject Category |
13 | | `rn` | Report Number |
14 | | `id` | Id (use id_list instead) |
15 | | `all` | All of the above |
16 |
17 | ### Boolean operators
18 |
19 | ```
20 | AND # logical "and"
21 | OR # logical "or"
22 | ANDNOT # logical "exclude"
23 | ```
24 |
25 | #### Grouping operators
26 |
27 | | operator | description |
28 | |-----------|---------------|
29 | | `()` | Used to group Boolean expressions for Boolean operator precedence. |
30 | | `""` double quotes | Used to group multiple words into phrases to search a particular field. |
--------------------------------------------------------------------------------
/software-tutorials/getpapers/getpapers-eupmc-queries.md:
--------------------------------------------------------------------------------
1 | [Official Documentation, Appendix 1](http://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf)
2 |
3 | ### Restrict search by bibliographic metadata
4 |
5 | | Field | Description | Example |
6 | |-----------|---------------|---------------|
7 | | `PMCID:` | Search for a publication by its PubMed Central ID, where applicable (i.e. available as full text) | `PMCID:PMC1287967` |
8 | | `TITLE:` | Search for a term or terms in publication titles | `TITLE:aspirin, TITLE:”protein knowledgebase”` |
9 | | `ABSTRACT:` | Search for a term or terms in publication abstracts | `ABSTRACT:malaria`, `ABSTRACT:”chicken pox”` |
10 | | `AUTH:` | Search for a surname and (optionally) initial(s) in publication author lists | `AUTH:einstein`, `AUTH:”Smith AB”` |
11 | | `JOURNAL:` | Journal title – searchable either in full or abbreviated form | `JOURNAL:”biology letters”`, `JOURNAL:”biol lett”` |
12 | | `LICENSE:` | Search for content according to the assigned Creative Commons license (where provided). | `LICENSE:"cc by" OR LICENSE:"cc-by"`, `LICENSE:cc` |
13 | | `PUB_YEAR:` | Search by year of publication in YYYY format; note syntax for range searching. | `PUB_YEAR:2000`, `PUB_YEAR:[2000 TO 2001]` |
14 | | `E_PDATE:` | Electronic publication date, when an article was first published online. | `E_PDATE:2013-12-15`, `E_PDATE:20070930`, `E_PDATE:[2000-12-18 TO 2014-12-30]`, `E_PDATE:[20040101 TO 20140101]` |
15 | | `FIRST_PDATE:` | The date of first publication, whichever is first, electronic or print publication. Where a date is not fully available e.g. year only, an algorithm is applied to determine the value. | `FIRST_PDATE:1995-02-01`, `FIRST_PDATE:20000101`, `FIRST_PDATE:[2000-10-14 TO 2010-11-15]`, `FIRST_PDATE:[20040101 TO 20140101]` |
16 | | `P_PDATE:` | Print publication date of journal issue, when an article appeared in print format. | `P_PDATE:1982-10-01`, `P_PDATE:20140101`, `P_PDATE:[2000-12-18 TO 2014-12-30]`, `P_PDATE:[20031114 TO 20141115]` |
17 |
18 |
19 | ### Restrict by article metadata
20 |
21 | | Field | Description | Example |
22 | |---------------|--------------------------------------------------|-------------------------------------------------------|
23 | | `DISEASE:` | Search for mined diseases | `DISEASE:dysthymias` |
24 | | `GENE_PROTEIN:` | Search for records that have GENE_PROTEINS mined | `GENE_PROTEIN:gng11` |
25 | | `GOTERM:` | Search for records that have GOTERM mined | `GOTERM:apoptosis` |
26 | | `CHEM:` | Limit your search by MeSH substance | `CHEM:propantheline`, `CHEM:”protein kinases”` |
27 | | `ORGANISM:` | Search for mined organisms | `ORGANISM:terebratulide` |
28 | | `PUB_TYPE:` | Limit your search by publication type | `PUB_TYPE:review`, `PUB_TYPE:”retraction of publication”` |
29 |
30 | ### Section-level search
31 |
32 | | Field | Description | Example |
33 | |------------|----------------------------------------------------------------------|--------------------------------|
34 | | `INTRO:` | Find articles with a phrase in the Introduction & Background section | `INTRO:“protein interactions”` |
35 | | `METHODS:` | Find articles with a phrase in the Materials & Methods section | `METHODS:“yeast two-hybrid”` |
36 | | `RESULTS:` | Find articles with a phrase in the Results section | `RESULTS:"in vivo"` |
37 | | `DISCUSS:` | Find articles with a phrase in the Discussion seciton | `DISCUSS:cardivascular` |
38 | | `ACK_FUND:` | Find articles with a phrase in the Acknowledgements & Funding section | `ACK_FUND:ERC` |
39 |
--------------------------------------------------------------------------------
/software-tutorials/getpapers/getpapers-ieee-queries.md:
--------------------------------------------------------------------------------
1 | [Official documentation](http://ieeexplore.ieee.org/gateway/)
2 |
3 | ### Search in Fields
4 |
5 | | field | description |
6 | |-----------|---------------|
7 | | `au` | Terms to search for in Author |
8 | | `ti` | Terms to search for in Document Title |
9 | | `ab` | Terms to search for in Abstract |
10 | | `doi` | Terms to search for in DOI |
11 | | `cs` | Terms to search for in Affiliations |
12 | | `jn` | Terms to search for in Publication Title |
13 | | `isbn` | Terms to search for in isbn |
14 | | `issn` | Terms to search for in issn |
15 | | `py` | Terms to search for in Publication Year |
16 | | `partnum` | Terms to search by Part Number |
17 | | `thsrsterms` | Terms to search for in Thesaurus Terms |
18 | | `cntrlterms` | Terms to search for in Controlled Index Terms |
19 | | vidxterms` | Terms to search for in Index Terms |
20 | | `md` | Terms to search for in all configured metadata fields and abstract. Accepts complex queries involving field names and boolean operators. |
21 | | `querytext` | Terms to search for in all configured metadata fields, abstract and document text. Accepts complex queries involving field names and boolean operators. |
22 |
23 |
24 | ### Filtering parameters
25 |
26 | | parameter | description |
27 | |-----------|---------------|
28 | | `oa` | Open Access only (1 - true; 0 - false) |
29 | | `pn` | Publication number |
30 | | `pys` | Start value of Publication Year to restrict results by. |
31 | | `pye` | End value of Publication Year to restrict results by. |
32 | | `pu` | Publisher. One of: IEEE/AIP/IET/AVS/IBM |
33 | | `ctype` | Content Type. One of: Conferences/Journals/Books/Early Access/Standards/Educational Courses |
34 |
35 | ### Paging parameters
36 |
37 | | parameter | description |
38 | |-----------|---------------|
39 | | `hc` | Number of records to fetch. Default: 25; Maximum: 1000 |
--------------------------------------------------------------------------------
/software-tutorials/installation/README.md:
--------------------------------------------------------------------------------
1 | # Installation
2 |
3 | The installation instructions have been moved, please look [here](http://contentmine.github.io/) for installing on you platform.
4 |
--------------------------------------------------------------------------------
/software-tutorials/norma/README.md:
--------------------------------------------------------------------------------
1 | # norma
2 | ==============================
3 |
4 | ## Table of Content
5 |
6 | 1. [Description](#description)
7 | 1. [Preparations](#preparations)
8 | 1. [Data Used](#data-used)
9 | 1. [Tutorial](#tutorial)
10 | 1. [XML to sHTML](#xml-to-shtml)
11 | 1. [PDF collections to TXT](#pdf-to-txt)
12 | 1. [Troubleshooting](#troubleshooting)
13 | 1. [Summary](#summary)
14 | 1. [Next Tutorial](#next-tutorial)
15 | 1. [Further Materials](#further-materials)
16 |
17 | ## Description
18 |
19 | **What norma does**
20 |
21 | Norma transforms different file formats such as PDF, XML or HTML provided by publishing websites and API's into our internal standard for scientific literature called [scholarly HTML](../sHTML/). Norma furthermore possesses the ability to reverse engineer graphs, and e.g. extract data points from a timeline.
22 |
23 | **Why do we need norma?**
24 |
25 | Norma parses and merges different formats and file standards of scientific literature, and offers a unified output which can be used for further processing.
26 |
27 | **How can I use norma?**
28 |
29 | Norma offers paths from three different input streams to sHTML:
30 | * from XML-files collected by an API-query of [getpapers](../getpapers/README.md)
31 | * from HTML-files collected by a URL-scrape of [quickscrape](../quickscrape/README.md)
32 | * from an existing collection of PDFs
33 |
34 | **What you will learn here**
35 |
36 | This tutorial shows you how to
37 | * normalize XML-files or HTML-files into sHTML
38 | * normalize a collection of PDFs into simple text files
39 |
40 | **How to use the tutorial**
41 |
42 | We have some conventions at work, which will be used through-out the tutorial.
43 | - Variables as placeholders are always caps, like NAME, YOURDIRECTORY etc.
44 |
45 | **Glossary**
46 | - API
47 | - PDF
48 | - XML
49 | - HTML
50 | - Normalizing
51 | - Parsing
52 | - Scraping
53 |
54 |
55 | ## Preparations
56 | ### Prerequisites
57 |
58 | ### Used Software
59 | - [Future TDM Virtual Machine](LINK)
60 | - norma 0.1.4
61 |
62 | ### Installation
63 |
64 | On the ContentMine-VM norma is already provided. If you want to install it locally or update it on the VM, please refer to the [installation instructions](http://contentmine.github.io/) for your platform.
65 |
66 | You can find the technical documentation for `norma` in its [repository](https://github.com/ContentMine/norma).
67 |
68 | ## Data used
69 |
70 | 
71 |
72 | Norma can take 4 different file formats for the publications (XML, PDF, HTML, XHTML) and additional files for supplementary materials and PNG's. getpapers or quickscrape can be used to create a corpus of papers, but you also can use your own PDFs or HTML. The task is to unify the different input streams into a single data format.
73 |
74 | ## Tutorial
75 |
76 | This tutorial is based on release [0.2.26](https://github.com/ContentMine/norma/releases).
77 |
78 | ### XML to sHTML
79 |
80 | 
81 |
82 | We continue after retrieving a number of papers with `getpapers -q 'ursus maritimus' -o ursus -x`.
83 |
84 | getpapers returns a [CProject](../cproject)-folder. It contains subfolders holding one paper and associated files, fulltexts in PDF or XML format, and if requested, supplementary files. Just a quick reminder of how it may look like:
85 |
86 | 
87 |
88 | 
89 |
90 | We now transform the fulltext.xml-files into sHTML. This can be done in bulk by passing the project folder with `-q`. The input/output-parameters `-i` and `-o` are the files to read in and write to. The parameter `--transform nlm2html` corresponds to a specific transformation from one format to another.
91 |
92 | ```bash
93 | norma --project ursus -i fulltext.xml -o scholarly.html --transform nlm2html
94 | ```
95 |
96 | 
97 |
98 | norma will print a lot process information in the form of format warnings.
99 |
100 | If you inspect the folder with `tree ursus`, norma will have added a `scholarly.html`-file to those papers, where a format conversion was possible.
101 |
102 | 
103 |
104 | ### PDF to text
105 |
106 | PDF is a notoriously bad format for automatic processing. While understandable for the human reader, PDF is a real obstacle for computers and so for content mining. This lies in the nature of the document, which - from a machine's perspective - is essentially a 2-dimensional plane with symbols on it. The only information that a machine readily knows about any symbol is it's x- and y-location on the plane. Meaning, relations with other symbols, or logical concepts are not present in a PDF and have to be constructed by input from the outside.
107 |
108 | This process is still in an experimental stage, and most suited if you have an application for plain text analysis and want to extract the *raw content* of a PDF.
109 |
110 | Before we can apply norma on a collection of PDFs, we have to move them into a CProject structure. First place all target PDFs into a folder, and [cd](../shell) into it. You can then use the following code to create folders based on the names of PDFs, move each PDF into the corresponding folder, and rename it to fulltext.pdf.
111 |
112 | ```bash
113 | for fname in *.pdf; do
114 | filename=$(basename "$fname");
115 | filename="${filename%.*}";
116 | mkdir "$filename";
117 | mv "$fname" "$filename"/fulltext.pdf;
118 | done
119 | ```
120 |
121 | This is now our CProject folder. We then move one level back in the hierarchy with `cd ..`, in order to run norma from outside the folder.
122 |
123 | The folder structure should then look like this:
124 |
125 | ```
126 | tree my-pdfs
127 | my-pdfs
128 | ├── foldername1
129 | │ └── fulltext.pdf
130 | ├── foldername2
131 | │ └── fulltext.pdf
132 | ├── foldername3
133 | │ └── fulltext.pdf
134 | └── foldername4
135 | └── fulltext.pdf
136 |
137 | 4 directories, 4 files
138 | ```
139 |
140 | To finally convert the pdf into text files, run the following norma-commands:
141 |
142 | ```bash
143 | norma --project ursus/ -i fulltext.pdf -o fulltext.pdf.txt --transform pdf2txt
144 | ```
145 |
146 | ## Summary
147 |
148 | * use publications downloaded via getpapers and convert the XML to uniform sHTML with `norma --project PROJECTFOLDER -i fulltext.xml -o scholarly.html --transform nlm2html`
149 | * convert PDF's to text files with `norma --project PROJECTFOLDER -i fulltext.pdf -o fulltext.pdf.txt --transform pdf2txt`
150 |
151 | ## next tutorial
152 | * Back to the [tutorial overview](..)
153 | * Continue to [sHTML](../sHTML) if you want to learn more about scholarly HTML.
154 | * Continue to [ami](../ami) for the next step of the ContentMine pipeline.
155 |
--------------------------------------------------------------------------------
/software-tutorials/norma/notes.txt:
--------------------------------------------------------------------------------
1 | * CProject erwähnen
2 | * überall erwähnen entlang der pipeline
3 | * abfangen: redirect to getpapers - führe schritte blabla aus
4 | * abfangen: redirect to quickscrape
5 | * check norma-logs for success/errors
6 | * test norma-logging levels
7 | * norma-paths löschen
8 | * erklärungen die peripher sind, löschen
9 | * sHTML tutorial
10 | * PDF klarer abgrenzen: viel schlechtere quali, keine sHTML
11 | * PDF2text experimentellen status anführen
12 | * PDF2text results erklären
--------------------------------------------------------------------------------
/software-tutorials/quickscrape/README.md:
--------------------------------------------------------------------------------
1 | # quickscrape
2 |
3 | ## Table of content
4 |
5 | 1. [Description](#description)
6 | 1. [Installation](#installation)
7 | 2. [Scraper definitions](#scraper-definitions)
8 | 3. [Scraping](#scraping)
9 | 4. [Other sources](#other-sources)
10 | 5. [Summary and next steps](#summary-and-next-steps)
11 |
12 | ## Description
13 | **What is quickscrape?**
14 |
15 | `quickscrape` is together with [getpapers](../getpapers/README.md) one of the entry points of the ContentMine pipeline. It is designed to enable large-scale content mining, and retrieve PDFs, images and fulltext-htmls of scientific literature.
16 |
17 | **For what do we need quickscrape?**
18 |
19 | **How can I use quickscrape?**
20 |
21 |
22 | ## Installation
23 |
24 | On the ContentMine-VM norma is already provided. If you want to install it locally or update it on the VM, please refer to the [installation instructions](http://contentmine.github.io/) for your platform.
25 |
26 | You can find the technical documentation for `quickscrape` in its [repository](https://github.com/ContentMine/quickscrape).
27 |
28 | ## Scraper definitions
29 |
30 | **Quickscrape is incomplete without the scraper definitions**. They are developed individually for each journal to accustom for different page layouts or html-tags. Scraper definitions are maintained in a [separate repository](https://github.com/ContentMine/journal-scrapers.git), and it is possible to [create your own definitions](../journal-scrapers/README.md) for a journal.
31 |
32 | At the moment there exist definitions for following publishers/journals:
33 | * BMC
34 | * PLoS
35 | * PeerJ
36 | * PNAS
37 | * elife
38 |
39 | You can download the newest scraper definitions with this command:
40 | ```bash
41 | git clone https://github.com/ContentMine/journal-scrapers.git
42 | ```
43 | The scraper definitions will then be found in `your_path/journal-scraper/scrapers/`. Remember your path, cause it will be needed later on.
44 |
45 | ## Scraping
46 |
47 | There are two possible inputs for quickscrape, a single url, or list of urls. The single url can be passed directly as a parameter from the command line, the list of urls should be collected either manually by you, or may be taken from a basic getpapers query. Quickscrape will then visit every URL in this list and grab everything it can. This includes sections according to tags, images or tables. This process depends heavily on the format that is provided by the publisher.
48 |
49 | ### Single URLs
50 |
51 | A minimum query consists of a URL (or URL-list) and the path to a specific scraper (or a folder containing scraper definitions). You pass the url with `-u or --url`, the scraper definition with `-s or --scraper` and the output-directory with `-o or --output`.
52 |
53 | ```bash
54 | quickscrape -u url -s journal-scrapers/scrapers/scraper.json -o test_folder
55 | ```
56 |
57 | The scraper you use should correspond the to URL you provide, if none is available, choose `generic_open.json`.
58 |
59 | ```bash
60 | quickscrape \
61 | --url https://peerj.com/articles/384 \
62 | --scraper journal-scrapers/scrapers/peerj.json \
63 | --output peerj-384
64 | ```
65 |
66 | 
67 |
68 | Quickscrape now creates a subfolder for each searchresult, describing the article source, a fulltext.html with the scraping results, and a results.json containing metadata of the article, e.g. authors, title, abstract and bibliographic data. It may include other files such as fulltext PDFs, fulltext XMLs, or scraped images. This is also one of the starting points for a [ctree](../ctree/README.md), the main datastructure of the ContentMine pipeline.
69 |
70 | ```bash
71 | tree peerj-384/
72 | peerj-384/
73 | └── https_peerj.com_articles_384
74 | ├── fig-1-full.png
75 | ├── fulltext.html
76 | ├── fulltext.pdf
77 | ├── fulltext.xml
78 | └── results.json
79 |
80 | 1 directory, 5 files
81 | ```
82 |
83 | ### getpapers URL-lists
84 |
85 | In the next example we take the output we get from a [basic getpapers query](../getpapers/README.md#construct-a-simple-query_and-compare-results), e.g. `getpapers -q 'dinosaurs' --api eupmc -o test_eupmc`. This returns two files in a search results folder. An *apiname*_results.json, which contains metadata about the search results, and a fulltext_html_urls.txt, which contains a list of URLs of fulltext papers. A *valid list of URLs* is a textfile with exactly one valid URL per line. A *valid URL* is a URL that leads to a fulltext page, e.g. [https://peerj.com/articles/384](https://peerj.com/articles/384).
86 |
87 | ```bash
88 | tree test_eupmc
89 | test_eupmc
90 | ├── eupmc_results.json
91 | └── fulltext_html_urls.txt
92 |
93 | cat test_eupmc/fulltext_html_urls.txt
94 | http://europepmc.org/articles/PMC4040045
95 | http://europepmc.org/articles/PMC4055607
96 | http://europepmc.org/articles/PMC4022087
97 | http://europepmc.org/articles/PMC3394943
98 | http://europepmc.org/articles/PMC4448809
99 | http://europepmc.org/articles/PMC3115283
100 | ```
101 |
102 | We now take the fulltext_html_urls.txt as input for quickscrape. quickscrape will choose a scraper automatically, if one is available. If not, a scraper with very generic definitions will be used, and the result will not be as precise.
103 |
104 | ```bash
105 | quickscrape -r test_eupmc/fulltext_html_urls.txt -d journal-scrapers/scrapers/ -o test_eupmc
106 | tree test_eupmc
107 | test_eupmc
108 | ├── eupmc_results.json
109 | ├── fulltext_html_urls.txt
110 | ├── http_europepmc.org_articles_PMC1234567
111 | │ ├── fulltext.html
112 | │ └── results.json
113 | ├── http_europepmc.org_articles_PMC1234568
114 | │ ├── fig-1-full.png
115 | │ ├── fig-2-full.png
116 | │ ├── fulltext.html
117 | │ ├── fulltext.pdf
118 | │ ├── fulltext.xml
119 | │ └── results.json
120 | ├── ...
121 | ...
122 | └── http_europepmc.org_articles_PMC4448809
123 | ├── fulltext.html
124 | ├── fulltext.pdf
125 | └── results.json
126 |
127 | 19 directories, 43 files
128 |
129 | ```
130 |
131 | ## Other sources
132 | From other searches or citation data you may have a list of DOIs ([Digital Object Identifier](https://en.wikipedia.org/wiki/Digital_object_identifier)), such as `https://dx.doi.org/10.7717/peerj.384`. This is not a valid URL input for quickscrape. You must first resolve the DOI, in this case [https://dx.doi.org/10.7717/peerj.384](https://dx.doi.org/10.7717/peerj.384) leads to [https://peerj.com/articles/384/](https://peerj.com/articles/384/). Other examples are [http://dx.doi.org/10.4103%2F1817-1745.131497](http://dx.doi.org/10.4103%2F1817-1745.131497) which leads to the [article on pediatricneurosciences.com](http://www.pediatricneurosciences.com/article.asp?issn=1817-1745;year=2014;volume=9;issue=1;spage=79;epage=81;aulast=Vitaliti), or [http://dx.doi.org/10.1074%2Fmcp.M111.014167](http://dx.doi.org/10.1074%2Fmcp.M111.014167) which leads to the [article on MCPOnline.org](http://www.mcponline.org/content/11/7/M111.014167). It is important to distinguish between the DOI and the landing page. **Content can only be scraped from the landing page.**
133 |
134 | ## Summary and next steps
135 |
136 | * A minimum query consists of a URL (or URL-list) and the path to a specific scraper (or a folder containing scraper definitions).
137 | * Please be a respectful and responsible miner and apply a reasonable rate limit `-r` (recommended between 3 and 6 scrapes per minute).
138 | * The result will be a collection of [ctrees](../ctree/README.md) containing fulltexts in various formats (PDF, XML), a results.json with metadata, and possibly images.
139 |
140 | **Next steps**
141 | * Continue to [journal-scrapers](../journal-scrapers/README.md) if you want to define your own scraper.
142 | * Continue to [norma](../norma/README.md) for the next step of the ContentMine pipeline.
143 | * Continue to [ctree](../ctree/README.md) for an introduction of the main datastructure.
144 |
--------------------------------------------------------------------------------
/software-tutorials/sHTML/README.md:
--------------------------------------------------------------------------------
1 | see [the SHTML spec](https://github.com/ScholarlyHTML/spec/)
2 |
3 | Breaking news - 2015-12 - There is now a very exciting [new W3C development in ScholaryHTML](https://github.com/w3c/scholarly-html) which we are
4 | deeply involved in helping create
5 |
--------------------------------------------------------------------------------
/software-tutorials/shell/README.md:
--------------------------------------------------------------------------------
1 | .png)
2 |
3 | ### Basic shell commands
4 |
5 | In general: autocompletion with 'tab' may save you a lot of typing. If you want to interrupt and cancel the execution of a command, press ```CTRL``` + ```c```.
6 |
7 | **ls**: **l**i**s**ts files and directories, taking the current working directory as starting point. You can also look into the content of subdirectories by extending the path ```ls dir/nested_dir/nested_dir2``` ([wikipedia](https://en.wikipedia.org/wiki/Ls). We inspect which files and folders are present in our current working directory and in the `shell-example` folder:
8 | ```
9 | ls
10 | ls shell-example
11 | ```
12 | 
13 |
14 |
15 | **tree**: The tree-command provides a hierarchical overview of a folders content, including subfolders. Please note that if you work on your local machine and not within our virtual machine, you may have to [install tree first](https://askubuntu.com/questions/431251/how-to-print-the-directory-tree-in-terminal).
16 | ```
17 | tree shell-example
18 | ```
19 | 
20 |
21 |
22 | **cd**: **c**hange **d**irectory, moves the working directory location to the target location. You can move back up in the directory hierarchy with `cd ..`. If you want to navigate to an absolute path, you have to start with a "/", `cd /home/workshop/workshop/`.
23 | ```
24 | cd shell-example
25 | ls
26 | cd ..
27 | ```
28 | 
29 |
30 |
31 | **mkdir**: **m**a**k**e **dir**ectory: creates a new directory ([wikipedia](https://en.wikipedia.org/wiki/Cd_(command))).
32 | ```
33 | mkdir new-folder
34 | ls
35 | ```
36 | 
37 |
38 |
39 | **mv**: **m**o**v**es files and directories from the first location to the second. You can move them further down into already existing directories, but also up with ```mv dir ..```, and into the current directory with ```mv lower_dir ./new_lower_dir``` ([wikipedia](https://en.wikipedia.org/wiki/Mkdir)). mv is also used to rename files or folders, e.g. ```mv old_filename.txt new_filename.txt```.
40 | ```
41 | mv new-folder shell-example
42 | ls shell-example
43 | mv shell-example/new-folder ./
44 | ls
45 | ```
46 | 
47 |
48 |
49 | **cp**: **c**o**p**ies files from the first location to the second ([wikipedia](https://en.wikipedia.org/wiki/Cp_(Unix))). If you want to copy a folder, you have to use ```cp -r source_dir target_dir``` where ```-r``` stands for recursive.
50 | ```
51 | cp -r shell-example shell-example-2
52 | ls
53 | tree shell-example
54 | tree shell-example-2
55 | ```
56 | 
57 |
58 |
59 | **rm**: **r**e**m**oves the specified file ([wikipedia](https://en.wikipedia.org/wiki/Rm_(Unix))). If you want to remove a directory, use ```rm -r dir``` but be sure you want this.
60 | ```
61 | rm -r shell-example-2
62 | ls
63 | ```
64 | 
65 |
66 |
67 | **cat**: **cat**enates the content of a file sequentially and prints line by line to the terminal ([wikipedia](https://en.wikipedia.org/wiki/Cat_%28Unix%29))
68 | ```
69 | cat shell-example/lines.txt
70 | ```
71 | 
72 |
73 |
74 | **head**: prints the first *n* lines to the terminal ([wikipedia](https://en.wikipedia.org/wiki/Head_(Unix))). Default is 10.
75 | ```
76 | head -5 shell-example/lines.txt
77 | ```
78 | 
79 |
80 |
81 | **tail**: prints the last *n* lines to the terminal ([wikipedia](https://en.wikipedia.org/wiki/Tail_(Unix))). Default is 10.
82 | ```
83 | tail -5 shell-example/lines.txt
84 | ```
85 | 
86 |
87 |
88 | **wc**: Counts different things in a file, e.g. words, lines, or bytes ([wikipedia](https://en.wikipedia.org/wiki/Wc_%28Unix%29)). **wc -l filename** counts lines.
89 | ```
90 | wc -l shell-example/lines.txt
91 | wc -w shell-example/lines.txt
92 | ```
93 | 
94 |
95 |
96 | * [Return to tutorial overview](..)
97 | * [Proceed to getpapers-tutorial](../getpapers)
--------------------------------------------------------------------------------
/software-tutorials/vms/README.md:
--------------------------------------------------------------------------------
1 | # ContentMine Virtual Machines
2 |
3 | ## Table of Contents
4 |
5 | 1. [Description](#description)
6 | 2. [Installation](#installation)
7 | 3. [Troubleshooting](#troubleshooting)
8 | 4. [Components](#components)
9 |
10 | ## DESCRIPTION
11 |
12 | **What is a virtual machine?**
13 | A virtual machine is a simulated operating system 'within' your operating system (think Inception for operating systems). It consists of two parts:
14 | * [VirtualBox](https://www.virtualbox.org/): the software which runs the virtual machine
15 | * [ContentMine virtual machine image](): an image in which the whole operating system with its configuration and our ContentMine software packages are located
16 |
17 | **Why does ContentMine use a virtual machine?**
18 | Virtual Machines make it easy to use pre-configured software environments on different operating systems. In our case, it allows us to run the ContentMine software easily on all kinds of operating systems (Linux, Windows and Mac). This is used mostly for hands-on workshops and allows all attendees to run the software without having to modify their own systems. This allows us to quickly and smoothly start content mining, with a mininum of fuss.
19 |
20 | **How can I use the virtual machine?**
21 | For this, you have to install [VirtualBox](https://www.virtualbox.org/) and start from it the [ContentMine virtual machine image](). You will find more details in the install section.
22 |
23 | ## INSTALLATION
24 |
25 | ### Install VirtualBox
26 | VirtualBox runs on Windows, Linux, Macintosh, and Solaris and licensed under GNU General Public License (GPL) version 2.
27 |
28 | **Requirements**
29 | * Reasonably powerful x86 hardware. Any recent Intel or AMD processor should do.
30 | All the other things depend on the requirements of the virtual machine image, as you can find them below.
31 |
32 | 1. Download the VirtualBox platform installer for your operating system from the [VirtualBox website](https://www.virtualbox.org/wiki/Downloads).
33 | 2. Run the installer and follow the on-screen instructions.
34 |
35 | ### Installing the ContentMine Virtual Machine image
36 |
37 | **Requirements**
38 | * 64-bit architecture (unless otherwise stated)
39 | * 3 GB RAM
40 | * Adequate hard drive space for the VM (at least 5 GB)
41 |
42 | 1. Download the required ContentMine virtual machine image
43 | * 'Current VM' as at [May 2016](https://drive.google.com/open?id=0B7pJKedx9b97QVJWUzNnMGdPb00)
44 | * getpapers v0.4.5
45 | * quickscrape v0.4.7
46 | * norma v0.2.26
47 | * ami v0.2.24
48 | * 'Contentmine-FTDM' VM for Cambridge workshop/s [2015-12-10/11](http://contentmine.org/wp-content/uploads/static/contentmine-VM.ova) (3534356480 bytes on MAC-OSX)
49 | * 'Biology' VM for University of Bath workshop [28/07/15](https://onedrive.live.com/redir?resid=1652077CF1AA4E9F!1280&authkey=!AGyzu9zuzzKeJok&ithint=file%2cova)
50 | * 'Neuro' VM for Edinburgh Neuroscience hack [26/05/15] - [direct link](https://www.dropbox.com/s/yes9af47fn8vnz7/ContentMine-VM.ova?dl=0)
51 | * 'Cochrane' VM for Oxford Cochrane centre workshop [2015/03/15] - [direct link](https://drive.google.com/file/d/0B6ChGXuXmOEDemRtb1JBakREYWc/view?usp=sharing)
52 | * 'Playground' VM for [EBI workshop](https://github.com/ContentMine/EBI_workshop_20150330) [2015/03/30] - [direct link](https://drive.google.com/uc?export=download&confirm=dp8f&id=0B6ChGXuXmOEDNWx2d0EwbDkyY00) - [installation instructions](https://github.com/ContentMine/EBI_workshop_20150330/blob/master/docs/pre-workshop_installation.pdf)
53 |
54 | The image should be fairly large (>1GB, now ca 3.3GB). Depending on your connection that can take between 10 and 60 minutes (or much longer if you have very slow connection).
55 | 2. Double-click on the downloaded file (´´´.ova´´´ file-extension) to open VirtualBox and offer to import the virtual machine. Please follow the on-screen instructions to complete the import.
56 | 3. Configure the import of the image
57 |
58 |
59 | This is a series of screenshots to show what you should be seeing when you first install the Virtual Box and Virtual Machine. These are for a MAC-OSX and there will be minor differences for other OS.
60 |
61 | * Virtual Box Download (e.g. from https://www.virtualbox.org/wiki/Downloads);
62 |
63 |
Pick your operating system
64 |
65 |
66 | * Virtual Box Installation:
67 | (MAC-OSX) click on downloaded file (creates
68 |
69 |
70 | Then click on the package/box icon (1) and it should install in `Applications | VirtualBox.app`
71 |
72 | * Possible error (ignore). You might see:
73 |
74 |
75 | If so, click the "Do not show this message again" and continue.
76 |
77 |
78 | ### Starting the ContentMine Virtual Machine image
79 |
80 | After installing VirtualBox and importing the virtual machine image you can select the machine from the VirtualBox interface.
81 |
82 | 
83 |
84 | Please go to "Settings" first and make sure you allocate at least 2000MB RAM to the Base Memory. Then click OK.
85 |
86 | 
87 |
88 | To start VirtualBox, select your image and click the "Start" button.
89 |
90 | After a few seconds you land on the desktop.
91 |
92 | 
93 |
94 | You can shut down the vm by right click and then "Exit", and "Power off".
95 |
96 | **If you attend a workshop: Please try to get the virtual machine running _before_ the workshop. There will be little time on the day to help with VirtualBox issues!**
97 |
98 |
99 | * Starting the VM. (MAC-OSX) Click on the `Applications | VirtualBox.app` and you may see
100 |
101 |
102 |
103 | * When the VM is ready you will see
104 |
105 |
106 | (and on MAC-OSX the icons:
107 |
)
108 |
109 | There can be more than one VM - we release different ones for different tutorials, and you can switch between them on the LH side.
110 |
111 | * when the VM is running you should see a screen such as:
112 |
113 |
114 |
115 | * Right-click on the main window and get a popup:
116 |
117 |
118 |
119 |
120 | * Select `terminal` and you will get:
121 |
122 |
123 |
124 | * try ```ls -lt``` at the command line. If it comes out with a German &ess; type:
125 | ```setxkbmap gb``` to convert to GB or ```setxkbmap us``` for US.
126 |
127 | ### Usage
128 |
129 | Basic entry to different applications starts with a right click on the desktop. Following options are of interest to us:
130 | * Terminal: command line interface. This is the basic way how to operate the ContentMine software. It opens a text-based interface, from where we can navigate folders, look into files, and interact with the ContentMine software.
131 | * File Manager: visually navigate through the folders
132 | * Web Browser: go onto the web.
133 |
134 | **Terminal**
135 | The execution of the updates can take a while (depending on your internet connection). After everything is checked, you should see a window like this:
136 |
137 | 
138 |
139 | You can maximize it to fullscreen by double clicking on the title bar.
140 |
141 | **Command Line / Shell**
142 |
143 | The command line is going to be the main interface with ContentMine. Some basic commands for using and navigating the command line are documented [here](../shell/README.md), please have a look if you are new to using the command line.
144 |
145 | **Copy & Paste**
146 |
147 | The settings of the virtual machine allows to share the clipboard with the host machine. That means you can copy any text from the host machine (e. g. browser, text file, terminal) and paste it into the your used virtual machine application, and vice versa.
148 |
149 | If you want to paste something into the command line, this is possible with right-click+"Paste", or ```Ctrl+Shift+V```. If you want to copy something out of the command line, e.g. an error message, highlight the message with the cursor, and then either right-click+"Copy" or use ```Ctrl+Shift+C```.
150 |
151 | **File import/export between the host system and the vm**
152 |
153 | If you want to transfer files between the host system and the vm, you have to set up a shared folder. This has to be done in the VirtualBox before starting the vm. Go to "Settings-Shared Folders"
154 |
155 | ## TROUBLESHOOTING
156 |
157 | The most common error is an incomplete download of the large VM image file, please verify that the download has been succesfully and _fully_ completed.
158 |
159 | If you have any problems getting VirtualBox, or downloading and starting the virtual image, don't worry. Please contact the workshop organizer for support, or ask us on our [website](http://contentmine.org/contact).
160 |
161 | Note that some VMs may have a German locale and you may need to issue ```setxkbmap gb ``` or ```setxkbmap us``` in the terminal to change the keyboard settings.
162 |
163 | ## COMPONENTS
164 |
165 | The VM comes with following packages and environments installed:
166 |
167 | Environments:
168 | - node.js v0.10.24
169 | - npm 1.3.21
170 | - zsh 4.3.17
171 |
172 | ContentMine tools:
173 | - getpapers - version
174 | - quickscrape - version
175 | - norma - version
176 | - AMI-plugins - version
177 |
178 | Data analysis packages:
179 | - [anaconda](http://continuum.io/downloads#py34)
180 | - [jupyter](http://jupyter.readthedocs.org/en/latest/install.html)
181 | - [R 3.2]
182 |
--------------------------------------------------------------------------------
/training-guidelines/README.md:
--------------------------------------------------------------------------------
1 | # Facilitator Materials
2 |
3 | This directory contain resources to **organize**, **communicate**, **evaluate** and/or **facilitate a ContentMine training**. The following files are available to use:
4 |
5 | * [General guidelines](general.md)
6 | What should the setting and atmosphere be like? What should your mindset as a facilitator be? The general guidelines offer ideas about giving a workshop. A 'lessons learned' section describes solutions to avoidable mistakes in planning or running a workshop.
7 | * [Workflow](workflow.md)
8 | The workflow gives facilitators and organizers some orientation on which steps to perform when, and which dependencies exist. It also includes a checklist which helps facilitators and organizers prepare their sessions.
9 | * [Evaluation & Assessment](evaluation-assessment.md)
10 | What to prepare for? What can be improved? This document describes how to identify the needs of your audience before the workshop and how to evaluate the workshop afterwards. Improving the materials and guidelines through github-issues or other feedback helps facilitators creating better workshops, and participants have a better experience.
11 | * [Teaching resources](teaching.md)
12 | This contains a description of the methods used in the workshop as well as links to helpful external teaching resources.
13 |
--------------------------------------------------------------------------------
/training-guidelines/evaluation-assessment.md:
--------------------------------------------------------------------------------
1 | # Assessment and Evaluation
2 |
3 | Finding out the needs and the prior knowledge of the participants (assessment) and how the workshop met them (evaluation) are two important steps for improving the quality of the trainings, the trainng materials and the experience for participants (and facilitators). Allocate time and resources before, during and after a workshop to identify the needs of your audience and check how well they were met afterwards.
4 |
5 | ## Pre-Training Assessment
6 |
7 | Pre-training assessments should fill the gaps of missing knowledge of the facilitators to being able to prepare the training as good as possible regarding the participants. This leads to better trainings and less problems during them, but still it should always be decided training by training, if the effort is worth it. This depends mostly on organizational parameters, like how much time in advance is left, and how good you already know your target audience.
8 |
9 | Steps to do:
10 | 1. Create a registration form with email field
11 | 2. Think about, what are the uncertain variables regarding your participants? What do you want to know from them? => Ask yourself who they are.
12 | 2. Create survey
13 | 3. Send out link to the survey at least one week in advance
14 | 4. Evaluate the survey before the training
15 | 5. Implement outcomes of the evaluation to the training concept and content
16 |
17 |
18 | The survey should answer questions like:
19 | * Who is participating?
20 | * Profession
21 | * Skill levels: TDM, Programming, Research, (Newbie, advanced or expert)
22 | * Technical systems: OS, Software they use, etc.
23 | * What is the prior knowledge regarding the content of the training?
24 | * What do the participants expect (reagarding the communicated knowledge about the session in advance)?
25 |
26 | It is also important to think about questions you want to re-validate after the workshop, so you can check progress in the skills of the particiapnts and confidence in applying them.
27 |
28 | ## Post-Training Evaluation
29 |
30 | The survey after the training should evaluate the progress achived by the training, the satisfaction ot the participants with the training and further improvements.
31 |
32 | Steps to do:
33 | 1. Set Up evaluation form
34 | 2. Send out email with link
35 | 3. Send out a reminder two weeks later to the ones still missing
36 | 4. Evaluate results
37 | 5. Implement outcomes into trainings materials, the training concepts and the strategy
38 |
39 | For effective training and learning evaluation, the principal questions should be:
40 |
41 | * To what extent were the identified training needs objectives achieved by the programme?
42 | * To what extent were the learners' objectives achieved?
43 | * What specifically did the learners learn or was useful to be reminded of?
44 | * How was the prior expertise and how is the expertise in certain fields after the training (programming, legal issues, ContentMine project)?
45 | * What commitment have the learners made about the learning they are going to implement on their return to work?
46 | * What do they want to learn next (general in terms of Content-Mining)?
47 | * What can be improved?
48 |
49 |
--------------------------------------------------------------------------------
/training-guidelines/how-to-setup-a-training.md:
--------------------------------------------------------------------------------
1 | # How to set up a Training
2 |
3 | 1. [Goals and Strategy](#goals-and-strategy)
4 |
5 | 2. [Work out the Training](#work-out-the-training)
6 |
7 | 3. [Preparations](#preparations)
8 |
9 | 4. [On Spot](#on-spot)
10 |
11 | 5. [Follow Up](#follow-up)
12 |
13 | 6. [Lessons Learned](#lessons-learned)
14 |
15 | This document should help you to work out the training materials, work out a proper training concept and execute it on spot. Additionally you should use the [checklist.md](LINK), which supports you through this process.
16 |
17 | This document works together with [checklist.md](LINNK). In this document you get practical support how to work out a training concept, and what to do in the main steps toward it. The checklist.md](LINK) is an easy to use list to check, if you have done the most important steps, which are described in this document in more detail.
18 |
19 | ## 1. Goals and Strategy
20 |
21 | A training can be quite easy if you know exactly what you want to do and what your goal is. But that's not a generic thing, so you have to think about for every training, for every host, for every group of participants from the beginning. And also don't underestimate the possibilities to improve your training skills so far.
22 |
23 | The first goal is to identify the goal(s) and work out a proper strategy to reach them with the training. This here is the smallest section in the guideline, but when you pay too less time for this, the outcome will most likely not satisfy you and a lot of time can be spend into the wrong direction.
24 |
25 | So take yourself some time and think in depth about the following points:
26 | * Who is your target audience? Think in detail: age, expertise, prior-knowledge, expectations, diversity, etc.
27 | * What should be the concrete outcome for the participants, for the hosts and for ContentMine?
28 | * What is needed to achieve the outcomes?
29 | * What is the problem you want to solve with the things you teach?
30 | * What does the host expect?
31 |
32 | **Think especially about a strong narrative and a strong use-case you want to offer the participants.**
33 |
34 | Prepare short, concise answer for these questions:
35 | * What is a fact?
36 | * What is content mining?
37 | * What is content mine?
38 | * What is content mining good for?
39 | * What is ContentMine good for?
40 | * What does it have to do with the TDM exception?
41 |
42 | And to sum it up: **Make everyone happy: Hosts, participants and you, the facilitator!**
43 |
44 | ## 2. Work out the Training
45 |
46 | ### Workflow
47 | * plan with time to change the room setup
48 | * think about participants coming later in: how to include them most easily, how often will this happen?
49 | * use the [¢hecklist.md](LINK) and add your specific tasks with time to it.
50 | * check previous trainings: github repository, documents, follow ups, lessons learned
51 | * work in groups as often as possible. peer learning is a great way to have fun and learn something
52 | * check if people are excluded from some acitivies? e.g. non-devs, people without laptop, etc.
53 | * prepare offline alternatives, WIFI always can fail.
54 | * Have a good balance between hands on sessions, social interactive sessions and theoretical talks.
55 | * Check if there is anywhere pressure in the timeline when planning and after finishing the timeline. Keep spaces free to relax for everyone. People should never think “I need a break!”.
56 | * Identify the central transformational events from one block to the next and think about what is necessary for it. e.g. from presentation to hack session, or from world cafe group to the big circle
57 | * think about transition from last session and into next session
58 |
59 | ### Culture
60 | * create an inclusive and diverse space: no racism, no sexism, etc.
61 | * have fun
62 | * be positive
63 | * help people
64 | * don't blame them for not knowing something!
65 | * be patient
66 |
67 | ### Content
68 | * be clear about problems, limitations and advantages, but always offer a solution
69 | * Is there some pre-knowledge assumed?
70 | * Is the content to difficult?
71 | * do you use too many technical terms?
72 | * don't overload participants with too much information
73 |
74 | ### Session Formats
75 |
76 | Here we offer you some common types of sessions which can be used to put them together to a full training schedule
77 | * always sum up after blocks of content, what you did and why
78 |
79 | #### Introduction
80 | * tell your name and your background: why are you interested in Content-Mining?
81 | * tell about ContentMine: website, the idea, Shuttleworth Fellowship, Peter Murray Rust, Team,
82 | * use established narratives [narratives.md](LINK)
83 | * give an overview of what will happen in the training
84 | * tell what problem we want to solve and why CM is important
85 | * short introduction round: why are they here, name, activities/profession
86 |
87 | #### Presentation
88 | This is a list of basic information how to present well. If you look for How to create slides, please go down to "Create Training Materials" > "Slides"
89 | * ask questions to the audience: have they heard of it? etc.
90 | * tell a strong story
91 | * ask yourself: what is the purpose of the talk?
92 | * breath out before you start and take a 2-3 seconds break
93 | * stand still with your feet, imagine your feet are growing into the floor like the roots of a tree to have a stable stand
94 | * relax and try to find something positive/funny about the presentation when you are nervous
95 |
96 | #### Hack Session
97 | A Hack Session is when the participants together with the faciliators try to hack on their own laptops in an open framing. This can be small ones from 5-15 minutes, but also longer ones over a hour. The guide offers some basic support, but a lot depend on what software, which use-case and which data you use. The sessions can be done individually, but most likely they are better done in groups of 2-3 persons.
98 |
99 | Hack Sessions can be combined easily with On-Screen sessions. After the introduction on the screen, people can hack on a sandbox or a small worked-out task by themselves in the open.
100 |
101 | Hack Session are best for around 10 up to 20 participants with 2 faciliators. Then for every additional 10 participants another facilitator is necessary/helpful.
102 |
103 | **Purpose**
104 |
105 |
106 | * let people do something by themselves
107 | * let everyone create individual results if possible
108 | * let them learn in groups
109 | * guide them to something helpfull for their own interests
110 | * learning by doing
111 | * show them the power and magic of content mining in real
112 |
113 | **Preparations**
114 |
115 |
116 | * work out tasks for the participants
117 | * small tasks of 10 minutes are often better than one hour slots of one big project. maybe chunk down your big slot to 4 smaller chunks
118 | * find information xyz
119 | * we have x, we want z, please do y
120 | * tell them the timeslot and the goal they should achive
121 | * Prepare an offline alternative
122 |
123 | **Timeline**
124 | * explain the session: what is the goal, what do they have to do for it, why is this of interest to them, when will it end
125 | * show a demo (optional): sometimes it can be helpful to offer some concrete ideas
126 | * ask if every group has a laptop, the data and if the software is running
127 | * ask advanced participants to help after they managed
128 | * ask if everything is clear
129 | * start hacking session
130 | * walk around: talk and help people
131 | * inform 5min before session ends
132 | * end hacking session
133 | * let the people present/tell their results and interpretation. this is where the real magic happens and the own thoughts get connected with others.
134 | * what have you found?
135 | * what were the problems?
136 | * did you solve them? yes, how? no, what barrier?
137 | * discuss/compare outcomes
138 | * sum up at the end
139 |
140 | **Possible Problems**
141 | * WIFI is not working
142 | * too many participants: build new groups. group people together up to 4 but not more.
143 |
144 | #### On Screen Tutorials
145 |
146 | This session shows hands-on stuff on the screen, and optionally let's everyone in the training follow step-by-step the things shown (e.g. explaining iPython notebook).
147 |
148 | If you want to know more about coding stuff for workshops, look under Preparations -> Software.
149 |
150 | **Purpose**
151 |
152 |
153 | * show the audience some things / use-cases which can be done with Content-Mine.
154 | * Introduce the audience into the Content-Mine software
155 | * offer some pre-coded functionality to make more sophisticated analysis and applications easier to do for the audience.
156 |
157 | **Preparations**
158 |
159 |
160 | * have in mind, that the beamer may has lower resolution
161 | * Execute the tutorials in a sequential way
162 | * plan breaks when blocks are done
163 | * look for supporters: because one person is all the time presenting at the front, the resources are very low to do support for the participants and solve problems on their machine
164 |
165 | **Timeline**
166 |
167 | * explain the things you will do
168 | * ask people to help each other
169 | * check if the system is running
170 | * start the presentation
171 | * sum up at the end
172 |
173 | **Didactics**
174 |
175 | * take it easy and make intentional breaks to let users follow up.
176 | * take yourself time to type and execute commands. be aware, that after executing the command, the command itself may disappear from the screen, so it is not visible anymore for the participants. When you want to use short-cuts like tab-completion, explain and show them.
177 |
178 |
179 | **Possible Problems**
180 |
181 |
182 | * people get stuck and fall away: prepare breaks, have someone to support and help
183 | * the data gets changed
184 |
185 | #### Discussions
186 |
187 | A discussion is something familiar, so not too much must be said about it. This mostly focuses on offering some hints to moderate discussions and get them started.
188 |
189 | **Purpose**
190 |
191 |
192 | * let everyone speak
193 | * offer a inclusive environment
194 | * let people connect with each other
195 |
196 | **Preparations**
197 |
198 | * prepare a backup plan, if no one wants to start.
199 | * prepare 2-3 questions, especially one to kick-start the discussion
200 | * try to identify in advance people who are likely to start
201 |
202 | **Timeline**
203 |
204 | * explain the goal of the discussion
205 | * tell duration
206 | * start discussion: make general question. if no one starts, ask a person directly
207 | * moderate the comments. everyone should say something
208 | + stop discussion: when time is over, or ask participants if they want to overdue 5-15min (but not more!)
209 |
210 | **Possible Problems**
211 | * no one says something: wait, stay relaxed and offer a friendly and welcoming space. after 1min of waiting time (but before it gets too weird), ask again a person directly
212 |
213 | #### World Cafe
214 |
215 | **Concept**
216 |
217 | A Worl Cafe is a discussive format to bringt people and ideas together in small, intimate groups of 3-5 people. The group discusses several topics proposed by the facilitators in a defined amount of time. Normally you do it 3 times with around 10-20 minutes for each round. The length mostly depends on the time you have, the number of rounds too, but also on what you want to achieve with your questions.
218 |
219 | **Purpose**
220 | * Connect with the people in a meaningful way
221 | * Let people interact on a personal and direct level
222 | * Create a social athmosphere
223 | * Get straight into the thoughts and ideas of the people, no talking around issues!
224 |
225 | **Application**
226 | * After people did some content mining, more towards the end of a workshop
227 | * useful to wrap up things and get a meta-perspective
228 | * After theoretical, one-to-many block
229 |
230 | **Preparation**
231 | * Virtual bell: to stop the rounds
232 | * post-its
233 | * Pens and Sheets with questions on them
234 | * Tixu/Pins to hang up the sheets after the World Cafe
235 | * Think of 3-4 questions that open up a discussion, e.g.
236 | * What are opportunities created through content mining?
237 | * What are barrier for content mining?
238 | * How can content mining help you?
239 |
240 | **Timeline**
241 | 1. Explain the next session:
242 | * “How knows what a world cafe is?“
243 | * Art of Hosting method
244 | * tell the number of rounds and length
245 | * tell about meta harvest if done at the end
246 | * tell the questions to discuss
247 | * document on sheets offered by facilitators
248 | * groups should get straight into the topics!
249 | * seperate into groups of 3-5 people
250 | * ask for help to re-structure the room if necessary
251 | * start the room setup: hand out pens and sheets
252 | * Sum up: get in your round, discuss and document things
253 | 2. Get in groups.
254 | 3. Discuss and document first question
255 | * Ask if everyone has a sheet and 2-3 pens
256 | 4. Warn one minute before the end
257 | 5. Ring virtual bell after round ended
258 | 6. Group change: one person stays (decide yourself on the table who does what, if meta-harvest is used, the person who stays should report at the end in the big circle), everyone else should look for at total new group
259 | 7. Point 3 to 6 will be repeated until last question was discussed
260 | 8. Collect sheets and pin them on the wall
261 | 9. Circle (optional)
262 | * The Circle sits all people in a circle to start a meta-harvest of the world cafe rounds.
263 | * In this the main topics of the groups get presented and discussed.
264 | * From every group at least one person should speak about what happened and what where the main points.
265 | * A new sheet should be prepared for this.
266 | * This can take from 10 to 30 minutes and even more, if people have something to tell.
267 | * if not enough time to allow everyone to speak, just ask one per table to present issues
268 |
269 | **Possible Problems**
270 | * not enough participants: just reduce the number of groups too the smallest number necessary to get 3-5 people together.
271 | * too many participants: scale up group size until 6 persons. look for additional tables. have more sheets and pens prepared than you expect.
272 |
273 | ### Examples
274 | **This will come later**
275 |
276 | * offer some examples of slots in a workshop in general and for CM in specific. Which formats are available?
277 | * how do facilitators organize their activites during the training? notes, laptop docs??
278 | * narrative: offer information how to create a strong and compelling narrative in your training. this is a very crucial point
279 | * link to older training repos
280 |
281 | ## 4. Preparation
282 |
283 | ### Room
284 | * Pin sheet with bit.ly shorturl to github repo on the main wall
285 | * Pin sheet with most important steps for newbies on the wall
286 | * Check if WIFI is available
287 | * Beamer: check resolution
288 | * check for plugs
289 | * check for additional laptoops for users with none, or where the software does not work
290 |
291 | ### Software / VM
292 | * if you work with data, check if it is okay to share it with the participants (privacy, copyright)
293 | * think about different system requirements: OS, CPU power, inet bandwitdh, storage, RAM, pre-installed packages conflicts,
294 | * offer a sandbox (optional): for their first steps, before getting deeper into the software (e.g. creating a folder, opening a file, printing your first fact, etc.).
295 | * document code well
296 | * offer data examples (optional)
297 | * offer code examples (optional)
298 | * comment code so people understand processes
299 | * tell when computation may take a while
300 | * explain errors and how to handle them
301 | * give them information how to solve problems themselves
302 | * prepare a data set, which can easily be reset after participants did some changes on it
303 | * test your system: on different operating systems, on different hardware
304 | * when you test out your concept
305 | * when your software is finished
306 |
307 | ### Online
308 | * GitHub repository
309 | * wordpress event page [.md](LINK)
310 | * Tweet about it
311 |
312 | ### Dissemniate
313 | Share the content with the community, the hosts and associated people.
314 |
315 | ### Follow Up's
316 | This is a very important point: What do you want the participants to follow up and take home with from the training? You should offer some (not more than 3!) concrete ways how to follow up after the training with the project or Content-Mining. It is also very important to get in touch with the most motivated people and look if they are interested in doing more around the topic and the project.
317 |
318 | Here some examples:
319 | * Make Pull Request with own hacks/code
320 | * Communication: Email, Twitter, Mailinglist
321 | * Discourse
322 | * Invite to another training or related event
323 | * Write direct email with specific topic
324 |
325 | **Evaluation**
326 | * work out [evaluation](LINNK)
327 |
328 | ### Create Training Materials
329 | * think about color-blindness
330 |
331 | #### Slides
332 | Some basic hints for creating and sharing slides
333 |
334 | * try to not use lists, except it is obviously a list
335 | * use big enough font size
336 | * use colors which create high contrast to make it easier visible
337 | * have in mind, that the beamer may has lower resolution
338 | * use strong visual approach
339 | * Slideshare is the prefered way to share your slides
340 |
341 | #### Screencasts
342 | Depending on your operating system, you can/should use:
343 | * Ubuntu/Linux: RecordMyDesktop
344 | * MacOS:
345 | * Windows:
346 |
347 | #### Hand Outs
348 | * Add a bit.ly to the overall materials
349 |
350 |
351 | #### GitHub repository
352 | * offer participants ways to get to the necessary level in advance: install sw, learn stuff, read things
353 |
354 | ## 5. On Spot
355 |
356 | * Check WIFI connection
357 | * Disconnect interactive applications like Skype, Slack, Tweetdeck etc. so incoming messages don't disturbe the sessions.
358 | * Open necessary folders, slides, applications, notes and rowser tabs
359 | * mention follow ups, especially evaluation
360 | * mention online materials, especially the github repository
361 | * Don't pressure yourself and the participants: trainings must be easy going
362 | * Ask if people are hold back by others things out of their control to implement things learned
363 | * check for plugs
364 |
365 | ### Dealing with problems
366 |
367 | * Offer advice for offline alternatives: Art of Hosting, always prepare something!
368 | * Offer hints for testing the software, the data and the slides for readiness
369 | * Social interaction: offer some basic methods and concrete examples for social interactions. Give hints for common problems (people not interested, people start to fight, people blame you for failing/weak training, etc.)
370 | * Which information need to be available all the time, especially for new arriving people?
371 | * Think through if way more or less people arrive: will the sessions still work? what could help, what do you have to change?
372 | * How to treat racism, sexism, etc. Think about this when inviting people in advance and put focus on this
373 | * Work out alternative strategies for 3 and for 50 persons
374 |
375 | ## 6. Follow Up
376 |
377 | Also look at Follow Up in Section "How to work out a Training" above.
378 |
379 | **Participants**
380 | * Send out Email with follow ups and evaluation form
381 |
382 | **Hosts**
383 | * get some feedback from them
384 |
385 | **Online**
386 | * write blog post
387 | * social media
388 |
389 | **Facilitators**
390 | * Document lessons learned in [doc.md](LINK).
391 |
392 | ## 7. Lessons Learned
393 |
394 | This list describes common problems that may occur and how they can be solved.
395 |
396 | | PROBLEMS | LESSONS LEARNED |
397 | |----------|-----------------|
398 | | The central message (vertical integration, ease of access and scalability, sectioning of papers, use of supplementary information) does not come across. | Put focus on unique points of interaction (sHTML, ctree, results) and not the technical details. |
399 | | The use case demonstrated is not really appropriate for audience. | Start with powerful demo first, then go into technical details |
400 | | Too much content in too less time, not enough time for teaching/delivering key message | Start low but have more in-depth stuff available. |
401 | | Missing narrative, difficult transitions between sessions | Change preparation time allocation from 80% tech/ 20% storyline to 50/50. |
402 | | Early technological problems (keyboard locales, VM on windows) lose large parts of the group, and it is difficult to recover from there. | Have backup plans and material: reserve more time for error handling and if something does not work. Define alternative tasks while solving problems. |
403 | | Getting started with the VM could be problematic especially for newcomers. | People should not have to worry about CLI other than copying the relevant commands. Everything should have been within one folder which would be the starting folder of the command line. |
404 | | Missing central documentation to direct people to. | Prepare a github repo and an etherpad. |
405 | | Long URLs are difficult to type. | When uris are used, use bit.ly |
406 | | People start wandering off on their own e.g. when some are still stuck at installing, and others already completed the task. | Define clearer “group stages” (Now we’re installing for 5 min; then we’ll begin with notebook; play around with facts for X min; then talk about results) |
407 | | When a session loses focus, it is difficult to catch people's attention again and pull them back together. | Try to provide "a social solution for a technical problem" - clear statements what to do if something fails; have people group up in teams of two with different OS and experience level, so that they move more together. |
408 |
--------------------------------------------------------------------------------
/training-guidelines/session-formats.md:
--------------------------------------------------------------------------------
1 | # Teaching resources
2 |
3 | This contains a description of the methods used in the workshop as well as links to helpful external teaching resources.
4 |
5 | 1. [Presentation](#presentation)
6 |
7 | 2. [World Cafe](#world-cafe)
8 |
9 | 3. [Hands on](#hands-on)
10 |
11 | 4. [Hacking session](#hacking-session)
12 |
13 |
14 | ### Presentation
15 |
16 | ##### What is it about?
17 |
18 | ##### When can I use it?
19 |
20 | ##### How to setup / frame it
21 |
22 | ##### resources needed
23 |
24 | ##### min / max time allocation
25 |
26 | ##### min / max nr. of participants
27 |
28 | ##### possible problems
29 |
30 | * bad beamer resolution:
31 | * no fineprint / high-res images
32 | * bad beamer colors:
33 | * clearer contrasting slide colors
34 |
35 |
36 | ##### Links
37 |
38 |
39 |
40 | ### World Cafe
41 |
42 | ##### What is it about?
43 |
44 | * Connect with the people in a meaningful way
45 | * Let people interact on a very personal and direct level
46 | * Create a socially nice athmosphere towards the end to have good memory about the session
47 | * Get straight into it, no bullshit around!
48 | * Discuss 3 questions in small groups (~5), sum up in large circle
49 |
50 | ##### When can I use it?
51 |
52 | * After people did some content mining, more towards the end of a workshop
53 | * useful to wrap up things and get a meta-perspective
54 |
55 | ##### How to setup / frame it
56 |
57 |
58 | ** Before workshop**
59 |
60 | 1. Think of 3-4 questions that open up a discussion, e.g.
61 | * What are opportunities created through content mining?
62 | * What are barrier for content mining?
63 | * How can content mining help you?
64 | 1. write these questions down on flipchart
65 | 1. rearrange room / tables to allow for groups of 4-5 people to sit together
66 |
67 | ** At workshop**
68 |
69 | 1. Explain the next session (2min)
70 | * “How knows what a world cafe is? “
71 | * Art of Hosting Method
72 | * 3 rounds of 5min with one question for each.
73 | * Groups of 4-5 people
74 | * Discuss at a table. start straight into it. 5mins are not too much, so dont waste time.
75 | * Write down your thoughts. For this you have pens and a sheet. s
76 | * After bell rings, one person stays and everyone else leaves to another table. decide yourself on the table who does what.
77 | * At the end after the 3 rounds we will discuss the outcome in a big round/circle were we are going to harvest on a meta level.
78 | * Sum Up: Get in your round, discuss the questions and document main points on your sheet.
79 | * “Questions?”
80 | 1. Set up space(1min)
81 | * “So, let’s start it. Please get together in groups of 4-5 people.”
82 | * give sheets and pens to groups
83 | * Everyone at a table, a sheet and 1-2 pens?
84 | 1. three rounds (5min + 1min moving):
85 | * Check the tables: "everything ok?"
86 | * signal after 4min: "1 minute left"
87 | * Finish with virtual bell
88 | * Switch tables
89 | 1. Switch to big circle (1min)
90 | * “Let’s rearrange the tables and chairs to one big circle
91 | * when not enough place for all, 2nd, 3rd row
92 | * Circle with Harvesting (10min)
93 | * explain it: everyone can go into the middle and speak about something he/she experience. meta level. we collect this on an own sheet. circle is perfect form to bring things together.
94 | * if not enough time for all: ask one person from each table to summarize for meta-harvest after the three rounds.
95 |
96 | ##### resources needed
97 |
98 | * a bell - real or virtual to signal start/end of rounds
99 | * Flipchart pens and sheets
100 | * sticky tape
101 | * postits
102 |
103 | ##### min / max time allocation
104 |
105 | * 45min
106 |
107 | ##### min / max nr. of participants
108 |
109 | ##### possible problems
110 |
111 | * too many participants (>20):
112 | * create additional tables for more rotation
113 | * not enough participants (<6):
114 | * do the rounds together
115 | * no one starts talking in the big circle: give quick own impression
116 |
117 | ##### Links
118 |
119 |
120 |
121 | ### Hands-on
122 |
123 | ##### What is it about?
124 |
125 | * let people do something by themselves and create individual results
126 | * let them learn in groups
127 | * learning by doing
128 | * show them the magic of content mining
129 |
130 |
131 | ##### When can I use it?
132 |
133 | * as often as possible
134 |
135 | ##### How to setup / frame it
136 |
137 | 1. let people form teams of two (1min)
138 | * condition: have at least one working laptop per two
139 |
140 | 1. make sure every team has the software and data (5-10min)
141 |
142 | 1. explain whats going to happen and what they will be able to do after the session (use case)
143 |
144 | 1. introduce them to the concept to learn by demoing an example
145 |
146 | 1. have them reproduce the example
147 |
148 | 1. repeat: what is the use case, what is the problem, how are we going to solve it?
149 |
150 | 1. go through code / explain raw data
151 |
152 | 1. provide some small task/quiz:
153 | * find information xyz
154 | * we have x, we want z, please do y
155 |
156 | 1. enable sandbox environment:
157 | * let people explore on their own
158 | * go around, check if help needed
159 |
160 | 1. summarize:
161 | * what have you found?
162 | * what were the problems?
163 | * did you solve them? yes, how? no, what barrier? - should be identified during sandboxing
164 | * discuss/compare outcomes
165 |
166 |
167 | ##### resources needed
168 |
169 | * laptops
170 | * code examples
171 | * data examples
172 |
173 |
174 | ##### min / max time allocation
175 |
176 | 45min - 90min
177 |
178 | ##### min / max nr. of participants
179 |
180 | 2 - 15
181 |
182 | ##### possible problems
183 |
184 | * high participant / facilitator ratio:
185 | * ask advanced participants to help after they managed
186 |
187 | ##### Links
188 |
189 |
190 |
191 | ### Hacking session
192 |
193 | ##### What is it about?
194 |
195 | ##### When can I use it?
196 |
197 | ##### How to setup / frame it
198 |
199 | ##### resources needed
200 |
201 | ##### min / max time allocation
202 |
203 | ##### min / max nr. of participants
204 |
205 | ##### possible problems
206 |
207 | ##### Links
208 |
209 |
210 |
211 | ## additional facilitator resources
212 |
213 |
214 |
--------------------------------------------------------------------------------
/training-guidelines/workflow.md:
--------------------------------------------------------------------------------
1 | # General workflow
2 |
3 | This should give facilitators and organizers some orientation on which steps to perform when. The **days ahead** estimations roughly indicate when this process should start, since there are dependencies later on.
4 |
5 | The major points can be raised as personal issues for the facilitators/organizers, with the subpoints as issue notes. This helps keeping track of progress. Also, each major point should be discussed directly between facilitators/organizers at least once.
6 |
7 |
8 | 1. (**4-8 weeks ahead**): Logistics
9 | - [ ] Liase with local org on basic info, decide on:
10 | - [ ] Date, Venue
11 | - [ ] Team
12 | - [ ] Hashtag
13 | - [ ] Duration and timeframe (start - end, lunch breaks)
14 | - [ ] check local tech situation (internet speed, beamer,...)
15 | - [ ] Define target audience
16 | - [ ] Find place for social event after workshop (bar, restaurant)
17 | - [ ] start with PR (together with host?):
18 | - [ ] How?
19 | - [ ] When?
20 | - [ ] Where?
21 | - [ ] Partners: Media, Organizers, Domain specific, institutions,
22 | - [ ] Send to @GrahamSteel to create event page on contentmine.org
23 | 1. (**4-6 weeks ahead**): Framing
24 | - [ ] Think/talk about what people want to learn, check if any of our major use cases applies
25 | - [ ] Think about requirements: skills, material, preparation
26 | - [ ] Think about participant preparation: what people need to do in advance
27 | 1. (**4-6 weeks ahead, optional**): Guest speaker
28 | - [ ] ask for attendance
29 | - [ ] aks for compliance with date
30 | - [ ] send out email:
31 | - [ ] event link
32 | - [ ] hard facts: date, location, meeting time, participants,
33 | - [ ] ask for support in advertisement: personally and institutionally
34 | - [ ] questions?
35 | - [ ] agenda
36 | - [ ] short description of event
37 | 1. (**3-5 weeks ahead**):
38 | - [ ] Identify knowledge gaps and uncertainties and create pre-training assessment.
39 | - [ ] **optional, 4 weeks ahead:** Send out pre-training assessment
40 | 1. (**28d ahead**) Set up repo via template [LINK]()
41 | - [ ] Short description
42 | - [ ] Time-Schedule
43 | - [ ] Trainers with links to personal pages or Twitter
44 | - [ ] Preparational material: links to learn in advance
45 | - [ ] Travelling information and directions
46 | 1. (**18d ahead**, optional) Remind participants of pre-training assessment
47 | 1. (**14d ahead**) Draft session schedule
48 | - [ ] Choose modules according to target audience after closing pre-training assessment. We have three module categories
49 | - A: Introduction sessions
50 | - B: application of CMtools to do content mining
51 | - C: more hacky sessions, working together on user problems, downstream analysis
52 | - The session schedule should include short breaks after each session, and possibly time and space for a social event in the evening.
53 | - think about: framing, target audience, purpose for project and participants
54 | - discuss what should not / can not be done and why
55 | - potential technical issues -> prepare solution (additional documentation)
56 | - potential content questions / problems -> prepare answers
57 | - offer some social event in the evening
58 | - follow-ups
59 | 1. (**14d ahead**): Update repo
60 | - [ ] schedule
61 | - [ ] pre-requisites
62 | - [ ] downloads / installations / datasets / tech requirements
63 | - [ ] Ask for feedback and corrections about facts mentioned in event invitation with all involved in the workshop preparation.
64 | 1. (**10-14d ahead**) Prepare materials:
65 | - [ ] prepare pad
66 | - [ ] workflow
67 | - [ ] commands
68 | - [ ] links
69 | - [ ] bit.ly for github-repo
70 | - [ ] check technical readiness:
71 | - [ ] canary-workspace
72 | - [ ] jupyter-notebook
73 | - [ ] if using VM prepare sticks
74 | - [ ] readable by MAC/Windows/Linux
75 | - [ ] delocalize VM (language, host system)
76 | - [ ] check content readiness:
77 | - [ ] training github pages repo
78 | - [ ] slides up2date
79 | - [ ] adapted to local audience
80 | - [ ] dataset compiled
81 | - [ ] personal notes
82 | - [ ] handouts / posters / flipchart workshop sheets
83 | - [ ] if more than one facilitator: assign roles
84 | - [ ] who does facilitating?
85 | - [ ] who does tech support?
86 | 1. (**5d ahead**) Prepare follow ups
87 | - Set up post-training evaluation form
88 | - Set up email for participants: post-training survey, follow ups
89 | 1. (**3d ahead**) Connect with participants:
90 | - [ ] welcome email with installation instructions and material
91 | - [ ] event link
92 | - [ ] hard facts again: date, location
93 | - [ ] offer single point of contact
94 | 1. Do workshop:
95 | - [ ] Coordinate with local support:
96 | - [ ] technological support
97 | - [ ] some refreshments (coffee, snacks)
98 | - [ ] check beamer resolution, adjust screen for visibility
99 | - [ ] Ask for permission -> Take photos
100 | - [ ] Tweet and encourage others to
101 | - prepare locality (beamer, seating arrangement, material for harvesting)
102 | - [ ] Let participants group up (2-3-4), target: 4-6 groups
103 | - [ ] Ensure most (if not all) participants have required software installed
104 | - [ ] Introduce the pad
105 | - [ ] ask about skill level, test crucial skills (shell, python, statistics, etc) with easy improving steps and find out more and more about who your audience is
106 | - [ ] go through scheduled sessions
107 | 1. (**1d later**) Send out email to participants with:
108 | - [ ] feedback forms - read the [guidance](https://github.com/ContentMine/workshop-resources/blob/master/training-guidelines/evaluation-assessment.md#evaluation) on evaluation, copy and edit [template feedback form](https://docs.google.com/forms/d/1nCaM6_sA-clrWDoNzdua5Luxg8bV7dcBMj82pERIIpQ/edit).
109 | - [ ] github-repo
110 | - [ ] follow-ups
111 | 1. (**1d later** ) Document the workshop
112 | - [ ] document what you have done compared to the initial plan. this can then be used to interprete the evaluations properly
113 | - [ ] Review: document the lessons learned in an own doc
114 | 1. (**3d later, optional**) create additional media
115 | - [ ] Blog post with images
116 | - [ ] add photos to Flickr, videos to youtube
117 | 1. (**3d later**) Get feedback from host
118 | 1. (**asap**) update materials:
119 | - [ ] repo with content generated during workshop
120 | - [ ] Add latest VM to [VM repo](https://github.com/ContentMine/workshops)
121 |
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/README.md:
--------------------------------------------------------------------------------
1 | ## About ContentMine
2 |
3 | ### Contents
4 | * what is content mining in general, what is CM specifically?
5 | * Describe the different types of content that might be mined
6 | * Describe the standard procedure for mining (Crawl -> Scrape -> Extract)
7 | * Describe some of the legal and technical barriers to content mining
8 | * Describe 2-3 examples of content mining.
9 | * Describe what we can offer
10 |
11 | ### Learning Goals
12 |
13 | * what is the vision and purpose of content mining
14 | * get an overview about the products of ContentMine
15 | * how to relate content mining to their own work
16 |
17 | ### Activities and Methods
18 |
19 | * presentation
20 | * demo of CM product
21 | * other demos
22 | * Q&A
23 |
24 | ### Duration
25 |
26 | * 45min
27 | * 20min introduction presentation
28 | * 25min discussion of participants interests/visions
29 |
30 | ### Prerequisites
31 |
32 | * none
33 |
34 | ### Resources
35 |
36 | * slides ([.odp](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-About-ContentMine/about-contentmine.odp) and [PDF](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-About-ContentMine/about-contentmine.pdf))
37 | * online demos for amusement
38 | - [IUCN redlist](iucn/README.md)
39 | - [Fact of the Day](fotd/README.md)
40 | - [Chemistry](chemistry/README.md)
41 | - [bubbles](bubbles/README.md)
42 |
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/about-contentmine.odp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/about-contentmine.odp
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/about-contentmine.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/about-contentmine.pdf
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/bubbles/README.md:
--------------------------------------------------------------------------------
1 | # Bubbles# Bubbles
2 |
3 | A modern approach to showing document structure.
4 |
5 | ## online demo
6 |
7 | see http://bubbles.contentmine.org .
8 |
9 | *
10 | *
11 | *
12 | *
13 | *
14 |
15 |
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/bubbles/bubbles.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/bubbles/bubbles0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles0.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/bubbles/bubbles1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles1.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/bubbles/bubbles3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles3.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/bubbles/bubbles4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles4.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/bubbles/bubbles5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/bubbles/bubbles5.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/chemistry/README.md:
--------------------------------------------------------------------------------
1 | # ChemicalTagger
2 |
3 | A tool that completely parses chemical text and interprets syntetic procedures.
4 | ## online demo
5 |
6 | see http://chemicaltagger.ch.cam.ac.uk .
7 |
8 | * before annotation:
9 |
10 |
11 | * immediate effect of annotation after clicking "Process Text"
12 |
13 |
14 | * showing just one
15 |
16 | *
17 |
18 |
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/chemistry/chemicaltagger0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger0.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/chemistry/chemicaltagger1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger1.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/chemistry/chemicaltagger2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger2.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/chemistry/chemicaltagger3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger3.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/chemistry/chemicaltagger4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/chemistry/chemicaltagger4.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/fotd/README.md:
--------------------------------------------------------------------------------
1 | # Fact of the Day
2 |
3 | Every day the documents are searched for Facts - species, or genes. There are some false positives, but on the whole it's quite fun. Where possible we link to an image from Wikipedia
4 |
5 | ## online demo
6 |
7 | see http://fotd.contentmine.org eacg Fact points to the previous and the next. There are some places with a few blank days but generally you can find some nice pictures.
8 |
9 | *
10 | *
11 |
12 |
13 |
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/fotd/fotd.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/fotd/fotd.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/fotd/fotd1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/fotd/fotd1.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/iucn/README.md:
--------------------------------------------------------------------------------
1 | # IUCN
2 |
3 | See http://iucn.org for general background
4 |
5 | ## online demo
6 |
7 | Facts are harvested daily and filtered against IUCN species, results are displayed at http://iucn.contentmine.org .
8 |
9 | * The landing page:
10 | 
11 |
12 | * we search for `ursus`, and note the bear species
13 |
14 | * When entered we get a list of species with `ursus`
15 |
16 |
17 | we'll select `Ursus maritimus`:
18 |
19 | * All the hits for "ursus". (left) in textual context; on the right the title of the article they were located in
20 |
21 | * and the actual article
22 |
23 |
24 |
25 |
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/iucn/iucn1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn1.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/iucn/iucn2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn2.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/iucn/iucn3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn3.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/iucn/iucn4.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn4.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/iucn/iucn5.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn5.png
--------------------------------------------------------------------------------
/training-modules/A-About-ContentMine/iucn/iucn6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-About-ContentMine/iucn/iucn6.png
--------------------------------------------------------------------------------
/training-modules/A-CM-Facts/README.md:
--------------------------------------------------------------------------------
1 | ## What is a ContentMine Fact
2 |
3 | ### Contents
4 |
5 | * introduce concept of fact from a technical perspective
6 | * walk through the complete pipeline of CM, from inputs to outputs
7 | * which steps are necessary in a content mining workflow
8 |
9 | ### Learning Goals
10 |
11 | * Understand the impact of crawlers on websites
12 | * Describe steps one can take to limit impact of crawling e.g. limit requests
13 |
14 | ### Activities and Methods
15 |
16 | * paper markup exercise
17 | * Hand-out and follow the [worksheet](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-CM-Facts/manual_markup.pdf)
18 | * Provide 1-2 papers and ensure there is at least one copy of one paper per participant (paper links to be added here and to resource section)
19 | * step-by-step walkthrough: showing an example input, and what happens after CM is applied to it
20 |
21 | ### Duration
22 |
23 | * 45min
24 | * 15min introduction
25 | * 30min manual markup exercise
26 |
27 | ### Prerequisites
28 |
29 | * none
30 |
31 | ### Resources
32 |
33 | * readymade executable examples
34 | * slides ([.odp](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-CM-Facts/contentmine-facts.odp) and [PDF](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-CM-Facts/contentmine-facts.pdf))
35 | * Manual markup [workshee](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-CM-Facts/manual_markup.pdf) (need at least one copy per two participants)
36 | * Manual markup papers (links to be added here)
37 |
38 | ### Watch for
39 |
40 | * country-specificity
41 |
--------------------------------------------------------------------------------
/training-modules/A-CM-Facts/contentmine-facts.odp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-CM-Facts/contentmine-facts.odp
--------------------------------------------------------------------------------
/training-modules/A-CM-Facts/contentmine-facts.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-CM-Facts/contentmine-facts.pdf
--------------------------------------------------------------------------------
/training-modules/A-CM-Facts/iucn_walkthrough.md:
--------------------------------------------------------------------------------
1 | = Walkthrough of interactive use of IUCN =
2 |
3 | This file desceribes the current IUCN demonstration on contentmine.org
4 |
5 | ** WARNING** this may be unstable against changes to the actual implementation/service
6 |
--------------------------------------------------------------------------------
/training-modules/A-CM-Facts/manual_markup.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-CM-Facts/manual_markup.pdf
--------------------------------------------------------------------------------
/training-modules/A-Canary/README.md:
--------------------------------------------------------------------------------
1 | ## Canary
2 |
3 | ### Contents
4 |
5 | ### Learning Goals
6 |
7 | * what is the purpose of canary?
8 | * how does it work?
9 | * how to create own workspace
10 | * how to input URLs
11 | * how to execute queries
12 | * how to extract facts
13 | * how to access facts
14 |
15 | ### Activities and Methods
16 |
17 | * present a selection of possible applications, including network analysis, indexing and search, re-using extracted facts
18 | * guided software development
19 | * pair programming
20 |
21 | ### Duration
22 |
23 | * 45min
24 | * 5min intro
25 | * 5min create own workspace, explain options
26 | * 10min execute query: finding papers, removing failures, run plugins (do it with a narrow query)
27 | * 5min error handling
28 | * 10min exploring facts: context, source, options
29 | * 10min own exploration: show documentation, then hands on, while answering questions
30 |
31 | ### Prerequisites
32 |
33 | * CM tools and workflow
34 | * basic programming skills helpful
35 | * version control helpful
36 |
37 | ### Resources
38 |
39 | * canary.contentmine.org
40 |
41 | ### Watch for
42 |
43 |
--------------------------------------------------------------------------------
/training-modules/A-Legal-Responsible/README.md:
--------------------------------------------------------------------------------
1 | ## Legal Aspects and Responsible Mining
2 |
3 | ### Contents
4 |
5 | * Under what circumstances is it legal to content mine, and how?
6 | * copyright
7 | * publisher licences
8 | * country-specific legislation
9 | * legal areas that impact content mining
10 | * specify licensing /status of resulting facts
11 | * introduce concept of fact from a legal perspective
12 |
13 | ### Learning Goals
14 |
15 | * learn about the impact of copyright on the possibilities of CM
16 | * learn the differences between licences
17 | * learn about current legal developments regarding text and data mining, including updates on H2020, EU copyright revisions
18 | * learn about barriers erected by publishers, including restrictive licences, API ToS
19 | * learn about current consultations and plans in the EU around TDM
20 | * learn about copyright restrictions (or exceptions) in your jurisdication.
21 | * Understand the impact of crawlers on websites
22 | * Describe steps one can take to limit impact of crawling e.g. limit requests
23 |
24 | ### Activities and Methods
25 |
26 | * warm-up: legal quiz
27 | * presentation
28 | * Q&A
29 |
30 | ### Duration
31 |
32 | * 50min
33 | * 10min presentation about legal aspects
34 | * 15min quiz + discussion of answers
35 | * 10min wrap up and questions
36 | * 10min presentation about responsible mining
37 | * 5min discussion / questions
38 |
39 | ### Prerequisites
40 |
41 | * none
42 |
43 | ### Resources
44 |
45 | * Up-to-date [legal quiz handouts](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-Legal-Responsible/legal_true_false.pdf), one per participant.
46 | * Legal quiz [answer sheet](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-Legal-Responsible/true_false_answers.pdf) for trainer
47 | * Slides ([.odp](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-Legal-Responsible/legal-responsible.odp) and [PDF](https://github.com/ContentMine/workshop-resources/blob/master/training-modules/A-Legal-Responsible/legal-responsible.pdf))
48 | * [Article on Responsible Content Mining](http://contentmine.org/wp-content/uploads/2015/06/responsible-content-mining-1.pdf)
49 | * [Charles Oppenheim slides at JISC event] (jisctdm.pptx)
50 |
51 |
52 | ### Watch for
53 |
54 | * country-specificity
55 |
--------------------------------------------------------------------------------
/training-modules/A-Legal-Responsible/jisctdm.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/jisctdm.pdf
--------------------------------------------------------------------------------
/training-modules/A-Legal-Responsible/jisctdm.pptx:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/jisctdm.pptx
--------------------------------------------------------------------------------
/training-modules/A-Legal-Responsible/legal-responsible.odp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/legal-responsible.odp
--------------------------------------------------------------------------------
/training-modules/A-Legal-Responsible/legal-responsible.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/legal-responsible.pdf
--------------------------------------------------------------------------------
/training-modules/A-Legal-Responsible/legal_true_false.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/legal_true_false.pdf
--------------------------------------------------------------------------------
/training-modules/A-Legal-Responsible/true_false_answers.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/A-Legal-Responsible/true_false_answers.pdf
--------------------------------------------------------------------------------
/training-modules/A-Participate-Contribute/README.md:
--------------------------------------------------------------------------------
1 | ## Participate and Contribute
2 |
3 | ### Contents
4 |
5 | * CMunities
6 | * discourse
7 | * github
8 | * Cambridge-Meetups
9 |
10 | ### Learning Goals
11 |
12 | * how to continue to participate
13 | * how to contribute to code
14 | * learn about version control and how to contribute to CM (scrapers, code, plugins) but also through issues and bug reports
15 |
16 | ### Activities and Methods
17 |
18 | * raise issue together
19 | * exemplary coding and contribution to github-repo, e.g. workshop or setup extra test repo
20 | * hands-on work on own code/repo
21 |
22 | ### Duration
23 |
24 | * 15min
25 | * 10min presentation
26 | * 5min questions
27 |
28 | ### Prerequisites
29 |
30 | * github-account
31 | * basic command line knowledge
32 |
33 | ### Resources
34 |
35 | * slides [LINK]
36 |
37 | ### Watch for
38 |
39 |
--------------------------------------------------------------------------------
/training-modules/B-Architecture/README.md:
--------------------------------------------------------------------------------
1 | ## Architecture
2 |
3 | ### Contents
4 |
5 | * Overview of tool chain
6 | * What happens where, where can I jump in?
7 |
8 | ### Learning Goals
9 |
10 | * get a first intuition about the building blocks of ContentMine
11 | * learn about the different steps of content mining
12 | * learn about the source of papers and facts
13 | * learn about intermediaries like scholarly.html and CProject/CTree
14 | * learn how this leads to the outcome of content mining: a fact in context
15 |
16 | ### Activities and Methods
17 |
18 | * opening, reading and interpreting the intermediaries of ContentMine
19 |
20 | ### Duration
21 |
22 | * 45min
23 | * 7min presentation
24 | * 7min looking at publisher APIs and scope of data sources
25 | * 7min looking at CProject/CTree
26 | * 7min looking at a fulltext.xml / scholarly.html
27 | * 7min looking at facts in context
28 | * 10min reserve for questions
29 |
30 | ### Prerequisites
31 |
32 | * none
33 |
34 | ### Resources
35 |
36 | * slides [LINK]
37 | * a precompiled dataset with ami-plugins run
38 | * links and documentation to 2-3 publisher APIs (EUPMC, ...)
39 | * an example CProject/CTree
40 | * an example fact + surrounding context
41 |
42 | ### Watch for
43 |
44 |
--------------------------------------------------------------------------------
/training-modules/B-Architecture/contentmine-architecture.odp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/B-Architecture/contentmine-architecture.odp
--------------------------------------------------------------------------------
/training-modules/B-Architecture/contentmine-architecture.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/B-Architecture/contentmine-architecture.pdf
--------------------------------------------------------------------------------
/training-modules/B-Building-corpus/README.md:
--------------------------------------------------------------------------------
1 | ## Building a corpus with getpapers
2 |
3 | ### Contents
4 |
5 | ### Learning Goals
6 |
7 | * learn about the challenges posed by different publisher APIs
8 | * learn how to use getpapers
9 | * learn how to create your own collection of papers
10 |
11 | ### Activities and Methods
12 |
13 | * write search queries
14 | * compare website source with scraper definition and subsequent scraping result
15 | * handling errors
16 | * look at found results
17 |
18 | ### Duration
19 |
20 | * 50min
21 | * 10min presentation
22 | * 5min looking at API
23 | * 5min demo of getpapers-query
24 | * 20min hands-on of querying
25 | * 10min reserve for error handling
26 |
27 | ### Prerequisites
28 |
29 | * command line
30 |
31 | ### Resources
32 |
33 | * [Tutorial on getpapers](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/getpapers)
34 | * [Screencast on getpapers](https://youtu.be/H2ESPjihnDA)
35 |
36 | ### Watch for
37 |
38 | * too many big queries crashing the connection
39 |
--------------------------------------------------------------------------------
/training-modules/B-Fact-Extraction/README.md:
--------------------------------------------------------------------------------
1 | ## Extracting Facts with ami-plugins
2 |
3 | ### Contents
4 |
5 | ### Learning Goals
6 |
7 | * get an overview of possible plugins and what outcome to expect
8 | * run ami-plugins on your data
9 | * learn how to interpret results.xml
10 |
11 | ### Activities and Methods
12 |
13 | * demonstration of ami-plugins
14 | * discussion of use-cases
15 |
16 | ### Prerequisites
17 |
18 | * command line
19 |
20 | ### Duration
21 |
22 | * 60min
23 | * 10min presentation
24 | * 5min discussion of interesting use case
25 | * 5min demo of running the chosen ami-plugin
26 | * 15min hands on for everyone
27 | * 10min reserve for error handling
28 | * 10min looking at a results.xml in-depth
29 | * 5min reserve for questions
30 |
31 | ### Resources
32 |
33 | * [Tutorial on ami](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/ami)
34 | * prepared, normalized dataset
35 |
36 | ### Watch for
37 |
38 | * community-specific use case
39 |
--------------------------------------------------------------------------------
/training-modules/B-Normalization/README.md:
--------------------------------------------------------------------------------
1 | ## Normalisation
2 |
3 | ### Contents
4 |
5 | ### Learning Goals
6 |
7 | * what are the difficulties of unstructured information
8 | * why it is desirable to have a normalized version and what is scholarly html
9 | * understand the process of normalization within CM
10 |
11 | ### Activities and Methods
12 |
13 | * demonstration of norma
14 | * compare one input document with an output document in detail
15 |
16 | ### Duration
17 |
18 | * 30min
19 | * 10min presentation
20 | * 5min hands on with norma
21 | * 5min buffer for problems
22 | * 10min discussion / questions
23 |
24 | ### Prerequisites
25 |
26 | * command line
27 |
28 | ### Resources
29 |
30 | * [Tutorial on norma](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/norma)
31 |
32 | ### Watch for
33 |
34 |
--------------------------------------------------------------------------------
/training-modules/B-Scraping/README.md:
--------------------------------------------------------------------------------
1 | ## Scraping
2 |
3 | ### Contents
4 |
5 | ### Learning Goals
6 |
7 | * how does the general process of scraping work, including structure of a website
8 | * how to identify specific data
9 | * how to reorganize it into a reusable format
10 | * learn about the challenges posed by different publishing formats
11 | * learn how to use quickscrape with own urls
12 |
13 | ### Activities and Methods
14 |
15 | * compare website source with scraper definition and subsequent scraping result
16 | * write scraper
17 |
18 | ### Duration
19 |
20 | * 45min
21 | * 10min presentation
22 | * 5min demo of quickscrape
23 | * 20min hands on example, comparing source-html with scraper-definition and output
24 | * 10min reserve for questions
25 |
26 | ### Prerequisites
27 |
28 | * command line
29 |
30 | ### Resources
31 |
32 | * readymade executable examples
33 | * two or three tested links with example output
34 | * slides [LINK]
35 |
36 | ### Watch for
37 |
38 |
--------------------------------------------------------------------------------
/training-modules/B-Scraping/scraping.odp:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/B-Scraping/scraping.odp
--------------------------------------------------------------------------------
/training-modules/B-Scraping/scraping.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ContentMine/workshop-resources/f289d7ee6384ab8f7c4ae0a103407891f3bb4f70/training-modules/B-Scraping/scraping.pdf
--------------------------------------------------------------------------------
/training-modules/B-VM-Commandline/README.md:
--------------------------------------------------------------------------------
1 | ## How to work with ContentMine tools
2 |
3 | ### Contents
4 |
5 | How to set up the working environment of the Virtual Machine
6 |
7 | ### Learning Goals
8 |
9 | * how to install
10 | * how to get updates
11 | * basic steps of commandline
12 |
13 | ### Activities and Methods
14 |
15 | * installing software together
16 | * getting to know command line
17 |
18 | ### Duration
19 |
20 | * 45min
21 | * 10min installation
22 | * 5min reserve for error handling
23 | * 15min hands on example of command line
24 | * 10min showing how to keep software updated
25 | * 5min reserve for questions
26 |
27 | ### Prerequisites
28 |
29 | * command line
30 |
31 | ### Resources
32 |
33 | * readymade executable examples
34 | * links to stable versions of software
35 | * [Tutorial on VirtualMachines](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/vms)
36 | * [Tutorial on shell](https://github.com/ContentMine/workshop-resources/tree/master/software-tutorials/shell)
37 | * slides [LINK]
38 |
39 | ### Watch for
40 |
41 | * installation problems on different OS
42 |
--------------------------------------------------------------------------------
/training-modules/B-Working-with-Facts/README.md:
--------------------------------------------------------------------------------
1 | ## Working with ContentMine Facts
2 |
3 | ### Contents
4 |
5 | ### Learning Goals
6 |
7 | * show that real science can be done (e.g. supertree)
8 | * introduce ipython-notebooks
9 | * how to read the results.xmls
10 |
11 | ### Activities and Methods
12 |
13 | * very hacky
14 |
15 | ### Prerequisites
16 |
17 | * command line
18 | * at least one other programming language
19 | * domain knowledge
20 |
21 | ### Duration
22 |
23 | * 60min
24 | * 5min introduce notebooks
25 | * 10min getting notebooks to run
26 | * 5min reserve for error handling
27 | * 30min going through notebook workflow
28 | * 10min discussion / questions
29 |
30 | ### Resources
31 |
32 | * prepared ipython notebook
33 | * [pyCProject](https://github.com/ContentMine/pyCProject/blob/master/source/readctree.py)
34 |
35 | ### Watch for
36 |
37 | * community-specific use case
38 |
--------------------------------------------------------------------------------
/training-modules/C-Own-Usecase/README.md:
--------------------------------------------------------------------------------
1 | ## How to employ ContentMine tools to my own problem
2 |
3 | ### Contents
4 |
5 | * very hacky
6 | * interactive
7 | * explorative
8 |
9 | ### Learning Goals
10 |
11 | * how to use the CM tool chain to solve one self-defined problem
12 | * develop a custom application/tool/script/data pipeline to solve a problem
13 |
14 | ### Activities and Methods
15 |
16 | * present a selection of possible applications, including network analysis, indexing and search, re-using extracted facts
17 | * guided software development
18 | * pair programming
19 |
20 | ### Duration
21 |
22 | * 60min - 180min
23 |
24 | ### Prerequisites
25 |
26 | * CM tools and workflow
27 | * basic programming skills helpful
28 | * version control helpful
29 |
30 | ### Resources
31 |
32 | ### Watch for
33 |
34 |
--------------------------------------------------------------------------------
/training-modules/C-Regex/README.md:
--------------------------------------------------------------------------------
1 | ## Regex in ami
2 |
3 | ### Contents
4 |
5 | * what is a regex, why is it useful for me?
6 |
7 | ### Learning Goals
8 |
9 | * learn how to create their own regex queries
10 | * be able to interpret why something is not as expected
11 | * how to input a regex into ami and run it
12 | * work with results
13 |
14 | ### Activities and Methods
15 |
16 | * working on examples
17 | * guided hacking
18 | * pair programming
19 |
20 | ### Duration
21 |
22 | * 60min - 90min
23 |
24 | ### Prerequisites
25 |
26 | * CM tools and workflow
27 | * basic programming skills helpful
28 | * version control helpful
29 |
30 | ### Duration
31 |
32 | ### Resources
33 |
34 | * http://regexr.com/
35 | * an example dataset
36 |
37 | ### Watch for
38 |
39 |
--------------------------------------------------------------------------------
/training-modules/C-Writing-scrapers/README.md:
--------------------------------------------------------------------------------
1 | ## Writing scraper definitions
2 |
3 | ### Contents
4 |
5 | * What is a scraper definition?
6 | * What do I need it for?
7 | * How can I build one?
8 |
9 | ### Learning Goals
10 |
11 | * learn how to create their own scraper definitions
12 | * be able to interpret why something is not as expected
13 | * how to add a new definition to quickscrape and run it
14 | * get an intuition for results
15 |
16 | ### Activities and Methods
17 |
18 | * working on examples
19 | * guided hacking
20 | * pair programming
21 |
22 | ### Duration
23 |
24 | * 60min - 120min
25 |
26 | ### Prerequisites
27 |
28 | * CM tools and workflow
29 | * basic programming skills helpful
30 | * version control helpful
31 |
32 | ### Duration
33 |
34 | ### Resources
35 |
36 | * an example dataset
37 |
38 | ### Watch for
39 |
40 |
--------------------------------------------------------------------------------
/training-modules/README.md:
--------------------------------------------------------------------------------
1 | .png)
2 |
3 | # Content of this folder
4 |
5 | This folder contains learning modules which can be combined in any desired way by workshop facilitators.
6 | Each module contains a description of learning goals, links to additional resources like slides or software tutorials, and a detailed description of steps.
7 |
8 | Modules beginning with "A" contain a general introduction to content mining and the ContentMine project. They include an overview about legal and practical issues, as well as an introduction to core concepts of content mining.
9 |
10 | Modules beginning with "B" contain practical examples and demonstrations of ContentMine tools, and are designed as more interactive sessions.
11 |
12 | Modules beginning with "C" build on what has been learned in "B". They are designed as hacky, guided applications of ContentMine tools, and enable participants to explore and work on their own use cases with the help of our facilitators.
13 |
14 |
15 | | Module | Description | Estimated length |
16 | |--------|------------|-------------------|
17 | | A | **Introduction sessions** | 195min |
18 | | A1 | [What is content mining in general, what does ContentMine do specifically?](A-About-ContentMine) | 45min |
19 | | A2 | [Under what circumstances is it legal to content mine, and how?](A-Legal-Responsible) | 45min |
20 | | A3 | [What is a fact, where does it come from, what can I do with it?](A-CM-Facts) | 45min |
21 | | A4 | [Participate and contribute to the ContentMine](A-Participate-Contribute) | 15min |
22 | | A5 | [Getting an easy start with Canary](A-Canary) | 45min |
23 | |----|------------------------------------------------|------|
24 | | B | **Practical sessions** | 335min |
25 | | B1 | [Overview of tool chain: What happens how, where can I jump in?](B-Architecture) | 45min |
26 | | B2 | [How to work with ContentMine tools](B-VM-Commandline) | 45min |
27 | | B3 | [Building a corpus with getpapers](B-Building-corpus) | 50min |
28 | | B4 | [Scraping with quickscrape](B-Scraping) | 45min |
29 | | B5 | [Normalizing the literature](B-Normalization) | 30min |
30 | | B6 | [Extracting facts with AMI](B-Fact-Extraction) | 60min |
31 | | B7 | [Putting facts in context with jupyter](B-Working-with-Facts) | 60min |
32 | |----|------------------------------------------------|------|
33 | | C | **Hacking sessions** | 180min - 390min |
34 | | C1 | [ami-regex](C-Regex) | 60min - 90min |
35 | | C2 | [Writing scraper definitions](C-Writing-scrapers) | 60min - 120min |
36 | | C3 | [Explore own use case](C-Own-Usecase) | 60min - 180min |
--------------------------------------------------------------------------------