├── .gitignore
├── README.textile
├── _config.yml
├── _layouts
└── default.html
├── budget.md
├── commercial.md
├── details.md
├── dni.md
├── fonts
├── copse-regular-webfont.eot
├── copse-regular-webfont.svg
├── copse-regular-webfont.ttf
├── copse-regular-webfont.woff
├── quattrocentosans-bold-webfont.eot
├── quattrocentosans-bold-webfont.svg
├── quattrocentosans-bold-webfont.ttf
├── quattrocentosans-bold-webfont.woff
├── quattrocentosans-bolditalic-webfont.eot
├── quattrocentosans-bolditalic-webfont.svg
├── quattrocentosans-bolditalic-webfont.ttf
├── quattrocentosans-bolditalic-webfont.woff
├── quattrocentosans-italic-webfont.eot
├── quattrocentosans-italic-webfont.svg
├── quattrocentosans-italic-webfont.ttf
├── quattrocentosans-italic-webfont.woff
├── quattrocentosans-regular-webfont.eot
├── quattrocentosans-regular-webfont.svg
├── quattrocentosans-regular-webfont.ttf
└── quattrocentosans-regular-webfont.woff
├── images
├── background.png
├── body-background.png
├── bullet.png
├── hr.gif
└── octocat-logo.png
├── index.md
├── innovation.md
├── javascripts
└── main.js
├── kpi.md
├── opensource.md
├── overview.md
├── params.json
├── plan.md
├── related.md
├── roadmap.md
└── stylesheets
├── github-dark.css
├── normalize.css
└── styles.css
/.gitignore:
--------------------------------------------------------------------------------
1 | *.bak
2 | _site/
3 |
--------------------------------------------------------------------------------
/README.textile:
--------------------------------------------------------------------------------
1 | h1. Pages about the IPTC EXTRA project
2 |
3 | The HTML rendition of the pages is available at http://iptc.github.io/extra/
4 |
5 |
--------------------------------------------------------------------------------
/_config.yml:
--------------------------------------------------------------------------------
1 | name: IPTC EXTRA
2 | makrdown: kramdown
3 |
--------------------------------------------------------------------------------
/_layouts/default.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
45 |
46 |
47 | {{ content }}
48 |
49 |
50 |
54 |
55 |
56 |
57 |
58 |
--------------------------------------------------------------------------------
/budget.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: IPTC EXTRA
4 | ---
5 | ## EXTRA Budget
6 |
7 | In order to build and distribute a solid working version of the EXTRA platform, together with two initial rules sets to demonstrate the power and potential of platform, and to put in place the core technical and marketing foundation for a sustainable open source project, the main project costs would be the labour required. The grant of 50,000 EUR will be used for the work of
8 |
9 | * Developers
10 | * Linguists
11 | * Marketing
12 |
--------------------------------------------------------------------------------
/commercial.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: IPTC EXTRA
4 | ---
5 | ## Commercial Considerations
6 |
7 | The core EXTRA platform will be [open source](opensource.html) and will not be directly monetized by the IPTC.
8 |
9 | However, the EXTRA platform does create opportunities for monetization, including companies offering support services, hosting the EXTRA software as a service and integrating the EXTRA classification engine within commercial Content Management Systems.
10 |
11 | There is also an opportunity for the development and maintenance of the rule sets applied by the EXTRA platform, either as consulting services or as commercial offerings. In particular, there is the opportunity to create rule sets targeted at particular taxonomies and content sets, including beyond the general news domain covered by the IPTC’s Media Topics.
12 |
--------------------------------------------------------------------------------
/details.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: IPTC EXTRA
4 | ---
5 | ## EXTRA in Detail
6 |
7 | The aim of the EXTRA project is to build and freely distribute an initial version of EXTRA, an open source rules-based classifier. In addition, we plan to create two rule sets which make use of the EXTRA platform, which will also be free and open source. Finally, our goal is to put in place the core technical and marketing foundation to make EXTRA a sustainable open source project.
8 |
9 | To provide intelligent search, high-quality topical aggregations, subject-specific alerts and content analytics, many modern news publishers tag content items with relevant topics. To achieve scale and consistency in tagging operations, publishers often employ rule-based tagging software. Such software tags relevant topics by analyzing the text of a document using a set of human authored rules. For example, to identify the topic "Kate Middleton (Person)" a publisher might use the rule:
10 |
11 | > apply tag "Kate Middleton (Person)" if the document text contains any of the following phrases: "Duchess of Cambridge", "Catherine Middleton", " Catherine Elizabeth Middleton"
12 |
13 | News publishers have invested an enormous amount of manual effort to create, manage, and maintain sets of these kinds of rules. For example, over the last fifteen years, The New York Times metadata services team has created a rule set containing over half a million manually-crafted rules.
14 |
15 | Creating and deploying such rule-sets requires significant investments in both costly software and specialized personnel. As such, only the largest publishers can afford to acquire and maintain such systems. A freely available open source rule-based information extraction and classification toolkit would – for the first time – put a powerful knowledge management tool into the hands of small-to-medium sized publishers and create a marketplace for the decades-long investment made by larger publishers in their rule-sets.
16 |
17 | For this reason, the International Press Telecommunications Council (IPTC) proposes to build EXTRA. EXTRA is a rules-based, open source, multilingual information extraction platform. Additionally, to make EXTRA immediately useful to the news publishing community, the IPTC further proposes to create two suites of rules for classifying news documents into the IPTC Media Topics Taxonomy, aimed at two of the languages supported by the Media Topics. Developed over many years by the IPTC and used by several leading news providers, the IPTC Media Topics is an industry-standard taxonomy for classifying news documents by subject. The Media Topics are available in English, French, Spanish and German.
18 |
19 | To accomplish these goals, the IPTC proposes to hire both a software development contractor and a linguist. The software contractor will develop the EXTRA engine, a software component that takes as input EXTRA rules and a text document and produces as output a list of rules matched by the text document and their corresponding topics. In developing the rules engine, every effort will be made to identify and build upon existing open source components. The IPTC believes that Elasticsearch Percolator shows great potential to be one such open source component. Other open source frameworks that may be relevant are Apache’s UIMA and Sheffield’s GATE (the General Architecture for Text Engineering). Similarly, the IPTC will explore how to make EXTRA compatible with modern cloud architectures, to simplify the deployment of the system for small-to-medium sized publishers.
20 |
21 | The software contractor will also develop and deliver a formal specification for the EXTRA rules language. The linguist will then, based on this formal specification, develop two collections of rules for classifying documents into IPTC Media Topics. All of these items, rules engine, language specification and classification rules, will be openly developed on github.com and released under a permissive [open source license](opensource.html).
22 |
23 | More broadly, it is the IPTC's hope that this project will catalyze a migration in the news publishing community away from expensive proprietary document classification systems and towards a common industry wide open source platform. The IPTC further hopes that the broad adoption of a common rules-based document classification platform will create a marketplace for the many rule sets developed by news publishers over the last several years. Lastly, The IPTC believes that a freely available document classification platform will provide great benefit to small-to-medium sized publishers. The cost of existing document classification technology and the lack of freely available classification rule sets makes it extremely difficult for all but the largest publishers to leverage this technology in their operations. As such, small-to-medium sized publishers face a challenge in providing their readership with the kinds of search and aggregation experiences typical of their larger peers.
24 |
--------------------------------------------------------------------------------
/dni.md:
--------------------------------------------------------------------------------
1 | ---
2 | layout: default
3 | title: IPTC EXTRA
4 | ---
5 | ## EXTRA DNI
6 |
7 | The goal of the [DNI](https://www.digitalnewsinitiative.com/fund) is to stimulate innovation among European news organizations. There are multiple funding rounds.
8 |
9 | IPTC applied during the first round in December 2015 for the “EXTRA” project and was awarded a prototype project amount of 50,000 EUR.
10 |
--------------------------------------------------------------------------------
/fonts/copse-regular-webfont.eot:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iptc/extra/a2e1024d1f968cf3ac672e3529ce1b45ceaff186/fonts/copse-regular-webfont.eot
--------------------------------------------------------------------------------
/fonts/copse-regular-webfont.ttf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iptc/extra/a2e1024d1f968cf3ac672e3529ce1b45ceaff186/fonts/copse-regular-webfont.ttf
--------------------------------------------------------------------------------
/fonts/copse-regular-webfont.woff:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iptc/extra/a2e1024d1f968cf3ac672e3529ce1b45ceaff186/fonts/copse-regular-webfont.woff
--------------------------------------------------------------------------------
/fonts/quattrocentosans-bold-webfont.eot:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/iptc/extra/a2e1024d1f968cf3ac672e3529ce1b45ceaff186/fonts/quattrocentosans-bold-webfont.eot
--------------------------------------------------------------------------------
/fonts/quattrocentosans-bold-webfont.svg:
--------------------------------------------------------------------------------
1 |
2 |
3 |