├── README.md └── doc ├── README.txt ├── api_concept.txt ├── api_contribution.txt ├── api_historization.txt ├── api_maintenance.txt ├── api_retrieval.txt └── use_cases.txt /README.md: -------------------------------------------------------------------------------- 1 | This is (by now) the git repository of the #openwebindex API architect https://twitter.com/0rb1t3r/ 2 | The following is a copy of the text from http://www.suma-lab.de/webindex/ 3 | 4 | EUROPE NEEDS AN OPEN INDEX OF THE WEB 5 | 6 | Initiators: 7 | 8 | 1. Prof. Dr. Dirk Lewandowski, Hochschule für Angewandte Wissenschaften Hamburg 9 | 2. Prof. Dr. Volker Grassmuck, Leuphana Universität Lüneburg 10 | 3. Dr. Philipp Mayr, GESIS – Leibniz-Institute for the Social Sciences 11 | 4. Sebastian Sünkler, Hochschule für Angewandte Wissenschaften Hamburg 12 | 5. Agata Królikowski, Leuphana Universität Lüneburg 13 | 6. Lambert Heller, Technische Informationsbibliothek Hannover 14 | 7. Dr. Wolfgang Sander-Beuermann, SUMA-EV – Verein für Freien Wissenszugang 15 | 16 | ONE SEARCH ENGINE IS NOT ENOUGH! 17 | 18 | Europe’s digital economy and civil society are virtually dependent on non-European businesses. This is particularly evident with regard to search engines, the cornerstone of our digital information infrastructure. Google currently dominates the market, leading to dependencies and economic damage that are no longer acceptable. 19 | 20 | If we were to apply the present situation in the digital world to the mass media, we would find ourselves with only one television channel as the sole source of information for the public. Businesses would also be dependent on this channel, as it would be the only available outlet for their advertising. 21 | 22 | Such a situation contradicts the pluralism of our Western democratic societies. Pluralism must also be reflected in a diversity of information systems. 23 | 24 | The market has failed in this respect. For more than ten years, we have been dependent on a single search engine, and no other company has been able to challenge it. We do not foresee the market regulating itself in the future. 25 | 2. AN OPEN INDEX OF THE WEB WOULD SET THE STAGE FOR INFORMATION AUTONOMY IN EUROPE’S ECONOMY AND SOCIETY 26 | 27 | Restoring choice to the search engine market will mean putting prerequisites in place at the European level as a foundation for pluralism and competing search engines. Merely establishing a publicly funded competing search engine would not be appropriate, as this would only create a further monolithic structure. 28 | 29 | The open source, open access and open data communities were essential to enabling the web as we know it today and have become the main drivers of the digital economy. A further key element that we currently lack is open access to the information distributed across the web. 30 | 31 | To be clear, we are not seeking a government-funded alternative search engine – we want to enable innovation in the business world and civil society by providing searchable web data. 32 | 3. OUR OBJECTIVE: AN EU-FUNDED GLOBAL INDEX OF THE WEB 33 | 34 | The key to a European digital information infrastructure would be an EU-funded, global, searchable index of the web open to competing companies, institutions and civil-society actors. 35 | 36 | There is no alternative to public funding for such a project. Unlike the early years of the web, the present volume of data and growing complexity of the internet means that even major corporations and organizations do not have the financial resources to establish such an index. 37 | 38 | The new index could form the basis for general and specialized search engines, analysis tools and many other applications. Any system based on the index would be free to develop its own business model. 39 | 40 | As the key to tapping the world’s collective knowledge, the index must be set up and provided as a universally accessible element of public information infrastructure unaffected by commercial interests, not unlike public broadcasting. 41 | 42 | Once it is in place, institutions, businesses and civil-society actors will be able to provide innovative services based on the index and compete in delivering the best ideas for its use. The search engine landscape would thus be transformed from the monopoly of a private company to a pluralistic cooperation that would not be at the mercy of a national government or a single business entity. 43 | 44 | The signatories call on all actors of the European Union to jointly create the preconditions for independence, diversity and autonomy in Europe’s information infrastructure through an open index of the web. 45 | -------------------------------------------------------------------------------- /doc/README.txt: -------------------------------------------------------------------------------- 1 | The documents in this folder are work-in-progress. 2 | They are not at all finalized and may be in a very early stage of development. 3 | 4 | The specifications in the documents are open for discussion. 5 | Everyone is invited to make suggestions or extensions: 6 | contact the author @0rb1t3r or make a pull request from a git clone. 7 | 8 | To start reading the documents, first read the file api_concept.txt which 9 | refers to the other documents. 10 | -------------------------------------------------------------------------------- /doc/api_concept.txt: -------------------------------------------------------------------------------- 1 | # 2 | # api_concept.txt 3 | # (C) 11.02.2015 [CC-by-SA] Michael Christen, @0rb1t3r 4 | # 5 | 6 | The following had been fixed in the meeting from 09.02.2015: 7 | 8 | A1: The API is open 9 | The api must accept requests from all users without any authentication. 10 | No authorization is required. The access is open to everyone. 11 | This has an exception if the request frequence is too high. 12 | For high-volume users, there will be a payment option for mass retrieval 13 | which then actually requires an account and authorization at the api. 14 | 15 | 16 | A2: The API provides privacy 17 | The api software should not store and/or log any IP addresses permanently. 18 | However, to detect mass retrieval and to protect itself it is necessary to 19 | store IP addresses temporary in RAM. 20 | 21 | 22 | A3: Retrieval Functions 23 | The api provides information retrieval functions to search the document 24 | (web etc.). 25 | The api may also provide inside information about link structure 26 | and crawl success details. 27 | # for details see: api_retrieval.txt 28 | 29 | 30 | A4: Contribution Functions 31 | The api should provide push functions to enable users to insert information. 32 | Pushed information can be on document basis (which requires either 33 | authentication or fraud detection or both) or on host-bases (which could 34 | start a web crawl based on the history of crawls of the domain and domain size) 35 | # for details see: api_contribution.txt 36 | 37 | 38 | A5: Maintenance Functions 39 | The api should provide a function to announce non-existent document. 40 | This may cause that the document is deleted from the index if it actually 41 | does not exist any more as document in the web. This requires fraud detection 42 | to protect the function to be mis-used as censoring tool. 43 | # for details see: api_maintenance.txt 44 | 45 | 46 | A6: Historisation and cache access 47 | The api should provide access to cached document including different versions 48 | of the document in the past. 49 | # for details see: api_historization.txt 50 | -------------------------------------------------------------------------------- /doc/api_contribution.txt: -------------------------------------------------------------------------------- 1 | # 2 | # api_contribution.txt 3 | # (C) 11.02.2015 [CC-by-SA] Michael Christen, @0rb1t3r 4 | # 5 | # this document is a sub-specification of api_concept.txt 6 | # 7 | 8 | This document contains the api essentials for several services. Each service 9 | is implemented as http(s) servlet. Every servlet shall support the transmission 10 | of parameters as GET and POST operation. POST provides more security because 11 | the parameter values are not transmitted in the http metadata and therefore 12 | do not occur in http/proxy logs while GET operations are more convenient for 13 | development and testing. 14 | 15 | JSONP compliance: 16 | All json-servlets shall be callable with a GET/POST attribute 'callback'. If 17 | a servlet is called using that attribute, the whole result response is 18 | encapsulated into a JavaScript function call using the callback value as 19 | function name. 20 | 21 | Servlets described in this document: 22 | /api/push.json 23 | /api/crawl.json 24 | 25 | 26 | # 27 | # servlet /api/push.json 28 | # 29 | 30 | This servlet makes user-generated content available to the openwebindex 31 | platform. To protect the platform to become spamed with false content, the 32 | access must be restricted to authorized clients only. In this context, false 33 | content means that the resource key within the pushed content (the URL) points 34 | to a place where the pushed text is not available. False content must be 35 | avoided and therefore it is recommended to verify pushed content even if the 36 | authentified client is authorised. 37 | 38 | 39 | * GET/POST parameters (GET should only be used for testing). 40 | All parameters are required, none is optional. 41 | 42 | - 'count' 43 | An integer number denoting the number of documents pushed within this 44 | POST request. 45 | 46 | - 'auth' 47 | This attribute must be filled with a value for authentication. This may be i.e. 48 | an OAuth string. Search results which do not require authorization must be 49 | identical to search request with and without the auth value to ensure 50 | transparency against the usage of authorized results 51 | 52 | Also attached must be a set of documents with their metadata. 53 | The documents are numbered from 0 to -1. This document number X is used 54 | to name attribute field names for each of the documents. In the following 55 | list, replace thecharacter X with those numbers of the file: 56 | 57 | - 'url-X' 58 | The URL which is used in a search result as link to the submitted document. 59 | 60 | - 'data-X' 61 | This is the binary data of the document. 62 | 63 | - 'collection-X' 64 | A name for a collection which is assigned to the document. This can be an 65 | arbitrary word or a comma-separated list of terms. These words will be listed 66 | in the collection navigation for a search facet (if that facet is switched on). 67 | 68 | - 'responseHeader-X' 69 | A HTTP response header line. This can be used to submit all kinds of metadata 70 | which YaCy is able to process. 71 | 72 | You should submit the Content-Type and Last-Modified header fields using the 73 | responseHeader-X post attribute. This would look like this: 74 | 75 | responseHeader-X=Last-Modified: 76 | The is date which is assigned to the document. 77 | The date format must be according to RFC 1123 like 78 | "EEE, dd MMM yyyy HH:mm:ss Z" with a time zone indicator according to RFC 5322 79 | 80 | responseHeader-X=Content-Type: 81 | The is mime type of the document. 82 | 83 | Because media-type documents do not have a textual component which can be used 84 | for searching, it is possible to attach the title and keywords to the media 85 | document as well. To do this, the extra-http header fields X-YaCy-Media-Title 86 | and X-YaCy-Media-Keywords can be used: 87 | 88 | responseHeader-X=X-YaCy-Media-Title: 89 | The <Title> will be used as document title 90 | 91 | responseHeader-X=X-YaCy-Media-Keywords:<Keywords> 92 | <Keywords> is a list of keywords, separated by space characters. 93 | 94 | * Returned file: 95 | The servlet then returns a json result which explains how successful the 96 | transmission was. A typical result looks like: 97 | { 98 | "count":"1", 99 | "successall": "true", 100 | "item-0":{ 101 | "item":"0", 102 | "url":"http://nowhere.cc/example.txt", 103 | "success": "true", 104 | "message": "[a text representing a link which can be used to verify the success of the operation, like a query url]" 105 | }, 106 | "countsuccess":1, 107 | "countfail":0 108 | } 109 | 110 | The "message" attribute contains a link to a search result which shows the 111 | pushed document in indexed metadata format. In case that a push is not 112 | successful, the "success" attribute turns to "false" and the "message" field 113 | contains the reason for the failure. 114 | 115 | 116 | 117 | # 118 | # servlet /api/crawl.json 119 | # 120 | 121 | This servlet orders the OpenWebIndex platform to load itself the requested 122 | content from the web. There can be requests to load only the specified page or 123 | also linked pages within a given pattern. 124 | 125 | - 'auth' (optional, should be required for depth != 0) 126 | This attribute must be filled with a value for authentication. This may be i.e. 127 | an OAuth string. Search results which do not require authorization must be 128 | identical to search request with and without the auth value to ensure 129 | transparency against the usage of authorized results 130 | 131 | - 'url' (required) 132 | The url which shall be loaded by the crawler. 133 | 134 | - 'depth' (optional, by default '0') 135 | The recursion number for the crawler which denotes the depth of the crawl. 136 | If this number is '0', then only the start document (given in url parameter) is 137 | loaded. It may be recommended to allow to crawl single pages without 138 | authentification while it is wise to grant authorization for large crawls only 139 | for authentified clients 140 | 141 | - 'collection' (optional) 142 | A name for a collection which is assigned to the documents. This can be an 143 | arbitrary word or a comma-separated list of terms. These words will be listed 144 | in the collection navigation for a search facet (if that facet is switched on). 145 | 146 | - 'url-must-match' (optional, by default '.*') 147 | - 'url-must-not-match' (optional, by default '') 148 | - 'content-must-match' (optional, by default '.*') 149 | - 'content-must-not-match' (optional, by default '') 150 | Regular expressions which allow a steering of the crawler. I.e. it is possible 151 | to crawl only a specific domain using the regular expression .*domain.* 152 | 153 | * Returned file: 154 | The servlet then returns a json result which explains how successful the 155 | crawl was. Because a crawl is a long-running job, the result must be given 156 | asynchronously to the crawl process and cannot know the outcome of the crawl 157 | job request. A typical result looks like: 158 | { 159 | "success": "true", 160 | "message": "[a text explaining the success value]" 161 | } 162 | -------------------------------------------------------------------------------- /doc/api_historization.txt: -------------------------------------------------------------------------------- 1 | # 2 | # api_historization.txt 3 | # (C) 11.02.2015 [CC-by-SA] Michael Christen, @0rb1t3r 4 | # 5 | # this document is a sub-specification of api_concept.txt 6 | # 7 | -------------------------------------------------------------------------------- /doc/api_maintenance.txt: -------------------------------------------------------------------------------- 1 | # 2 | # api_maintenance.txt 3 | # (C) 11.02.2015 [CC-by-SA] Michael Christen, @0rb1t3r 4 | # 5 | # this document is a sub-specification of api_concept.txt 6 | # 7 | -------------------------------------------------------------------------------- /doc/api_retrieval.txt: -------------------------------------------------------------------------------- 1 | # 2 | # api_retrieval.txt 3 | # (C) 11.02.2015 [CC-by-SA] Michael Christen, @0rb1t3r 4 | # 5 | # this document is a sub-specification of api_concept.txt 6 | # 7 | 8 | This document contains the api essentials for several services. Each service 9 | is implemented as http(s) servlet. Every servlet shall support the transmission 10 | of parameters as GET and POST operation. POST provides more security because 11 | the parameter values are not transmitted in the http metadata and therefore 12 | do not occur in http/proxy logs while GET operations are more convenient for 13 | development and testing. 14 | 15 | JSONP compliance: 16 | All json-servlets shall be callable with a GET/POST attribute 'callback'. If 17 | a servlet is called using that attribute, the whole result response is 18 | encapsulated into a JavaScript function call using the callback value as 19 | function name. 20 | 21 | Servlets described in this document: 22 | /api/search.[xml|json] 23 | /api/suggest.json 24 | 25 | 26 | # 27 | # servlet /api/search.[xml|json] 28 | # 29 | 30 | This is the access to document searches. There exist some standards for such a 31 | API type, i.e. SRU (http://www.loc.gov/standards/sru/) and 32 | Opensearch (http://www.opensearch.org/Specifications/OpenSearch/1.1/Draft_5). 33 | We want to support both standards and provide a subset of their attributes. 34 | Both standards do not specify enough details to provide the ability for 35 | facetted search; we will extend the standards in such a way that the 36 | necessary data entities are enclosed while still maintaining compliance with 37 | the given standards. 38 | 39 | * GET/POST parameters 40 | - 'query' (required): 41 | The query string identical to the text which the user typed in. The query 42 | string may contain search operators (like 'intitle:') very similar to common 43 | operators used by other search engines. 44 | 45 | - 'startRecord' (optional, default 1) 46 | The first record of the result set. Count starts with 1 47 | 48 | - 'maximumRecords' (optional, default 10) 49 | The number of records to be returned. It could be wise to limit the 50 | requested number to prevent excessive mass-retrieval. Default limitation 51 | may be i.e. 100. 52 | 53 | - 'collection' (optional) 54 | This may be used to restrict the results to a predefined set of documents. 55 | The value is a name of such a subset. The name may be defined in the context 56 | of search mandants. Mandants can be defined i.e. in the context of the 57 | contribution-api (see api_contribution.txt) where users can push content 58 | together with the name of their collection. 59 | 60 | - 'order' (optional) 61 | This is either the name of a pre-defined ranking method or a string which 62 | represents the schema for a user-defined ranking. The details of such a ranking 63 | schema text is yet not defined. It may contain methods such as field and 64 | query boosts. Both depend on the index field structure which is yet unknown. 65 | 66 | - 'filter' (optional, default empty) 67 | This shall be used to create result subsets based on facets. The format of 68 | the value is undefined but must be filled with a string which is named by 69 | the facet in the search result. Several filters can be combined by separating 70 | the individual filter strings with a space character. All filters operate 71 | in conjunction, the result is a intersection of the individual filter results. 72 | 73 | - 'domain' (optional, default 'all') 74 | The search result may contain links to documents, images, audio files and video 75 | files. This attribute can select one of these document types or all of them. 76 | Possible values are: 'doc', 'image', 'audio', 'video', 'all' 77 | 78 | - 'near' (optional, default 1) 79 | Some search engines do not create results based on exact matches of the search 80 | query but also enrich the result with fuzzy matches on the search string 81 | literals. Using this option, the result may be exact (value = 1) or fuzzy 82 | (0 < value < 1). 83 | 84 | - 'op' (optional, default 'and') 85 | If the query string contains several terms, then all terms are to be matched 86 | as they are combined with an AND operator or with an OR operator. This 87 | attribute can select the default matching mode. Possible values are 'and' and 88 | 'or'. If 'or' is selected, the ranking prefers documents which match more 89 | terms of the search query. 90 | 91 | - 'auth' (optional, default empty, can only used with a POST http request) 92 | If an authorization for the query is necessary (i.e. for mass-retrieval) then 93 | this attribute can be filled with a value for authentication. This may be i.e. 94 | an OAuth string. Search results which do not require authorization must be 95 | identical to search request with and without the auth value to ensure 96 | transparency against the usage of authorized results 97 | 98 | * Returned file: 99 | The extension of the search servlet can be either xml or json. While we prefer 100 | the usage of json, the xml format is supported to return data in opensearch 101 | format. The json format follows the DOM of the opensearch api and models the 102 | json structure accordingly. 103 | 104 | XML return format: 105 | We extend opensearch with attributes to support search facets. Facet types may 106 | be defined separately or extended in the future, but we suggest to provide 107 | at least facets for file types (the extension of the file name), facets for 108 | the domain name and facets for the document date. 109 | This is an example output for a call to /api/search.xml?query=test 110 | 111 | <?xml version="1.0" encoding="UTF-8"?> 112 | <?xml-stylesheet type='text/xsl' href='/api/search.xsl' version='1.0'?> 113 | <rss version="2.0" 114 | xmlns:openwebindex="" <!-- TO BE DEFINED --> 115 | xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" 116 | xmlns:atom="http://www.w3.org/2005/Atom" 117 | xmlns:dc="http://purl.org/dc/elements/1.1/" 118 | xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" 119 | > 120 | <channel> 121 | <title>OpenWebSearch RSS Result for query: test 122 | Search for test 123 | /api/search.html?query=test 124 | Search for test 125 | 0 126 | 10 127 |