├── .gitignore
├── .travis.yml
├── Academic Docs IPFS gateway.isf
├── Academic Docs IPFS gateway.pdf
├── Classes.md
├── Extending.md
├── HTTPAPI.md
├── LICENSE
├── Metadata.md
├── README.md
├── Usecases.md
├── cron_ipfs.py
├── etc_ferm_input_nginx
├── etc_supervisor_conf.d_dweb.conf
├── load_ipfs.py
├── nginx
├── README.md
├── dweb.archive.org
├── dweb.me
├── gateway.dweb.me
├── ipfs.dweb.me
├── ipfsconvert.dweb.me
└── www.dweb.me
├── python
├── Archive.py
├── Btih.py
├── ContentStore.py
├── DOI.py
├── Errors.py
├── HashResolvers.py
├── HashStore.py
├── KeyPair.py
├── LocalResolver.py
├── Multihash.py
├── NameResolver.py
├── OutputFormat.py
├── ServerBase.py
├── ServerGateway.py
├── SmartDict.py
├── Transport.py
├── TransportHTTP.py
├── TransportIPFS.py
├── TransportLocal.py
├── __init__.py
├── config.py
├── elastic_schema.json
├── maintenance.py
├── miscutils.py
├── requirements.txt
└── test
│ ├── __init__.py
│ ├── _utils.py
│ ├── test_LocationService.py
│ ├── test_archive.py
│ ├── test_doi.py
│ ├── test_local.py
│ └── test_multihash.py
├── rungate.py
├── scripts
├── install.sh
├── reset_ipfs.sh
├── temp.sh
└── tests.sh
└── temp.py
/.gitignore:
--------------------------------------------------------------------------------
1 | idents_files_urls.sqlite
2 | *.pyc
3 | # pycharm
4 | .idea
5 | .cache
6 | python/.cache
7 | idents_files_urls_sqlite
8 | dweb-gateway.log
9 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: python
2 |
3 | python:
4 | - "2.7"
5 | - "3.6"
6 |
7 | before_install: cd python
8 |
9 | install:
10 | - pip install -r requirements.txt
11 |
12 | services:
13 | - redis-server
14 |
15 | script:
16 | - python -m pytest test/
17 |
--------------------------------------------------------------------------------
/Academic Docs IPFS gateway.isf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/internetarchive/dweb-gateway/a32b67dba701da6cd79204a488b787fe15112974/Academic Docs IPFS gateway.isf
--------------------------------------------------------------------------------
/Academic Docs IPFS gateway.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/internetarchive/dweb-gateway/a32b67dba701da6cd79204a488b787fe15112974/Academic Docs IPFS gateway.pdf
--------------------------------------------------------------------------------
/Classes.md:
--------------------------------------------------------------------------------
1 | # dweb-gateway - Classes
2 | A decentralized web gateway for open academic papers on the Internet Archive
3 |
4 | ## Important editing notes
5 | * Names might not be consistent below as it gets edited and code built.
6 | * Please edit to match names in the code as you notice conflicts.
7 | * A lot of this file will be moved into actual code as the skeleton gets built, just leaving summaries here.
8 |
9 | ## Other Info Links
10 |
11 | * [Main README](./README.md)
12 | * [Use Cases](./Usecases.md)
13 | * [Classes](./Classes.md) << You are here
14 | * [Data for the project - sqlite etc](https://archive.org/download/ia_papers_manifest_20170919)
15 | * [Proposal for meta data](./MetaData.md) - first draft - deleted :-(
16 | * [google doc with IPFS integration comments](https://docs.google.com/document/d/1kqETK1kmvbdgApCMQEfmajBdHzqiNTB-TSbJDePj0hM/edit#heading=h.roqqzmshx7ww) #TODO: Needs revision ot match this.
17 | * [google doc with top level overview of Dweb project](https://docs.google.com/document/d/1-lI352gV_ma5ObAO02XwwyQHhqbC8GnAaysuxgR2dQo/edit) - best place for links to other resources & docs.
18 | * [gateway.dweb.me](https://gateway.dweb.me) points at the server - which should be running the "deployed" branch.
19 | * [Gitter chat area](https://gitter.im/ArchiveExperiments/Lobby)
20 | So for example: curl https://gateway.dweb.me/info
21 |
22 | ## Overview
23 |
24 | This gateway sits between a decentralized web server running locally
25 | (in this case an Go-IPFS server) and the Archive.
26 | It will expose a set of services to the server.
27 |
28 | The data is stored in a sqlite database that matches DOI's to hashes of the files we know of,
29 | and the URLs to retrieve them.
30 |
31 | Note its multivalue i.e. a DOI represents an academic paper, which may be present in the archive in
32 | various forms and formats. (e.g. PDF, Doc; Final; Preprint).
33 |
34 | See [Information flow diagram](./Academic Docs IPFS gateway.pdf)
35 |
36 | Especially see main [README](./README.md) and [Use Cases](./UseCases.md)
37 |
38 | ## Structure high level
39 |
40 | Those services will be built from a set of microservices which may or may not be exposed.
41 |
42 | All calls to the gateway will come through a server that routes to individual services.
43 |
44 | Server URLs have a consistent form
45 | /outputformat/namespace/namespace-dependent-string
46 |
47 | Where:
48 | * outputformat: Extensible format wanted e.g. [IPLD](#IPLD) or [nameresolution](#nameresolution)
49 | * namespace: is a extensible descripter for name spaces e.g. "doi"
50 | * namespace-dependent-string: is a string, that may contain additional "/" dependent on the namespace.
51 |
52 | This is implemented as a pair of steps
53 | - first the name is passed to a class representing the name space,
54 | and then the object is passed to a class for the outputformat that can interpret it,
55 | and then a "content" method is called to output something for the client.
56 |
57 | See [HTTPServer](httpserver) for how this is processed in an extensible form.
58 |
59 | ## Microservices
60 |
61 | ### Summary
62 |
63 | * HTTP Server: Routes queries to handlers based on first part of the URL, pretty generic (code done, needs pushing)
64 | * Name Resolvers: A group of classes that recognize names and connect to internal resources
65 | * NameResolver: Superclass of each Name resolution class
66 | * NameResolverItem: Superclass to represent a file in a NameResolver
67 | * NameResolverShard: Superclass to represent a shard of a NameResolverItem
68 | * DOIResolver: A NameResolver that knows about DOI's
69 | * DOIResolverFile: A NameResolverItem that knows about files in a DOI
70 | * ContentHash: A NameResolverItem that knows about hashes of content
71 | * GatewayOutput: A group of classes handling converting data for output
72 | * IPLD: Superclass for different ways to build IPFS specific IPLDs
73 | * IPLDfiles: Handles IPLD that contain a list of other files
74 | * IPLDshards: Handles IPLD that contain a list of shards of a single file
75 | * Storage and Retrieval: A group of classes and services that store in DB or disk.
76 | * Hashstore: Superclass for a generic hash store to be built on top of REDIS
77 | * LocationService: A hashstore mapping multihash => location
78 | * ContentStore: Maps multihash <=> bytes. Might be on disk or REDIS
79 | * Services Misc
80 | * Multihash58: Convert hashes to base58 multihash, and SHA content.
81 |
82 |
83 | ### HTTP Server
84 | Routes queries to handlers based on the first part of the URL for the output format,
85 | that routine will create an object by calling the constructor for the Namespace, and
86 | then do whatever is needed to generate the output format (typically calling a method on the created
87 | object, or invoking a constructor for that format.)
88 |
89 | `GET '/outputformat/namespace/namespace_dependent_string?aaa=bbb,ccc=ddd'`
90 |
91 | details moved to [ServerGateway.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/ServerGateway.py)
92 |
93 | ## Name Resolvers
94 | The NameResolver group of classes manage recognizing a name, and connecting it to resources
95 | we have at the Archive.
96 |
97 | ###NameResolver superclass
98 |
99 | The NameResolver class is superclass of each name resolution,
100 | it specifies a set of methods we expect to be able to do on a subclass,
101 | and may have default code based on assumptions about the data structure of subclasses.
102 |
103 | Logically it can represent one or multiple files.
104 |
105 | details moved to [NameResolver.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/NameResolver.py)
106 |
107 | ###NameResolverDir superclass
108 |
109 | * Superclass for items that represent multiple files,
110 | * e.g. a directory, or the files that contain a DOI
111 | * Its files() method iterates over them, returning NameResolverFile
112 | * details moved to [NameResolver.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/NameResolver.py)
113 |
114 | ###NameResolverFile superclass
115 |
116 | * Superclass for items in a NameResolverDir,
117 | * for example a subclass would be specific PDFs containing a DOI.
118 | * It contains enough information to allow for retrieval of the file e.g. HTTP URL, or server and path. Also can have byterange,
119 | * And meta-data such as date, size
120 | * Its shards() method iterates over the shards stored.
121 | * details moved to [NameResolver.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/NameResolver.py)
122 |
123 | ###NameResolverShard superclass
124 |
125 | * Superclass for references to Shards in a NameResolverItem
126 | * Returned by the shards() iterator in NameResovlerItem
127 | * details moved to [NameResolver.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/NameResolver.py)
128 |
129 | ###DOI Resolver
130 |
131 | Implements name resolution of the DOI namespace, via a sqlite database (provided by Brian)
132 |
133 | * URL: `/xxx/doi/10.pub_id/pub_specific_id` (forwarded here by HTTPServer)
134 | Resolves a DOI specific name such as 10.nnn/zzzz,
135 |
136 | * details moved to [DOI.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/DOI.py)
137 | * Future Project to preload the different stores from the sqlite.
138 |
139 | ###DOIResolverFile
140 | * Subclass of NameResolverFile that holds meta-data from the sqllite database
141 | * details moved to [DOI.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/DOI.py)
142 |
143 | ###ContentHash
144 | Subclass of NameResolverItem
145 | Looks up the multihash in Location Service to find where can be retrieved from.
146 | * details moved to [ContentHash.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/ContentHash.py)
147 |
148 | ## Gateway Outputs
149 | The Gateway Output group of classes manage producing derived content for sending back to requesters.
150 |
151 | ###GatewayOutput superclass
152 | Superclass of IPLD. Nothing useful defined here currently, but might be!
153 |
154 | Each subclass must implement:
155 | * content(): Return content suitable for returning via the HTTP Server
156 | * contenttype: String suitable for Content-Type HTTP field, e.g. "application/json"
157 |
158 | ##### Future Work
159 | Support streams as a return content type, both here and in server base class.
160 |
161 | ###IPLD superlass
162 | Subclass of GatewayOutput; Superclass for IPLDdir and IPLDshards
163 |
164 | This is a format specified by IPFS,
165 | see [IPLD spec](https://github.com/ipld/specs/tree/master/ipld)
166 |
167 | Note there are two principal variants of IPLD from our perspective,
168 | one provides a list of files (like a directory listing),
169 | the other provides a list of shards, that together create the desired file.
170 |
171 | Note that structurally this is similar, but not identical to the data held in the DOIResolver format.
172 | There will be a mapping required especially as the IPLD spec is incomplete and subject to new version
173 | which is overdue and probably (Kyle to confirm) doesn't accurately match how IPLDs exist in current
174 | IPFS code (based on the various variations I see).
175 |
176 | Potential mapping:
177 |
178 | * Convert Python dates or ISOStrings into the (undefined) format in IPFS, (its unclear
179 | why a standard like ISO wasn't used by IPFS) See [IPLD#46](https://github.com/ipld/specs/issues/46)
180 | * Possibly replacing links - its unclear in the spec if IPLD wants a string like /ipfs/Q... or just the multhash.
181 | * Possibly removing fields we don't want to expose (e.g. the URL)
182 |
183 | Methods:
184 | * multihash58(): returns the hash of the results of content() using the multihash58 service.
185 |
186 |
187 |
188 | ###IPLDfiles
189 | Subclass of IPLD where we want to return a directory, or a list of choices
190 | - for example each of the PDF's & other files available for a specific DOI
191 | * IPLDfiles(NameResolver) *{Expand}* load IPLD with meta-data from NR and iterate through it loading own list.
192 | * content() - *{Expand}* Return internal data structure as a JSON
193 |
194 | ###IPLDshards
195 | Subclass of IPLD where we want to return a list of subfiles, that are the shards that together
196 | make up the result.
197 | * IPLDshards(NameResolverItem) *{Expand}* load IPLD with meta-data from NR and iterate through it loading own list.
198 | * content() - *{Expand}* Return internal data structure as a JSON
199 |
200 | The constructor is potentially complex.
201 | * Read metadata from NR and store in appropriate format (esp format of RawHash not yet defined)
202 | * Loop through an iterator on the NR which will return shards.
203 | * Write each of them to the IPLDshards' data structure.
204 | * Then write result of content() to the Content_store getting back a multihash (called ipldhash)
205 | * Store ipldhash to location_store with pointer to the Content_store.
206 | * And locationstore_push(contenthash, { ipld: ipldhash }) so that contenthash maps to ipldhash
207 |
208 |
209 | ## Storage and Retrieval
210 | Services for writing and reading to disk or database.
211 | They are intentionally separated out so that in future the location of storage could change,
212 | for example metadata could be stored with Archive:item or with WayBack:CDX
213 |
214 | Preferably these will be implemented as classes, and interface doc below changed a little.
215 |
216 | ### Hashstore
217 | Stores and retrieves meta data for hashes, NOTE the hash is NOT the hash of the metadata, and metadata is mutable.
218 | The fields allow essentially for independent indexes.
219 | It should be a simple shim around REDIS, note will have to combine multihash and field to get a redis "key" as if we
220 | used multihash as the key, and field is one field of a Redis dict type, then we won't be able to "push" to it.
221 | Note we can assume that this is used in a consistent fashion, e.g. won't do hash_store then hash_push which would be invalid.
222 |
223 | ### Location Service
224 | Maps hashes to locations
225 | The multihash represents a file or a part of a file. Build upon hashstore.
226 | It is split out because this could be a useful service on its own.
227 |
228 | ### Content Store
229 | Store and retrieve content by its hash.
230 |
231 | ## Services
232 | Any services not bound to any class, group of classes
233 |
234 | ### Multihash58
235 | Convert file or hash to multihash in base58
236 |
--------------------------------------------------------------------------------
/Extending.md:
--------------------------------------------------------------------------------
1 | # Dweb Gateway - Extending
2 |
3 | This document is a work in progress on how to extend the Dweb gateway
4 |
5 | ## Adding a new data type / resolver
6 |
7 | * Create a new file (or sometimes class in existing file)
8 | * Create a class in that file
9 | * If the class conceptually holds multiple objects (like a directory or collection) subclass NameResolverDir
10 | * If just one file, sublass NameResolverFile
11 | * See SEE-OTHERNAMESPACE in Python (and potentially in clients) for places to hook in.
12 | * Add required / optional methods
13 | * new(cls, namespace, *args, **kwargs) - which is passed everything from the HTTPS request except the outputtype
14 |
15 |
16 |
17 | ### Add the following methods
18 |
19 | __init__(cls, namespace, *args, **kwargs)
20 | * Create and initialize an object, often the superclass's method is used and work done in "new"
21 |
22 | @property mimetype
23 | * The mimetype string for the content
24 |
25 | @classmethod new(cls, namespace, *args, **kwargs)
26 | * Create a new object, initialize from args & kwargs, often does a metadata fetch etc.
27 |
28 | retrieve(verbose)
29 | * returns: "content" of an object, typically a binary such as a PDF
30 |
31 | content(verbose)
32 | * returns: "content" encapsulated for return to server
33 | * default: encapsulates retrieve() with mimetype
34 |
35 |
--------------------------------------------------------------------------------
/HTTPAPI.md:
--------------------------------------------------------------------------------
1 | # DWEB HTTPS API
2 |
3 | This doc describes the API exposed by the Internet Archive's DWEB server.
4 |
5 | Note this server is experimental, it could change without notice.
6 | Before using it for anything critical please contact mitra@archive.org.
7 |
8 | ## Overview
9 |
10 | The URLs have a consistent structure, except for a few odd cases. See ____
11 | ```
12 | https://dweb.me/outputtype/itemtype/itempath
13 | ```
14 | Where:
15 |
16 | * dweb.me is the HTTPS server. Any other server running this code should give the same output.
17 | * outputtype: is the type of result requested e.g. metadata or content
18 | * itemtype: is the kind of item being inquired about e.g. doi, contenthash
19 |
20 | The outputtype and itemtype are in most cases orthogonal, i.e. any outputtype SHOULD work with any itemtype.
21 | In practice some combinations don't make any sense.
22 |
23 | ## Output Types
24 |
25 | * content: Return the file itself
26 | * contenthash: Return the hash of the content, suitable for a contenthash/xyz request
27 | * contenturl: Return a URL that could be used to retrieve the content
28 | * metadata: Return a JSON with metadata about the file - its format is type dependent
29 | * void: Return emptiness
30 |
31 | ## Item Types
32 |
33 | * contenthash: The hash of a content, in base58 multihash form typically Q123abc or z123abc depending on which hash is used,
34 | it returns files from the Archive.org and will be expanded to cover more collections over time.
35 | * doi: Document Object Identifier e.g. 10.1234/abc-def. The standard identifier of Academic papers.
36 | * sha1hex: Sha1 expressed as a hex string e.g. a1b2c3
37 | * rawstore: The data - provided with a POST is to be stored
38 | * rawfetch: Equivalent to contenthash except only retrieves from a local data store (so is faster)
39 | * rawadd: Adds a JSON data structure to a named list e.g. rawadd/Q123
40 | * rawlist: Returns an array of data structures added to a list with rawadd
41 | * archiveid: An item (a collection of related files) represented by an Archive.org itemid.
42 | * advancedsearch: A collection of items returned by a search on archive.org
43 |
44 | (Note this set is mapped in ServerGateway.py to the classes that serve them)
45 |
46 |
47 | ## Odd cases
48 |
49 | * info - returns a JSON describing the server - format will change except that always contains { type: "gateway" }
50 |
51 |
--------------------------------------------------------------------------------
/Metadata.md:
--------------------------------------------------------------------------------
1 | # Dweb Gateway - Metadata
2 |
3 |
4 | Metadata changes - in brief….
5 |
6 | See []https://dweb.me/arc/archive.org/metadata/commute] for example
7 | ```
8 | {
9 | collection_titles {
10 | artsandmusicvideos: “Arts & Music” # Maps collection name to the title in the UI
11 | files: [
12 | {
13 | contenthash: contenthash:/contenthash/
14 | magnetlink: “magnet …. /”
15 | }
16 | ]
17 | metadata: {
18 | magnetlink: “magnet ….”
19 | thumbnaillinks: [
20 | “ipfs:/ipfs/”, # IPFS link (lazy added if not already in Redis)
21 | “http://dweb.me/arc/archive.org/thumbnail/commute”, # Direct http link
22 | ]
23 | }
24 | ```
25 |
26 | []https://dweb.me/arc/archive.org/metadata/commute/commute.avi]
27 | Expands on the files metadata to add
28 | ```
29 | {
30 | contenthash: contenthash:/contenthash/
31 | ipfs: ipfs:/ipfs/ # Adds IPFS link after lazy adding file to IPFS (only done at this point because of speed of adding)
32 | magnetlink: “magnet …. /”
33 | }
34 | ```
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | # dweb-gateway
3 | A decentralized web gateway for open academic papers on the Internet Archive
4 |
5 | ## NOTE THIS REPO IS NO LONGER MAINTAINED, ITS ALL MOVED TO ia-dweb and www-dweb on internal system
6 | ## which just calls the dweb-archivecontroller routing.js
7 |
8 | ## Important editing notes
9 | * Names might not be consistent below as it gets edited and code built.
10 | * Please edit to match names in the code as you notice conflicts.
11 | * A lot of this file will be moved into actual code as the skeleton gets built, just leaving summaries here.
12 |
13 | ## Other Info Links
14 |
15 | * [Main README](./README.md) << You are here
16 | * [Use Cases](./Usecases.md)
17 | * [Classes](./Classes.md)
18 | * [HTTP API](./HTTPAPI.md)
19 | * [Extending](./Extending.md)
20 | * [Data for the project - sqlite etc](https://archive.org/download/ia_papers_manifest_20170919)
21 | * [Proposal for meta data](./MetaData.md) - first draft - looks like got deleted :-(
22 | * [google doc with IPFS integration comments](https://docs.google.com/document/d/1kqETK1kmvbdgApCMQEfmajBdHzqiNTB-TSbJDePj0hM/edit#heading=h.roqqzmshx7ww) #TODO: Needs revision ot match this.
23 | * [google doc with top level overview of Dweb project](https://docs.google.com/document/d/1-lI352gV_ma5ObAO02XwwyQHhqbC8GnAaysuxgR2dQo/edit) - best place for links to other resources & docs.
24 | * [gateway.dweb.me](https://gateway.dweb.me) points at the server - which should be running the "deployed" branch.
25 | * [Gitter chat area](https://gitter.im/ArchiveExperiments/Lobby)
26 | So for example: curl https://gateway.dweb.me/info
27 |
28 | ## Overview
29 |
30 | This gateway sits between a decentralized web server running locally
31 | (in this case an Go-IPFS server) and the Archive.
32 | It will expose a set of services to the server.
33 |
34 | The data is stored in a sqlite database that matches DOI's to hashes of the files we know of,
35 | and the URLs to retrieve them.
36 |
37 | Note its multivalue i.e. a DOI represents an academic paper, which may be present in the archive in
38 | various forms and formats. (e.g. PDF, Doc; Final; Preprint).
39 |
40 | See [Information flow diagram](./Academic Docs IPFS gateway.pdf)
41 |
42 | ## Structure high level
43 |
44 | Those services will be built from a set of microservices which may or may not be exposed.
45 |
46 | All calls to the gateway will come through a server that routes to individual services.
47 |
48 | Server URLs have a consistent form
49 | /outputformat/namespace/namespace-dependent-string
50 |
51 | Where:
52 | * outputformat: Extensible format wanted e.g. [IPLD](#IPLD) or [nameresolution](#nameresolution)
53 | * namespace: is a extensible descripter for name spaces e.g. "doi"
54 | * namespace-dependent-string: is a string, that may contain additional "/" dependent on the namespace.
55 |
56 | This is implemented as a pair of steps
57 | - first the name is passed to a class representing the name space,
58 | and then the object is passed to a class for the outputformat that can interpret it,
59 | and then a "content" method is called to output something for the client.
60 |
61 | See [HTTPServer](httpserver) for how this is processed in an extensible form.
62 |
63 | See [UseCases](./Usecases.md) and [Classes](./Classes.md) for expansion of this
64 |
65 | See [HTTPS API](./HTTPSAPI.md) for the API exposed by the URLs.
66 |
67 | ## Installation
68 |
69 | This should work, someone please confirm on a clean(er) machine and remove this comment.
70 |
71 | You'll first need REDIS & Supervisor to be installed
72 | ### On a Mac
73 | ```bash
74 | brew install redis
75 | brew services start redis
76 | brew install supervisor
77 | ```
78 |
79 | ### On a Linux
80 |
81 | Supervisor install details are in: [https://pastebin.com/ctEKvcZt] and [http://supervisord.org/installing.html]
82 |
83 | Its unclear to me how to install REDIS, its been on every machine I've used.
84 |
85 | ### Python gateway:
86 | #### Installation
87 | ```bash
88 | # Note it uses the #deployable branch, #master may have more experimental features.
89 | cd /usr/local # On our servers its always in /usr/local/dweb-gateway and there may be dependencies on this
90 | git clone http://github.com/internetarchive/dweb-gateway.git
91 |
92 | ```
93 | Run this complex install script, if it fails then check the configuration at top and rerun. It will:
94 |
95 | * Do the pip install (its all python3)
96 | * Updates from the repo#deployable (and pushes back any locally made changes) to #deployed
97 | * Pulls a sqlite file that isn’t actually used any more (it was for academic docs in the first iteration of the gateway)
98 | * Checks the NGINX files map what I expect and (if run as `install.sh NGINX`) copies them over if you have permissions
99 | * Restarts service via supervisorctl, it does NOT setup supervisor
100 |
101 | There are zero guarrantees that changing the config will not cause it to fail!
102 | ```bash
103 | cd dweb-gateway
104 | scripts/install.sh
105 | ```
106 | In addition
107 | * Check and copy etc_supervisor_conf.d_dweb.conf to /etc/supervisor/conf.d/dweb.conf or server-specific location
108 | * Check and copy etc_ferm_input_nginx to /etc/ferm/input/nginx or server specific location
109 |
110 | #### Update
111 | `cd /usr/local/dweb-gateway; scripts/install.sh`
112 | should update from the repo and restart
113 |
114 | #### Restart
115 | `supervisorctl restart dweb:dweb-gateway`
116 |
117 | ### Gun, Webtorrent Seeder; Webtorrent-tracker
118 | #### Installation
119 | They are all in the dweb-transport repo so ...
120 | ```bash
121 | cd /usr/local # There are probably dependencies on this location
122 | git clone http://github.com/internetarchive/dweb-transport.git
123 | npm install
124 | # Supervisorctl, nginx and ferm should have been setup above.
125 | supervisorctl start dweb:dweb-gun
126 | supervisorctl start dweb:dweb-seeder
127 | supervisorctl start dweb:dweb-tracker
128 | ```
129 | #### Update
130 | ```bash
131 | cd /usr/local/dweb-transport
132 | git pull
133 | npm update
134 | supervisorctl restart dweb:*
135 | sleep 10 # Give it time to start and not quickly exit
136 | supervisorctl status
137 | ```
138 |
139 | #### Restart
140 | `supervisorctl restart dweb:*` will restart these, and the python gateway and IPFS
141 | or restart `dweb:dweb-gun` or `dweb:dweb-seeder` or `dweb:dweb-tracker` individually.
142 |
143 | ### IPFS
144 | ### Installation
145 | Was done by Protocol labs and I’m not 100% sure the full set of things done to setup the repo in a slightly non-standard way,
146 |
147 | In particular I know there is a command that have to be run once to enable the ‘urlstore’ functionality
148 |
149 | And there may be something needed to enable WebSockets connections (they are enabled in the gateway’s nginx files)
150 |
151 | There is a cron task running every 10 minutes that calls one of the scripts and works around a IPFS problem that should be fixed at some point, but not necessarily soon.
152 | ```
153 | 3,13,23,33,43,53 * * * * python3 /usr/local/dweb-gateway/cron_ipfs.py
154 | ```
155 |
156 | ### Update
157 | ```bash
158 | ipfs update install latest
159 | supervisorctl restart dweb:dweb-ipfs
160 | ```
161 | Should work, but there have been issues with IPFS's update process in the past with non-automatic revisions of the IPFS repo.
162 |
163 | ### Restart
164 | ```
165 | supervisorctl restart dweb:dweb-ipfs
166 | ```
167 |
168 | ### dweb.archive.org UI
169 | ```bash
170 | cd /usr/local && git clone http://github.com/internetarchive/dweb-archive.git
171 | cd /usr/local/dweb-archive && npm install
172 | ```
--------------------------------------------------------------------------------
/Usecases.md:
--------------------------------------------------------------------------------
1 | # dweb-gateway - Use Cases
2 | A decentralized web gateway for open academic papers on the Internet Archive
3 |
4 | An outline of Use Cases for the gateway
5 |
6 | ## Important editing notes
7 | * Names might not be consistent below as it gets edited and code built.
8 | * Please edit to match names in the code as you notice conflicts.
9 |
10 | ## Other Info Links
11 |
12 | * [Main README](./README.md)
13 | * [Use Cases](./Usecases.md) << You are here
14 | * [Classes](./Classes.md)
15 | * [Data for the project - sqlite etc](https://archive.org/download/ia_papers_manifest_20170919)
16 | * [Proposal for meta data](./MetaData.md) - first draft - deleted needs recreating
17 | * [google doc with IPFS integration comments](https://docs.google.com/document/d/1kqETK1kmvbdgApCMQEfmajBdHzqiNTB-TSbJDePj0hM/edit#heading=h.roqqzmshx7ww) #TODO: Needs revision ot match this.
18 | * [google doc with top level overview of Dweb project](https://docs.google.com/document/d/1-lI352gV_ma5ObAO02XwwyQHhqbC8GnAaysuxgR2dQo/edit) - best place for links to other resources & docs.
19 | * [gateway.dweb.me](https://gateway.dweb.me) points at the server - which should be running the "deployed" branch.
20 | * [Gitter chat area](https://gitter.im/ArchiveExperiments/Lobby)
21 | So for example: curl https://gateway.dweb.me/info
22 |
23 | ## Overview
24 |
25 | This gateway sits between a decentralized web server running locally
26 | (in this case an Go-IPFS server) and the Archive.
27 | It will expose a set of services to the server.
28 |
29 | The data is stored in a sqlite database that matches DOI's to hashes of the files we know of,
30 | and the URLs to retrieve them.
31 |
32 | Note the data is multivalue i.e. a DOI represents an academic paper, which may be present in the archive in
33 | various forms and formats. (e.g. PDF, Doc; Final; Preprint).
34 |
35 | See [Information flow diagram](./Academic Docs IPFS gateway.pdf)
36 |
37 | Please see the main [README](./README.md) for the overall structure and [Classes](./Classes.md) for the class overview.
38 |
39 | ## Use Case examples
40 |
41 | ### Retrieving a document starting with DOI
42 |
43 |
44 | -#TODO: Copy the use case from [google doc with previous architecture version](https://docs.google.com/document/d/1FO6Tdjz7A1yi4ABcd8vDz4vofRDUOrKapi3sESavIcc/edit#)
45 | with edits to match current names etc in Microservices below. Below is draft
46 |
47 | ##### Retrieval of file by content hash
48 | * IPFS Gateway
49 | * Receives a request by contenthash
50 | * Requests GET //gateway.dweb.me/ipldfile/contenthash/Qm.....
51 | * Gateway Server/Service gateway.dweb.doi
52 | * Calls ContentHash(Qm...)
53 | * ContentHash(Qm...)
54 | * (ContentHash is subclass of NameResolverFile)
55 | * Locates file in the sqlite
56 | * Loads meta-data for that file
57 | * Gateway Server
58 | * Passes ContentHash object to ipldfile(CH)
59 | * IPLDfile calls CH.shards() as an iterator on CH to read each shard
60 | * ContentHash.shards()
61 | * Is an iterator that iterates over shards (or chunks) of the file. For each shard:
62 | * It reads a chunk of bytes from the file (using a byterange in a HTTP call)
63 | * It hashes those bytes
64 | * Stores the hash and the URL + Byterange in the location service
65 | * Returns the metadata & hash to IPLDfile
66 | * IPLDfile
67 | * Comines the return into the IPLD variant for shards,
68 | * and adds metadata, especially the contenthash
69 | * returns to NameServer
70 | * Gateway Server > IPFS > client
71 | * Calls IPLDfile.content() to get the ipld file to return to IPFS Gateway
72 | * IPFS Gateway
73 | * Pins the hash of the IPLD and each of the shards, and returns to client
74 |
75 | ##### File retrieval
76 | * iPFS Client
77 | * Having retrieved the IPLDfile, iterates over the shards
78 | * For each shard it tries to retrieve the hash
79 | * IPFS Gateway node
80 | * Recognizes the shard, and calls gateway.dweb.me/content/multihash/Q...
81 | * Gateway server
82 | * Routes to multihash("multihash", Q...)
83 | * Multihash("multihash", Q...)
84 | * Looks up the multihash in the location service
85 | * Disovers the location is a URL + Byterange
86 | * Gateway server
87 | * Calls content method on multihash
88 | * Multihash.content()
89 | * Retrieves the bytes (from elsewhere in Archive) and returns to Gateway Server
90 | * Gateway Server > IPFS Gateway > Client
91 |
--------------------------------------------------------------------------------
/cron_ipfs.py:
--------------------------------------------------------------------------------
1 | import logging
2 | # This is run every 10 minutes by Cron (10 * 58 = 580 ~ 10 hours)
3 | from python.config import config
4 | import redis
5 | import base58
6 | from python.HashStore import StateService
7 | from python.TransportIPFS import TransportIPFS
8 | from python.maintenance import resetipfs
9 |
10 | logging.basicConfig(**config["logging"]) # For server
11 | resetipfs(announcedht=True)
12 |
13 | # To fully reset IPFS need to also ...
14 | # rm /usr/local/dweb-gateway/.cache/table/{config["domains"]["metadataverifykey"]} which is where leafs stored - these refer to IPFS hashes for metadata
15 |
--------------------------------------------------------------------------------
/etc_ferm_input_nginx:
--------------------------------------------------------------------------------
1 | proto tcp dport 80 ACCEPT;
2 | proto tcp dport 443 ACCEPT;
3 | proto tcp dport 4001 ACCEPT;
4 | proto tcp dport 4245 ACCEPT;
5 | proto tcp dport 4246 ACCEPT;
6 | proto tcp dport 8080 ACCEPT;
7 | proto tcp dport 6881 ACCEPT;
8 | proto tcp dport 6969 ACCEPT;
9 | proto udp dport 6969 ACCEPT;
10 |
--------------------------------------------------------------------------------
/etc_supervisor_conf.d_dweb.conf:
--------------------------------------------------------------------------------
1 | [group:dweb]
2 | programs=dweb-gateway,dweb-ipfs,dweb-gun,dweb-tracker,dweb-seeder
3 |
4 | [program:dweb-gateway]
5 | command=/usr/bin/python3 -m python.ServerGateway
6 | directory = /usr/local/dweb-gateway
7 | user = mitra
8 | stdout_logfile = /var/log/dweb/dweb-gateway
9 | stdout_logfile_maxbytes=500MB
10 | redirect_stderr = True
11 | autostart = True
12 | autorestart = True
13 | environment=USER=mitra,PYTHONUNBUFFERED=TRUE
14 | exitcodes=0
15 |
16 | [program:dweb-ipfs]
17 | command=/usr/local/bin/ipfs daemon --enable-gc --migrate=true
18 | directory = /usr/local/dweb-gateway
19 | user = ipfs
20 | stdout_logfile = /var/log/dweb/dweb-ipfs
21 | stdout_logfile_maxbytes=500MB
22 | redirect_stderr = True
23 | autostart = True
24 | autorestart = True
25 | environment=USER=ipfs
26 | exitcodes=0
27 |
28 | [program:dweb-gun]
29 | command=node ./gun_https_archive.js 4246
30 | directory = /usr/local/dweb-transport/gun
31 | user = gun
32 | stdout_logfile = /var/log/dweb/dweb-gun
33 | stdout_logfile_maxbytes=500mb
34 | redirect_stderr = True
35 | autostart = True
36 | autorestart = True
37 | environment=GUN_ENV=false
38 | exitcodes=0
39 |
40 | [program:dweb-tracker]
41 | command=node index.js
42 | directory = /usr/local/dweb-transport/tracker
43 | user = mitra
44 | stdout_logfile = /var/log/dweb/dweb-tracker
45 | stdout_logfile_maxbytes=500mb
46 | redirect_stderr = True
47 | autostart = True
48 | autorestart = True
49 | exitcodes=0
50 |
51 | [program:dweb-seeder]
52 | command=node index.js
53 | directory = /usr/local/dweb-transport/seeder
54 | user = mitra
55 | stdout_logfile = /var/log/dweb/dweb-seeder
56 | stdout_logfile_maxbytes=500mb
57 | redirect_stderr = True
58 | autostart = True
59 | autorestart = True
60 | environment=DEBUG=*
61 | exitcodes=0
62 |
--------------------------------------------------------------------------------
/load_ipfs.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import logging
4 | import sys
5 | import redis
6 |
7 | from python.config import config
8 | from python.Archive import ArchiveItem, ArchiveFile
9 |
10 | logging.basicConfig(**config["logging"]) # On server logs to /var/log/dweb/dweb-gateway
11 |
12 | #print(config);
13 | logging.debug("load_ipfs args={}".format(sys.argv)) # sys.argv[1] is first arg (0 is this script)
14 | if (len(sys.argv) > 1) and ("/" in sys.argv[1]):
15 | args = sys.argv[1].split('/')
16 | else:
17 | args = sys.argv[1:]
18 |
19 | #Can override args while testin
20 | #args = ["commute"]
21 | #args = ["commute", "commute.avi"]
22 | #args = ["commute", "closeup.gif"]
23 |
24 | # Set one of these to True based on whether want to use IPFS add or IPFS urlstore
25 | forceadd = True
26 | forceurlstore = False
27 | # Set to true if want each ipfs hash added to DHT via DHT provide
28 | announcedht = False
29 |
30 | obj = ArchiveItem.new("archiveid", *args, wanttorrent=False)
31 | print('"URL","Add/Urlstore","Hash","Size","Announced"')
32 | if isinstance(obj, ArchiveFile): # TODO-PERMS-OK cache_ipfs should be checking perms
33 | obj.cache_ipfs(url = obj.archive_url, forceadd=forceadd, forceurlstore=forceurlstore, verbose=False, printlog=True, announcedht=announcedht, size=int(obj._metadata["size"]))
34 | else:
35 | obj.cache_ipfs(forceurlstore=forceurlstore, forceadd=forceadd, verbose=False, announcedht=announcedht, printlog=True) # Will Loop through all files in Item
36 |
37 | #print("---FINISHED ---")
38 |
--------------------------------------------------------------------------------
/nginx/README.md:
--------------------------------------------------------------------------------
1 | # NGINX configuration
2 |
3 | These files are to support tracking changes to nginx,
4 |
5 | For now just being tracked manually, i.e. editing here wont change anything but can add any changes back to git by ...
6 | ```bash
7 | cd /usr/local/dweb-gateway/nginx
8 | cp /etc/nginx/sites-enabled/* .
9 | git commit -a
10 | git push
11 | ```
12 |
13 | TODO - this could use some support in the install scripts etc,
14 |
15 |
16 | ## SUMMARY
17 | * https://dweb.me (secure server) proxypass http://dweb.me (research server)
18 | * http://dweb.me/ -> https://dweb.me/
19 | * http://dweb.archive.org/aaa/xxx -> gateway /arc/archive.org/aaa/xxx
20 | * https://{gateway.dweb.me, dweb.me}/ -> https://dweb.archive.org - exact URL only
21 | * https://{gateway.dweb.me, dweb.me}/ proxypass localhost:4244 (gateway python)
22 | * https://{gateway.dweb.me, dweb.me, dweb.archive.org}/examples -> file
23 | * http://dweb.archive.org/{details,search} -> bootloader
24 | * https://{gateway.dweb.me, dweb.me}/arc/archive.org/{details,search} -> bootloader
25 | * https://{dweb.me, gateway.dweb.me, dweb.archive.org}/{ws,wss} proxypass localhost:4002 (websockets for IPFS) - not yet working
26 | ##
27 |
28 | The main differences between the different domains are ...
29 |
30 | * dweb.archive.org answers on http, because its the end-point of a proxypass from https://dweb.archive.org (another machine)
31 | * dweb.me forces http by redirecting to https
32 | * gateway.dweb.me provides access to the python server for any URL at http://gateway.dweb.me:80
33 | * dweb.me and gateway.dweb.me forward exact root URL to '/' to https://dweb.archive.org/
34 | * dweb.archive.org forwards /{details, search}; dweb.me & gateway.dweb.me forward /arc/archive.org/{details,search} to bootstrap.html
35 |
36 |
37 | ## URLS of interest
38 | dweb.archive.org|dweb.me or gateway.dweb.me|Action
39 | ----------------|--------------------------|------
40 | /|/|Archive home page via bootloader
41 | /search?q=xyz|/arc/archive.org/search/q=xyz|search page via bootloader
42 | /details/foo|/arc/archive.org/details/foo|details page via bootloader
43 | /ipfs/Q1234|/ipfs/Q1234|IPFS result
44 | /metadata/foo|/arc/archive.org/metadata/foo|cache and return metadata JSON
45 | /leaf/foo|/arc/archive.org/leaf/foo|cache and return leaf record JSON (for naming)
46 | /download/foo/bar|/arc/archive.org/download/foo|return file bar from item foo
47 | n/a|/add,list,store etc|access gateway functionality
48 |
--------------------------------------------------------------------------------
/nginx/dweb.archive.org:
--------------------------------------------------------------------------------
1 | ##
2 | # You should look at the following URL's in order to grasp a solid understanding
3 | # of Nginx configuration files in order to fully unleash the power of Nginx.
4 | # http://wiki.nginx.org/Pitfalls
5 | # http://wiki.nginx.org/QuickStart
6 | # http://wiki.nginx.org/Configuration
7 | #
8 | # Generally, you will want to move this file somewhere, and start with a clean
9 | # file but keep this around for reference. Or just disable in sites-enabled.
10 | #
11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples.
12 | #
13 | # SUMMARY
14 | # https://dweb.me (secure server) proxypass http://dweb.me (research server)
15 | # http://dweb.me/ -> https://dweb.me/
16 | # http://dweb.archive.org/aaa/xxx -> gateway /arc/archive.org/aaa/xxx
17 | # https://{gateway.dweb.me, dweb.me}/ -> https://dweb.archive.org - exact URL only
18 | # https://{gateway.dweb.me, dweb.me}/ proxypass localhost:4244 (gateway python)
19 | # https://{gateway.dweb.me, dweb.me, dweb.archive.org}/{archive,examples} -> file
20 | # http://dweb.archive.org/{details,search} -> bootloader
21 | # https://{gateway.dweb.me, dweb.me}/arc/archive.org/{details,search} -> bootloader
22 | # https://{dweb.me, gateway.dweb.me, dweb.archive.org}/{ws,wss} proxypass localhost:4002 (websockets for IPFS) - not yet working
23 | # https://dweb.archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects}
24 | # https://{dweb.me, gateway}/arc/archive.org/download/xxx -> /arc/archive.org/details/xxx?download=1##
25 | # https://{dweb.archive.org}/download/xxx -> /details/xxx?download=1
26 | ##
27 |
28 | ####
29 | #### THE SITES gateway.dweb.me & dweb.me ARE ALMOST IDENTICAL, ITS HIGHLY LIKEY ANY CHANGES HERE NEED TO BE MADE ON THE OTHER SITE
30 | ####
31 |
32 | server {
33 | # dweb.me and gateway.dweb.me on 443
34 | # dweb.archive.org answers on 80 because its the end-point from port 443 proxypass on the real dweb.archive.org
35 | listen 80;
36 |
37 | root /var/www/html;
38 |
39 | # Add index.php to the list if you are using PHP
40 | index index.html index.htm index.nginx-debian.html;
41 |
42 | server_name dweb.archive.org;
43 |
44 | # Forward details and search -> bootloader
45 | # On dweb.me & gateway.dweb.me thsi is at /arc/archive.org/{details,search}, on dweb.archive.org its at /{details,search}
46 | # Load bootloader which will examine the URL
47 | location ~ ^/$ {
48 | add_header Access-Control-Allow-Origin *;
49 | try_files /archive/bootloader.html =404;
50 | }
51 |
52 | # Catch /download/foo - displayed on details page; rather than /download/foo/bar which goes to gateway
53 | location ~ ^/download/[^/]*$ {
54 | rewrite ^/download/([^/]*)$ /details/$1&download=1 redirect;
55 | }
56 |
57 | location /details {
58 | add_header Access-Control-Allow-Origin *;
59 | try_files /archive/bootloader.html =404;
60 | }
61 | location /search {
62 | add_header Access-Control-Allow-Origin *;
63 | try_files /archive/bootloader.html =404;
64 | }
65 |
66 | # Handle archive.org urls that cant currently be done with Dweb
67 | location ~ ^/(about|bookmarks|donate|projects) {
68 | return 302 https://archive.org$request_uri;
69 | }
70 |
71 | # Not yet working forward, unclear if should be /ws or /wss
72 | location /ws {
73 | proxy_pass http://localhost:4002;
74 | proxy_http_version 1.1;
75 | proxy_set_header Upgrade $http_upgrade;
76 | proxy_set_header Connection "upgrade";
77 | }
78 |
79 | location /wss {
80 | proxy_pass http://localhost:4002;
81 | proxy_http_version 1.1;
82 | proxy_set_header Upgrade $http_upgrade;
83 | proxy_set_header Connection "upgrade";
84 | }
85 |
86 | location /favicon.ico {
87 | add_header Access-Control-Allow-Origin *;
88 | # First attempt to serve request as file, then
89 | # as directory, then fall back to displaying a 404.
90 | try_files $uri $uri/ =404;
91 | }
92 |
93 | location /examples {
94 | add_header Access-Control-Allow-Origin *;
95 | # First attempt to serve request as file, then
96 | # as directory, then fall back to displaying a 404.
97 | try_files $uri $uri/ =404;
98 | }
99 |
100 | location /archive {
101 | add_header Access-Control-Allow-Origin *;
102 | # First attempt to serve request as file, then
103 | # as directory, then fall back to displaying a 404.
104 | try_files $uri $uri/ =404;
105 | }
106 |
107 | location /dweb-serviceworker-bundle.js {
108 | add_header Access-Control-Allow-Origin *;
109 | # First attempt to serve request as file, then
110 | # as directory, then fall back to displaying a 404.
111 | try_files /examples/$uri =404;
112 | }
113 |
114 | location /ipfs {
115 | rewrite ^/ipfs/(.*) /ipfs/$1 break;
116 | proxy_set_header Host $host;
117 | proxy_set_header X-Real-IP $remote_addr;
118 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
119 | proxy_set_header X-Forwarded-Proto $scheme;
120 | proxy_pass http://localhost:8080/;
121 | proxy_read_timeout 600;
122 | }
123 |
124 | # # Exact root URL - on dweb.me or gateway.dweb.me redirect to https://dweb.archive.org so client uses https, most of these are going to be crawlers/hackers
125 | # location ~ ^/$ {
126 | # return 301 https://dweb.archive.org/;
127 | # }
128 |
129 | location ~ ^.+/dweb-(transports|objects)-bundle.js { # contains dweb-*-bundles, but doesnt start with it - probably relative url in bootstrap.html
130 | rewrite ^.*/dweb-(transports|objects)-bundle.js.*$ /dweb-$1-bundle.js redirect;
131 | }
132 |
133 | location / {
134 | add_header Access-Control-Allow-Origin *;
135 | try_files $uri /archive$uri @gateway;
136 | }
137 |
138 | # on dweb.me or gatewy.dweb.me Forward everything else to the gateway on port 4244, not on dweb.archive.org as assumes /arc/archive.org
139 | location @gateway {
140 | rewrite (.*) /arc/archive.org$1 break; # On dweb.archive.org rewrite it to /arc/archive.org/...
141 | proxy_set_header Host $host;
142 | proxy_set_header X-Real-IP $remote_addr;
143 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
144 | proxy_set_header X-Forwarded-Proto $scheme;
145 | proxy_pass http://localhost:4244;
146 | proxy_read_timeout 600;
147 | }
148 |
149 | }
150 |
--------------------------------------------------------------------------------
/nginx/dweb.me:
--------------------------------------------------------------------------------
1 | ##
2 | # You should look at the following URL's in order to grasp a solid understanding
3 | # of Nginx configuration files in order to fully unleash the power of Nginx.
4 | # http://wiki.nginx.org/Pitfalls
5 | # http://wiki.nginx.org/QuickStart
6 | # http://wiki.nginx.org/Configuration
7 | #
8 | # Generally, you will want to move this file somewhere, and start with a clean
9 | # file but keep this around for reference. Or just disable in sites-enabled.
10 | #
11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples.
12 | #
13 | # SUMMARY
14 | # https://dweb.me (secure server) proxypass http://dweb.me (research server)
15 | # http://dweb.me/ -> https://dweb.me/
16 | # http://dweb.archive.org/aaa/xxx -> gateway /arc/archive.org/aaa/xxx
17 | # https://{gateway.dweb.me, dweb.me}/ -> https://dweb.archive.org - exact URL only
18 | # https://{gateway.dweb.me, dweb.me}/ proxypass localhost:4244 (gateway python)
19 | # https://{gateway.dweb.me, dweb.me, dweb.archive.org}/{archive,examples} -> file
20 | # http://dweb.archive.org/{details,search} -> bootloader
21 | # https://{gateway.dweb.me, dweb.me}/arc/archive.org/{details,search} -> bootloader
22 | # https://{dweb.me, gateway.dweb.me, dweb.archive.org}/{ws,wss} proxypass localhost:4002 (websockets for IPFS) - not yet working
23 | # https://dweb.archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects}
24 | # https://{gatgeway.dweb.me, dweb.me}/arc/archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects}
25 | # https://{dweb.me, gateway}/arc/archive.org/download/xxx -> /arc/archive.org/details/xxx?download=1##
26 | # https://{dweb.archive.org}/download/xxx -> /details/xxx?download=1
27 | ##
28 |
29 | ####
30 | #### THE SITES gateway.dweb.me & dweb.me ARE ALMOST IDENTICAL, ITS HIGHLY LIKEY ANY CHANGES HERE NEED TO BE MADE ON THE OTHER SITE
31 | ####
32 |
33 | server {
34 | listen 0.0.0.0:80;
35 | server_name dweb.me;
36 | rewrite ^ https://$http_host$request_uri? permanent; # force redirect http to https
37 | }
38 |
39 | # Default server configuration
40 | #
41 | server {
42 | listen 443 http2;
43 |
44 | # SSL configuration
45 | #
46 | # listen 443 ssl default_server;
47 | # listen [::]:443 ssl default_server;
48 | #
49 | # Note: You should disable gzip for SSL traffic.
50 | # See: https://bugs.debian.org/773332
51 | #
52 | # Read up on ssl_ciphers to ensure a secure configuration.
53 | # See: https://bugs.debian.org/765782
54 | #
55 | # Self signed certs generated by the ssl-cert package
56 | # Don't use them in a production server!
57 | #
58 | # include snippets/snakeoil.conf;
59 |
60 | root /var/www/html;
61 |
62 | # Add index.php to the list if you are using PHP
63 | index index.html index.htm index.nginx-debian.html;
64 |
65 | server_name dweb.me;
66 | ssl on;
67 | ssl_certificate /etc/letsencrypt/live/dweb.me/fullchain.pem;
68 | ssl_certificate_key /etc/letsencrypt/live/dweb.me/privkey.pem;
69 |
70 | # Catch /download/foo - displayed on details page; rather than /download/foo/bar which goes to gateway
71 | location ~ ^/arc/archive.org/download/[^/]*$ {
72 | rewrite ^/arc/archive.org/download/([^/]*)$ /arc/archive.org/details/$1&download=1 redirect;
73 | }
74 |
75 | # Forward details and search -> bootloader
76 | # On dweb.me & gateway.dweb.me thsi is at /arc/archive.org/{details,search}, on dweb.archive.org its at /{details,search}
77 | location /arc/archive.org/details {
78 | add_header Access-Control-Allow-Origin *;
79 | try_files /archive/bootloader.html =404;
80 | }
81 | location /arc/archive.org/search {
82 | add_header Access-Control-Allow-Origin *;
83 | try_files /archive/bootloader.html =404;
84 | }
85 |
86 | # Handle archive.org urls that cant currently be done with Dweb
87 | location /arc/archive.org/about {
88 | rewrite /arc/archive.org/(.*) https://archive.org/$1;
89 | }
90 | location /arc/archive.org/donate {
91 | rewrite /arc/archive.org/(.*) https://archive.org/$1;
92 | }
93 | location /arc/archive.org/projects {
94 | rewrite /arc/archive.org/(.*) https://archive.org/$1;
95 | }
96 |
97 | # Not yet working forward, unclear if should be /ws or /wss
98 | location /ws {
99 | proxy_pass http://localhost:4002;
100 | proxy_http_version 1.1;
101 | proxy_set_header Upgrade $http_upgrade;
102 | proxy_set_header Connection "upgrade";
103 | }
104 |
105 | location /wss {
106 | proxy_pass http://localhost:4002;
107 | proxy_http_version 1.1;
108 | proxy_set_header Upgrade $http_upgrade;
109 | proxy_set_header Connection "upgrade";
110 | }
111 |
112 | location /favicon.ico {
113 | add_header Access-Control-Allow-Origin *;
114 | # First attempt to serve request as file, then
115 | # as directory, then fall back to displaying a 404.
116 | try_files $uri $uri/ =404;
117 | }
118 |
119 | location /examples {
120 | add_header Access-Control-Allow-Origin *;
121 | # First attempt to serve request as file, then
122 | # as directory, then fall back to displaying a 404.
123 | try_files $uri $uri/ =404;
124 | }
125 |
126 | location /archive {
127 | add_header Access-Control-Allow-Origin *;
128 | # First attempt to serve request as file, then
129 | # as directory, then fall back to displaying a 404.
130 | try_files $uri $uri/ =404;
131 | }
132 |
133 | location /dweb-serviceworker-bundle.js {
134 | add_header Access-Control-Allow-Origin *;
135 | # First attempt to serve request as file, then
136 | # as directory, then fall back to displaying a 404.
137 | try_files /examples/$uri =404;
138 | }
139 |
140 | location /ipfs {
141 | rewrite ^/ipfs/(.*) /ipfs/$1 break;
142 | proxy_set_header Host $host;
143 | proxy_set_header X-Real-IP $remote_addr;
144 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
145 | proxy_set_header X-Forwarded-Proto $scheme;
146 | proxy_pass http://localhost:8080/;
147 | proxy_read_timeout 600;
148 | }
149 |
150 | # Exact root URL - on dweb.me or gateway.dweb.me redirect to https://dweb.archive.org so client uses https, most of these are going to be crawlers/hackers
151 | location ~ ^/$ {
152 | return 301 https://dweb.archive.org/;
153 | }
154 | location ~ ^.+/dweb-(transports|objects)-bundle.js { # contains dweb-*-bundles, but doesnt start with it - probably relative url in bootstrap.html
155 | rewrite ^.*/dweb-(transports|objects)-bundle.js.*$ /dweb-$1-bundle.js redirect;
156 | }
157 |
158 | location / {
159 | add_header Access-Control-Allow-Origin *;
160 | try_files $uri /archive$uri @gateway;
161 | }
162 |
163 | # on dweb.me or gatewy.dweb.me Forward everything else to the gateway on port 4244, not on dweb.archive.org as assumes /arc/archive.org
164 | location @gateway {
165 | proxy_set_header Host $host;
166 | proxy_set_header X-Real-IP $remote_addr;
167 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
168 | proxy_set_header X-Forwarded-Proto $scheme;
169 | proxy_pass http://localhost:4244;
170 | proxy_read_timeout 600;
171 | }
172 |
173 | }
174 |
175 | server {
176 | listen 4245;
177 | root /var/www/html;
178 | server_name dweb.me;
179 | ssl on;
180 | ssl_certificate /etc/letsencrypt/live/dweb.me/fullchain.pem;
181 | ssl_certificate_key /etc/letsencrypt/live/dweb.me/privkey.pem;
182 |
183 | # Tunnel websocket
184 | location / {
185 | proxy_pass http://localhost:4002;
186 | proxy_http_version 1.1;
187 | proxy_set_header Upgrade $http_upgrade;
188 | proxy_set_header Connection "upgrade";
189 | }
190 | }
191 |
--------------------------------------------------------------------------------
/nginx/gateway.dweb.me:
--------------------------------------------------------------------------------
1 | ##
2 | # You should look at the following URL's in order to grasp a solid understanding
3 | # of Nginx configuration files in order to fully unleash the power of Nginx.
4 | # http://wiki.nginx.org/Pitfalls
5 | # http://wiki.nginx.org/QuickStart
6 | # http://wiki.nginx.org/Configuration
7 | #
8 | # Generally, you will want to move this file somewhere, and start with a clean
9 | # file but keep this around for reference. Or just disable in sites-enabled.
10 | #
11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples.
12 | #
13 | # SUMMARY
14 | # https://dweb.me (secure server) proxypass http://dweb.me (research server)
15 | # http://dweb.me/ -> https://dweb.me/
16 | # http://dweb.archive.org/aaa/xxx -> gateway /arc/archive.org/aaa/xxx
17 | # https://{gateway.dweb.me, dweb.me}/ -> https://dweb.archive.org - exact URL only
18 | # https://{gateway.dweb.me, dweb.me}/ proxypass localhost:4244 (gateway python)
19 | # https://{gateway.dweb.me, dweb.me, dweb.archive.org}/{archive,examples} -> file
20 | # http://dweb.archive.org/{details,search} -> bootloader
21 | # https://{gateway.dweb.me, dweb.me}/arc/archive.org/{details,search} -> bootloader
22 | # https://{dweb.me, gateway.dweb.me, dweb.archive.org}/{ws,wss} proxypass localhost:4002 (websockets for IPFS) - not yet working
23 | # https://dweb.archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects}
24 | # https://{gatgeway.dweb.me, dweb.me}/arc/archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects}
25 | # https://{dweb.me, gateway}/arc/archive.org/download/xxx -> /arc/archive.org/details/xxx?download=1##
26 | # https://{dweb.archive.org}/download/xxx -> /details/xxx?download=1
27 | ##
28 |
29 | ####
30 | #### THE SITES gateway.dweb.me & dweb.me ARE ALMOST IDENTICAL, ITS HIGHLY LIKEY ANY CHANGES HERE NEED TO BE MADE ON THE OTHER SITE
31 | ####
32 |
33 | # TODO gateway.dweb.me handles / as if it was on port 4244, dweb.me instead passes all to https - its not clear this difference is needed (dweb.me's forward to https prefered)
34 |
35 | server {
36 | # dweb.archive.org answers on 80, dweb.me and gateway.dweb.me on 443
37 |
38 | listen 0.0.0.0:80;
39 | root /var/www/html;
40 | server_name gateway.dweb.me;
41 |
42 | #rewrite ^ https://$http_host$request_uri? permanent; # force redirect http to https
43 |
44 | # Exact root URL - Forward root URL to dweb.archive.org, shouldnt be hitting it here - probably a crawler
45 | location ~ ^/$ {
46 | return 301 https://dweb.archive.org/;
47 | }
48 |
49 | location / {
50 | proxy_set_header Host $host;
51 | proxy_set_header X-Real-IP $remote_addr;
52 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
53 | proxy_set_header X-Forwarded-Proto $scheme;
54 |
55 | proxy_pass http://localhost:4244/;
56 | proxy_read_timeout 600;
57 | }
58 | }
59 | server {
60 | listen 0.0.0.0:80;
61 | server_name dweb.me;
62 | rewrite ^ https://$http_host$request_uri? permanent; # force redirect http to https
63 | }
64 | # Default server configuration
65 | #
66 | server {
67 | listen 443 http2;
68 |
69 | # SSL configuration
70 | #
71 | # listen 443 ssl default_server;
72 | # listen [::]:443 ssl default_server;
73 | #
74 | # Note: You should disable gzip for SSL traffic.
75 | # See: https://bugs.debian.org/773332
76 | #
77 | # Read up on ssl_ciphers to ensure a secure configuration.
78 | # See: https://bugs.debian.org/765782
79 | #
80 | # Self signed certs generated by the ssl-cert package
81 | # Don't use them in a production server!
82 | #
83 | # include snippets/snakeoil.conf;
84 |
85 | root /var/www/html;
86 |
87 | # Add index.php to the list if you are using PHP
88 | index index.html index.htm index.nginx-debian.html;
89 |
90 | server_name gateway.dweb.me;
91 | ssl on;
92 | ssl_certificate /etc/letsencrypt/live/gateway.dweb.me/fullchain.pem;
93 | ssl_certificate_key /etc/letsencrypt/live/gateway.dweb.me/privkey.pem;
94 |
95 | # Catch /download/foo - displayed on details page; rather than /download/foo/bar which goes to gateway
96 | location ~ ^/arc/archive.org/download/[^/]*$ {
97 | rewrite ^/arc/archive.org/download/([^/]*)$ /arc/archive.org/details/$1&download=1 redirect;
98 | }
99 |
100 | # Forward details and search -> bootloader
101 | # On dweb.me & gateway.dweb.me thsi is at /arc/archive.org/{details,search}, on dweb.archive.org its at /{details,search}
102 | location /arc/archive.org/details {
103 | add_header Access-Control-Allow-Origin *;
104 | try_files /archive/bootloader.html =404;
105 | }
106 | location /arc/archive.org/search {
107 | add_header Access-Control-Allow-Origin *;
108 | try_files /archive/bootloader.html =404;
109 | }
110 |
111 | # Handle archive.org urls that cant currently be done with Dweb
112 | location /arc/archive.org/about {
113 | rewrite /arc/archive.org/(.*) https://archive.org/$1;
114 | }
115 | location /arc/archive.org/donate {
116 | rewrite /arc/archive.org/(.*) https://archive.org/$1;
117 | }
118 | location /arc/archive.org/projects {
119 | rewrite /arc/archive.org/(.*) https://archive.org/$1;
120 | }
121 |
122 | # Not yet working forward, unclear if should be /ws or /wss
123 | location /ws {
124 | proxy_pass http://localhost:4002;
125 | proxy_http_version 1.1;
126 | proxy_set_header Upgrade $http_upgrade;
127 | proxy_set_header Connection "upgrade";
128 | }
129 |
130 | location /wss {
131 | proxy_pass http://localhost:4002;
132 | proxy_http_version 1.1;
133 | proxy_set_header Upgrade $http_upgrade;
134 | proxy_set_header Connection "upgrade";
135 | }
136 |
137 | location /favicon.ico {
138 | add_header Access-Control-Allow-Origin *;
139 | # First attempt to serve request as file, then
140 | # as directory, then fall back to displaying a 404.
141 | try_files $uri $uri/ =404;
142 | }
143 |
144 | location /examples {
145 | add_header Access-Control-Allow-Origin *;
146 | # First attempt to serve request as file, then
147 | # as directory, then fall back to displaying a 404.
148 | try_files $uri $uri/ =404;
149 | }
150 |
151 | location /archive {
152 | add_header Access-Control-Allow-Origin *;
153 | # First attempt to serve request as file, then
154 | # as directory, then fall back to displaying a 404.
155 | try_files $uri $uri/ =404;
156 | }
157 |
158 | location /dweb-serviceworker-bundle.js {
159 | add_header Access-Control-Allow-Origin *;
160 | # First attempt to serve request as file, then
161 | # as directory, then fall back to displaying a 404.
162 | try_files /examples/$uri =404;
163 | }
164 |
165 | location /ipfs {
166 | rewrite ^/ipfs/(.*) /ipfs/$1 break;
167 | proxy_set_header Host $host;
168 | proxy_set_header X-Real-IP $remote_addr;
169 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
170 | proxy_set_header X-Forwarded-Proto $scheme;
171 | proxy_pass http://localhost:8080/;
172 | proxy_read_timeout 600;
173 | }
174 |
175 | # Exact root URL - on dweb.me or gateway.dweb.me redirect to https://dweb.archive.org so client uses https, most of these are going to be crawlers/hackers
176 | location ~ ^/$ {
177 | return 301 https://dweb.archive.org/;
178 | }
179 | location ~ ^.+/dweb-(transports|objects)-bundle.js { # contains dweb-*-bundles, but doesnt start with it - probably relative url in bootstrap.html
180 | rewrite ^.*/dweb-(transports|objects)-bundle.js.*$ /dweb-$1-bundle.js redirect;
181 | }
182 |
183 | location / {
184 | try_files $uri /archive$uri @gateway;
185 | }
186 |
187 | # on dweb.me or gatewy.dweb.me Forward everything else to the gateway on port 4244, not on dweb.archive.org as assumes /arc/archive.org
188 | location @gateway {
189 | proxy_set_header Host $host;
190 | proxy_set_header X-Real-IP $remote_addr;
191 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
192 | proxy_set_header X-Forwarded-Proto $scheme;
193 | proxy_pass http://localhost:4244;
194 | proxy_read_timeout 600;
195 | }
196 |
197 | }
198 |
--------------------------------------------------------------------------------
/nginx/ipfs.dweb.me:
--------------------------------------------------------------------------------
1 | ##
2 | # You should look at the following URL's in order to grasp a solid understanding
3 | # of Nginx configuration files in order to fully unleash the power of Nginx.
4 | # http://wiki.nginx.org/Pitfalls
5 | # http://wiki.nginx.org/QuickStart
6 | # http://wiki.nginx.org/Configuration
7 | #
8 | # Generally, you will want to move this file somewhere, and start with a clean
9 | # file but keep this around for reference. Or just disable in sites-enabled.
10 | #
11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples.
12 | ##
13 | server {
14 | listen 0.0.0.0:80;
15 | server_name ipfs.dweb.me;
16 | rewrite ^ https://$http_host$request_uri? permanent; # force redirect http to https
17 | }
18 |
19 |
20 | # Default server configuration
21 | #
22 | server {
23 | listen 443;
24 |
25 | # SSL configuration
26 | #
27 | # listen 443 ssl default_server;
28 | # listen [::]:443 ssl default_server;
29 | #
30 | # Note: You should disable gzip for SSL traffic.
31 | # See: https://bugs.debian.org/773332
32 | #
33 | # Read up on ssl_ciphers to ensure a secure configuration.
34 | # See: https://bugs.debian.org/765782
35 | #
36 | # Self signed certs generated by the ssl-cert package
37 | # Don't use them in a production server!
38 | #
39 | # include snippets/snakeoil.conf;
40 |
41 | root /var/www/html;
42 |
43 | # Add index.php to the list if you are using PHP
44 | index index.html index.htm index.nginx-debian.html;
45 |
46 | server_name ipfs.dweb.me;
47 | ssl on;
48 | ssl_certificate /etc/letsencrypt/live/dweb.me/fullchain.pem;
49 | ssl_certificate_key /etc/letsencrypt/live/dweb.me/privkey.pem;
50 |
51 |
52 | location / {
53 | # First attempt to serve request as file, then
54 | # as directory, then fall back to displaying a 404.
55 | proxy_set_header Host $host;
56 | proxy_set_header X-Real-IP $remote_addr;
57 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
58 | proxy_set_header X-Forwarded-Proto $scheme;
59 |
60 | proxy_pass http://localhost:8080/;
61 | proxy_read_timeout 600;
62 | }
63 |
64 | # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
65 | #
66 | #location ~ \.php$ {
67 | # include snippets/fastcgi-php.conf;
68 | #
69 | # # With php7.0-cgi alone:
70 | # fastcgi_pass 127.0.0.1:9000;
71 | # # With php7.0-fpm:
72 | # fastcgi_pass unix:/run/php/php7.0-fpm.sock;
73 | #}
74 |
75 | # deny access to .htaccess files, if Apache's document root
76 | # concurs with nginx's one
77 | #
78 | #location ~ /\.ht {
79 | # deny all;
80 | #}
81 | }
82 |
83 |
84 | # Virtual Host configuration for example.com
85 | #
86 | # You can move that to a different file under sites-available/ and symlink that
87 | # to sites-enabled/ to enable it.
88 | #
89 | #server {
90 | # listen 80;
91 | # listen [::]:80;
92 | #
93 | # server_name example.com;
94 | #
95 | # root /var/www/example.com;
96 | # index index.html;
97 | #
98 | # location / {
99 | # try_files $uri $uri/ =404;
100 | # }
101 | #}
102 |
--------------------------------------------------------------------------------
/nginx/ipfsconvert.dweb.me:
--------------------------------------------------------------------------------
1 | ##
2 | # You should look at the following URL's in order to grasp a solid understanding
3 | # of Nginx configuration files in order to fully unleash the power of Nginx.
4 | # http://wiki.nginx.org/Pitfalls
5 | # http://wiki.nginx.org/QuickStart
6 | # http://wiki.nginx.org/Configuration
7 | #
8 | # Generally, you will want to move this file somewhere, and start with a clean
9 | # file but keep this around for reference. Or just disable in sites-enabled.
10 | #
11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples.
12 | ##
13 | server {
14 | listen 0.0.0.0:80;
15 | server_name ipfsconvert.dweb.me;
16 |
17 | # SSL configuration
18 | #
19 | # listen 443 ssl default_server;
20 | # listen [::]:443 ssl default_server;
21 | #
22 | # Note: You should disable gzip for SSL traffic.
23 | # See: https://bugs.debian.org/773332
24 | #
25 | # Read up on ssl_ciphers to ensure a secure configuration.
26 | # See: https://bugs.debian.org/765782
27 | #
28 | # Self signed certs generated by the ssl-cert package
29 | # Don't use them in a production server!
30 | #
31 | # include snippets/snakeoil.conf;
32 |
33 | root /var/www/html;
34 |
35 | # Add index.php to the list if you are using PHP
36 | index index.html index.htm index.nginx-debian.html;
37 |
38 | # ssl on;
39 | # ssl_certificate /etc/letsencrypt/live/ipfsconvert.dweb.me/fullchain.pem;
40 | # ssl_certificate_key /etc/letsencrypt/live/ipfsconvert.dweb.me/privkey.pem;
41 |
42 |
43 | location / {
44 | # First attempt to serve request as file, then
45 | # as directory, then fall back to displaying a 404.
46 | proxy_set_header Host $host;
47 | proxy_set_header X-Real-IP $remote_addr;
48 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
49 | proxy_set_header X-Forwarded-Proto $scheme;
50 |
51 | proxy_pass http://localhost:4245/;
52 | proxy_read_timeout 600;
53 | }
54 |
55 | # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
56 | #
57 | #location ~ \.php$ {
58 | # include snippets/fastcgi-php.conf;
59 | #
60 | # # With php7.0-cgi alone:
61 | # fastcgi_pass 127.0.0.1:9000;
62 | # # With php7.0-fpm:
63 | # fastcgi_pass unix:/run/php/php7.0-fpm.sock;
64 | #}
65 |
66 | # deny access to .htaccess files, if Apache's document root
67 | # concurs with nginx's one
68 | #
69 | #location ~ /\.ht {
70 | # deny all;
71 | #}
72 | }
73 |
74 |
75 | # Virtual Host configuration for example.com
76 | #
77 | # You can move that to a different file under sites-available/ and symlink that
78 | # to sites-enabled/ to enable it.
79 | #
80 | #server {
81 | # listen 80;
82 | # listen [::]:80;
83 | #
84 | # server_name example.com;
85 | #
86 | # root /var/www/example.com;
87 | # index index.html;
88 | #
89 | # location / {
90 | # try_files $uri $uri/ =404;
91 | # }
92 | #}
93 |
--------------------------------------------------------------------------------
/nginx/www.dweb.me:
--------------------------------------------------------------------------------
1 | ##
2 | # You should look at the following URL's in order to grasp a solid understanding
3 | # of Nginx configuration files in order to fully unleash the power of Nginx.
4 | # http://wiki.nginx.org/Pitfalls
5 | # http://wiki.nginx.org/QuickStart
6 | # http://wiki.nginx.org/Configuration
7 | #
8 | # Generally, you will want to move this file somewhere, and start with a clean
9 | # file but keep this around for reference. Or just disable in sites-enabled.
10 | #
11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples.
12 | ##
13 | server {
14 | server_name www.dweb.me;
15 | return 301 $scheme://dweb.me$request_uri;
16 | }
17 |
--------------------------------------------------------------------------------
/python/Btih.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from .NameResolver import NameResolverDir
3 | from .Errors import CodingException, ToBeImplementedException, NoContentException
4 | from .HashStore import MagnetLinkService
5 | from .miscutils import httpget, loads
6 | from .config import config
7 | from .Archive import ArchiveItem
8 | from magneturi import bencode
9 |
10 |
11 | class BtihResolver(NameResolverDir):
12 | """
13 | Resolve BitTorrent Hashes
14 | Fields:
15 | btih # BitTorrent hash - in ascii B32 format
16 |
17 | This could also easily be extended to
18 | Support "magnetlink" as the thing being looked for (and return btih and other outputs)
19 | Support outputs of itemid, metadata (of item)
20 |
21 | """
22 | namespace="btih" # Defined in subclasses
23 |
24 | def __init__(self, namespace, hash, **kwargs):
25 | """
26 | Creates the object
27 |
28 | :param namespace: "btih"
29 | :param hash: Hash representing the object - format is specified by namespace
30 | :param kwargs: Any other args to the URL, ignored for now.
31 | """
32 | verbose=kwargs.get("verbose")
33 | if verbose:
34 | logging.debug("{0}.__init__({1}, {2}, {3})".format(self.__class__.__name__, namespace, hash, kwargs))
35 | if namespace != self.namespace: # Checked though should be determined by ServerGateway mapping
36 | raise CodingException(message="namespace != "+self.namespace)
37 | super(BtihResolver, self).__init__(self, namespace, hash, **kwargs) # Note ignores the namespace
38 | self.btih = hash # Ascii B32 version of hash
39 |
40 | @classmethod
41 | def new(cls, namespace, hash, *args, **kwargs):
42 | """
43 | Called by ServerGateway to handle a URL - passed the parts of the remainder of the URL after the requested format,
44 |
45 | :param namespace:
46 | :param args:
47 | :param kwargs:
48 | :return:
49 | :raise NoContentException: if cant find content directly or via other classes (like DOIfile)
50 | """
51 | verbose=kwargs.get("verbose")
52 | ch = super(BtihResolver, cls).new(namespace, hash, *args, **kwargs) # By default (on NameResolver) calls cls() which goes to __init__
53 | return ch
54 |
55 | def itemid(self, verbose=False, **kwargs):
56 | searchurl = config["archive"]["url_btihsearch"] + self.btih
57 | searchres = loads(httpget(searchurl))
58 | if not searchres["response"]["numFound"]:
59 | return None
60 | return searchres["response"]["docs"][0]["identifier"]
61 |
62 | def retrieve(self, verbose=False, **kwargs):
63 | """
64 | Fetch the content, dont pass to caller (typically called by NameResolver.content()
65 | TODO - if needed can retrieve the torrent file here - look at HashStore for example of getting from self.url
66 |
67 | :returns: content - i.e. bytes
68 | """
69 | raise ToBeImplementedException("btih retrieve")
70 |
71 | def content(self, verbose=False, **kwargs):
72 | """
73 | :returns: content - i.e. bytes
74 | """
75 | data = self.retrieve()
76 | if verbose: logging.debug("Retrieved doc size={}".format(len(data)))
77 | return {'Content-type': self.mimetype,
78 | 'data': data,
79 | }
80 |
81 | def metadata(self, headers=True, verbose=False, **kwargs):
82 | """
83 | :param verbose:
84 | :return:
85 | """
86 | raise ToBeImplementedException(name="btih.metadata()")
87 |
88 | def magnetlink(self, verbose=False, headers=False, **kwargs):
89 | magnetlink = MagnetLinkService.btihget(self.btih)
90 | data = magnetlink or "" # Current paths mean we should have it, but if not we'll return "" as we have no way of doing that lookup
91 | return {"Content-type": "text/plain", "data": data} if headers else data
92 |
93 | def torrenturl(self, verbose=False): # TODO-PERMS only used in torrent() below which doesnt use result so can delete this routine
94 | itemid = self.itemid(verbose=verbose)
95 | if not itemid:
96 | raise NoContentException()
97 | return "https://archive.org/download/{}/{}_archive.torrent".format(itemid, itemid)
98 |
99 | def torrent(self, verbose=False, headers=False, **kwargs):
100 | torrenturl = self.torrenturl(verbose=verbose) # NoContentException if not found # TODO-PERMS unused can delete this line?
101 | data = bencode.bencode(ArchiveItem.modifiedtorrent(self.itemid(), wantmodified=True, verbose=verbose))
102 | mimetype = "application/x-bittorrent"
103 | return {"Content-type": mimetype, "data": data} if headers else data
104 |
105 |
--------------------------------------------------------------------------------
/python/ContentStore.py:
--------------------------------------------------------------------------------
1 |
2 | class ContentStore(HashStore):
3 | Store and retrieve content by its hash.
4 | Could use REDIS or just store in a file - see rawstore and rawfetch in https://github.com/mitra42/dweb/blob/master/dweb/TransportLocal.py for an example
5 | * rawstore(bytes) => multihash
6 | * rawfetch(multihash) => bytes
7 | * Consumes: multihash; hashstore
8 |
9 | Notes: The names are for compatability with a separate client library project.
10 | For now this could use the hashstore or it could use the file system (have code for this)
11 |
12 |
--------------------------------------------------------------------------------
/python/Errors.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 |
3 | class MyBaseException(Exception):
4 | """
5 | Base class for Exceptions
6 |
7 | Create subclasses with parameters in their msg e.g. {message} or {name}
8 | and call as in: raise NewException(name="Foo");
9 |
10 | msgargs Arguments that slot into msg
11 | __str__ Returns msg expanded with msgparms
12 | """
13 | errno=0
14 | httperror = 500 # See BaseHTTPRequestHandler for list of errors
15 | msg="Generic Model Exception" #: Parameterised string for message
16 | def __init__(self, **kwargs):
17 | self.msgargs=kwargs # Store arbitrary dict of message args (can be used ot output msg from template
18 |
19 | def __str__(self):
20 | try:
21 | return self.msg.format(**self.msgargs)
22 | except:
23 | return self.msg + " UNFORMATABLE ARGS:" + repr(self.msgargs)
24 |
25 | class ToBeImplementedException(MyBaseException):
26 | """
27 | Raised when some code has not been implemented yet
28 | """
29 | httperror = 501
30 | msg = "{name} needs implementing"
31 |
32 | # Note TransportError is in Transport.py
33 |
34 | class IPFSException(MyBaseException):
35 | httperror = 500
36 | msg = "IPFS Error: {message}"
37 |
38 | class CodingException(MyBaseException):
39 | httperror = 501
40 | msg = "Coding Error: {message}"
41 |
42 | class SignatureException(MyBaseException):
43 | httperror = 501
44 | msg = "Signature Verification Error: {message}"
45 |
46 | class EncryptionException(MyBaseException):
47 | httperror = 500 # Failure in the encryption code other than lack of authentication
48 | msg = "Encryption error: {message}"
49 |
50 | class ForbiddenException(MyBaseException):
51 | httperror = 403 # Forbidden (WWW Authentication won't help (note there is no real HTTP error for authentication (other than HTTP authentication) failed )
52 | msg = "Not allowed: {what}"
53 |
54 | class AuthenticationException(MyBaseException):
55 | """
56 | Raised when some code has not been implemented yet
57 | """
58 | httperror = 403 # Forbidden - this should be 401 except that requires extra headers (see RFC2616)
59 | msg = "Authentication Exception: {message}"
60 |
61 | class IntentionallyUnimplementedException(MyBaseException):
62 | """
63 | Raised when some code has not been implemented yet
64 | """
65 | httperror = 501
66 | msg = "Intentionally not implemented: {message}"
67 |
68 | class DecryptionFailException(MyBaseException):
69 | """
70 | Raised if decrypytion failed - this could be cos its the wrong (e.g. old) key
71 | """
72 | httperror = 500
73 | msg = "Decryption fail"
74 |
75 | class SecurityWarning(MyBaseException):
76 | msg = "Security warning: {message}"
77 |
78 |
79 | class AssertionFail(MyBaseException): #TODO-BACKPORT - console.assert on JS should throw this
80 | """
81 | Raised when something that should be True isn't - usually a coding failure or some change not propogated fully
82 | """
83 | httperror = 500
84 | msg = "{message}"
85 |
86 | class TransportURLNotFound(MyBaseException):
87 | httperror = 404
88 | msg = "{url} not found"
89 |
90 | class NoContentException(MyBaseException):
91 | httperror = 404
92 | msg = "No content found"
93 |
94 | class MultihashError(MyBaseException):
95 | httperror = 500
96 | msg = "Multihash error {message}"
97 |
98 | class SearchException(MyBaseException):
99 | httperror = 404
100 | msg = "{search} not found"
101 |
102 | class TransportFileNotFound(MyBaseException):
103 | httperror = 404
104 | msg = "file {file} not found"
105 |
106 | """
107 |
108 | # Following are currently obsolete - not being used in Python or JS
109 |
110 | class PrivateKeyException(MyBaseException):
111 | #Raised when some code has not been implemented yet
112 | httperror = 500
113 | msg = "Operation requires Private Key, but only Public available."
114 |
115 | """
116 |
--------------------------------------------------------------------------------
/python/HashResolvers.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from .NameResolver import NameResolverFile
3 | from .miscutils import loads, dumps, httpget
4 | from .Errors import CodingException, NoContentException, ForbiddenException
5 | from .HashStore import LocationService, MimetypeService
6 | from .LocalResolver import LocalResolverFetch
7 | from .Multihash import Multihash
8 | from .DOI import DOIfile
9 | from .Archive import ArchiveItem, ArchiveFile
10 | from .config import config
11 |
12 |
13 | class HashResolver(NameResolverFile):
14 | """
15 | Base class for Sha1Hex and ContentHash - used where we are instantiating something of unknown type from a hash of some form.
16 |
17 | Sha1Hex & ContentHash are classes for retrieval by a hash
18 | typically of form sha1hex/1a2b3c for SHA1
19 |
20 | Implements name resolution of the ContentHash namespace, via a local store and any other internal archive method
21 |
22 | Future Work
23 | * Build way to preload the hashstore with the hashes and URLs from various parts of the Archive
24 | """
25 | namespace = None # Defined in subclasses
26 | multihashfield = None # Defined in subclasses
27 | archivefilemetadatafield = None # Defined in subclasses
28 |
29 | def __init__(self, namespace, hash, **kwargs):
30 | """
31 | Creates the object
32 |
33 | :param namespace: "contenthash"
34 | :param hash: Hash representing the object - format is specified by namespace
35 | :param kwargs: Any other args to the URL, ignored for now.
36 | """
37 | """
38 | Pseudo-code
39 | Looks up the multihash in Location Service to find where can be retrieved from, does not retrieve it.
40 | """
41 | verbose = kwargs.get("verbose")
42 | if verbose:
43 | logging.debug("{0}.__init__({1}, {2}, {3})".format(self.__class__.__name__, namespace, hash, kwargs))
44 | if namespace != self.namespace: # Defined in subclasses
45 | raise CodingException(message="namespace != "+self.namespace)
46 | super(HashResolver, self).__init__(self, namespace, hash, **kwargs) # Note ignores the name
47 | self.multihash = Multihash(**{self.multihashfield: hash})
48 | self.url = LocationService.get(self.multihash.multihash58, verbose) #TODO-FUTURE recognize different types of location, currently assumes URL
49 | #logging.debug("XXX@HashResolver.__init__ setting {} .url = {}".format(self.multihash.multihash58, self.url))
50 | self.mimetype = MimetypeService.get(self.multihash.multihash58, verbose) # Should be after DOIfile resolution, which will set mimetype in MimetypeService
51 | self._metadata = None # Not resolved yet
52 | self._doifile = None # Not resolved yet
53 |
54 | # noinspection PyMethodOverriding
55 | @classmethod
56 | def new(cls, namespace, hash, *args, **kwargs):
57 | """
58 | Called by ServerGateway to handle a URL - passed the parts of the remainder of the URL after the requested format,
59 |
60 | :param namespace:
61 | :param hash: hash or next part of name within namespace
62 | :param args: rest of path
63 | :param kwargs:
64 | :return:
65 | :raise NoContentException: if cant find content directly or via other classes (like DOIfile)
66 | """
67 | verbose = kwargs.get("verbose")
68 | if hash == HashFileEmpty.emptymeta[cls.archivefilemetadatafield]:
69 | return HashFileEmpty(verbose) # Empty file
70 | ch = super(HashResolver, cls).new(namespace, hash, *args, **kwargs) # By default (on NameResolver) calls cls() which goes to __init__
71 | if not ch.url:
72 | if verbose: logging.debug("No URL, looking on archive for {0}.{1}".format(namespace, hash))
73 | try:
74 | #!SEE-OTHERHASHES -this is where we look things up in the DOI.sql etc essentially cycle through some other classes, asking if they know the URL
75 | # ch = DOIfile(multihash=ch.multihash).url # Will fill in url if known. Note will now return a DOIfile, not a Sha1Hex
76 | return ch.searcharchivefor(verbose=verbose) # Will now be a ArchiveFile
77 | except NoContentException as e:
78 | pass; # If doesnt or cant find on archive, we can still check locally
79 | if not kwargs.get("nolocal") and (not ch.url or ch.url.startswith("local:")):
80 | ch = LocalResolverFetch.new("rawfetch", hash, **kwargs)
81 | if not (ch and ch.url):
82 | raise NoContentException()
83 | return ch
84 |
85 | def push(self, obj):
86 | """
87 | Add a Shard to a ContentHash -
88 | :return:
89 | """
90 | pass # Note could probably be defined on NameResolverFile class
91 |
92 | def retrieve(self, verbose=False, **kwargsx):
93 | """
94 | Fetch the content, dont pass to caller (typically called by NameResolver.content()
95 |
96 | :returns: content - i.e. bytes
97 | :raise: TransportFileNotFound, ForbiddenException, HTTPError if cant find url
98 | """
99 | # TODO-STREAMS future work to return a stream
100 | if not self.url:
101 | raise NoContentException()
102 | if self.url.startswith("local:"):
103 | raise CodingException(message="Shouldnt get here, should convert to LocalResolver in HashResolver.new: {0}".format(self.url))
104 | """
105 | u = self.url.split('/')
106 | if u[1] == "rawfetch":
107 | assert(False) # hook to LocalResolver/rawfetch if need this
108 | else:
109 | raise CodingException(message="unsupported for local: {0}".format(self.url))
110 | """
111 | else:
112 | return httpget(self.url) # Err TransportFileNotFound or HTTPError
113 |
114 | def searcharchivefor(self, multihash=None, verbose=False, **kwargs):
115 | # Note this only works on certain machines
116 | # And will return a ArchiveFile
117 | # Multihash must already be sha1, or convertable into it.
118 | mh = multihash or self.multihash
119 | if mh.code != mh.SHA1: #Can only search on sha1 currently
120 | raise NoContentException()
121 | searchurl = config["archive"]["url_sha1search"] + (multihash or self.multihash).sha1hex
122 | res = loads(httpget(searchurl))
123 | #logging.info("XXX@searcharchivefor res={}".format(res))
124 | if res.get("error"):
125 | # {"error": "internal use only"}
126 | raise ForbiddenException(what="SHA1 search from machine unless its whitelisted by Aaron ip=:"+res.get("error"))
127 | if not res["hits"]["total"]:
128 | # {"key":"sha1","val":"88d4b0d91acd3c25139804afbf4aef4e675bef63","hits":{"total":0,"matches":[]}}
129 | raise NoContentException()
130 | # {"key": "sha1", "val": "88...2", "hits": {"total": 1, "matches": [{"identifier": [""],"name": [""]}]}}
131 | firstmatch = res["hits"]["matches"][0]
132 | logging.info("ArchiveFile.new({},{},{}".format("archiveid", firstmatch["identifier"][0], firstmatch["name"][0]))
133 | return ArchiveItem.new("archiveid", firstmatch["identifier"][0], firstmatch["name"][0], verbose=True) # Note uses ArchiveItem because need to retrieve item level metadata as well
134 |
135 | def content(self, verbose=False, **kwargs):
136 | """
137 | :returns: content - i.e. bytes
138 | """
139 | data = self.retrieve()
140 | if verbose: logging.debug("Retrieved doc size={}".format(len(data)))
141 | return {'Content-type': self.mimetype,
142 | 'data': data,
143 | }
144 |
145 | def metadata(self, headers=True, verbose=False, **kwargs):
146 | """
147 | :param verbose:
148 | :param headers: True if caller wants HTTP response headers
149 | :return:
150 | """
151 | #logging.info("XXX@HR.metadata m={}, u={}".format(self._metadata, self.url))
152 | if not self._metadata:
153 | try:
154 | if not self._doifile:
155 | self._doifile = DOIfile(multihash=self.multihash, verbose=verbose) # If not found, dont set url/metadata etc raises NoContentException
156 | self._metadata = self._metadata or (
157 | self._doifile and self._doifile.metadata(headers=False, verbose=verbose))
158 | except NoContentException as e:
159 | pass # Ignore absence of DOI file, try next
160 | if not self._metadata and self.url and self.url.startswith(config["archive"]["url_download"]):
161 | u = self.url[len(config["archive"]["url_download"]):].split('/') # [ itemid, filename ]
162 | self._metadata = ArchiveItem.new("archiveid", *u).metadata(headers=False) # Note will retun an ArchiveFile since passing the filename
163 | mimetype = 'application/json' # Note this is the mimetype of the response, not the mimetype of the file
164 | return {"Content-type": mimetype, "data": self._metadata} if headers else self._metadata
165 |
166 | # def canonical - not needed as already in a canonical form
167 |
168 |
169 | class Sha1Hex(HashResolver):
170 | """
171 | URL: `/xxx/sha1hex/Q...` (forwarded here by ServerGateway methods)
172 | """
173 | namespace = "sha1hex"
174 | multihashfield = "sha1hex" # Field to Multihash.init
175 | archivefilemetadatafield = "sha1"
176 |
177 |
178 | class ContentHash(HashResolver):
179 | """
180 | URL: `/xxx/contenthash/Q...` (forwarded here by ServerGateway methods)
181 | """
182 | namespace = "contenthash"
183 | multihashfield = "multihash58" # Field to Multihash.init
184 | archivefilemetadatafield = "multihash58" # Not quite true, really combined into URL in "contenthash" but this is for detecting emptyhash
185 |
186 | class HashFileEmpty(HashResolver):
187 | # Catch special case of an empty file and deliver an empty file
188 | emptymeta = { # Example from archive.org/metadata/AboutBan1935/AboutBan1935.asr.srt
189 | "name": "emptyfile.txt",
190 | "source": "original",
191 | "format": "Unknown",
192 | "size": "0",
193 | "md5": "d41d8cd98f00b204e9800998ecf8427e",
194 | "crc32": "00000000",
195 | "sha1": "da39a3ee5e6b4b0d3255bfef95601890afd80709",
196 | "contenthash": "contenthash:/contenthash/5dtpkBuw5TeS42SJSTt33HCE3ht4rC",
197 | "multihash58": "5dtpkBuw5TeS42SJSTt33HCE3ht4rC",
198 | }
199 |
200 | # noinspection PyMissingConstructor
201 | def __init__(self, verbose=False):
202 | # Intentionally not calling superclass's init.
203 | self.mimetype = "application/octet-stream"
204 |
205 | def retrieve(self, _headers=None, verbose=False, **kwargs):
206 | # Return a empty file
207 | return ''
208 |
209 | def metadata(self, headers=None, verbose=False, **kwargs):
210 | mimetype = 'application/json' # Note this is the mimetype of the response, not the mimetype of the file
211 | return {"Content-type": mimetype, "data": self.emptymeta } if headers else self.emptymeta
212 |
--------------------------------------------------------------------------------
/python/HashStore.py:
--------------------------------------------------------------------------------
1 | """
2 | Hash Store set of classes for storage and retrieval
3 | """
4 | import redis
5 | import logging
6 | from .Errors import CodingException
7 | from .TransportIPFS import TransportIPFS
8 | from .miscutils import loads, dumps
9 |
10 | class HashStore(object):
11 | """
12 | Superclass for key value storage, a shim around REDIS intended to be subclassed (see LocationService for example)
13 |
14 | Will tie to a REDIS database initially.
15 |
16 | Class Fields:
17 | _redis: redis object Redis Connection object once connection to redis once established,
18 |
19 | Fields:
20 | redisfield: string name of field in redis store being used.
21 |
22 | Class methods:
23 | redis() Initiate connection to redis or return already open one.
24 |
25 | Instance methods:
26 | hash_set(multihash, field, value, verbose=False) Set Redis.multihash.field to value
27 | hash_get(multihash, field, verbose=False) Retrieve value of Redis.multihash.field
28 | set(multihash, value, verbose=False) Set Redis.multihash. = value
29 | get(multihash, value, verbose=False) Retrieve Redis.multihash.
30 |
31 | Delete and Push are not supported but could be if required.
32 |
33 | Subclasses map
34 |
35 | Note Contenthash = multihash base58 of content (typically SHA1 on IA at present)
36 | itemid = archive's item id, e.g. "commute"
37 |
38 | Class StoredAt Maps To
39 | StateService __STATE__. field arbitraryvalue For global state
40 | StateService __STATE__.LastDHTround number? Used by cron_ipfs.py to track whats up next
41 | LocationService .location url As returned by rawstore or url of content on IA
42 | MimetypeService .mimetype mimetype
43 | IPLDService Not used currently
44 | IPLDHashService .ipld IPFS hash e.g. Q123 or z123 (hash of the IPLD)
45 | ThumbnailIPFSfromItemIdService .thumbnailipfs ipfsurl e.g. ipfs:/ipfs/Q1…
46 | MagnetLinkService bits:.magnetlink magnetlink
47 | MagnetLinkService archived:.magnetlink magnetlink
48 | TitleService archived:.title title Used to map collection item’s to their titles (cache search query)
49 | """
50 |
51 | _redis = None # Will be connected to a redis instance by redis()
52 | redisfield = None # Subclasses define this, and use set & get
53 |
54 | @classmethod
55 | def redis(cls):
56 | if not HashStore._redis:
57 | logging.debug("HashStore connecting to Redis")
58 | HashStore._redis = redis.StrictRedis( # Note uses HashStore cos this connection is shared across subclasses
59 | host="localhost",
60 | port=6379,
61 | db=0,
62 | decode_responses=True
63 | )
64 | return HashStore._redis
65 |
66 | def __init__(self):
67 | raise CodingException(message="It is meaningless to instantiate an instance of HashStore, its all class methods")
68 |
69 | @classmethod
70 | def hash_set(cls, multihash, field, value, verbose=False):
71 | """
72 | :param multihash:
73 | :param field:
74 | :param value:
75 | :return:
76 | """
77 | if verbose: logging.debug("Hash set: {0} {1}={2}".format(multihash, field, value))
78 | cls.redis().hset(multihash, field, value)
79 |
80 | @classmethod
81 | def hash_get(cls, multihash, field, verbose=False):
82 | """
83 |
84 | :param multihash:
85 | :param field:
86 | :return:
87 | """
88 | res = cls.redis().hget(multihash, field)
89 | if verbose: logging.debug("Hash found: {0} {1}={2}".format(multihash, field, res))
90 | return res
91 |
92 | @classmethod
93 | def set(cls, multihash, value, verbose=False):
94 | """
95 |
96 | :param multihash:
97 | :param value: What we want to store in the redisfield
98 | :return:
99 | """
100 | return cls.hash_set(multihash, cls.redisfield, value, verbose)
101 |
102 | @classmethod
103 | def get(cls, multihash, verbose=False):
104 | """
105 |
106 | :param multihash:
107 | :return: string stored in Redis
108 | """
109 | return cls.hash_get(multihash, cls.redisfield, verbose)
110 |
111 |
112 | @classmethod
113 | def archiveidget(cls, itemid, verbose=False):
114 | return cls.get("archiveid:"+itemid)
115 |
116 | @classmethod
117 | def archiveidset(cls, itemid, value, verbose=False):
118 | return cls.set("archiveid:" + itemid, value)
119 |
120 | @classmethod
121 | def btihget(cls, btihhash, verbose=False):
122 | return cls.get("btih:"+btihhash)
123 |
124 | @classmethod
125 | def btihset(cls, btihhash, value, verbose=False):
126 | return cls.set("btih:"+btihhash, value)
127 |
128 | class StateService(HashStore):
129 | """
130 | Store some global state for the server
131 |
132 | Field Value Means
133 | LastDHTround ?? Used by cron_ipfs.py to record which part of hash table it last worked on
134 | """
135 |
136 | @classmethod
137 | def set(cls, field, value, verbose=False):
138 | """
139 | Store to global state
140 | field: Name of field to store
141 | value: Content to store
142 | """
143 | return cls.hash_set("__STATE__", field, dumps(value), verbose)
144 |
145 | @classmethod
146 | def get(cls, field, verbose=False):
147 | """
148 | Store to global state saving
149 | :param field:
150 | :return: string stored in Redis
151 | """
152 | res = cls.hash_get("__STATE__", field, verbose)
153 | if res is None:
154 | return None
155 | else:
156 | return loads(res)
157 |
158 | class LocationService(HashStore):
159 | """
160 | OLD NOTES
161 | Maps hashes to locations
162 | * set(multihash, location)
163 | * get(multihash) => url (currently)
164 | * Consumes: Hashstore
165 | * ConsumedBy: DOI Name Resolver
166 |
167 | The multihash represents a file or a part of a file. Build upon hashstore.
168 | It is split out because this could be a useful service on its own.
169 | """
170 | redisfield = "location"
171 |
172 |
173 | class MimetypeService(HashStore):
174 | # Maps contenthash58 to mimetype
175 | redisfield = "mimetype"
176 |
177 |
178 | class IPLDService(HashStore):
179 | # TODO-IPFS may need to move this to ContentStore (which needs implementing)
180 | # Note this doesnt appear to be used except by IPLDFile/IPLDdir which themselves arent used
181 | redisfield = "ipld"
182 |
183 |
184 | class IPLDHashService(HashStore):
185 | # Maps contenthash58 to IPLD's multihash CIDv0 or CIDv1
186 | redisfield = "ipldhash"
187 |
188 | class ThumbnailIPFSfromItemIdService(HashStore):
189 | # Maps itemid to IPFS URL (e.g. ipfs:/ipfs/Q123...)
190 | redisfield = "thumbnailipfs"
191 |
192 | class MagnetLinkService(HashStore):
193 | # uses archiveidset/get
194 | redisfield = "magnetlink"
195 |
196 | class TitleService(HashStore):
197 | # Cache collection names, they dont change often enough to worry
198 | # uses archiveidset/get
199 | # TODO-REDIS note this is caching for ever, which is generally a bad idea ! Should figure out how to make Redis expire this cache every few days
200 | redisfield = "title"
201 |
--------------------------------------------------------------------------------
/python/LocalResolver.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from .TransportLocal import TransportLocal
3 | from .NameResolver import NameResolverFile
4 | from .HashStore import LocationService
5 | from .Multihash import Multihash
6 | from .miscutils import loads, dumps
7 | from .Errors import TransportFileNotFound
8 |
9 | #TODO add caching to headers returned so not repeatedly pinged for same file
10 |
11 | class LocalResolver(NameResolverFile):
12 | """
13 | Subclass of NameResolverFile to resolve hashes locally
14 |
15 | Attributes:
16 | _contenthash Multihash of content
17 |
18 | Supports
19 | contenthash via NameResolver default
20 | """
21 |
22 | @classmethod
23 | def new(cls, namespace, *args, **kwargs): # Used by Gateway
24 | if kwargs.get("verbose"):
25 | logging.debug("{0}.new namespace={1} args={2} kwargs={3}"
26 | .format(cls.__name__, namespace, args, kwargs))
27 | return super(LocalResolver, cls).new(namespace, *args, **kwargs) # Calls __init__() by default
28 |
29 | @staticmethod
30 | def transport(verbose=False):
31 | return TransportLocal(options={"local": {"dir": ".cache"}},
32 | verbose=verbose) # TODO-LOCAL move to options at higher level
33 |
34 | class LocalResolverStore(LocalResolver):
35 |
36 | @classmethod
37 | def new(cls, namespace, *args, **kwargs): # Used by Gateway
38 | verbose = kwargs.get("verbose")
39 | obj = super(LocalResolverStore, cls).new(namespace, *args, **kwargs) # Calls __init__() by default
40 | res = cls.transport(verbose=verbose).rawstore(data=kwargs["data"], returns="contenthash,url")
41 | obj._contenthash = res["contenthash"] # Returned via contenthash() in NameResolveer
42 | obj.url = res["url"] #TODO-LOCAL this is going to be wrong its currently local:/rawfetch/Q...
43 | LocationService.set(obj._contenthash.multihash58, obj.url, verbose=verbose) # Let LocationService know we have it locally
44 | return obj
45 |
46 |
47 |
48 | class LocalResolverFetch(LocalResolver):
49 | @classmethod
50 | def new(cls, namespace, *args, **kwargs): # Used by Gateway
51 | verbose = kwargs.get("verbose")
52 | obj = super(LocalResolverFetch, cls).new(namespace, *args, **kwargs) # Calls __init__() by default
53 | obj._contenthash = Multihash(multihash58=args[0])
54 | # Not looking up URL in LocationService yet, will look up if needed
55 | # Not fetching data, will be retrieved by content() method etc
56 | obj.url = cls.transport(verbose).url(multihash=obj._contenthash)
57 | return obj
58 |
59 | @property
60 | def mimetype(self):
61 | return "application/octet-stream" # By default we don't know what it is #TODO-LOCAL look up in MimetypeService just in case ...
62 |
63 | def retrieve(self, verbose=False, **kwargs):
64 | try:
65 | return self.transport(verbose=verbose).rawfetch(multihash=self._contenthash)
66 | except TransportFileNotFound as e1: # Not found in block store, lets try contenthash
67 | logging.debug("LocalResolverFetch.retrieve: err={}".format(e1))
68 | try:
69 | from .HashResolvers import ContentHash # Avoid a circular reference
70 | contenthash = self._contenthash.multihash58
71 | logging.debug("LocalResolverFetch.retrieve falling back to contenthash: {}".format(contenthash))
72 | return ContentHash.new("contenthash", contenthash, verbose=verbose, nolocal=True).retrieve(verbose=verbose)
73 | except Exception as e:
74 | logging.debug("Fallback failed, raising original error")
75 | raise e1
76 |
77 | class LocalResolverAdd(LocalResolver):
78 |
79 | @classmethod
80 | def new(cls, namespace, url, *args, data=None, **kwargs): # Used by Gateway
81 | verbose = kwargs.get("verbose")
82 | obj = super(LocalResolverAdd, cls).new(namespace, *args, **kwargs) # Calls __init__() by default
83 | if isinstance(data, (str, bytes)): # Assume its JSON
84 | data = loads(data) # HTTP just delivers bytes
85 | cls.transport(verbose=verbose).rawadd(url, data)
86 | return obj
87 |
88 | class LocalResolverList(LocalResolver):
89 |
90 | @classmethod
91 | def new(cls, namespace, hash, *args, data=None, **kwargs): # Used by Gateway
92 | verbose = kwargs.get("verbose")
93 | obj = super(LocalResolverList, cls).new(namespace, hash, *args, **kwargs) # Calls __init__() by default
94 | obj._contenthash = Multihash(multihash58=hash)
95 | return obj
96 |
97 | def metadata(self, headers=True, verbose=False, **kwargs):
98 | data = self.transport(verbose=verbose).rawlist(self._contenthash.multihash58, verbose=verbose)
99 | mimetype = 'application/json';
100 | return {"Content-type": mimetype, "data": data} if headers else data
101 |
102 | class KeyValueTable(LocalResolver):
103 | @classmethod
104 | def new(cls, namespace, database, table, *args, data=None, **kwargs): # Used by Gateway
105 | verbose = kwargs.get("verbose")
106 | obj = super(KeyValueTable, cls).new(namespace, database, table, *args, **kwargs) # Calls __init__() by default
107 | obj.database = database
108 | obj.table = table
109 | #obj.data = data # For use by command
110 | #obj.args = args
111 | #obj.kwargs = kwargs # Esp key=a or key=[a,b,c]
112 | return obj
113 |
114 | def set(self, verbose=False, headers=False, data=None, **kwargs): # set/table/
115 | #TODO check pubkey or have transport do it - and save with it
116 | if isinstance(data, (str, bytes)): # Assume its JSON
117 | data = loads(data) # HTTP just delivers bytes
118 | self.transport(verbose=verbose).set(database=self.database, table=self.table, keyvaluelist=data, value=None, verbose=verbose)
119 |
120 | def get(self, verbose=False, headers=False, **kwargs): # set/table/
121 | # TODO check pubkey or have transport do it - and save with it
122 | res = self.transport(verbose=verbose).get(database=self.database, table=self.table, keys = kwargs["key"] if isinstance(kwargs["key"], list) else [kwargs["key"]], verbose=verbose)
123 | return { "Content-type": "application/json", "data": res} if headers else res
124 |
125 |
126 | def delete(self, verbose=False, headers=False, **kwargs): # set/table/
127 | # TODO check pubkey or have transport do it - and save with it
128 | self.transport(verbose=verbose).delete(database=self.database, table=self.table, keys = kwargs["key"] if isinstance(kwargs["key"], list) else [kwargs["key"]], verbose=verbose)
129 |
130 | def keys(self, verbose=False, headers=False, **kwargs): # set/table/
131 | # TODO check pubkey or have transport do it - and save with it
132 | res = self.transport(verbose=verbose).keys(database=self.database, table=self.table, verbose=verbose)
133 | return { "Content-type": "application/json", "data": res} if headers else res
134 |
135 | def getall(self, verbose=False, headers=False, **kwargs): # set/table/
136 | # TODO check pubkey or have transport do it - and save with it
137 | res = self.transport(verbose=verbose).getall(database=self.database, table=self.table, verbose=verbose)
138 | return { "Content-type": "application/json", "data": res} if headers else res
139 |
140 |
141 |
--------------------------------------------------------------------------------
/python/Multihash.py:
--------------------------------------------------------------------------------
1 | """
2 | A set of classes to hold different kinds of hashes etc and convert between them,
3 |
4 | Much of this was adapted from https://github.com/tehmaze/python-multihash,
5 | which seems to have evolved from the pip3 multihash, which is seriously broken.
6 | """
7 |
8 | import hashlib
9 | import struct
10 | import sha3
11 | import pyblake2
12 | import base58
13 | import binascii
14 | import logging
15 |
16 | from sys import version as python_version
17 | if python_version.startswith('3'):
18 | from urllib.parse import urlparse
19 | else:
20 | from urlparse import urlparse # See https://docs.python.org/2/library/urlparse.html
21 | from .Errors import MultihashError
22 |
23 | class Multihash(object):
24 | """
25 | Superclass for all kinds of hashes, this is for convenience in passing things around between some places that want binary, or
26 | multihash or hex.
27 |
28 | core storage is as a multihash_binary i.e. [ code, length, digest...]
29 |
30 | Each instance:
31 | code = SHA1, SHA256 etc (uses integer conventions from multihash
32 | """
33 |
34 | # Constants
35 | # 0x01..0x0F are app specific (unused)
36 | SHA1 = 0x11
37 | SHA2_256 = 0x12
38 | SHA2_512 = 0x13
39 | SHA3 = 0x14
40 | BLAKE2B = 0x40
41 | BLAKE2S = 0x41
42 |
43 | FUNCS = {
44 | SHA1: hashlib.sha1,
45 | SHA2_256: hashlib.sha256,
46 | # Alternative use nacl.hash.sha256(data, encoder=nacl.encoding.RawEncoder) which has different footprint
47 | SHA2_512: hashlib.sha512,
48 | SHA3: lambda: hashlib.new('sha3_512'),
49 | BLAKE2B: lambda: pyblake2.blake2b(),
50 | BLAKE2S: lambda: pyblake2.blake2s(),
51 | }
52 | LENGTHS = {
53 | SHA1: 20,
54 | SHA2_256: 32,
55 | SHA2_512: 64,
56 | SHA3: 64,
57 | BLAKE2B: 64,
58 | BLAKE2S: 32,
59 | }
60 |
61 | def assertions(self, code=None):
62 | if code and code != self.code:
63 | raise MultihashError(message="Expecting code {}, got {}".format(code, self.code))
64 | if self.code not in self.FUNCS:
65 | raise MultihashError(message="Unsupported Hash type {}".format(self.code))
66 | if (self.digestlength != len(self.digest)) or (self.digestlength != self.LENGTHS[self.code]):
67 | raise MultihashError(message="Invalid lengths: expect {}, byte {}, len {}"
68 | .format(self.LENGTHS[self.code], self.digestlength, len(self.digest)))
69 |
70 | def __init__(self, multihash58=None, sha1hex=None, data=None, code=None, url=None):
71 | """
72 | Accept variety of parameters,
73 |
74 | :param multihash_58:
75 | """
76 | digest = None
77 |
78 | if url: # Assume its of the form somescheme:/somescheme/Q...
79 | logging.debug("url={} {}".format(url.__class__.__name__,url))
80 | if isinstance(url, str) and "/" in url: # https://.../Q...
81 | url = urlparse(url)
82 | if not isinstance(url, str):
83 | multihash58 = url.path.split('/')[-1]
84 | else:
85 | multihash58 = url
86 | if multihash58[0] not in ('5','Q'): # Simplistic check that it looks ok-ish
87 | raise MultihashError(message="Invalid hash portion of URL {}".format(multihash58))
88 | if multihash58:
89 | self._multihash_binary = base58.b58decode(multihash58)
90 | if sha1hex:
91 | if python_version.startswith('2'):
92 | digest = sha1hex.decode('hex') # Python2
93 | else:
94 | digest = bytes.fromhex(sha1hex) # Python3
95 | code = self.SHA1
96 | if data and code:
97 | digest = self._hash(code, data)
98 | if digest and code:
99 | self._multihash_binary = bytearray([code, len(digest)])
100 | self._multihash_binary.extend(digest)
101 | self.assertions() # Check consistency
102 |
103 | def _hash(self, code, data):
104 | if not code in self.FUNCS:
105 | raise MultihashError(message="Cant encode hash code={}".format(code))
106 | hashfn = self.FUNCS.get(code)() # Note it calls the function in that strange way hashes work!
107 | if isinstance(data, bytes):
108 | hashfn.update(data)
109 | elif isinstance(data, str):
110 | # In Python 3 this is ok, would be better if we were sure it was utf8
111 | # raise MultihashError(message="Should be passing bytes, not strings as could encode multiple ways") # TODO can remove this if really need to handle UTF8 strings, but better to push conversion upstream
112 | hashfn.update(data.encode('utf-8'))
113 | return hashfn.digest()
114 |
115 | def check(self, data):
116 | assert self.digest == self._hash(self.code, data), "Hash doesnt match expected"
117 |
118 | @property
119 | def code(self):
120 | return self._multihash_binary[0]
121 |
122 | @property
123 | def digestlength(self):
124 | return self._multihash_binary[1]
125 |
126 | @property
127 | def digest(self):
128 | """
129 | :return: bytes, the digest part of any multihash
130 | """
131 | return self._multihash_binary[2:]
132 |
133 | @property
134 | def sha1hex(self):
135 | """
136 | :return: The hex of the sha1 (as used in DOI sqlite tables)
137 | """
138 | self.assertions(self.SHA1)
139 | return binascii.hexlify(self.digest).decode('utf-8') # The decode is turn bytes b'a1b2' to str 'a1b2'
140 |
141 | @property
142 | def multihash58(self):
143 | foo = base58.b58encode(bytes(self._multihash_binary)) # Documentation says returns bytes, Mac returns string, want string
144 | if isinstance(foo,bytes):
145 | return foo.decode('ascii')
146 | else:
147 | return foo
--------------------------------------------------------------------------------
/python/NameResolver.py:
--------------------------------------------------------------------------------
1 | import logging
2 | import requests
3 | from urllib.parse import urlparse
4 | from .Errors import ToBeImplementedException, NoContentException, IPFSException
5 | from .Multihash import Multihash
6 | from .HashStore import LocationService, MimetypeService, IPLDHashService
7 | from .config import config
8 | from .miscutils import httpget
9 | from .TransportIPFS import TransportIPFS
10 |
11 |
12 |
13 | class NameResolver(object):
14 | """
15 | The NameResolver group of classes manage recognizing a name, and connecting it to resources
16 | we have at the Archive.
17 |
18 | These are base classes for specific name resolvers like DOI
19 |
20 | it specifies a set of methods we expect to be able to do on a subclass,
21 | and may have default code for some of them based on assumptions about the data structure of subclasses.
22 |
23 | Each subclass of NameResolver must support:
24 | content() Generate an output to return to a browser. (Can be a dict, array or string, later will add Streams) (?? Not sure if should implement content for dirs)
25 |
26 | Each subclass of NameResolver can provide, but can also use default:
27 | contenthash() The hash of the content
28 |
29 | Logically it can represent one or multiple files depending on subclass
30 |
31 | Attributes reqd:
32 | name: Name of the object being retrieved (short string)
33 | namespace: Store the namespace here.
34 |
35 | A subclass can have any meta-data fields, recommended ones include.
36 | contentSize: The size of the content in brief (compatible with Schema.org, not compatible with standard Archive metadata)
37 | contentType: The mime-type of the content, (TODO check against schema.org), not compatible with standard Archive metadata which uses three letter types like PNG
38 | """
39 |
40 | def __init__(self, namespace, *args, **kwargs): # Careful if change, note its the default __init__ for NameResolverDir, NameResolverFile, NameResolverSearch etc
41 | self._list = []
42 |
43 | @classmethod
44 | def new(cls, namespace, *args, **kwargs):
45 | """
46 | Default creation of new obj, returns None if not found (to allow multiple attempts to instantiate)
47 |
48 | :param namespace:
49 | :param args:
50 | :param kwargs:
51 | :return:
52 | """
53 | try:
54 | return cls(namespace, *args, **kwargs)
55 | except NoContentException:
56 | return None
57 |
58 | def retrieve(self, _headers=None, verbose=False, **kwargs):
59 | """
60 |
61 | :return:
62 | """
63 | raise ToBeImplementedException(name=self.__class__.__name__+".retrieve()")
64 |
65 | def content(self, _headers=None, verbose=False, **kwargs):
66 | """
67 | Return the content, by default its just the result of self.retrieve() which must be defined in superclass
68 | Requires mimetype to be set in subclass
69 |
70 | :param verbose:
71 | :return:
72 | """
73 | return {"Content-type": self.mimetype, "data": self.retrieve(_headers=_headers)}
74 |
75 | def metadata(self, verbose=False, **kwargs):
76 | """
77 |
78 | :return:
79 | """
80 | raise ToBeImplementedException(name=self.__class__.__name__+".metadata()")
81 |
82 | def contenthash(self, verbose=False):
83 | """
84 | By default contenthash is the hash of the content.
85 |
86 | :return:
87 | """
88 | if not self._contenthash:
89 | self._contenthash = Multihash(data=self.content(), code=Multihash.SHA2_256)
90 | return {'Content-type': 'text/plain',
91 | 'data': self._contenthash.multihash58
92 | }
93 |
94 | def contenturl(self, verbose=False):
95 | """
96 | By default contenthash is the hash of the content.
97 |
98 | :return:
99 | """
100 | if not self._contenthash:
101 | self._contenthash = Multihash(data=self.content(), code=Multihash.SHA2_256)
102 | return {'Content-type': 'text/plain',
103 | 'data': "https://dweb.me/contenthash/"+self._contenthash.multihash58, # TODO parameterise server name, maybe store from incoming URL
104 | }
105 |
106 | def push(self, obj):
107 | """
108 | Add a NameResolverShard to a NameResolverFile or a NameResolverFile to a NameResolverDir - in both cases on _list field
109 | Doesnt check class of object added to allow for variety of nested constructs.
110 |
111 | :param obj: NameResolverShard, NameResolverFile, or NameResolverDir
112 | :return:
113 | """
114 | self._list.append(obj)
115 |
116 | @classmethod
117 | def canonical(cls, namespace, *args, **kwargs):
118 | """
119 | If this method isn't subclassed, then its already a canonical form so return with slashes
120 |
121 | :param cls:
122 | :param namespace:
123 | :param [args]: List of arguments to URL
124 | :return: Concatenated args with / by default (subclasses will override)
125 | """
126 | return namespace, args.join('/') # By default reconcatonate args
127 |
128 |
129 | class NameResolverDir(NameResolver):
130 |
131 | """
132 | Represents a set of files,
133 |
134 | Attributes:
135 | _list: Hold data for a list of files (NameResolverFile) in the directory.
136 | files(): An iterator over _list - returns NameResolverFile
137 | name: Name of the directory
138 | """
139 | def files(self):
140 | return self._list
141 |
142 |
143 | class NameResolverFile(NameResolver):
144 | """
145 | Represents a single file, and its shards,
146 | It contains enough info for retrieval of the file e.g. HTTP URL, or server and path. Also can have byterange,
147 |
148 | Attributes:
149 | _list: Hold data for a list of shards in this file.
150 | shards(): An iterator over _list
151 | See NameResolver for other metadata fields
152 |
153 | TODO - define fields for location & byterange
154 |
155 | Any other field can be used as namespace specific metadata
156 | """
157 | shardsize = 256000 # A default for shard size, TODO-IPLD determine best size, subclasses can overwrite, or ignore for things like video.
158 |
159 | def shards(self):
160 | """
161 | Return an iterator that returns each of the NameResolverShard in the file's _list attribute.
162 | * Each time called, should:
163 | * read next `shardsize` bytes from content (either from a specific byterange, or by reading from an open stream)
164 | * Pass that through multihash58 service to get a base58 multihash
165 | * Return that multihash, plus metadata (size may be all required)
166 | * Store the mapping between that multihash, and location (inc byterange) in locationstore
167 | * May Need to cache the structure, but since the IPLD that calls this will be cached, that might not be needed.
168 | """
169 | raise ToBeImplementedException(name="NameResolverFile.shards")
170 | pass
171 |
172 | def cache_ipfs(self, url=None, data=None, forceurlstore=False, forceadd=False, printlog=False, announcedht=False, size=None, verbose=False ):
173 | """
174 | Cache in IPFS, will automatically select no action, urlstore or add unless constrained by forcexxx
175 | Before doing this, should have checked if IPLDHashService can return the hash already
176 |
177 | :param url: # If present is the url of the file
178 | :param data: # If present is the data for the file
179 | :param forceurlstore: # Override default and use urlstore
180 | :param forceadd: # Override default and use add
181 | :raises: # IPFS Exeption if its failing
182 | :return: # IPLDhash
183 |
184 | Logical combinations of arguments attempt to get the "right" result.
185 | forceurlstore && url => urlstore
186 | forceurlstore && !url => error
187 | forceadd && data => add
188 | forceadd && !data && url => fetch data then add
189 | url && data && !forceurl && !forcedata => default to urlstore (ignore data)
190 | """
191 | # TODO-PERMS, this cant be caching to IPFS if dont have permission
192 | if not config["ipfs"].get("url_urlstore"): # If not running on machine with urlstore
193 | forceadd = True
194 | if url and forceadd: # To "add" from an URL we need to retrieve and then urlstore
195 | (data, self.mimetype) = httpget(url, wantmime=True)
196 | if not self.multihash: # Since we've got the data, we can compute SHA1 from it
197 | if verbose: logging.debug("Computing SHA1 hash of url {}".format(url))
198 | self.multihash = Multihash(data=data, code=Multihash.SHA1)
199 | # Since we retrieved mimetype we can save it, since not set in metadata
200 | MimetypeService.set(self.multihash.multihash58, self.mimetype, verbose=verbose)
201 | if (url and not forceadd):
202 | did = "urlstore"
203 | ipldurl = TransportIPFS().store(urlfrom=url, pinggateway=False, verbose=verbose) # Can throw IPFSExeption
204 | elif data: # Either provided or fetched from URL
205 | did = "add"
206 | ipldurl = TransportIPFS().store(data=data, pinggateway=False, mimetype=self.mimetype, verbose=verbose)
207 | else:
208 | raise errors.CodingException(message="Invalid options to cache_ipfs forceurlstore={} forceadd={} url={} data len={}"\
209 | .format(forceurlstore, forceadd, url, len(data) if data else 0))
210 | # Each of the successful routes through above leaves us with ipldurl
211 | ipldhash = urlparse(ipldurl).path.split('/')[2]
212 | if announcedht:
213 | TransportIPFS().announcedht(ipldhash) # Let DHT know - dont wait for up to 10 hours for next cycle
214 | IPLDHashService.set(self.multihash.multihash58, ipldhash)
215 | #("URL", "Add/Urlstore", "Hash", "Size", "Announced")
216 | if size and data and (len(data) != size):
217 | size = "{}!={}".format(size, len(data))
218 | print('"{}","{}","{}","{}","{}"'.format(url, did, ipldhash, size, announcedht))
219 | return ipldhash
220 |
221 |
222 | def cache_content(self, url, wantipfs=False, verbose=False):
223 | """
224 | Retrieve content from a URL, cache it in various places especially IPFS, and set tables so can be retrieved by contenthash
225 |
226 | Requires multihash to be set prior to this, if required it could be set from the retrieved data
227 | Call path is Archivefile.metadata > ArchiveFile.cache_content > NameResolverFile.cache_content
228 |
229 | :param url: URL - typically inside archive.org of contents
230 | :param transport: Either None (for all) or a list of transports to cache for
231 | In particular, transport needs to be None or contain IPFS to cache in IPFS
232 | :param verbose:
233 | :return:
234 | """
235 | ipldhash = self.multihash and IPLDHashService.get(self.multihash.multihash58) # May be None, we don't know it
236 | if ipldhash:
237 | self.mimetype = MimetypeService.get(self.multihash.multihash58, verbose=verbose)
238 | ipldhash = IPLDHashService.get(self.multihash.multihash58, verbose=verbose)
239 | else:
240 | if wantipfs:
241 | #TODO could check sha1 here, but would be slow
242 | #TODO-URLSTORE delete old cache
243 | #TODO-URLSTORE - check dont need mimetype
244 | if not self.multihash:
245 | (data, self.mimetype) = httpget(url, wantmime=True) # SLOW - retrieval
246 | if verbose: logging.debug("Computing SHA1 hash of url {}".format(url))
247 | self.multihash = Multihash(data=data, code=Multihash.SHA1)
248 | ipldhash = self.multihash and IPLDHashService.get(self.multihash.multihash58) # Try again now have hash
249 | MimetypeService.set(self.multihash.multihash58, self.mimetype, verbose=verbose)
250 | if not ipldhash: # We might have got it now especially for _files.xml if unchanged-
251 | # Can throw IPFSException - ignore it
252 | try:
253 | # TODO-PERMS cache_ipfs should be checking permissions
254 | ipldhash = self.cache_ipfs(url=url, verbose=verbose, announcedht=False) # Not announcing to DHT here, its too slow (16+ seconds) better to let first client fail, try gateway, fail again, and subsequent work.
255 | if verbose: logging.debug("ipfs pushed to: {}".format(ipldhash))
256 | except IPFSException as e:
257 | pass # Ignore it - wont have an ipldhash, but usually dont care
258 | if self.multihash:
259 | LocationService.set(self.multihash.multihash58, url, verbose=verbose)
260 | return {"ipldhash": ipldhash}
261 |
262 |
263 | class NameResolverShard(NameResolver):
264 | """
265 | Represents a single shard returned by a NameResolverFile.shards() iterator
266 | Holds enough info to do a byte-range retrieval of just those bytes from a server,
267 | And a multihash that could be retrieved by IPFS for just this shard.
268 | """
269 | pass
270 |
271 |
272 | class NameResolverSearchItem(NameResolver):
273 | """
274 | Represents each element in a search
275 | """
276 | pass
277 |
278 |
279 | class NameResolverSearch(NameResolver):
280 | """
281 | Represents the results of a search
282 | """
283 | pass
284 |
--------------------------------------------------------------------------------
/python/OutputFormat.py:
--------------------------------------------------------------------------------
1 | class OutputFormat(object):
2 | pass
3 |
4 |
--------------------------------------------------------------------------------
/python/ServerBase.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | import logging
3 | from .miscutils import dumps # Use our own version of dumps - more compact and handles datetime etc
4 | from json import loads # Not our own loads since dumps is JSON compliant
5 | from sys import version as python_version
6 | from cgi import parse_header, parse_multipart
7 | #from Dweb import Dweb # Import Dweb library (wont use for Academic project
8 | #TODO-API needs writing up
9 | import html
10 | from http import HTTPStatus
11 | from .config import config
12 |
13 | """
14 | This file is intended to be Application independent , i.e. not dependent on Dweb Library
15 | """
16 |
17 | if python_version.startswith('3'):
18 | from urllib.parse import parse_qs, parse_qsl, urlparse, unquote
19 | from http.server import BaseHTTPRequestHandler, HTTPServer
20 | from socketserver import ThreadingMixIn
21 | else: # Python 2
22 | from urlparse import parse_qs, parse_qsl, urlparse # See https://docs.python.org/2/library/urlparse.html
23 | from urllib import unquote
24 | from SocketServer import ThreadingMixIn
25 | import threading
26 | from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
27 | # See https://docs.python.org/2/library/basehttpserver.html for docs on how servers work
28 | # also /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/BaseHTTPServer.py for good error code list
29 |
30 | import traceback
31 |
32 | from .Errors import MyBaseException, ToBeImplementedException, TransportFileNotFound
33 | #from Transport import TransportBlockNotFound, TransportFileNotFound
34 | #from TransportHTTP import TransportHTTP
35 |
36 | class HTTPdispatcherException(MyBaseException):
37 | httperror = 400 # Unimplemented
38 | msg = "HTTP request {req} not recognized"
39 |
40 | class HTTPargrequiredException(MyBaseException):
41 | httperror = 400 # UnimplementedAccess
42 | msg = "HTTP request {req} requires {arg}"
43 |
44 | class DWEBMalformedURLException(MyBaseException):
45 | httperror = 400
46 | msg = "Malformed URL {path}"
47 |
48 | class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
49 | """Handle requests in a separate thread."""
50 |
51 | class MyHTTPRequestHandler(BaseHTTPRequestHandler):
52 | """
53 | Generic HTTPRequestHandler, extends BaseHTTPRequestHandler, to make it easier to use
54 | """
55 | # Carefull - do not define __init__ as it is run for each incoming request.
56 | # TODO-STREAMS add support for longer (streamed) files on both upload and download, allow a stream to be passed back from the subclasses routines.
57 |
58 | """
59 | Simple (standard) HTTPdispatcher,
60 | Subclasses should define "exposed" as a list of exposed methods
61 | """
62 | exposed = []
63 | protocol_version = "HTTP/1.1"
64 | onlyexposed = False # Dont Limit to @exposed functions (override in subclass if using @exposed)
65 | defaultipandport = { "ipandport": ('localhost', 8080) }
66 | expectedExceptions = () # List any exceptions that you "expect" (and dont want stacktraces for)
67 |
68 | @classmethod
69 | def serve_forever(cls, ipandport=None, verbose=False, **options):
70 | """
71 | Start a server,
72 | ERR: socket.error if address(port) in use.
73 |
74 | :param ipandport: Ip and port to listen on, else use defaultipandport
75 | :param verbose: If want debugging
76 | :param options: Stored on class for access by handlers
77 | :return: Never returns
78 | """
79 | cls.ipandport = ipandport or cls.defaultipandport
80 | cls.verbose = verbose
81 | cls.options = options
82 | #HTTPServer(cls.ipandport, cls).serve_forever() # Start http server
83 | logging.info("Server starting on {0}:{1}:{2}".format(cls.ipandport[0], cls.ipandport[1], cls.options or ""))
84 | ThreadedHTTPServer(cls.ipandport, cls).serve_forever() # OR Start http server
85 | logging.error("Server exited") # It never should
86 |
87 | def _dispatch(self, **postvars):
88 | """
89 | HTTP dispatcher (replaced a more complex version Sept 2017
90 | URLS of form GET /foo/bar/baz?a=b,c=d
91 | Are passed to foo(bar,baz,a=b,c=d) which mirrors Python argument conventions i.e. if def foo(bar,baz,**kwargs) then foo(aaa,bbb) == foo(baz=bbb, bar=aaa)
92 | POST will pass a dictionary, if its just a body text or json it will be passed with a single value { date: content data }
93 | In case of conflict, postvars overwrite args in the query string, but you shouldn't be getting both in most cases.
94 |
95 | :param vars:
96 | :return:
97 | """
98 | # In documentation, assuming call with /foo/aaa/bbb?x=ccc,y=ddd
99 | try:
100 | # TODO-PERMS make sure we are getting X-ORIGINATING-IP or similar here then make sure passed all way thru to httpget callls
101 | logging.info("dispatcher: {0}".format(self.path)) # Always log URLs in
102 | o = urlparse(self.path) # Parsed URL {path:"/foo/aaa/bbb", query: "bbb?x=ccc,y=ddd"}
103 |
104 | # Get url args, remove HTTP quote (e.g. %20=' '), ignore leading / and anything before it. Will always be at least one item (empty after /)
105 | args = [ unquote(u) for u in o.path.split('/')][1:]
106 | cmd = args.pop(0) # foo
107 | #kwargs = dict(parse_qsl(o.query)) # { baz: bbb, bar: aaa }
108 | kwargs = {}
109 | for (k,b) in parse_qsl(o.query):
110 | a = kwargs.get(k)
111 | kwargs[k] = b if (a is None) else a+[b] if (isinstance(a,list)) else [a,b]
112 | if cmd == "":
113 | cmd = config["httpserver"]["root_path"];
114 | # Drop through and parse that command
115 | if cmd == "favicon.ico": # May general case this for a set of top level links e.g. robots.txt
116 | self.send_response(301)
117 | self.send_header('Location',config["httpserver"]["favicon_url"])
118 | self.end_headers()
119 | elif cmd in config["ignoreurls"]: # Looks like hacking or ignorable e.g. robots.txt, note this just ignores /arc/archive.org/xyz
120 | raise TransportFileNotFound(file=o.path)
121 | else:
122 | kwargs.update(postvars)
123 |
124 | cmds = [self.command + "_" + cmd, cmd, self.command + "_" + cmd.replace(".","_"), cmd.replace(".","_")]
125 | try:
126 | func = next(getattr(self, c, None) for c in cmds if getattr(self, c, None))
127 | except StopIteration:
128 | func = None
129 | #func = getattr(self, self.command + "_" + cmd, None) or getattr(self, cmd, None) # self.POST_foo or self.foo (should be a method)
130 | if not func or (self.onlyexposed and not func.exposed):
131 | raise HTTPdispatcherException(req=cmd) # Will be caught in except
132 | res = func(*args, **kwargs)
133 | # Function should return
134 |
135 | # Send the content-type
136 | self.send_response(200) # Send an ok response
137 | contenttype = res.get("Content-type","application/octet-stream")
138 | self.send_header('Content-type', contenttype)
139 | if self.headers.get('Origin'): # Handle CORS (Cross-Origin)
140 | self.send_header('Access-Control-Allow-Origin', '*')
141 | # self.send_header('Access-Control-Allow-Origin', self.headers['Origin']) # '*' didnt work
142 | data = res.get("data","")
143 | if data or isinstance(data, (list, tuple, dict)): # Allow empty arrays toreturn as [] or empty dict as {}
144 | if isinstance(data, (dict, list, tuple)): # Turn it into JSON
145 | data = dumps(data) # Does our own version to handle classes like datetime
146 | #elif hasattr(data, "dumps"): # Unclear if this is used except maybe in TransportDist_Peer
147 | # raise ToBeImplementedException(message="Just checking if this is used anywhere, dont think so")
148 | # data = dumps(data) # And maype this should be data.dumps()
149 | if isinstance(data, str):
150 | #logging.debug("converting to utf-8")
151 | if python_version.startswith('2'): # Python3 should be unicode, need to be careful if convert
152 | if contenttype.startswith('text') or contenttype in ('application/json',): # Only convert types we know are strings that could be unicode
153 | data = data.encode("utf-8") # Needed to make sure any unicode in data converted to utf8 BUT wont work for intended binary -- its still a string
154 | if python_version.startswith('3'):
155 | data = bytes(data,"utf-8") # In Python3 requests wont work on strings, have to convert to bytes explicitly
156 | if not isinstance(data, (bytes, str)):
157 | #logging.debug(data)
158 | # Raise an exception - will not honor the status already sent, but this shouldnt happen as coding
159 | # error in the dispatched function if it returns anything else
160 | raise ToBeImplementedException(name=self.__class__.__name__+"._dispatch for return data "+data.__class__.__name__)
161 | self.send_header('content-length', str(len(data)) if data else 0)
162 | self.end_headers()
163 | if data:
164 | self.wfile.write(data) # Write content of result if applicable
165 | # Thows BrokenPipeError if browser has gone away
166 | #self.wfile.close()
167 | except BrokenPipeError as e:
168 | logging.error("Broken Pipe Error (browser probably gave up waiting) url={}".format(self.path))
169 | # Don't send error as the browser has gone away
170 | except Exception as e: # Gentle errors, entry in log is sufficient (note line is app specific)
171 | # TypeError Message will be like "sandbox() takes exactly 3 arguments (2 given)" or whatever exception returned by function
172 | httperror = e.httperror if hasattr(e, "httperror") else 500
173 | if not (self.expectedExceptions and isinstance(e, self.expectedExceptions)): # Unexpected error
174 | logging.error("Sending Unexpected Error {0}:".format(httperror), exc_info=True)
175 | else:
176 | logging.info("Sending Error {0}:{1}".format(httperror, str(e)))
177 | #if self.headers.get('Origin'): # Handle CORS (Cross-Origin)
178 | #self.send_header('Access-Control-Allow-Origin', '*') # '*' didnt work
179 | # self.send_header('Access-Control-Allow-Origin', self.headers['Origin']) # '*' didnt work
180 | self.send_error(httperror, str(e)) # Send an error response
181 |
182 |
183 | def do_GET(self):
184 | #logging.debug(self.headers)
185 | self._dispatch()
186 |
187 | def do_OPTIONS(self):
188 | #logging.info("Options request")
189 | self.send_response(200)
190 | self.send_header('Access-Control-Allow-Methods', "POST,GET,OPTIONS")
191 | self.send_header('Access-Control-Allow-Headers', self.headers['Access-Control-Request-Headers']) # Allow anythihg, but '*' doesnt work
192 | self.send_header('content-length','0')
193 | self.send_header('Content-Type','text/plain')
194 | if self.headers.get('Origin'):
195 | self.send_header('Access-Control-Allow-Origin', '*') # '*' didnt work
196 | # self.send_header('Access-Control-Allow-Origin', self.headers['Origin']) # '*' didnt work
197 | self.end_headers()
198 |
199 | def do_POST(self):
200 | """
201 | Handle a HTTP POST - reads data in a variety of common formats and passes to _dispatch
202 |
203 | :return:
204 | """
205 | try:
206 | #logging.debug(self.headers)
207 | ctype, pdict = parse_header(self.headers['content-type'])
208 | #logging.debug("Contenttype={0}, dict={1}".format(ctype, pdict))
209 | if ctype == 'multipart/form-data':
210 | postvars = parse_multipart(self.rfile, pdict)
211 | elif ctype == 'application/x-www-form-urlencoded':
212 | # This route is taken by browsers using jquery as no easy wayto uploadwith octet-stream
213 | # If its just singular like data="foo" then return single values else (unusual) lists
214 | length = int(self.headers['content-length'])
215 | postvars = { p: (q[0] if (isinstance(q, list) and len(q)==1) else q) for p,q in parse_qs(
216 | self.rfile.read(length),
217 | keep_blank_values=1).items() } # In Python2 this was iteritems, I think items will work in both cases.
218 | elif ctype in ('application/octet-stream', 'text/plain'): # Block sends this
219 | length = int(self.headers['content-length'])
220 | postvars = {"data": self.rfile.read(length)}
221 | elif ctype == 'application/json':
222 | length = int(self.headers['content-length'])
223 | postvars = {"data": loads(self.rfile.read(length))}
224 | else:
225 | postvars = {}
226 | self._dispatch(**postvars)
227 | except Exception as e:
228 | #except ZeroDivisionError as e: # Uncomment this to actually throw exception (since it wont be caught here)
229 | # Return error to user, should have been logged already
230 | httperror = e.httperror if hasattr(e, "httperror") else 500
231 | self.send_error(httperror, str(e)) # Send an error response
232 |
233 | def send_error(self, code, message=None, explain=None):
234 | """
235 | THIS IS A COPY OF superclass's send_error with cors header added
236 | """
237 | """Send and log an error reply.
238 |
239 | Arguments are
240 | * code: an HTTP error code
241 | 3 digits
242 | * message: a simple optional 1 line reason phrase.
243 | *( HTAB / SP / VCHAR / %x80-FF )
244 | defaults to short entry matching the response code
245 | * explain: a detailed message defaults to the long entry
246 | matching the response code.
247 |
248 | This sends an error response (so it must be called before any
249 | output has been generated), logs the error, and finally sends
250 | a piece of HTML explaining the error to the user.
251 |
252 | """
253 |
254 | try:
255 | shortmsg, longmsg = self.responses[code]
256 | except KeyError:
257 | shortmsg, longmsg = '???', '???'
258 | if message is None:
259 | message = shortmsg
260 | if explain is None:
261 | explain = longmsg
262 | self.log_error("code %d, message %s", code, message)
263 | self.send_response(code, message)
264 | self.send_header('Connection', 'close')
265 |
266 | # Message body is omitted for cases described in:
267 | # - RFC7230: 3.3. 1xx, 204(No Content), 304(Not Modified)
268 | # - RFC7231: 6.3.6. 205(Reset Content)
269 | body = None
270 | if (code >= 200 and
271 | code not in (HTTPStatus.NO_CONTENT,
272 | HTTPStatus.RESET_CONTENT,
273 | HTTPStatus.NOT_MODIFIED)):
274 | # HTML encode to prevent Cross Site Scripting attacks
275 | # (see bug #1100201)
276 | content = (self.error_message_format % {
277 | 'code': code,
278 | 'message': html.escape(message, quote=False),
279 | 'explain': html.escape(explain, quote=False)
280 | })
281 | body = content.encode('UTF-8', 'replace')
282 | self.send_header("Content-Type", self.error_content_type)
283 | self.send_header('Content-Length', int(len(body)))
284 | self.send_header('Access-Control-Allow-Origin', '*')
285 | self.end_headers()
286 |
287 | if self.command != 'HEAD' and body:
288 | self.wfile.write(body)
289 |
290 |
291 | def exposed(func):
292 | def wrapped(*args, **kwargs):
293 | result = func(*args, **kwargs)
294 | return result
295 |
296 | wrapped.exposed = True
297 | return wrapped
298 |
--------------------------------------------------------------------------------
/python/SmartDict.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | import dateutil.parser # pip py-dateutil
3 | from json import dumps, loads
4 | from .Errors import ToBeImplementedException, EncryptionException
5 | """
6 | from Dweb import Dweb
7 | from Transportable import Transportable
8 | """
9 |
10 | # THIS FILE IS COPIED FROM THE OLD DWEB repo IT IS NOT TESTED FULLY, SO BIG CHUNKS ARE COMMENTED OUT
11 | # - ONLY PARTS NEEDED FOR KEYPAIR ARE BACKPORTED FROM JS AND UNCOMMENTED
12 |
13 | class SmartDict(object):
14 |
15 | """
16 | Stores a data structure, usually a single layer Javascript dictionary object.
17 | SmartDict is intended to support the mechanics of storage and retrieval while being subclassed to implement functionality
18 | that understands what the data means.
19 |
20 | By default any fields not starting with “_” will be stored, and any object will be converted into its url.
21 |
22 | The hooks for encrypting and decrypting data are at this level, depending on the _acl field, but are implemented by code in KeyPair.
23 |
24 | _acl If set (on master) defines storage as encrypted
25 | """
26 | table = "sd"
27 |
28 | def __init__(self, data=None, verbose=False, **options):
29 | """
30 | Creates and initialize a new SmartDict.
31 |
32 | :param data: String|Object, If a string (typically JSON), then parse first.
33 | A object with attributes to set on SmartDict via _setdata
34 | :param options: Passed to _setproperties, by default overrides attributes set by data
35 | """
36 | # COPIED BACK FROM JS 2018-07-02
37 | self._urls = [] # Empty URLs - will be loaded by SmartDict.p_fetch if loading from an URL
38 | self._setdata(data) # The data being stored - note _setdata usually subclassed does not store or set _url
39 | self._setproperties(options) # Note this will override any properties set with data #TODO-SMARTDICT need this
40 |
41 | def __str__(self):
42 | return self.__class__.__name__+"("+str(self.__dict__)+")"
43 |
44 | def __repr__(self):
45 | return repr(self.__dict__)
46 |
47 | # Allow access to arbitrary attributes, allows chaining e.g. xx.data.len = foo
48 | def __setattr__(self, name, value):
49 | # THis code was running self.dirty() - problem is that it clears url during loading from the dWeb
50 | if name[0] != "_":
51 | if "date" in name and isinstance(value,basestring):
52 | value = dateutil.parser.parse(value)
53 | return super(SmartDict, self).__setattr__(name, value) # Calls any property esp _data
54 |
55 | def _setproperties(self, options): # Call chain is ... onloaded or constructor > _setdata > _setproperties > __setattr__
56 | # Checked against JS 20180703
57 | for k in options:
58 | self.__setattr__(k, options[k])
59 |
60 | def __getattr__(self, name): # Need this in Python while JS supports foo._url
61 | return self.__dict__.get(name)
62 |
63 | """
64 |
65 | def preflight(self, dd):
66 | "-"-"
67 | Default handler for preflight, strips attributes starting “_” and stores and converts objects to urls.
68 | Subclassed in AccessControlList and KeyPair to avoid storing private keys.
69 | :param dd: dictionary to convert..
70 | :return: converted dictionary
71 | "-"-"
72 | res = {
73 | k: dd[k].store()._url if isinstance(dd[k], Transportable) else dd[k]
74 | for k in dd
75 | if k[0] != '_'
76 | }
77 | res["table"] = res.get("table",self.table) # Assumes if used table as a field, that not relying on it being the table for loading
78 | assert res["table"]
79 | return res
80 |
81 | def _getdata(self):
82 | "-"-"
83 | Prepares data for sending. Retrieves attributes, runs through preflight.
84 | If there is an _acl field then it passes data through it for encrypting (see AccessControl library)
85 | Exception: UnicodeDecodeError - if its binary
86 | :return: String suitable for rawstore
87 | "-"-"
88 | try:
89 | res = self.transport().dumps(self.preflight(self.__dict__.copy())) # Should call self.dumps below { k:self.__dict__[k] for k in self.__dict__ if k[0]!="_" })
90 | except UnicodeDecodeError as e:
91 | print "Unicode error in StructuredBlock"
92 | print self.__dict__
93 | raise e
94 | if self._acl: # Need to encrypt
95 | encdata = self._acl.encrypt(res, b64=True)
96 | dic = {"encrypted": encdata, "acl": self._acl._publicurl, "table": self.table}
97 | res = self.transport().dumps(dic)
98 | return res
99 |
100 | ABOVE HERE NOT BACKPORTED FROM JS
101 | """
102 |
103 | def _setdata(self, value):
104 | """
105 | Stores data, subclass this if the data should be interpreted as its stored.
106 | value Object, or JSON string to load into object.
107 | """
108 | # Note SmartDict expects value to be a dictionary, which should be the case since the HTTP requester interprets as JSON
109 | # Call chain is ... or constructor > _setdata > _setproperties > __setattr__
110 | # COPIED BACK FROM JS 2018-07-02
111 | value = loads(value) if isinstance(value, str) else value # Will throw exception if it isn't JSON
112 | if value and ("encrypted" in value):
113 | raise EncryptionException("Should have been decrypted in fetch")
114 | self._setproperties(value);
115 |
116 | """
117 | BELOW HERE NOT BACKPORTED FROM JS
118 |
119 | def _match(self, key, val):
120 | if key[0] == '.':
121 | return (key == '.instance' and isinstance(self, val))
122 | else:
123 | return (val == self.__dict__[key])
124 |
125 | def match(self, dict):
126 | "-"-"
127 | Checks if a object matches for each key:value pair in the dictionary.
128 | Any key starting with "." is treated specially esp:
129 | .instanceof: class: Checks if this is a instance of the class
130 | other fields will be supported here, any unsupported field results in a false.
131 |
132 | :returns: boolean, true if matches
133 | "-"-"
134 | return all([self._match(k, dict[k]) for k in dict])
135 |
136 |
137 | @classmethod
138 | def fetch(cls, url, verbose):
139 | "-"-"
140 | Fetches the object from Dweb, passes to decrypt in case it needs decrypting,
141 | and creates an object of the appropriate class and passes data to _setdata
142 | This should not need subclassing, (subclass _setdata or decrypt instead).
143 |
144 | :return: New object - e.g. StructuredBlock or MutableBlock
145 | :catch: TransportError - can probably, or should throw TransportError if transport fails
146 | :throws: TransportError if url invalid, Authentication Error
147 | "-"-"
148 | from letter2class.py import LetterToClass
149 | if verbose: print "SmartDict.fetch", url;
150 | data = super(SmartDict, cls).fetch(url, verbose) #Fetch the data Throws TransportError immediately if url invalid, expect it to catch if Transport fails
151 | data = Dweb.transport(url).loads(data) # Parse JSON //TODO-REL3 maybe function in Transportable
152 | table = data.table # Find the class it belongs to
153 | cls = LetterToClass[table] # Gets class name, then looks up in Dweb - avoids dependency
154 | if not cls:
155 | raise ToBeImplementedException("SmartDict.fetch: "+table+" isnt implemented in table2class")
156 | if not isinstance(Dweb.table2class[table], cls):
157 | raise ForbiddenException("Avoiding data driven hacks to other classes - seeing "+table);
158 | data = cls.decrypt(data, verbose) # decrypt - may return string or obj , note it can be suclassed for different encryption
159 | data["_url"] = url; # Save where we got it - preempts a store - must do this afer decrypt
160 | return cls(data)
161 |
162 | @classmethod
163 | def decrypt(data, verbose):
164 | "-"-"
165 | This is a hook to an upper layer for decrypting data, if the layer isn't there then the data wont be decrypted.
166 | Chain is SD.fetch > SD.decryptdata > ACL|KC.decrypt, then SD.setdata
167 |
168 | :param data: possibly encrypted object produced from json stored on Dweb
169 | :return: same object if not encrypted, or decrypted version
170 | "-"-"
171 | return AccessControlList.decryptdata(data, verbose)
172 |
173 | def dumps(self): # Called by json_default, but preflight() is used in most scenarios rather than this
174 | 1/0 # DOnt believe this is used
175 | return {k: self.__dict__[k] for k in self.__dict__ if k[0] != "_"} # Serialize the dict, excluding _xyz
176 |
177 | def copy(self):
178 | return self.__class__(self.__dict__.copy())
179 | """
--------------------------------------------------------------------------------
/python/Transport.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | from datetime import datetime
3 | import logging
4 |
5 | from .miscutils import dumps, loads
6 | from urllib.parse import urlparse
7 | from .Errors import ToBeImplementedException, MyBaseException, IntentionallyUnimplementedException
8 |
9 | class TransportBlockNotFound(MyBaseException):
10 | httperror = 404
11 | msg = "{url} not found"
12 |
13 | class TransportURLNotFound(MyBaseException):
14 | httperror = 404
15 | msg = "{url}, {options} not found"
16 |
17 | class TransportFileNotFound(MyBaseException):
18 | httperror = 404
19 | msg = "{file} not found"
20 |
21 | class TransportPathNotFound(MyBaseException):
22 | httperror = 404
23 | msg = "{path} not found for obj {url}"
24 |
25 | class TransportUnrecognizedCommand(MyBaseException):
26 | httperror = 500
27 | msg = "Class {classname} doesnt have a command {command}"
28 |
29 |
30 | class Transport(object):
31 | """
32 | Setup the resource and open any P2P connections etc required to be done just once.
33 | In almost all cases this will call the constructor of the subclass
34 | Should return a new Promise that resolves to a instance of the subclass
35 |
36 | :param obj transportoptions: Data structure required by underlying transport layer (format determined by that layer)
37 | :param boolean verbose: True for debugging output
38 | :param options: Data structure stored on the .options field of the instance returned.
39 | :resolve Transport: Instance of subclass of Transport
40 | """
41 |
42 | def __init__(self, options, verbose):
43 | """
44 | :param options:
45 | """
46 | raise ToBeImplementedException(name=cls.__name__+".__init__")
47 |
48 | @classmethod
49 | def setup(cls, options, verbose):
50 | """
51 | Called to deliver a transport instance of a particular class
52 |
53 | :param options: Options to subclasses init method
54 | :return: None
55 | """
56 | raise ToBeImplementedException(name=cls.__name__+".setup")
57 |
58 |
59 | def _lettertoclass(self, abbrev):
60 | #TODO-BACKPORTING - check if really needed after finish port (was needed on server)
61 | from letter2class import LetterToClass
62 | return LetterToClass.get(abbrev, None)
63 |
64 | def supports(self, url, func=None): #TODO-API
65 | """
66 | Determine if this transport supports a certain set of URLs
67 |
68 | :param url: String or parsed URL
69 | :param func: Function being attempted on url
70 | :return: True if this protocol supports these URLs
71 | """
72 | if not url: return True # Can handle default URLs
73 | if isinstance(url, basestring):
74 | url = urlparse(url) # For efficiency, only parse once.
75 | if not url.scheme: raise CodingException(message="url passed with no scheme (part before :): "+url)
76 | return (url.scheme in self.urlschemes) and (not func or func in self.supportFunctions) #Lower case, NO trailing : (unlike JS)
77 |
78 |
79 | def url(self, data):
80 | """
81 | Return an identifier for the data without storing
82 |
83 | :param string|Buffer data arbitrary data
84 | :return string valid id to retrieve data via rawfetch
85 | """
86 | raise ToBeImplementedException(name=cls.__name__+".url")
87 |
88 | def info(self, **options): #TODO-API
89 | raise ToBeImplementedException(name=cls.__name__+".info")
90 |
91 | def rawstore(self, data=None, verbose=False, **options):
92 | raise ToBeImplementedException(name=cls.__name__+".rawstore")
93 |
94 | def store(self, command=None, cls=None, url=None, path=None, data=None, verbose=False, **options):
95 | raise ToBeImplementedException(message="Backporting - unsure if needed - match JS Dweb"); # TODO-BACKPORTING
96 | #store(command, cls, url, path, data, options) = fetch(cls, url, path, options).command(data|data._data, options)
97 | #store(url, data)
98 | if not isinstance(data, basestring):
99 | data = data._getdata()
100 | if command:
101 | # TODO not so sure about this production, document any uses here if there are any
102 | obj = self.fetch(command=None, cls=None, url=url, path=path, verbose=verbose, **options)
103 | return obj.command(data=data, verbose=False, **options)
104 | else:
105 | return self.rawstore(data=data, verbose=verbose, **options)
106 |
107 | def rawfetch(self, url=None, verbose=False, **options):
108 | """
109 | Fetch data from a url and return as a (binary) string
110 |
111 | :param url:
112 | :param options: { ignorecache if shouldnt use any cached value (mostly in testing);
113 | :return: str
114 | """
115 | raise ToBeImplementedException(name=cls.__name__+".rawfetch")
116 |
117 | def fetch(self, command=None, cls=None, url=None, path=None, verbose=False, **options):
118 | """
119 | More comprehensive fetch function, can be sublassed either by the objects being fetched or the transport.
120 | Exceptions: TransportPathNotFound, TransportUnrecognizedCommand
121 |
122 | :param command: Command to be performed on the retrieved data (e.g. content, or size)
123 | :param cls: Class of object being returned, if None will return a str
124 | :param url: Hash of object to retrieve
125 | :param path: Path within object represented by url
126 | :param verbose:
127 | :param options: Passed to command, NOT passed to subcalls as for example mucks up sb.__init__ by dirtying - this might be reconsidered
128 | :return:
129 | """
130 | if verbose: logging.debug("Transport.fetch command={0} cls={1} url={2} path={3} options={4}".format(command, cls, url, path, options))
131 | #TODO-BACKPORTING see if needed after full port - hint it was used in ServerHTTP but not on client side
132 | if cls:
133 | if isinstance(cls, basestring): # Handle abbreviations for cls
134 | cls = self._lettertoclass(cls)
135 | obj = cls(url=url, verbose=verbose).fetch(verbose=verbose)
136 | # Can't pass **options to cls as disrupt sb.__init__ by causing dirty
137 | # Not passing **options to fetch, but probably could
138 | else:
139 | obj = self.rawfetch(url, verbose=verbose) # Not passing **options, probably could but not used
140 | #if verbose: logging.debug("Transport.fetch obj={0}".format(obj))
141 | if path:
142 | obj = obj.path(path, verbose=verbose) # Not passing **options as ignored, but probably could
143 | #TODO handle not found exception
144 | if not obj:
145 | raise TransportPathNotFound(path=path, url=url)
146 | if not command:
147 | return obj
148 | else:
149 | if not cls:
150 | raise TransportUnrecognizedCommand(command=command, classname="None")
151 | func = getattr(obj, command, None)
152 | if not func:
153 | raise TransportUnrecognizedCommand(command=command, classname=cls.__name__)
154 | return func(verbose=verbose, **options)
155 |
156 | def rawadd(self, url, sig, verbose=False, subdir=None, **options):
157 | raise ToBeImplementedException(name=cls.__name__+".rawadd")
158 |
159 | def add(self, urls=None, date=None, signature=None, signedby=None, verbose=False, obj=None, **options ):
160 | #TODO-BACKPORTING check if still needed after Backport - not used in JS
161 | #add(dataurl, sig, date, keyurl)
162 | if (obj and not url):
163 | url = obj._url
164 | return self.rawadd(urls=urls, date=date, signature=signature, signedby=signedby, verbose=verbose, **options) # TODO would be better to store object
165 |
166 | def rawlist(self, url=None, verbose=False, **options):
167 | raise ToBeImplementedException(name=cls.__name__+".rawlist")
168 |
169 | def list(self, command=None, cls=None, url=None, path=None, verbose=False, **options):
170 | """
171 |
172 | :param command: if found: list.commnd(list(cls, url, path)
173 | :param cls: if found (cls(l) for l in list(url)
174 | :param url: Hash of list to look up - usually url of private key of signer
175 | :param path: Ignored for now, unclear how applies
176 | :param verbose:
177 | :param options:
178 | :return:
179 | """
180 | raise ToBeImplementedException("Backporting - unsure if needed - match JS Dweb"); #TODO-BACKPORTING
181 |
182 | res = rawlist(url, verbose=verbose, **options)
183 | if cls:
184 | if isinstance(cls, basestring): # Handle abbreviations for cls
185 | cls = self._lettertoclass(cls)
186 | res = [ cls(l) for l in res ]
187 | if command:
188 | func = getattr(CommonList, command, None) #TODO May not work, might have to turn res into CommonList first
189 | if not func:
190 | raise TransportUnrecognizedCommand(command=command, classname=cls.__name__)
191 | res = func(res, verbose=verbose, **options)
192 | return res
193 |
194 | def rawreverse(self, url=None, verbose=False, **options):
195 | raise ToBeImplementedException(name=cls.__name__+".rawreverse")
196 |
197 |
198 | def reverse(self, command=None, cls=None, url=None, path=None, verbose=False, **options):
199 | """
200 |
201 | :param command: if found: reverse.commnd(list(cls, url, path)
202 | :param cls: if found (cls(l) for l in reverse(url)
203 | :param url: Hash of reverse to look up - usually url of data signed
204 | :param path: Ignored for now, unclear how applies
205 | :param verbose:
206 | :param options:
207 | :return:
208 | """
209 | raise ToBeImplementedException(message="Backporting - unsure if needed - match JS Dweb"); #TODO-BACKPORTING
210 |
211 | res = rawreverse(url, verbose=verbose, **options)
212 | if cls:
213 | if isinstance(cls, basestring): # Handle abbreviations for cls
214 | cls = self._lettertoclass(cls)
215 | res = [ cls(l) for l in res ]
216 | if command:
217 | func = getattr(self, command, None)
218 | if not func:
219 | raise TransportUnrecognizedCommand(command=command, classname=cls.__name__)
220 | res = func(res, verbose=verbose, **options)
221 | return res
222 |
223 | #TODO-BACKPORT add listmonitor
224 |
--------------------------------------------------------------------------------
/python/TransportHTTP.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | import logging
3 | from .Transport import Transport
4 |
5 | class TransportHTTP(Transport):
6 | """
7 | Subclass of Transport.
8 | Implements the raw primitives via an http API to a local IPFS instance
9 | Only partially complete TODO - get from old library
10 | """
11 |
12 | # urlschemes = ['http','https'] - subclasses as can handle all
13 | supportFunctions = ['set']
14 |
15 | def __init__(self, options=None, verbose=False):
16 | """
17 | Create a transport object (use "setup" instead)
18 | |Exceptions: TransportFileNotFound if dir invalid, IOError other OS error (e.g. cant make directory)
19 |
20 | :param dir:
21 | :param options:
22 | """
23 | self.options = options or {}
24 | pass
25 |
26 | def __repr__(self):
27 | return self.__class__.__name__ + " " + dumps(self.options)
28 |
29 | def supports(self, url, func):
30 | return (func in supportfunctions) and (url.startswith('https:') or url.startswith('http')) # Local can handle any kind of URL, since cached.
31 |
32 | def set(self, url, keyvalues, value, verbose):
33 | pass # TODO-DOMAIN complete
34 |
--------------------------------------------------------------------------------
/python/TransportIPFS.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | import json
3 | import logging
4 | from .miscutils import loads, dumps
5 | from .Transport import Transport
6 | from .config import config
7 | import requests # HTTP requests
8 | from .miscutils import httpget
9 | from .Errors import IPFSException
10 |
11 |
12 | class TransportIPFS(Transport):
13 | """
14 | Subclass of Transport.
15 | Implements the raw primitives via an http API to a local IPFS instance
16 | Only partially complete
17 | """
18 |
19 | # urlschemes = ['ipfs'] - subclasses as can handle all
20 | supportFunctions = ['store','fetch']
21 |
22 | def __init__(self, options=None, verbose=False):
23 | """
24 | Create a transport object (use "setup" instead)
25 | |Exceptions: TransportFileNotFound if dir invalid, IOError other OS error (e.g. cant make directory)
26 |
27 | :param dir:
28 | :param options:
29 | """
30 | self.options = options or {}
31 | pass
32 |
33 | def __repr__(self):
34 | return self.__class__.__name__ + " " + dumps(self.options)
35 |
36 | def supports(self, url, func):
37 | return url.startswith('ipfs:') # Local can handle any kind of URL, since cached.
38 |
39 | #TODO-LOCAL - feed this back into ServerGateway.info
40 | def info(self, **options):
41 | return { "type": "ipfs", "options": self.options }
42 |
43 | def rawfetch(self, url=None, verbose=False, **options):
44 | """
45 | Fetch a block from IPFS
46 | Exception: TransportFileNotFound if file doesnt exist
47 | #TODO-STREAM make return stream to HTTP and so on
48 |
49 | :param url:
50 | :param multihash: a Multihash structure
51 | :param options:
52 | :return:
53 | """
54 | raise ToBeImplementedException(name="TransportIPFS.rawfetch")
55 |
56 | def pinggateway(self, ipldhash):
57 | """
58 | Pin to gateway or JS clients wont see it TODO remove this when client relay working (waiting on IPFS)
59 | This next line is to get around bug in IPFS propogation
60 | See https://github.com/ipfs/js-ipfs/issues/1156
61 | Feb2018: Note this is waiting on a workaround by IPFS (David > Kyle > Lars )
62 | : param ipldhash Hash of form z... or Q.... or array of ipldhash
63 | """
64 | if isinstance(ipldhash, (list,tuple,set)):
65 | for i in ipldhash:
66 | self.pinggateway(i)
67 | headers = { "Connection": "keep-alive"}
68 | ipfsgatewayurl = "https://ipfs.io/ipfs/{}".format(ipldhash)
69 | res = requests.head(ipfsgatewayurl, headers=headers); # Going to ignore the result
70 | logging.debug("Transportipfs.pinggateway workaround for JS-IPFS issue #1156 - pin gateway for {}".format(ipfsgatewayurl))
71 |
72 | def announcedht(self, ipldhash):
73 | """
74 | Periodically tell URLstore to announce blocks or JS clients wont see it
75 | This next line is to get around bug in IPFS propogation
76 | : param ipldhash Hash of form z... or Q.... or array of ipldhash
77 | """
78 | if isinstance(ipldhash, (list,tuple,set)):
79 | for i in ipldhash:
80 | self.announcedht(i)
81 | headers = { "Connection": "keep-alive"}
82 | ipfsurl = config["ipfs"]["url_dht_provide"]
83 | res = requests.get(ipfsurl, headers=headers, params={'arg': ipldhash}) # Ignoring result
84 | logging.debug("Transportipfs.announcedht for {}?arg={}".format(ipfsurl, ipldhash)) # Log whether verbose or not
85 |
86 | def rawstore(self, data=None, verbose=False, returns=None, pinggateway=True, mimetype=None, **options):
87 | """
88 | Store the data on IPFS
89 | Exception: TransportFileNotFound if file doesnt exist
90 |
91 | :param data: opaque data to store (currently must be bytes, not str)
92 | :param returns: Comma separated string if want result as a dict, support "url","contenthash"
93 | :raises: IPFSException if cant reach server
94 | :return: url of data e.g. ipfs:/ipfs/Qm123abc
95 | """
96 | assert (not returns), 'Not supporting "returns" parameter to TransportIPFS.store at this point'
97 | ipfsurl = config["ipfs"]["url_add_data"]
98 | if verbose: logging.debug("Posting IPFS to {0}".format(ipfsurl))
99 | headers = { "Connection": "keep-alive"}
100 | try:
101 | res = requests.post(ipfsurl, headers=headers, params={ 'trickle': 'true', 'pin': 'true'}, files={'file': ('', data, mimetype)}).json()
102 | #except ConnectionError as e: # TODO - for some reason this never catches even though it reports "ConnectionError" as the class
103 | except requests.exceptions.ConnectionError as e: # Alternative - too broad a catch but not expecting other errors
104 | pass
105 | raise IPFSException(message="Unable to post to local IPFS at {} it is probably not running or wedged".format(ipfsurl))
106 | logging.debug("IPFS result={}".format(res))
107 | ipldhash = res['Hash']
108 | if pinggateway:
109 | self.pinggateway(ipldhash)
110 | return "ipfs:/ipfs/{}".format(ipldhash)
111 |
112 | def store(self, data=None, urlfrom=None, verbose=False, mimetype=None, pinggateway=True, returns=None, **options):
113 | """
114 | Higher level store semantics
115 |
116 | :param data:
117 | :param urlfrom: URL to fetch from for storage, allows optimisation (e.g. pass it a stream) or mapping in transport
118 | :param verbose:
119 | :param pinggateway: True (default) to ping ipfs.io so that it knows where to find, (alternative is to allow browser to ping it on failure to retrieve)
120 | :param mimetype:
121 | :param options:
122 | :raises: IPFSException if cant reach server or doesnt return JSON
123 | :return:
124 | """
125 | assert (not returns), 'Not supporting "returns" parameter to TransportIPFS.store at this point'
126 | try:
127 | headers = { "Connection": "keep-alive"}
128 | if urlfrom and config["ipfs"].get("url_urlstore"): # On a machine with urlstore and passed a url
129 | ipfsurl = config["ipfs"]["url_urlstore"]
130 | res = requests.get(ipfsurl, headers=headers, params={'arg': urlfrom, 'trickle': 'true', 'nocopy': 'true', 'cid-version':"1"}).json()
131 | ipldhash = res['Key']
132 | # Now pin to gateway or JS clients wont see it TODO remove this when client relay working (waiting on IPFS)
133 | # This next line is to get around bug in IPFS propogation
134 | # See https://github.com/ipfs/js-ipfs/issues/1156
135 | if pinggateway:
136 | self.pinggateway(ipldhash)
137 | url = "ipfs:/ipfs/{}".format(ipldhash)
138 | else: # Need to store via "add"
139 | if not data or not mimetype and urlfrom:
140 | (data, mimetype) = httpget(urlfrom, wantmime=True) # This is a fetch from somewhere else before putting to gateway
141 | if not isinstance(data, (str,bytes)): # We've got data, but if its an object turn into JSON, (example is name/archiveid which passes metadata)
142 | data = dumps(data)
143 | url = self.rawstore(data=data, verbose=verbose, returns=returns, mimetype=mimetype, pinggateway=pinggateway, **options) # IPFSException if down
144 | return url
145 | except (KeyError) as e:
146 | raise IPFSException(message="Bad format back from IPFS - no key field" + json.dumps(res))
147 | except (json.decoder.JSONDecodeError) as e:
148 | raise IPFSException(message="Bad format back from IPFS - not JSON;"+str(e))
149 | except (requests.exceptions.ConnectionError) as e:
150 | raise IPFSException(message="IPFS refused connection;"+str(e))
151 |
152 |
--------------------------------------------------------------------------------
/python/TransportLocal.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | import logging
3 | #from sys import version as python_version
4 | #if python_version.startswith('3'):
5 | # from urllib.parse import urlparse
6 | #else:
7 | # from urlparse import urlparse # See https://docs.python.org/2/library/urlparse.html
8 | import os # For isdir and exists
9 |
10 | # Neither of these are used in the Gateway which could be extended
11 | #from Transport import Transport
12 | #from Dweb import Dweb
13 | from .Errors import TransportFileNotFound
14 | from .Multihash import Multihash
15 | from .miscutils import loads, dumps
16 | from .Transport import Transport
17 |
18 |
19 | class TransportLocal(Transport):
20 | """
21 | Subclass of Transport.
22 | Implements the raw primitives as reads and writes of file system.
23 | """
24 |
25 | # urlschemes = ['http'] - subclasses as can handle all
26 |
27 |
28 | def __init__(self, options, verbose):
29 | """
30 | Create a transport object (use "setup" instead)
31 | |Exceptions: TransportFileNotFound if dir invalid, IOError other OS error (e.g. cant make directory)
32 |
33 | :param dir:
34 | :param options:
35 | """
36 | subdirs = "list", "reverse", "block"
37 | dir = options["local"]["dir"]
38 | if not os.path.isdir(dir):
39 | os.mkdir(dir)
40 | self.dir = dir
41 | for table in subdirs:
42 | dirname = "%s/%s" % (self.dir, table)
43 | if not os.path.isdir(dirname):
44 | os.mkdir(dirname)
45 | self.options = options
46 |
47 | def __repr__(self):
48 | return self.__class__.__name__ + " " + dumps(self.options)
49 |
50 | @classmethod
51 | def OBSsetup(cls, options, verbose): #TODO-LOCAL maybe not needed
52 | """
53 | Setup local transport to use dir
54 | Exceptions: TransportFileNotFound if dir invalid
55 |
56 | :param dir: Directory to use for storage
57 | :param options: Unused currently
58 | """
59 | t = cls(options, verbose)
60 | Dweb.transports["local"] = t
61 | Dweb.transportpriority.append(t)
62 | return t
63 |
64 | #see other !ADD-TRANSPORT-COMMAND - add a function copying the format below
65 |
66 | def supports(self, url, func):
67 | return True # Local can handle any kind of URL, since cached.
68 |
69 | #TODO-LOCAL - feed this back into ServerGateway.info
70 | def info(self, **options):
71 | return { "type": "local", "options": self.options }
72 |
73 | def _filename(self, subdir, multihash=None, verbose=False, **options):
74 | # Utility function to get filename to use for storage
75 | return "%s/%s/%s" % (self.dir, subdir, multihash.multihash58)
76 |
77 | def _tablefilename(self, database, table, subdir="table", createdatabase=False):
78 | # Utility function to get filename to use for storage
79 | dir = "{}/{}/{}".format(self.dir, subdir, database)
80 | if createdatabase and not os.path.isdir(dir):
81 | os.mkdir(dir)
82 | return "{}/{}".format(dir, table)
83 |
84 | def url(self, data=None, multihash=None):
85 | """
86 | Return an identifier for the data without storing
87 |
88 | :param data string|Buffer data arbitrary data
89 | :param multihash string of form Q...
90 | :return string valid id to retrieve data via rawfetch
91 | """
92 |
93 | if data:
94 | multihash = Multihash(data=data, code=Multihash.SHA2_256)
95 | return "local:/rawfetch/{0}".format(multihash.multihash58 if isinstance(multihash, Multihash) else multihash)
96 |
97 | def rawfetch(self, url=url, multihash=None, verbose=False, **options):
98 | """
99 | Fetch a block from the local file system
100 | Exception: TransportFileNotFound if file doesnt exist
101 | #TODO-STREAM make return stream to HTTP and so on
102 |
103 | :param url: Of form somescheme:/something/hash
104 | :param multihash: a Multihash structure
105 | :param options:
106 | :return:
107 | """
108 | multihash = multihash or Multihash(url=url)
109 | filename = self._filename("block", multihash)
110 | try:
111 | if verbose: logging.debug("Opening {0}".format(filename))
112 | with open(filename, 'rb') as file:
113 | content = file.read()
114 | if verbose: logging.debug("Opened")
115 | return content
116 | except (IOError, FileNotFoundError) as e:
117 | logging.debug("TransportLocal.rawfetch err={}".format(e))
118 | raise TransportFileNotFound(file=filename)
119 |
120 | def _rawlistreverse(self, filename=None, verbose=False, **options):
121 | """
122 | Retrieve record(s) matching a url (usually the url of a key), in this case from a local directory
123 | Exception: IOError if file doesnt exist
124 |
125 | :param url: Hash in table to be retrieved or url ending in that hash
126 | :return: list of dictionaries for each item retrieved
127 | """
128 | try:
129 | f = open(filename, 'rb')
130 | s = [ loads(s) for s in f.readlines() ]
131 | f.close()
132 | return s
133 | except IOError as e:
134 | return []
135 | #Trying commenting out error, and returning empty array
136 | #raise TransportFileNotFound(file=filename)
137 |
138 | def rawlist(self, url, verbose=False, **options):
139 | """
140 | Retrieve record(s) matching a url (usually the url of a key), in this case from a local directory
141 | Exception: IOError if file doesnt exist
142 |
143 | :param url: URL to be retrieved
144 | :return: list of dictionaries for each item retrieved
145 | """
146 | if verbose: logging.debug("TransportLocal:rawlist {0}".format(url))
147 | filename = self._filename("list", multihash= Multihash(url=url), verbose=verbose, **options)
148 | return self._rawlistreverse(filename=filename, verbose=False, **options)
149 |
150 |
151 | def rawreverse(self, url, verbose=False, **options):
152 |
153 | """
154 | Retrieve record(s) matching a url (usually the url of a key), in this case from a local directory
155 | Exception: IOError if file doesnt exist
156 |
157 | :param url: Hash in table to be retrieved or url ending in hash
158 | :return: list of dictionaries for each item retrieved
159 | """
160 | filename = self._filename("reverse", multihash= Multihash(url=url), verbose=verbose, **options)
161 | return self._rawlistreverse(filename=filename, verbose=False, **options)
162 |
163 | def rawstore(self, data=None, verbose=False, returns=None, **options):
164 | """
165 | Store the data locally
166 | Exception: TransportFileNotFound if file doesnt exist
167 |
168 | :param data: opaque data to store (currently must be bytes, not str)
169 | :param returns: Comma separated string if want result as a dict, support "url","contenthash"
170 | :return: url of data
171 | """
172 | assert data is not None # Its meaningless (or at least I think so) to store None (empty string is meaningful) #TODO-LOCAL move assert to CodingException
173 | contenthash=Multihash(data=data, code=Multihash.SHA2_256)
174 | filename = self._filename("block", multihash=contenthash, verbose=verbose, **options)
175 | try:
176 | f = open(filename, 'wb')
177 | f.write(data)
178 | f.close()
179 | except IOError as e:
180 | raise TransportFileNotFound(file=filename)
181 | url = self.url(multihash=contenthash)
182 | if returns:
183 | returns = returns.split(',')
184 | return { k: url if k=="url" else contenthash if k=="contenthash" else "ERROR" for k in returns }
185 | else:
186 | return url
187 |
188 |
189 | def _rawadd(self, filename, value):
190 | try:
191 | with open(filename, 'ab') as f:
192 | f.write(value)
193 | except IOError as e:
194 | raise TransportFileNotFound(file=filename)
195 |
196 | def rawadd(self, url, sig, verbose=False, subdir=None, **options):
197 | """
198 | Store a signature in a pair of DHTs
199 | Exception: IOError if file doesnt exist
200 |
201 | :param url: List to store on
202 | :param Signature sig: including { date, signature, signedby, urls}
203 | :param subdir: Can select list or reverse to store only one or both halfs of the list. This is used in TransportDistPeer as the two halfs are stored in diffrent parts of the DHT
204 | :param verbose:
205 | :param options:
206 | :return:
207 | """
208 | subdir = subdir or ("list","reverse") # By default store forward and backwards
209 | if verbose: logging.debug("TransportLocal.rawadd {0} {1} subdir={2} options={3}"
210 | .format(url, sig, subdir, options))
211 | value = dumps(sig) + "\n" #Note this is a compact dump
212 | value = value.encode('utf-8')
213 | if "list" in subdir:
214 | self._rawadd(
215 | self._filename("list", multihash= Multihash(url=url), verbose=verbose, **options), # List of things signedby
216 | value)
217 | """
218 | # Reverse removed for now, as not used, and causes revision issues with Multi.
219 | if "reverse" in subdir:
220 | if not isinstance(urls, (list, tuple, set)):
221 | urls = [urls]
222 | for u in urls:
223 | self._rawadd(
224 | self._filename("reverse", multihash= Multihash(url=u), verbose=verbose, **options), # Lists that this object is on
225 | value)
226 | """
227 |
228 | def set(self, url=None, database=None, table=None, keyvaluelist=None, keyvalues=None, value=None, verbose=False):
229 | #Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end
230 | #Each line is a seperate keyvalue pair since each needs to be signed so that they can be verified by recipients who might only read one key
231 | # Note url & keyvalues or keyvalues|value are not supported yet
232 | filename = self._tablefilename(database, table, createdatabase=True)
233 | #TODO-KEYVALUE check and store sig which has to be on each keyvalue, not on entire set
234 | #TODO-KEYVALUE encode string in value for storing in quoted string
235 | appendable = "".join([ dumps(kv)+"\n" for kv in keyvaluelist ]).encode('utf-8') # Essentially jSON for array but without enclosing [ ]
236 | self._rawadd(filename, appendable)
237 |
238 | def get(self, url=None, database=None, table=None, keys=None, verbose=False):
239 | #Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end
240 | filename = self._tablefilename(database, table, createdatabase=True)
241 | resarr = self._rawlistreverse(filename=filename, verbose=False) # [ {key:k1, value:v1} {key:k2, value:v2}, {key:k1, value:v3}]
242 | #TODO-KEYVALUE check sig which has to be on each keyvalue, not on entire set
243 | resdict = { kv["key"]: kv.get("value") for kv in resarr if kv["key"] in keys } # {k1:v3, k2:v2} - replaces earlier with later values for same key
244 | return resdict
245 |
246 | def delete(self, url=None, database=None, table=None, keys=None, verbose=False):
247 | # Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end
248 | filename = self._tablefilename(database, table)
249 | # TODO-KEYVALUE check and store sig which has to be on each keyvalue, not on entire set
250 | appendable = ("\n".join([dumps({"key": key}) for key in keys]) + "\n").encode('utf-8')
251 | self._rawadd(filename, appendable)
252 |
253 | def keys(self, url=None, database=None, table=None, verbose=False):
254 | # Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end
255 | filename = self._tablefilename(database, table, createdatabase=True)
256 | resarr = self._rawlistreverse(filename=filename, verbose=False) # [ {key:k1, value:v1} {key:k2, value:v2}, {key:k1, value:v3}]
257 | #TODO-KEYVALUE check sig which has to be on each keyvalue, not on entire set
258 | resdict = { kv["key"]: kv.get("value") for kv in resarr } # {k1:v3, k2:v2} - replaces earlier with later values for same key
259 | return list(resdict.keys()) # keys() returns a dict_keys object, want to return a list
260 |
261 | def getall(self, url=None, database=None, table=None, verbose=False):
262 | # Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end
263 | filename = self._tablefilename(database, table, createdatabase=True)
264 | #logging.debug("XXX@getal {}".format(filename))
265 | resarr = self._rawlistreverse(filename=filename, verbose=False) # [ {key:k1, value:v1} {key:k2, value:v2}, {key:k1, value:v3}]
266 | #TODO-KEYVALUE check sig which has to be on each keyvalue, not on entire set
267 | resdict = { kv["key"]: kv.get("value") for kv in resarr } # {k1:v3, k2:v2} - replaces earlier with later values for same key
268 | return resdict
269 |
270 |
271 |
--------------------------------------------------------------------------------
/python/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/internetarchive/dweb-gateway/a32b67dba701da6cd79204a488b787fe15112974/python/__init__.py
--------------------------------------------------------------------------------
/python/config.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | import socket
3 | import logging
4 | import urllib.parse
5 |
6 |
7 | config = {
8 | "archive": {
9 | "url_download": "https://archive.org/download/", # TODO-PERMS - usage checked
10 | "url_servicesimg": "https://archive.org/services/img/",
11 | "url_metadata": "https://archive.org/metadata/",
12 | "url_btihsearch": 'https://archive.org/advancedsearch.php?fl=identifier,btih&output=json&rows=1&q=btih:',
13 | "url_sha1search": "http://archive.org/services/dwhf.php?key=sha1&val=",
14 | },
15 | "ipfs": {
16 | "url_add_data": "http://localhost:5001/api/v0/add", # FOr use on gateway or if run "ipfs daemon" on test machine
17 | # "url_add_data": "https://ipfs.dweb.me/api/v0/add", # note Kyle was using localhost:5001/api/v0/add which wont resolve externally.
18 | # "url_add_url": "http://localhost:5001/api/v0/add", #TODO-IPFS move uses of url_add_data to urladd when its working
19 | "url_urlstore": "http://localhost:5001/api/v0/urlstore/add", # Should have "ipfs daemon" running locally
20 | "url_dht_provide": "http://localhost:5001/api/v0/dht/provide",
21 | },
22 | "gateway": {
23 | "url_metadata": "https://https://dweb.me/arc/archive.org/metadata/",
24 | "url_download": "https://dweb.me/arc/archive.org/download/", # TODO-PERMS usage checked
25 | "url_servicesimg": "https://dweb.me/arc/archive.org/thumbnail/",
26 | "url_torrent": "https://dweb.me/arc/archive.org/torrent/", #TODO-PERMS CHECK USAGE
27 | },
28 | "httpserver": { # Configuration used by generic HTTP server
29 | "favicon_url": "https://dweb.me/favicon.ico",
30 | "root_path": "info",
31 | },
32 | "domains": {
33 | # This is also name of directory in /usr/local/dweb-gateway/.cache/table, if change this then can safely rename that directory to new name to retain metadata saved
34 | "metadataverifykey": 'NACL VERIFY:h9MB6YOnYEgby-ZRkFKzY3rPDGzzGZ8piGNwi9ltBf0=',
35 | "metadatapassphrase": "Replace this with something secret/arc/archive.org/metadata", # TODO - change for something secret!
36 | "directory": '/usr/local/dweb-gateway/.cache/table/', # Used by maintenance note overridden below for mitraglass (mitra's laptop)
37 | },
38 | "directories": {
39 | "bootloader": "/usr/local/dweb-archive/dist/bootloader.html", # Location of bootloader file, note overridden below for mitraglass (mitra's laptop)
40 | },
41 | "logging": {
42 | "level": logging.DEBUG,
43 | # "filename": '/var/log/dweb/dweb-gateway', # Use stdout for logging and redirect in supervisorctl
44 | },
45 | "ignoreurls": [ # Ignore these, they are hacks or similar
46 | urllib.parse.unquote("%E2%80%9D"),
47 | ".well-known",
48 | "clientaccesspolicy.xml",
49 | "db",
50 | "index.php",
51 | "mysqladmin",
52 | "login.cgi",
53 | "robots.txt", #Not a hack, but we dont have one TODO
54 | "phpmyadmin",
55 | "phpMyAdminold",
56 | "phpMyAdmin.old",
57 | "phpmyadmin-old",
58 | "phpMyadmin_bak",
59 | "phpMyAdmin",
60 | "phpma"
61 | "phpmyadmin0",
62 | "phpmyadmin1",
63 | "phpmyadmin2",
64 | "pma",
65 | "PMA",
66 | "scripts",
67 | "setup.php",
68 | "sitemap.xml",
69 | "sqladmin",
70 | "tools",
71 | "typo3",
72 | "web",
73 | "www",
74 | "xampp",
75 | ],
76 | "torrent_reject_list": [ # Baked into torrentmaker at in petabox/sw/bin/ia_make_torrent.py # See Archive/inTorrent()
77 | "_archive.torrent", # Torrent file isnt in itself !
78 | "_files.xml",
79 | "_reviews.xml",
80 | "_all.torrent", # aborted abuie torrent-izing
81 | "_64kb_mp3.zip", # old packaged streamable mp3s for etree
82 | "_256kb_mp3.zip",
83 | "_vbr_mp3.zip",
84 | "_meta.txt", # s3 upload turds
85 | "_raw_jp2.zip", # scribe nodes
86 | "_orig_cr2.tar",
87 | "_orig_jp2.tar",
88 | "_raw_jpg.tar", # could exclude scandata.zip too maybe...
89 | "_meta.xml" # Always written after the torrent so cant be in it
90 | ],
91 | "torrent_reject_collections": [ # See Archive/inTorrent()
92 | "loggedin",
93 | "georestricted"
94 | ],
95 | "have_no_sha1_list": [
96 | "_files.xml"
97 | ]
98 | }
99 | if socket.gethostname() in ["wwwb-dev0.fnf.archive.org"]:
100 | pass
101 | elif socket.gethostname().startswith('mitraglass'):
102 | config["directories"]["bootloader"] = "/Users/mitra/git/dweb-archive/bootloader.html"
103 | config["domains"]["directory"] = "/Users/mitra/git/dweb-gateway/.cache/table/"
104 | else:
105 | # Probably on docker
106 | pass
107 |
108 |
--------------------------------------------------------------------------------
/python/elastic_schema.json:
--------------------------------------------------------------------------------
1 | {
2 | "mappings": {
3 | "work": {
4 | "_all": { "enabled": true },
5 | "properties": {
6 | "doi": { "type": "keyword" },
7 | "title": { "type": "text", "boost": 3.0 },
8 | "authors": { "type": "text", "boost": 2.0 },
9 | "journal": { "type": "text" },
10 | "date": { "type": "date" },
11 | "publisher":{ "type": "text", "include_in_all": false },
12 | "topic": { "type": "text", "include_in_all": false },
13 | "media": { "type": "keyword", "include_in_all": false }
14 | }
15 | }
16 | }
17 | }
18 |
--------------------------------------------------------------------------------
/python/maintenance.py:
--------------------------------------------------------------------------------
1 | import logging
2 | # This is run every 10 minutes by Cron (10 * 58 = 580 ~ 10 hours)
3 | from python.config import config
4 | import redis
5 | import base58
6 | from .HashStore import StateService
7 | from .TransportIPFS import TransportIPFS
8 |
9 | logging.basicConfig(**config["logging"]) # For server
10 |
11 | def resetipfs(removeipfs=False, reseedipfs=False, removemagnet=False, announcedht=False, verbose=False, fixbadurls=False):
12 | """
13 | Loop over and "reset" ipfs
14 | :param removeipfs: If set will remove all cached pointers to IPFS - note this is part of a three stage process see notes in cleanipfs.sh
15 | :param reseedipfs: If set we will ping the ipfs.io gateway to make sure it knows about our files, this isn't used any more
16 | :param removemagnet: Remove all cached magnet links (e.g. to add a new default tracker
17 | :param announcedht: Announce our files to the DHT - currently run by cron regularly
18 | :param verbose: Generate verbose debugging - the code below could use more of this
19 | :param fixbadurls: Removes some historically bad URLs, this was done so isn't needed again - just left as a modifyable stub.
20 | :return:
21 | """
22 | knownbadhashes = [
23 | "zb2rhhEncXjn7PnqJ16mzfeug1bqWuupQ3PnkhnWLpAaDatiZ", # audio
24 | "zb2rhiSEszTZ4YuY7GJScy6jKZTJuR97MLs7KSe2nKLHwb4A7", # texts
25 | "zb2rhk2FYVEy5VRHmaEzor7NuA936E8GGaokZFurKmUE959zx", # movies
26 | ]
27 | r = redis.StrictRedis(host="localhost", port=6379, db=0, decode_responses=True)
28 | reseeded = 0
29 | removed = 0
30 | magremoved = 0
31 | total = 0
32 | withipfs = 0
33 | withmagnet = 0
34 | announceddht = 0
35 | if announcedht:
36 | dhtround = ((int(((StateService.get("LastDHTround", verbose)) or 0)) + 1) % 58)
37 | StateService.set("LastDHTround", dhtround, verbose)
38 | dhtroundletter = base58.b58encode_int(dhtround)
39 | logging.debug("DHT round: {}".format(dhtroundletter))
40 | for i in r.scan_iter():
41 | total = total+1
42 | if fixbadurls:
43 | url = r.hget(i, "url")
44 | if urls.startswith("ipfs:"):
45 | logging.debug("Would delete {} .url= {}".format(i,url))
46 | #r.hdel(i, "url")
47 | for k in ["magnetlink"]:
48 | magnetlink = r.hget(i, k)
49 | if magnetlink:
50 | withmagnet = withmagnet + 1
51 | if removemagnet:
52 | r.hdel(i, k)
53 | magremoved = magremoved + 1
54 |
55 | for k in [ "ipldhash", "thumbnailipfs" ]:
56 | ipfs = r.hget(i, k)
57 | #print(i, ipfs)
58 | if ipfs:
59 | withipfs = withipfs + 1
60 | ipfs = ipfs.replace("ipfs:/ipfs/", "") # The hash
61 | if removeipfs or (ipfs in knownbadhashes):
62 | r.hdel(i, k)
63 | removed = removed + 1
64 | if reseedipfs:
65 | #logging.debug("Reseeding {} {}".format(i, ipfs)) # Logged in TransportIPFS
66 | TransportIPFS().pinggateway(ipfs)
67 | reseeded = reseeded + 1
68 | if announcedht:
69 | #print("Testing ipfs {} .. {} from {}".format(ipfs[6],dhtroundletter,ipfs))
70 | if dhtroundletter == ipfs[6]: # Compare far enough into string to be random
71 | # logging.debug("Announcing {} {}".format(i, ipfs)) # Logged in TransportIPFS
72 | TransportIPFS().announcedht(ipfs)
73 | announceddht = announceddht + 1
74 | logging.debug("Scanned {}, withipfs {}, deleted {}, reseeded {}, announced {}, magremoved {}".format(total, withipfs, removed, reseeded, announceddht, magremoved))
75 |
76 | # To announce DHT under cron
77 | #logging.basicConfig(**config["logging"]) # For server
78 | #resetipfs(announcedht=True)
79 |
80 | # To fully reset IPFS need to also ...
81 | # rm /usr/local/dweb-gateway/.cache/table/{config["domains"]["metadataverifykey"]} which is where leafs stored - these refer to IPFS hashes for metadata
82 | # Clean out the repo (Arkadiy to provide info)
83 |
--------------------------------------------------------------------------------
/python/miscutils.py:
--------------------------------------------------------------------------------
1 | """
2 | This is a place to put miscellaneous utilities, not specific to this project
3 | """
4 | import json # Note dont "from json import dumps" as clashes with overdefined dumps below
5 | from datetime import datetime
6 | import requests
7 | import logging
8 | from magneturi import bencode
9 | import base64
10 | import hashlib
11 | import urllib.parse
12 | from .Errors import TransportURLNotFound, ForbiddenException
13 | from .config import config
14 |
15 |
16 |
17 | def mergeoptions(a, b):
18 | """
19 | Deep merge options dictionaries
20 | - note this might not (yet) handle Arrays correctly but handles nested dictionaries
21 |
22 | :param a,b: Dictionaries
23 | :returns: Deep copied merge of the dictionaries
24 | """
25 | c = a.copy()
26 | for key in b:
27 | val = b[key]
28 | if isinstance(val, dict) and a.get(key, None):
29 | c[key] = mergeoptions(a[key], b[key])
30 | else:
31 | c[key] = b[key]
32 | return c
33 |
34 | def dumps(obj): #TODO-BACKPORT FROM GATEWAY TO DWEB - moved from Transport to miscutils
35 | """
36 | Convert arbitrary data into a JSON string that can be deterministically hashed or compared.
37 | Must be valid for loading with json.loads (unless change all calls to that).
38 | Exception: UnicodeDecodeError if data is binary
39 |
40 | :param obj: Any
41 | :return: JSON string that can be deterministically hashed or compared
42 | """
43 | # ensure_ascii = False was set otherwise if try and read binary content, and embed as "data" in StructuredBlock then complains
44 | # if it cant convert it to UTF8. (This was an example for the Wrenchicon), but loads couldnt handle return anyway.
45 | # sort_keys = True so that dict always returned same way so can be hashed
46 | # separators = (,:) gets the most compact representation
47 | return json.dumps(obj, sort_keys=True, separators=(',', ':'), default=json_default)
48 |
49 | def loads(s):
50 | """
51 |
52 | :param s: JSON string to convert
53 | :return: Python dictionary, array, string etc depending on s
54 | :raises: json.decoder.JSONDecodeError if not json
55 | """
56 | if isinstance(s, bytes): #TODO can remove once python upgraded to 3.6.2
57 | s = s.decode('utf-8')
58 | return json.loads(s) # Will fail if s empty, or not json
59 |
60 | def json_default(obj): #TODO-BACKPORT FROM GATEWAY TO DWEB - moved from Transport to miscutils
61 | """
62 | Default JSON serialiser especially for handling datetime, can add handling for other special cases here
63 |
64 | :param obj: Anything json dumps can't serialize
65 | :return: string for extended types
66 | """
67 | if isinstance(obj, datetime): # Using isinstance rather than hasattr because __getattr__ always returns True
68 | #if hasattr(obj,"isoformat"): # Especially for datetime
69 | return obj.isoformat()
70 | try:
71 | return obj.dumps() # See if the object has its own dumps
72 | except Exception as e:
73 | raise TypeError("Type {0} not serializable".format(obj.__class__.__name__)) from e
74 |
75 |
76 | def httpget(url, wantmime=False, range=None):
77 | # Returns the content - i.e. bytes
78 | # Raises TransportFileNotFound or HTTPError TODO latter error should be caughts
79 | #TODO-STREAMS future work to return a stream
80 | #TODO-PERMS should ideally check perms here, or pass flag to make it check or similar
81 | #TODO-PERMS should also pass the X-ORIGINATING-IP (?) header, but need to figure out how to get that.
82 | r = None # So that if exception in get, r is still defined and can be tested for None
83 | try:
84 | logging.debug("GET {} {}".format(url, range if range else ""))
85 | headers = { "Connection": "keep-alive"}
86 | if range: headers["range"] = range
87 | r = requests.get(url, headers=headers)
88 | r.raise_for_status()
89 | if not r.encoding or ("application/pdf" in r.headers.get('content-type')) or ("image/" in r.headers.get('content-type')):
90 | data = r.content # Should work for PDF or other binary types
91 | else:
92 | data = r.text
93 | if wantmime:
94 | return data, r.headers.get('content-type')
95 | else:
96 | return data
97 | #TODO-STREAM support streams in future
98 |
99 | except (requests.exceptions.RequestException, requests.exceptions.HTTPError, requests.exceptions.InvalidSchema) as e:
100 | if r is not None and (r.status_code == 404):
101 | raise TransportURLNotFound(url=url)
102 | elif r is not None and (r.status_code == 403):
103 | raise ForbiddenException(what=e)
104 | else:
105 | logging.error("HTTP request failed err={}".format(e))
106 | raise e
107 | except requests.exceptions.MissingSchema as e:
108 | logging.error("HTTP request failed", exc_info=True)
109 | raise e # For now just raise it
110 |
111 |
--------------------------------------------------------------------------------
/python/requirements.txt:
--------------------------------------------------------------------------------
1 | sha3
2 | redis
3 | requests
4 | #multihash - no longer used and its buggy anyway
5 | py-dateutil
6 | base58
7 | pynacl # - pynacl is needed, but requires sudo - uncomment and run as sudo once.
8 | pyblake2
9 | # For Brians search, maybe not in production
10 | flask
11 | #binascii - built in
12 | #hashlib - built in I think, and fails to install with pip3
13 | #struct - built in
14 | magneturi # To decode magnet files
15 | bencode # To decode Bittorrents binary encoding
16 |
17 |
--------------------------------------------------------------------------------
/python/test/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/internetarchive/dweb-gateway/a32b67dba701da6cd79204a488b787fe15112974/python/test/__init__.py
--------------------------------------------------------------------------------
/python/test/_utils.py:
--------------------------------------------------------------------------------
1 | from python.ServerGateway import DwebGatewayHTTPRequestHandler
2 |
3 |
4 | def _processurl(url, verbose=False, headers={}, **kwargs):
5 | # Simulates HTTP Server process - wont work for all methods
6 | args = url.split('/')
7 | method = args.pop(0)
8 | DwebGatewayHTTPRequestHandler.headers = headers # This is a kludge, put headers on class, method expects an instance.
9 | f = getattr(DwebGatewayHTTPRequestHandler, method)
10 | assert f
11 | namespace = args.pop(0)
12 | if verbose: kwargs["verbose"] = True
13 | res = f(DwebGatewayHTTPRequestHandler, namespace, *args, **kwargs)
14 | return res
15 |
--------------------------------------------------------------------------------
/python/test/test_LocationService.py:
--------------------------------------------------------------------------------
1 | from python.HashStore import HashStore, LocationService
2 |
3 | MULTIHASH = "testmultihash"
4 | FIELD = "testfield"
5 | VALUE = "testvalue"
6 |
7 |
8 | def test_hash_store():
9 | HashStore.hash_set(MULTIHASH, FIELD, VALUE)
10 | assert HashStore.hash_get(MULTIHASH, FIELD) == VALUE
11 |
12 | def test_location_service():
13 | LocationService.set(MULTIHASH, VALUE)
14 | LocationService.get(MULTIHASH)
15 |
--------------------------------------------------------------------------------
/python/test/test_archive.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from datetime import datetime
3 | from ._utils import _processurl
4 | from python.miscutils import dumps, loads
5 | from python.Archive import ArchiveItemNotFound
6 | from python.config import config
7 |
8 | logging.basicConfig(level=logging.DEBUG) # Log to stderr
9 |
10 | def test_archiveid():
11 | verbose=False
12 | if verbose: logging.debug("Starting test_archiveid")
13 | itemid = "commute"
14 | btih='XCMYARDAKNWYBERJHUSQR5RJG63JX46B'
15 | magnetlink='magnet:?xt=urn:btih:XCMYARDAKNWYBERJHUSQR5RJG63JX46B&tr=http%3A%2F%2Fbt1.archive.org%3A6969%2Fannounce&tr=http%3A%2F%2Fbt2.archive.org%3A6969%2Fannounce&tr=wss%3A%2F%2Ftracker.btorrent.xyz&tr=wss%3A%2F%2Ftracker.openwebtorrent.com&tr=wss%3A%2F%2Ftracker.fastcast.nz&ws=https%3A%2F%2Fdweb.me%2Farc%2Farchive.org%2Fdownload%2F&xs=https%3A%2F%2Fdweb.me%2Farc%2Farchive.org%2Ftorrent%2Fcommute'
16 | res = _processurl("arc/archive.org/metadata/{}".format(itemid), verbose) # Simulate what the server would do with the URL
17 |
18 | if verbose: logging.debug("test_archiveid metadata returned {0}".format(res))
19 | assert res["data"]["metadata"]["identifier"] == itemid
20 | assert res["data"]["metadata"]["magnetlink"] == magnetlink
21 | assert "ipfs:/ipfs" in res["data"]["metadata"]["thumbnaillinks"][0]
22 | assert itemid in res["data"]["metadata"]["thumbnaillinks"][1]
23 | if verbose: logging.debug("test_archiveid complete")
24 | res = _processurl("magnetlink/btih/{}".format(btih), verbose)
25 | if verbose: logging.debug("test_archiveid magnetlink returned {0}".format(res))
26 | assert res["data"] == magnetlink
27 |
28 | def test_collectionsortorder():
29 | verbose=True
30 | itemid="prelinger"
31 | collectionurl = "arc/archive.org/metadata/{}"
32 | res = _processurl(collectionurl.format(itemid), verbose) # Simulate what the server would do with the URL
33 | assert res["data"]["collection_sort_order"] == "-downloads"
34 |
35 | def test_leaf():
36 | verbose=False
37 | if verbose: logging.debug("Starting test_leaf")
38 | # Test it can respond to leaf requests
39 | item = "commute"
40 | # leafurl="leaf/archiveid" OLD FORM
41 | leafurl="arc/archive.org/leaf"
42 | res = _processurl(leafurl, verbose=verbose, key=item) # Simulate what the server would do with the URL
43 | if verbose: logging.debug("{} returned {}".format(leafurl, res))
44 | leafurl="get/table/{}/domain".format(config["domains"]["metadataverifykey"]) #TODO-ARC
45 | res = _processurl(leafurl, verbose=verbose, key=item) # Should get value cached above
46 | if verbose: logging.debug("{} returned {}".format(leafurl, res))
47 |
48 | def test_archiveerrs():
49 | verbose=True
50 | if verbose: logging.debug("Starting test_archiveid")
51 | itemid = "nosuchitematall"
52 | try:
53 | res = _processurl("arc/archive.org/metadata/{}".format(itemid), verbose) # Simulate what the server would do with the URL
54 | except ArchiveItemNotFound as e:
55 | pass # Expecting an error
56 |
57 | def test_search():
58 | verbose=True
59 | kwargs1={ # Taken from example home page
60 | 'output': "json",
61 | 'q': "mediatype:collection AND NOT noindex:true AND NOT collection:web AND NOT identifier:fav-* AND NOT identifier:what_cd AND NOT identifier:cd AND NOT identifier:vinyl AND NOT identifier:librarygenesis AND NOT identifier:bibalex AND NOT identifier:movies AND NOT identifier:audio AND NOT identifier:texts AND NOT identifier:software AND NOT identifier:image AND NOT identifier:data AND NOT identifier:web AND NOT identifier:additional_collections AND NOT identifier:animationandcartoons AND NOT identifier:artsandmusicvideos AND NOT identifier:audio_bookspoetry AND NOT identifier:audio_foreign AND NOT identifier:audio_music AND NOT identifier:audio_news AND NOT identifier:audio_podcast AND NOT identifier:audio_religion AND NOT identifier:audio_tech AND NOT identifier:computersandtechvideos AND NOT identifier:coverartarchive AND NOT identifier:culturalandacademicfilms AND NOT identifier:ephemera AND NOT identifier:gamevideos AND NOT identifier:inlibrary AND NOT identifier:moviesandfilms AND NOT identifier:newsandpublicaffairs AND NOT identifier:ourmedia AND NOT identifier:radioprograms AND NOT identifier:samples_only AND NOT identifier:spiritualityandreligion AND NOT identifier:stream_only AND NOT identifier:television AND NOT identifier:test_collection AND NOT identifier:usgovfilms AND NOT identifier:vlogs AND NOT identifier:youth_media",
62 | 'rows': "75",
63 | 'sort[]': "-downloads",
64 | 'and[]': ""
65 | }
66 | kwargs2={ # Take from example search
67 | 'output': "json",
68 | 'q': "prelinger",
69 | 'rows': "75",
70 | 'sort[]': "",
71 | 'and[]': ""
72 | }
73 | #res = _processurl("metadata/advancedsearch", verbose, **kwargs2) # Simulate what the server would do with the URL
74 | res = _processurl("arc/archive.org/advancedsearch", verbose, **kwargs2) # Simulate what the server would do with the URL
75 | #logging.debug("XXX@65")
76 | logging.debug(res)
77 |
--------------------------------------------------------------------------------
/python/test/test_doi.py:
--------------------------------------------------------------------------------
1 | from python.Multihash import Multihash
2 | import logging
3 | from ._utils import _processurl
4 |
5 | DOIURL = "metadata/doi/10.1001/jama.2009.1064"
6 | CONTENTMULTIHASH = "5dqpnTaoMSJPpsHna58ZJHcrcJeAjW"
7 | PDF_SHA1HEX="02efe2abec13a309916c6860de5ad8a8a096fe5d"
8 | #CONTENTHASHURL = "content/contenthash/" + CONTENTMULTIHASH # OLD STYLE
9 | CONTENTHASHURL = "contenthash/" + CONTENTMULTIHASH
10 | #SHA1HEXMETADATAURL = "metadata/sha1hex/"+PDF_SHA1HEX # OLD STYLE
11 | #SHA1HEXCONTENTURL = "content/sha1hex/"+PDF_SHA1HEX # OLD STYLE
12 | SHA1HEXCONTENTURL = "sha1hex/" + PDF_SHA1HEX
13 | CONTENTSIZE = 262438
14 | QBF="The Quick Brown Fox"
15 | BASESTRING="A quick brown fox"
16 | SHA1BASESTRING="5drjPwBymU5TC4YNFK5aXXpwpFFbww" # Sha1 of above
17 |
18 |
19 | logging.basicConfig(level=logging.DEBUG) # Log to stderr
20 |
21 | def test_doi_resolve():
22 | verbose=False # True to debug
23 | res = _processurl(DOIURL, verbose)
24 | assert res["Content-type"] == "application/json"
25 | #assert res["data"]["files"][0]["sha1hex"] == PDF_SHA1HEX, "Would check sha1hex, but not returning now do multihash58"
26 | assert res["data"]["files"][0]["multihash58"] == CONTENTMULTIHASH
27 |
28 |
29 | def test_contenthash_resolve():
30 | verbose=False # True to debug
31 | res = _processurl(CONTENTHASHURL, verbose) # Simulate what the server would do with the URL
32 | assert res["Content-type"] == "application/pdf", "Check retrieved content of expected type"
33 | assert len(res["data"]) == CONTENTSIZE, "Check retrieved content of expected length"
34 | multihash = Multihash(data=res["data"], code=Multihash.SHA1)
35 | assert multihash.multihash58 == CONTENTMULTIHASH, "Check retrieved content has same multihash58_sha1 as we expect"
36 | assert multihash.sha1hex == PDF_SHA1HEX, "Check retrieved content has same hex sha1 as we expect"
37 |
38 | def test_sha1hexcontent_resolve():
39 | verbose = False # True to debug
40 | res = _processurl(SHA1HEXCONTENTURL, verbose) # Simulate what the server would do with the URL
41 | assert res["Content-type"] == "application/pdf", "Check retrieved content of expected type"
42 | assert len(res["data"]) == CONTENTSIZE, "Check retrieved content of expected length"
43 | multihash = Multihash(data=res["data"], code=Multihash.SHA1)
44 | assert multihash.multihash58 == CONTENTMULTIHASH, "Check retrieved content has same multihash58_sha1 as we expect"
45 | assert multihash.sha1hex == PDF_SHA1HEX, "Check retrieved content has same hex sha1 as we expect"
46 |
47 | def test_sha1hexmetadata_resolve():
48 | verbose = False # True to debug
49 | res = _processurl("sha1hex/"+PDF_SHA1HEX, verbose, output="metadata") # Simulate what the server would do with the URL
50 | if verbose: logging.debug("test_sha1hexmetadata_resolve {0}".format(res))
51 | assert res["Content-type"] == "application/json", "Check retrieved content of expected type"
52 | assert res["data"]["metadata"]["size_bytes"] == CONTENTSIZE
53 | assert res["data"]["metadata"]["multihash58"] == CONTENTMULTIHASH, "Expecting multihash58 of sha1"
54 |
55 |
--------------------------------------------------------------------------------
/python/test/test_local.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from datetime import datetime
3 | from ._utils import _processurl
4 | from python.miscutils import dumps, loads
5 |
6 | logging.basicConfig(level=logging.DEBUG) # Log to stderr
7 |
8 | CONTENTMULTIHASH = "5dqpnTaoMSJPpsHna58ZJHcrcJeAjW"
9 | BASESTRING="A quick brown fox"
10 | SHA1BASESTRING="5drjPwBymU5TC4YNFK5aXXpwpFFbww" # Sha1 of above
11 |
12 |
13 | def test_local():
14 | verbose=True
15 | res = _processurl("contenthash/rawstore", verbose, data=BASESTRING.encode('utf-8')) # Simulate what the server would do with the URL #TODO-ARC
16 | if verbose: logging.debug("test_local store returned {0}".format(res))
17 | contenthash = res["data"]
18 | res = _processurl("content/rawfetch/{0}".format(contenthash), verbose) # Simulate what the server would do with the URL #TODO-ARC
19 | if verbose: logging.debug("test_local content/rawfetch/{0} returned {1}".format(contenthash, res))
20 | assert res["data"].decode('utf-8') == BASESTRING
21 | #res = _processurl("content/contenthash/{0}".format(contenthash), verbose) # OLD STYLE
22 | res = _processurl("contenthash/{0}".format(contenthash), verbose)
23 | if verbose: logging.debug("test_local content/contenthash/{0} returned {1}".format(contenthash, res))
24 |
25 | def test_list():
26 | verbose = True
27 | date = datetime.utcnow().isoformat()
28 | adddict = { "urls": [ CONTENTMULTIHASH ], "date": date, "signature": "XXYYYZZZ", "signedby": [ SHA1BASESTRING ], "verbose": verbose }
29 | res = _processurl("void/rawadd/"+SHA1BASESTRING, verbose, data=dumps(adddict)) #TODO-ARC
30 | if verbose: logging.debug("test_list {0}".format(res))
31 | res = _processurl("metadata/rawlist/{0}".format(SHA1BASESTRING), verbose, data=dumps(adddict)) #TODO-ARC
32 | if verbose: logging.debug("rawlist returned {0}".format(res))
33 | assert res["data"][-1]["date"] == date
34 |
35 | def test_keyvaluetable(): #TODO-ARC
36 | verbose=True
37 | database = "Q123456789"
38 | table = "mytesttable"
39 | res = _processurl("set/table/{}/{}".format(database, table), data=dumps([{"key": "aaa", "value": "AAA"}, {"key": "bbb", "value": "BBB"}]), verbose=verbose)
40 | res = _processurl("get/table/{}/{}".format(database, table), key="aaa", verbose=verbose)
41 | assert res["data"]["aaa"] == "AAA"
42 | res = _processurl("get/table/{}/{}".format(database, table), key=["aaa","bbb"], verbose=verbose)
43 | assert res["data"]["aaa"] == "AAA" and res["data"]["bbb"] == "BBB"
44 | res = _processurl("delete/table/{}/{}".format(database, table), key="aaa", verbose=verbose)
45 | res = _processurl("get/table/{}/{}".format(database, table), key="aaa", verbose=verbose)
46 | assert res["data"]["aaa"] is None
47 | res = _processurl("keys/table/{}/{}".format(database, table), verbose=verbose)
48 | assert len(res["data"]) == 2
49 | res = _processurl("getall/table/{}/{}".format(database, table), verbose=verbose)
50 | assert res["data"]["aaa"] is None and res["data"]["bbb"] == "BBB"
51 |
--------------------------------------------------------------------------------
/python/test/test_multihash.py:
--------------------------------------------------------------------------------
1 | from python.Multihash import Multihash
2 |
3 | BASESTRING="A quick brown fox"
4 | SHA1BASESTRING="5drjPwBymU5TC4YNFK5aXXpwpFFbww"
5 |
6 | PDF_SHA1HEX="02efe2abec13a309916c6860de5ad8a8a096fe5d"
7 | PDF_MULTIHASHSHA1_58="5dqpnTaoMSJPpsHna58ZJHcrcJeAjW"
8 |
9 | def test_sha1():
10 | assert Multihash(data=BASESTRING.encode('utf-8'), code=Multihash.SHA1).multihash58 == SHA1BASESTRING, "Check expected sha1 from encoding basestring"
11 | assert Multihash(sha1hex=PDF_SHA1HEX).multihash58 == PDF_MULTIHASHSHA1_58
12 | assert Multihash(multihash58=PDF_MULTIHASHSHA1_58).sha1hex == PDF_SHA1HEX
13 |
--------------------------------------------------------------------------------
/rungate.py:
--------------------------------------------------------------------------------
1 | import logging
2 | from python.ServerGateway import DwebGatewayHTTPRequestHandler
3 | # This is just used for running tests
4 | from python.config import config
5 | logging.basicConfig(**config["logging"])
6 | DwebGatewayHTTPRequestHandler.DwebGatewayHTTPServeForever({'ipandport': ('localhost',4244)}) # Run local gateway
--------------------------------------------------------------------------------
/scripts/install.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | set -x
4 | ARG=$1
5 | GITNAME=dweb-gateway
6 | GITDIR=/usr/local/${GITNAME}
7 | SERVICENAME="dweb:dweb-gateway"
8 |
9 | cd $GITDIR
10 | #pip install --disable-pip-version-check -U $PIPS
11 | pip3 -q install --disable-pip-version-check -U -r python/requirements.txt
12 | [ -d data ] || mkdir data
13 | # First push whatever branch we are on
14 | git status | grep 'nothing to commit' || git commit -a -m "Changes made on server"
15 | git status | grep 'git push' && git push
16 |
17 | # Now switch to deployed branch - we'll probably be on it already
18 | git checkout deployed # Will run server branch
19 | git pull
20 |
21 | # Now merge the origin of deployable
22 | git merge origin/deployable
23 |
24 | # And pt
25 | git status | grep 'nothing to commit' || git commit -a -m "Merged deployable into deployed on server"
26 | git status | grep 'git push' && git push
27 |
28 | if [ ! -f data/idents_files_urls.sqlite ]
29 | then
30 | curl -L -o data/idents_files_urls.sqlite.gz https://archive.org/download/ia_papers_manifest_20170919/index/idents_files_urls.sqlite.gz
31 | gunzip data/idents_files_urls.sqlite.gz
32 | fi
33 |
34 | diff -r nginx /etc/nginx/sites-enabled
35 | if [ "$ARG" == "NGINX" ]
36 | then
37 | sudo cp nginx/* /etc/nginx/sites-available
38 | if sudo service nginx reload
39 | then
40 | echo "NGINX restarted"
41 | else
42 | systemctl status nginx.service
43 | fi
44 | fi
45 | diff etc_supervisor_conf.d_dweb.conf /etc/supervisor/conf.d/dweb.conf
46 |
47 | sudo supervisorctl restart $SERVICENAME
48 |
49 | if [ "$ARG" == "TORRENT" ]
50 | then
51 | echo "TODO run some kind of installer from dweb-transport"
52 | fi
--------------------------------------------------------------------------------
/scripts/reset_ipfs.sh:
--------------------------------------------------------------------------------
1 | # Removes references to IPFS from the server, and cleans up ipfs.
2 | # It might be worth running `sudo ipfs sh; cd /home/ipfs; gzip -c -r .ipfs >ipfsrepo.20180915.prerestore.zip` before
3 | # And will need to run the preseeder in /usr/local/dweb-mirror afterwards to get the popular collections back in
4 | cd /usr/local/dweb-gateway
5 |
6 | python3 -c '
7 | import logging
8 | import os
9 | from python.config import config
10 | from python.maintenance import resetipfs
11 |
12 | logging.basicConfig(**config["logging"]) # For server
13 | cachetabledomain=config["domains"]["directory"]+config["domains"]["metadataverifykey"]+"/domain"
14 | cachetable=config["domains"]["directory"]+config["domains"]["metadataverifykey"]
15 |
16 | print("Step 1: removing", cachetable, "which is where leafs stored - these refer to IPFS hashes for metadata")
17 | try:
18 | os.remove(cachetabledomain)
19 | except FileNotFoundError: # Might already have been deleted
20 | pass
21 | try:
22 | os.rmdir(cachetable)
23 | except FileNotFoundError: # Might already have been deleted
24 | pass
25 |
26 | print("Step 2: Remove all REDIS links to IPFS hashes")
27 | resetipfs(removeipfs=True)
28 |
29 | print("Step 3: Clearing out IPFS repo")
30 |
31 | '
32 | # The sudo stuff below here isn't tested - all these commands need running as ipfs
33 | #sudo -u ipfs ipfs pin ls --type recursive -q | sudo -u ipfs xargs ipfs pin rm
34 | #sudo -u ipfs ipfs repo gc
35 |
36 |
--------------------------------------------------------------------------------
/scripts/temp.sh:
--------------------------------------------------------------------------------
1 | cd /usr/local/dweb-gateway
2 |
3 | python3 -c '
4 | import logging
5 | import os
6 | from python.config import config
7 | from python.maintenance import resetipfs
8 |
9 | logging.basicConfig(**config["logging"]) # For server
10 | cachetabledomain=config["domains"]["directory"]+config["domains"]["metadataverifykey"]+"/domain"
11 | cachetable=config["domains"]["directory"]+config["domains"]["metadataverifykey"]
12 |
13 | print("Step 2: Remove all REDIS links to MAGNETLINKS hashes")
14 | resetipfs(removemagnet=True)
15 |
16 | '
17 |
18 |
--------------------------------------------------------------------------------
/scripts/tests.sh:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 |
3 | # This is just a quick test set until proper Python tests are built
4 |
5 | #python -m python.ServerGateway &
6 |
7 | set -x
8 | curl https://gateway.dweb.me/info
9 | echo; echo # Terminate response and blank line
10 | curl https://gateway.dweb.me/metadata/doi/10.1001/jama.2009.1064?verbose=True
11 | echo; echo # Terminate response and blank line
12 |
13 | # Fetch the sha1 multihash from above
14 | curl -D- -o /dev/null https://dweb.me/contenthash/5dqpnTaoMSJPpsHna58ZJHcrcJeAjW?verbose=True
15 | echo; echo # Terminate response and blank line
16 |
17 | echo "Now trying errors"
18 | #curl https://gateway.dweb.me/INVALIDCOMMAND
19 | #curl https://gateway.dweb.me/content/doi/10.INVALIDPUB/jama.2009.1064?verbose=True
20 | #curl https://gateway.dweb.me/content/doi/10.1001/INVALIDDOC.2009.1064?verbose=True
21 |
--------------------------------------------------------------------------------
/temp.py:
--------------------------------------------------------------------------------
1 | import logging
2 | # This is run every 10 minutes by Cron (10 * 58 = 580 ~ 10 hours)
3 | from python.config import config
4 | import redis
5 | import base58
6 | from python.HashStore import StateService
7 | from python.TransportIPFS import TransportIPFS
8 | from python.maintenance import resetipfs
9 |
10 | logging.basicConfig(**config["logging"]) # For server
11 | resetipfs() # Empty - should just count, and delete known bad hashes
12 |
13 | # To fully reset IPFS need to also ...
14 | # rm /usr/local/dweb-gateway/.cache/table/{config["domains"]["metadataverifykey"]} which is where leafs stored - these refer to IPFS hashes for metadata
15 |
--------------------------------------------------------------------------------