├── .gitignore ├── .travis.yml ├── Academic Docs IPFS gateway.isf ├── Academic Docs IPFS gateway.pdf ├── Classes.md ├── Extending.md ├── HTTPAPI.md ├── LICENSE ├── Metadata.md ├── README.md ├── Usecases.md ├── cron_ipfs.py ├── etc_ferm_input_nginx ├── etc_supervisor_conf.d_dweb.conf ├── load_ipfs.py ├── nginx ├── README.md ├── dweb.archive.org ├── dweb.me ├── gateway.dweb.me ├── ipfs.dweb.me ├── ipfsconvert.dweb.me └── www.dweb.me ├── python ├── Archive.py ├── Btih.py ├── ContentStore.py ├── DOI.py ├── Errors.py ├── HashResolvers.py ├── HashStore.py ├── KeyPair.py ├── LocalResolver.py ├── Multihash.py ├── NameResolver.py ├── OutputFormat.py ├── ServerBase.py ├── ServerGateway.py ├── SmartDict.py ├── Transport.py ├── TransportHTTP.py ├── TransportIPFS.py ├── TransportLocal.py ├── __init__.py ├── config.py ├── elastic_schema.json ├── maintenance.py ├── miscutils.py ├── requirements.txt └── test │ ├── __init__.py │ ├── _utils.py │ ├── test_LocationService.py │ ├── test_archive.py │ ├── test_doi.py │ ├── test_local.py │ └── test_multihash.py ├── rungate.py ├── scripts ├── install.sh ├── reset_ipfs.sh ├── temp.sh └── tests.sh └── temp.py /.gitignore: -------------------------------------------------------------------------------- 1 | idents_files_urls.sqlite 2 | *.pyc 3 | # pycharm 4 | .idea 5 | .cache 6 | python/.cache 7 | idents_files_urls_sqlite 8 | dweb-gateway.log 9 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | 3 | python: 4 | - "2.7" 5 | - "3.6" 6 | 7 | before_install: cd python 8 | 9 | install: 10 | - pip install -r requirements.txt 11 | 12 | services: 13 | - redis-server 14 | 15 | script: 16 | - python -m pytest test/ 17 | -------------------------------------------------------------------------------- /Academic Docs IPFS gateway.isf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/internetarchive/dweb-gateway/a32b67dba701da6cd79204a488b787fe15112974/Academic Docs IPFS gateway.isf -------------------------------------------------------------------------------- /Academic Docs IPFS gateway.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/internetarchive/dweb-gateway/a32b67dba701da6cd79204a488b787fe15112974/Academic Docs IPFS gateway.pdf -------------------------------------------------------------------------------- /Classes.md: -------------------------------------------------------------------------------- 1 | # dweb-gateway - Classes 2 | A decentralized web gateway for open academic papers on the Internet Archive 3 | 4 | ## Important editing notes 5 | * Names might not be consistent below as it gets edited and code built. 6 | * Please edit to match names in the code as you notice conflicts. 7 | * A lot of this file will be moved into actual code as the skeleton gets built, just leaving summaries here. 8 | 9 | ## Other Info Links 10 | 11 | * [Main README](./README.md) 12 | * [Use Cases](./Usecases.md) 13 | * [Classes](./Classes.md) << You are here 14 | * [Data for the project - sqlite etc](https://archive.org/download/ia_papers_manifest_20170919) 15 | * [Proposal for meta data](./MetaData.md) - first draft - deleted :-( 16 | * [google doc with IPFS integration comments](https://docs.google.com/document/d/1kqETK1kmvbdgApCMQEfmajBdHzqiNTB-TSbJDePj0hM/edit#heading=h.roqqzmshx7ww) #TODO: Needs revision ot match this. 17 | * [google doc with top level overview of Dweb project](https://docs.google.com/document/d/1-lI352gV_ma5ObAO02XwwyQHhqbC8GnAaysuxgR2dQo/edit) - best place for links to other resources & docs. 18 | * [gateway.dweb.me](https://gateway.dweb.me) points at the server - which should be running the "deployed" branch. 19 | * [Gitter chat area](https://gitter.im/ArchiveExperiments/Lobby) 20 | So for example: curl https://gateway.dweb.me/info 21 | 22 | ## Overview 23 | 24 | This gateway sits between a decentralized web server running locally 25 | (in this case an Go-IPFS server) and the Archive. 26 | It will expose a set of services to the server. 27 | 28 | The data is stored in a sqlite database that matches DOI's to hashes of the files we know of, 29 | and the URLs to retrieve them. 30 | 31 | Note its multivalue i.e. a DOI represents an academic paper, which may be present in the archive in 32 | various forms and formats. (e.g. PDF, Doc; Final; Preprint). 33 | 34 | See [Information flow diagram](./Academic Docs IPFS gateway.pdf) 35 | 36 | Especially see main [README](./README.md) and [Use Cases](./UseCases.md) 37 | 38 | ## Structure high level 39 | 40 | Those services will be built from a set of microservices which may or may not be exposed. 41 | 42 | All calls to the gateway will come through a server that routes to individual services. 43 | 44 | Server URLs have a consistent form 45 | /outputformat/namespace/namespace-dependent-string 46 | 47 | Where: 48 | * outputformat: Extensible format wanted e.g. [IPLD](#IPLD) or [nameresolution](#nameresolution) 49 | * namespace: is a extensible descripter for name spaces e.g. "doi" 50 | * namespace-dependent-string: is a string, that may contain additional "/" dependent on the namespace. 51 | 52 | This is implemented as a pair of steps 53 | - first the name is passed to a class representing the name space, 54 | and then the object is passed to a class for the outputformat that can interpret it, 55 | and then a "content" method is called to output something for the client. 56 | 57 | See [HTTPServer](httpserver) for how this is processed in an extensible form. 58 | 59 | ## Microservices 60 | 61 | ### Summary 62 | 63 | * HTTP Server: Routes queries to handlers based on first part of the URL, pretty generic (code done, needs pushing) 64 | * Name Resolvers: A group of classes that recognize names and connect to internal resources 65 | * NameResolver: Superclass of each Name resolution class 66 | * NameResolverItem: Superclass to represent a file in a NameResolver 67 | * NameResolverShard: Superclass to represent a shard of a NameResolverItem 68 | * DOIResolver: A NameResolver that knows about DOI's 69 | * DOIResolverFile: A NameResolverItem that knows about files in a DOI 70 | * ContentHash: A NameResolverItem that knows about hashes of content 71 | * GatewayOutput: A group of classes handling converting data for output 72 | * IPLD: Superclass for different ways to build IPFS specific IPLDs 73 | * IPLDfiles: Handles IPLD that contain a list of other files 74 | * IPLDshards: Handles IPLD that contain a list of shards of a single file 75 | * Storage and Retrieval: A group of classes and services that store in DB or disk. 76 | * Hashstore: Superclass for a generic hash store to be built on top of REDIS 77 | * LocationService: A hashstore mapping multihash => location 78 | * ContentStore: Maps multihash <=> bytes. Might be on disk or REDIS 79 | * Services Misc 80 | * Multihash58: Convert hashes to base58 multihash, and SHA content. 81 | 82 | 83 | ### HTTP Server 84 | Routes queries to handlers based on the first part of the URL for the output format, 85 | that routine will create an object by calling the constructor for the Namespace, and 86 | then do whatever is needed to generate the output format (typically calling a method on the created 87 | object, or invoking a constructor for that format.) 88 | 89 | `GET '/outputformat/namespace/namespace_dependent_string?aaa=bbb,ccc=ddd'` 90 | 91 | details moved to [ServerGateway.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/ServerGateway.py) 92 | 93 | ## Name Resolvers 94 | The NameResolver group of classes manage recognizing a name, and connecting it to resources 95 | we have at the Archive. 96 | 97 | ###NameResolver superclass 98 | 99 | The NameResolver class is superclass of each name resolution, 100 | it specifies a set of methods we expect to be able to do on a subclass, 101 | and may have default code based on assumptions about the data structure of subclasses. 102 | 103 | Logically it can represent one or multiple files. 104 | 105 | details moved to [NameResolver.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/NameResolver.py) 106 | 107 | ###NameResolverDir superclass 108 | 109 | * Superclass for items that represent multiple files, 110 | * e.g. a directory, or the files that contain a DOI 111 | * Its files() method iterates over them, returning NameResolverFile 112 | * details moved to [NameResolver.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/NameResolver.py) 113 | 114 | ###NameResolverFile superclass 115 | 116 | * Superclass for items in a NameResolverDir, 117 | * for example a subclass would be specific PDFs containing a DOI. 118 | * It contains enough information to allow for retrieval of the file e.g. HTTP URL, or server and path. Also can have byterange, 119 | * And meta-data such as date, size 120 | * Its shards() method iterates over the shards stored. 121 | * details moved to [NameResolver.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/NameResolver.py) 122 | 123 | ###NameResolverShard superclass 124 | 125 | * Superclass for references to Shards in a NameResolverItem 126 | * Returned by the shards() iterator in NameResovlerItem 127 | * details moved to [NameResolver.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/NameResolver.py) 128 | 129 | ###DOI Resolver 130 | 131 | Implements name resolution of the DOI namespace, via a sqlite database (provided by Brian) 132 | 133 | * URL: `/xxx/doi/10.pub_id/pub_specific_id` (forwarded here by HTTPServer) 134 | Resolves a DOI specific name such as 10.nnn/zzzz, 135 | 136 | * details moved to [DOI.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/DOI.py) 137 | * Future Project to preload the different stores from the sqlite. 138 | 139 | ###DOIResolverFile 140 | * Subclass of NameResolverFile that holds meta-data from the sqllite database 141 | * details moved to [DOI.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/DOI.py) 142 | 143 | ###ContentHash 144 | Subclass of NameResolverItem 145 | Looks up the multihash in Location Service to find where can be retrieved from. 146 | * details moved to [ContentHash.py](https://github.com/internetarchive/dweb-gateway/blob/master/python/ContentHash.py) 147 | 148 | ## Gateway Outputs 149 | The Gateway Output group of classes manage producing derived content for sending back to requesters. 150 | 151 | ###GatewayOutput superclass 152 | Superclass of IPLD. Nothing useful defined here currently, but might be! 153 | 154 | Each subclass must implement: 155 | * content(): Return content suitable for returning via the HTTP Server 156 | * contenttype: String suitable for Content-Type HTTP field, e.g. "application/json" 157 | 158 | ##### Future Work 159 | Support streams as a return content type, both here and in server base class. 160 | 161 | ###IPLD superlass 162 | Subclass of GatewayOutput; Superclass for IPLDdir and IPLDshards 163 | 164 | This is a format specified by IPFS, 165 | see [IPLD spec](https://github.com/ipld/specs/tree/master/ipld) 166 | 167 | Note there are two principal variants of IPLD from our perspective, 168 | one provides a list of files (like a directory listing), 169 | the other provides a list of shards, that together create the desired file. 170 | 171 | Note that structurally this is similar, but not identical to the data held in the DOIResolver format. 172 | There will be a mapping required especially as the IPLD spec is incomplete and subject to new version 173 | which is overdue and probably (Kyle to confirm) doesn't accurately match how IPLDs exist in current 174 | IPFS code (based on the various variations I see). 175 | 176 | Potential mapping: 177 | 178 | * Convert Python dates or ISOStrings into the (undefined) format in IPFS, (its unclear 179 | why a standard like ISO wasn't used by IPFS) See [IPLD#46](https://github.com/ipld/specs/issues/46) 180 | * Possibly replacing links - its unclear in the spec if IPLD wants a string like /ipfs/Q... or just the multhash. 181 | * Possibly removing fields we don't want to expose (e.g. the URL) 182 | 183 | Methods: 184 | * multihash58(): returns the hash of the results of content() using the multihash58 service. 185 | 186 | 187 | 188 | ###IPLDfiles 189 | Subclass of IPLD where we want to return a directory, or a list of choices 190 | - for example each of the PDF's & other files available for a specific DOI 191 | * IPLDfiles(NameResolver) *{Expand}* load IPLD with meta-data from NR and iterate through it loading own list. 192 | * content() - *{Expand}* Return internal data structure as a JSON 193 | 194 | ###IPLDshards 195 | Subclass of IPLD where we want to return a list of subfiles, that are the shards that together 196 | make up the result. 197 | * IPLDshards(NameResolverItem) *{Expand}* load IPLD with meta-data from NR and iterate through it loading own list. 198 | * content() - *{Expand}* Return internal data structure as a JSON 199 | 200 | The constructor is potentially complex. 201 | * Read metadata from NR and store in appropriate format (esp format of RawHash not yet defined) 202 | * Loop through an iterator on the NR which will return shards. 203 | * Write each of them to the IPLDshards' data structure. 204 | * Then write result of content() to the Content_store getting back a multihash (called ipldhash) 205 | * Store ipldhash to location_store with pointer to the Content_store. 206 | * And locationstore_push(contenthash, { ipld: ipldhash }) so that contenthash maps to ipldhash 207 | 208 | 209 | ## Storage and Retrieval 210 | Services for writing and reading to disk or database. 211 | They are intentionally separated out so that in future the location of storage could change, 212 | for example metadata could be stored with Archive:item or with WayBack:CDX 213 | 214 | Preferably these will be implemented as classes, and interface doc below changed a little. 215 | 216 | ### Hashstore 217 | Stores and retrieves meta data for hashes, NOTE the hash is NOT the hash of the metadata, and metadata is mutable. 218 | The fields allow essentially for independent indexes. 219 | It should be a simple shim around REDIS, note will have to combine multihash and field to get a redis "key" as if we 220 | used multihash as the key, and field is one field of a Redis dict type, then we won't be able to "push" to it. 221 | Note we can assume that this is used in a consistent fashion, e.g. won't do hash_store then hash_push which would be invalid. 222 | 223 | ### Location Service 224 | Maps hashes to locations 225 | The multihash represents a file or a part of a file. Build upon hashstore. 226 | It is split out because this could be a useful service on its own. 227 | 228 | ### Content Store 229 | Store and retrieve content by its hash. 230 | 231 | ## Services 232 | Any services not bound to any class, group of classes 233 | 234 | ### Multihash58 235 | Convert file or hash to multihash in base58 236 | -------------------------------------------------------------------------------- /Extending.md: -------------------------------------------------------------------------------- 1 | # Dweb Gateway - Extending 2 | 3 | This document is a work in progress on how to extend the Dweb gateway 4 | 5 | ## Adding a new data type / resolver 6 | 7 | * Create a new file (or sometimes class in existing file) 8 | * Create a class in that file 9 | * If the class conceptually holds multiple objects (like a directory or collection) subclass NameResolverDir 10 | * If just one file, sublass NameResolverFile 11 | * See SEE-OTHERNAMESPACE in Python (and potentially in clients) for places to hook in. 12 | * Add required / optional methods 13 | * new(cls, namespace, *args, **kwargs) - which is passed everything from the HTTPS request except the outputtype 14 | 15 | 16 | 17 | ### Add the following methods 18 | 19 | __init__(cls, namespace, *args, **kwargs) 20 | * Create and initialize an object, often the superclass's method is used and work done in "new" 21 | 22 | @property mimetype 23 | * The mimetype string for the content 24 | 25 | @classmethod new(cls, namespace, *args, **kwargs) 26 | * Create a new object, initialize from args & kwargs, often does a metadata fetch etc. 27 | 28 | retrieve(verbose) 29 | * returns: "content" of an object, typically a binary such as a PDF 30 | 31 | content(verbose) 32 | * returns: "content" encapsulated for return to server 33 | * default: encapsulates retrieve() with mimetype 34 | 35 | -------------------------------------------------------------------------------- /HTTPAPI.md: -------------------------------------------------------------------------------- 1 | # DWEB HTTPS API 2 | 3 | This doc describes the API exposed by the Internet Archive's DWEB server. 4 | 5 | Note this server is experimental, it could change without notice. 6 | Before using it for anything critical please contact mitra@archive.org. 7 | 8 | ## Overview 9 | 10 | The URLs have a consistent structure, except for a few odd cases. See ____ 11 | ``` 12 | https://dweb.me/outputtype/itemtype/itempath 13 | ``` 14 | Where: 15 | 16 | * dweb.me is the HTTPS server. Any other server running this code should give the same output. 17 | * outputtype: is the type of result requested e.g. metadata or content 18 | * itemtype: is the kind of item being inquired about e.g. doi, contenthash 19 | 20 | The outputtype and itemtype are in most cases orthogonal, i.e. any outputtype SHOULD work with any itemtype. 21 | In practice some combinations don't make any sense. 22 | 23 | ## Output Types 24 | 25 | * content: Return the file itself 26 | * contenthash: Return the hash of the content, suitable for a contenthash/xyz request 27 | * contenturl: Return a URL that could be used to retrieve the content 28 | * metadata: Return a JSON with metadata about the file - its format is type dependent 29 | * void: Return emptiness 30 | 31 | ## Item Types 32 | 33 | * contenthash: The hash of a content, in base58 multihash form typically Q123abc or z123abc depending on which hash is used, 34 | it returns files from the Archive.org and will be expanded to cover more collections over time. 35 | * doi: Document Object Identifier e.g. 10.1234/abc-def. The standard identifier of Academic papers. 36 | * sha1hex: Sha1 expressed as a hex string e.g. a1b2c3 37 | * rawstore: The data - provided with a POST is to be stored 38 | * rawfetch: Equivalent to contenthash except only retrieves from a local data store (so is faster) 39 | * rawadd: Adds a JSON data structure to a named list e.g. rawadd/Q123 40 | * rawlist: Returns an array of data structures added to a list with rawadd 41 | * archiveid: An item (a collection of related files) represented by an Archive.org itemid. 42 | * advancedsearch: A collection of items returned by a search on archive.org 43 | 44 | (Note this set is mapped in ServerGateway.py to the classes that serve them) 45 | 46 | 47 | ## Odd cases 48 | 49 | * info - returns a JSON describing the server - format will change except that always contains { type: "gateway" } 50 | 51 | -------------------------------------------------------------------------------- /Metadata.md: -------------------------------------------------------------------------------- 1 | # Dweb Gateway - Metadata 2 | 3 | 4 | Metadata changes - in brief…. 5 | 6 | See []https://dweb.me/arc/archive.org/metadata/commute] for example 7 | ``` 8 | { 9 | collection_titles { 10 | artsandmusicvideos: “Arts & Music” # Maps collection name to the title in the UI 11 | files: [ 12 | { 13 | contenthash: contenthash:/contenthash/ 14 | magnetlink: “magnet …. /” 15 | } 16 | ] 17 | metadata: { 18 | magnetlink: “magnet ….” 19 | thumbnaillinks: [ 20 | “ipfs:/ipfs/”, # IPFS link (lazy added if not already in Redis) 21 | “http://dweb.me/arc/archive.org/thumbnail/commute”, # Direct http link 22 | ] 23 | } 24 | ``` 25 | 26 | []https://dweb.me/arc/archive.org/metadata/commute/commute.avi] 27 | Expands on the files metadata to add 28 | ``` 29 | { 30 | contenthash: contenthash:/contenthash/ 31 | ipfs: ipfs:/ipfs/ # Adds IPFS link after lazy adding file to IPFS (only done at this point because of speed of adding) 32 | magnetlink: “magnet …. /” 33 | } 34 | ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # dweb-gateway 3 | A decentralized web gateway for open academic papers on the Internet Archive 4 | 5 | ## NOTE THIS REPO IS NO LONGER MAINTAINED, ITS ALL MOVED TO ia-dweb and www-dweb on internal system 6 | ## which just calls the dweb-archivecontroller routing.js 7 | 8 | ## Important editing notes 9 | * Names might not be consistent below as it gets edited and code built. 10 | * Please edit to match names in the code as you notice conflicts. 11 | * A lot of this file will be moved into actual code as the skeleton gets built, just leaving summaries here. 12 | 13 | ## Other Info Links 14 | 15 | * [Main README](./README.md) << You are here 16 | * [Use Cases](./Usecases.md) 17 | * [Classes](./Classes.md) 18 | * [HTTP API](./HTTPAPI.md) 19 | * [Extending](./Extending.md) 20 | * [Data for the project - sqlite etc](https://archive.org/download/ia_papers_manifest_20170919) 21 | * [Proposal for meta data](./MetaData.md) - first draft - looks like got deleted :-( 22 | * [google doc with IPFS integration comments](https://docs.google.com/document/d/1kqETK1kmvbdgApCMQEfmajBdHzqiNTB-TSbJDePj0hM/edit#heading=h.roqqzmshx7ww) #TODO: Needs revision ot match this. 23 | * [google doc with top level overview of Dweb project](https://docs.google.com/document/d/1-lI352gV_ma5ObAO02XwwyQHhqbC8GnAaysuxgR2dQo/edit) - best place for links to other resources & docs. 24 | * [gateway.dweb.me](https://gateway.dweb.me) points at the server - which should be running the "deployed" branch. 25 | * [Gitter chat area](https://gitter.im/ArchiveExperiments/Lobby) 26 | So for example: curl https://gateway.dweb.me/info 27 | 28 | ## Overview 29 | 30 | This gateway sits between a decentralized web server running locally 31 | (in this case an Go-IPFS server) and the Archive. 32 | It will expose a set of services to the server. 33 | 34 | The data is stored in a sqlite database that matches DOI's to hashes of the files we know of, 35 | and the URLs to retrieve them. 36 | 37 | Note its multivalue i.e. a DOI represents an academic paper, which may be present in the archive in 38 | various forms and formats. (e.g. PDF, Doc; Final; Preprint). 39 | 40 | See [Information flow diagram](./Academic Docs IPFS gateway.pdf) 41 | 42 | ## Structure high level 43 | 44 | Those services will be built from a set of microservices which may or may not be exposed. 45 | 46 | All calls to the gateway will come through a server that routes to individual services. 47 | 48 | Server URLs have a consistent form 49 | /outputformat/namespace/namespace-dependent-string 50 | 51 | Where: 52 | * outputformat: Extensible format wanted e.g. [IPLD](#IPLD) or [nameresolution](#nameresolution) 53 | * namespace: is a extensible descripter for name spaces e.g. "doi" 54 | * namespace-dependent-string: is a string, that may contain additional "/" dependent on the namespace. 55 | 56 | This is implemented as a pair of steps 57 | - first the name is passed to a class representing the name space, 58 | and then the object is passed to a class for the outputformat that can interpret it, 59 | and then a "content" method is called to output something for the client. 60 | 61 | See [HTTPServer](httpserver) for how this is processed in an extensible form. 62 | 63 | See [UseCases](./Usecases.md) and [Classes](./Classes.md) for expansion of this 64 | 65 | See [HTTPS API](./HTTPSAPI.md) for the API exposed by the URLs. 66 | 67 | ## Installation 68 | 69 | This should work, someone please confirm on a clean(er) machine and remove this comment. 70 | 71 | You'll first need REDIS & Supervisor to be installed 72 | ### On a Mac 73 | ```bash 74 | brew install redis 75 | brew services start redis 76 | brew install supervisor 77 | ``` 78 | 79 | ### On a Linux 80 | 81 | Supervisor install details are in: [https://pastebin.com/ctEKvcZt] and [http://supervisord.org/installing.html] 82 | 83 | Its unclear to me how to install REDIS, its been on every machine I've used. 84 | 85 | ### Python gateway: 86 | #### Installation 87 | ```bash 88 | # Note it uses the #deployable branch, #master may have more experimental features. 89 | cd /usr/local # On our servers its always in /usr/local/dweb-gateway and there may be dependencies on this 90 | git clone http://github.com/internetarchive/dweb-gateway.git 91 | 92 | ``` 93 | Run this complex install script, if it fails then check the configuration at top and rerun. It will: 94 | 95 | * Do the pip install (its all python3) 96 | * Updates from the repo#deployable (and pushes back any locally made changes) to #deployed 97 | * Pulls a sqlite file that isn’t actually used any more (it was for academic docs in the first iteration of the gateway) 98 | * Checks the NGINX files map what I expect and (if run as `install.sh NGINX`) copies them over if you have permissions 99 | * Restarts service via supervisorctl, it does NOT setup supervisor 100 | 101 | There are zero guarrantees that changing the config will not cause it to fail! 102 | ```bash 103 | cd dweb-gateway 104 | scripts/install.sh 105 | ``` 106 | In addition 107 | * Check and copy etc_supervisor_conf.d_dweb.conf to /etc/supervisor/conf.d/dweb.conf or server-specific location 108 | * Check and copy etc_ferm_input_nginx to /etc/ferm/input/nginx or server specific location 109 | 110 | #### Update 111 | `cd /usr/local/dweb-gateway; scripts/install.sh` 112 | should update from the repo and restart 113 | 114 | #### Restart 115 | `supervisorctl restart dweb:dweb-gateway` 116 | 117 | ### Gun, Webtorrent Seeder; Webtorrent-tracker 118 | #### Installation 119 | They are all in the dweb-transport repo so ... 120 | ```bash 121 | cd /usr/local # There are probably dependencies on this location 122 | git clone http://github.com/internetarchive/dweb-transport.git 123 | npm install 124 | # Supervisorctl, nginx and ferm should have been setup above. 125 | supervisorctl start dweb:dweb-gun 126 | supervisorctl start dweb:dweb-seeder 127 | supervisorctl start dweb:dweb-tracker 128 | ``` 129 | #### Update 130 | ```bash 131 | cd /usr/local/dweb-transport 132 | git pull 133 | npm update 134 | supervisorctl restart dweb:* 135 | sleep 10 # Give it time to start and not quickly exit 136 | supervisorctl status 137 | ``` 138 | 139 | #### Restart 140 | `supervisorctl restart dweb:*` will restart these, and the python gateway and IPFS 141 | or restart `dweb:dweb-gun` or `dweb:dweb-seeder` or `dweb:dweb-tracker` individually. 142 | 143 | ### IPFS 144 | ### Installation 145 | Was done by Protocol labs and I’m not 100% sure the full set of things done to setup the repo in a slightly non-standard way, 146 | 147 | In particular I know there is a command that have to be run once to enable the ‘urlstore’ functionality 148 | 149 | And there may be something needed to enable WebSockets connections (they are enabled in the gateway’s nginx files) 150 | 151 | There is a cron task running every 10 minutes that calls one of the scripts and works around a IPFS problem that should be fixed at some point, but not necessarily soon. 152 | ``` 153 | 3,13,23,33,43,53 * * * * python3 /usr/local/dweb-gateway/cron_ipfs.py 154 | ``` 155 | 156 | ### Update 157 | ```bash 158 | ipfs update install latest 159 | supervisorctl restart dweb:dweb-ipfs 160 | ``` 161 | Should work, but there have been issues with IPFS's update process in the past with non-automatic revisions of the IPFS repo. 162 | 163 | ### Restart 164 | ``` 165 | supervisorctl restart dweb:dweb-ipfs 166 | ``` 167 | 168 | ### dweb.archive.org UI 169 | ```bash 170 | cd /usr/local && git clone http://github.com/internetarchive/dweb-archive.git 171 | cd /usr/local/dweb-archive && npm install 172 | ``` -------------------------------------------------------------------------------- /Usecases.md: -------------------------------------------------------------------------------- 1 | # dweb-gateway - Use Cases 2 | A decentralized web gateway for open academic papers on the Internet Archive 3 | 4 | An outline of Use Cases for the gateway 5 | 6 | ## Important editing notes 7 | * Names might not be consistent below as it gets edited and code built. 8 | * Please edit to match names in the code as you notice conflicts. 9 | 10 | ## Other Info Links 11 | 12 | * [Main README](./README.md) 13 | * [Use Cases](./Usecases.md) << You are here 14 | * [Classes](./Classes.md) 15 | * [Data for the project - sqlite etc](https://archive.org/download/ia_papers_manifest_20170919) 16 | * [Proposal for meta data](./MetaData.md) - first draft - deleted needs recreating 17 | * [google doc with IPFS integration comments](https://docs.google.com/document/d/1kqETK1kmvbdgApCMQEfmajBdHzqiNTB-TSbJDePj0hM/edit#heading=h.roqqzmshx7ww) #TODO: Needs revision ot match this. 18 | * [google doc with top level overview of Dweb project](https://docs.google.com/document/d/1-lI352gV_ma5ObAO02XwwyQHhqbC8GnAaysuxgR2dQo/edit) - best place for links to other resources & docs. 19 | * [gateway.dweb.me](https://gateway.dweb.me) points at the server - which should be running the "deployed" branch. 20 | * [Gitter chat area](https://gitter.im/ArchiveExperiments/Lobby) 21 | So for example: curl https://gateway.dweb.me/info 22 | 23 | ## Overview 24 | 25 | This gateway sits between a decentralized web server running locally 26 | (in this case an Go-IPFS server) and the Archive. 27 | It will expose a set of services to the server. 28 | 29 | The data is stored in a sqlite database that matches DOI's to hashes of the files we know of, 30 | and the URLs to retrieve them. 31 | 32 | Note the data is multivalue i.e. a DOI represents an academic paper, which may be present in the archive in 33 | various forms and formats. (e.g. PDF, Doc; Final; Preprint). 34 | 35 | See [Information flow diagram](./Academic Docs IPFS gateway.pdf) 36 | 37 | Please see the main [README](./README.md) for the overall structure and [Classes](./Classes.md) for the class overview. 38 | 39 | ## Use Case examples 40 | 41 | ### Retrieving a document starting with DOI 42 | 43 | 44 | -#TODO: Copy the use case from [google doc with previous architecture version](https://docs.google.com/document/d/1FO6Tdjz7A1yi4ABcd8vDz4vofRDUOrKapi3sESavIcc/edit#) 45 | with edits to match current names etc in Microservices below. Below is draft 46 | 47 | ##### Retrieval of file by content hash 48 | * IPFS Gateway 49 | * Receives a request by contenthash 50 | * Requests GET //gateway.dweb.me/ipldfile/contenthash/Qm..... 51 | * Gateway Server/Service gateway.dweb.doi 52 | * Calls ContentHash(Qm...) 53 | * ContentHash(Qm...) 54 | * (ContentHash is subclass of NameResolverFile) 55 | * Locates file in the sqlite 56 | * Loads meta-data for that file 57 | * Gateway Server 58 | * Passes ContentHash object to ipldfile(CH) 59 | * IPLDfile calls CH.shards() as an iterator on CH to read each shard 60 | * ContentHash.shards() 61 | * Is an iterator that iterates over shards (or chunks) of the file. For each shard: 62 | * It reads a chunk of bytes from the file (using a byterange in a HTTP call) 63 | * It hashes those bytes 64 | * Stores the hash and the URL + Byterange in the location service 65 | * Returns the metadata & hash to IPLDfile 66 | * IPLDfile 67 | * Comines the return into the IPLD variant for shards, 68 | * and adds metadata, especially the contenthash 69 | * returns to NameServer 70 | * Gateway Server > IPFS > client 71 | * Calls IPLDfile.content() to get the ipld file to return to IPFS Gateway 72 | * IPFS Gateway 73 | * Pins the hash of the IPLD and each of the shards, and returns to client 74 | 75 | ##### File retrieval 76 | * iPFS Client 77 | * Having retrieved the IPLDfile, iterates over the shards 78 | * For each shard it tries to retrieve the hash 79 | * IPFS Gateway node 80 | * Recognizes the shard, and calls gateway.dweb.me/content/multihash/Q... 81 | * Gateway server 82 | * Routes to multihash("multihash", Q...) 83 | * Multihash("multihash", Q...) 84 | * Looks up the multihash in the location service 85 | * Disovers the location is a URL + Byterange 86 | * Gateway server 87 | * Calls content method on multihash 88 | * Multihash.content() 89 | * Retrieves the bytes (from elsewhere in Archive) and returns to Gateway Server 90 | * Gateway Server > IPFS Gateway > Client 91 | -------------------------------------------------------------------------------- /cron_ipfs.py: -------------------------------------------------------------------------------- 1 | import logging 2 | # This is run every 10 minutes by Cron (10 * 58 = 580 ~ 10 hours) 3 | from python.config import config 4 | import redis 5 | import base58 6 | from python.HashStore import StateService 7 | from python.TransportIPFS import TransportIPFS 8 | from python.maintenance import resetipfs 9 | 10 | logging.basicConfig(**config["logging"]) # For server 11 | resetipfs(announcedht=True) 12 | 13 | # To fully reset IPFS need to also ... 14 | # rm /usr/local/dweb-gateway/.cache/table/{config["domains"]["metadataverifykey"]} which is where leafs stored - these refer to IPFS hashes for metadata 15 | -------------------------------------------------------------------------------- /etc_ferm_input_nginx: -------------------------------------------------------------------------------- 1 | proto tcp dport 80 ACCEPT; 2 | proto tcp dport 443 ACCEPT; 3 | proto tcp dport 4001 ACCEPT; 4 | proto tcp dport 4245 ACCEPT; 5 | proto tcp dport 4246 ACCEPT; 6 | proto tcp dport 8080 ACCEPT; 7 | proto tcp dport 6881 ACCEPT; 8 | proto tcp dport 6969 ACCEPT; 9 | proto udp dport 6969 ACCEPT; 10 | -------------------------------------------------------------------------------- /etc_supervisor_conf.d_dweb.conf: -------------------------------------------------------------------------------- 1 | [group:dweb] 2 | programs=dweb-gateway,dweb-ipfs,dweb-gun,dweb-tracker,dweb-seeder 3 | 4 | [program:dweb-gateway] 5 | command=/usr/bin/python3 -m python.ServerGateway 6 | directory = /usr/local/dweb-gateway 7 | user = mitra 8 | stdout_logfile = /var/log/dweb/dweb-gateway 9 | stdout_logfile_maxbytes=500MB 10 | redirect_stderr = True 11 | autostart = True 12 | autorestart = True 13 | environment=USER=mitra,PYTHONUNBUFFERED=TRUE 14 | exitcodes=0 15 | 16 | [program:dweb-ipfs] 17 | command=/usr/local/bin/ipfs daemon --enable-gc --migrate=true 18 | directory = /usr/local/dweb-gateway 19 | user = ipfs 20 | stdout_logfile = /var/log/dweb/dweb-ipfs 21 | stdout_logfile_maxbytes=500MB 22 | redirect_stderr = True 23 | autostart = True 24 | autorestart = True 25 | environment=USER=ipfs 26 | exitcodes=0 27 | 28 | [program:dweb-gun] 29 | command=node ./gun_https_archive.js 4246 30 | directory = /usr/local/dweb-transport/gun 31 | user = gun 32 | stdout_logfile = /var/log/dweb/dweb-gun 33 | stdout_logfile_maxbytes=500mb 34 | redirect_stderr = True 35 | autostart = True 36 | autorestart = True 37 | environment=GUN_ENV=false 38 | exitcodes=0 39 | 40 | [program:dweb-tracker] 41 | command=node index.js 42 | directory = /usr/local/dweb-transport/tracker 43 | user = mitra 44 | stdout_logfile = /var/log/dweb/dweb-tracker 45 | stdout_logfile_maxbytes=500mb 46 | redirect_stderr = True 47 | autostart = True 48 | autorestart = True 49 | exitcodes=0 50 | 51 | [program:dweb-seeder] 52 | command=node index.js 53 | directory = /usr/local/dweb-transport/seeder 54 | user = mitra 55 | stdout_logfile = /var/log/dweb/dweb-seeder 56 | stdout_logfile_maxbytes=500mb 57 | redirect_stderr = True 58 | autostart = True 59 | autorestart = True 60 | environment=DEBUG=* 61 | exitcodes=0 62 | -------------------------------------------------------------------------------- /load_ipfs.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import logging 4 | import sys 5 | import redis 6 | 7 | from python.config import config 8 | from python.Archive import ArchiveItem, ArchiveFile 9 | 10 | logging.basicConfig(**config["logging"]) # On server logs to /var/log/dweb/dweb-gateway 11 | 12 | #print(config); 13 | logging.debug("load_ipfs args={}".format(sys.argv)) # sys.argv[1] is first arg (0 is this script) 14 | if (len(sys.argv) > 1) and ("/" in sys.argv[1]): 15 | args = sys.argv[1].split('/') 16 | else: 17 | args = sys.argv[1:] 18 | 19 | #Can override args while testin 20 | #args = ["commute"] 21 | #args = ["commute", "commute.avi"] 22 | #args = ["commute", "closeup.gif"] 23 | 24 | # Set one of these to True based on whether want to use IPFS add or IPFS urlstore 25 | forceadd = True 26 | forceurlstore = False 27 | # Set to true if want each ipfs hash added to DHT via DHT provide 28 | announcedht = False 29 | 30 | obj = ArchiveItem.new("archiveid", *args, wanttorrent=False) 31 | print('"URL","Add/Urlstore","Hash","Size","Announced"') 32 | if isinstance(obj, ArchiveFile): # TODO-PERMS-OK cache_ipfs should be checking perms 33 | obj.cache_ipfs(url = obj.archive_url, forceadd=forceadd, forceurlstore=forceurlstore, verbose=False, printlog=True, announcedht=announcedht, size=int(obj._metadata["size"])) 34 | else: 35 | obj.cache_ipfs(forceurlstore=forceurlstore, forceadd=forceadd, verbose=False, announcedht=announcedht, printlog=True) # Will Loop through all files in Item 36 | 37 | #print("---FINISHED ---") 38 | -------------------------------------------------------------------------------- /nginx/README.md: -------------------------------------------------------------------------------- 1 | # NGINX configuration 2 | 3 | These files are to support tracking changes to nginx, 4 | 5 | For now just being tracked manually, i.e. editing here wont change anything but can add any changes back to git by ... 6 | ```bash 7 | cd /usr/local/dweb-gateway/nginx 8 | cp /etc/nginx/sites-enabled/* . 9 | git commit -a 10 | git push 11 | ``` 12 | 13 | TODO - this could use some support in the install scripts etc, 14 | 15 | 16 | ## SUMMARY 17 | * https://dweb.me (secure server) proxypass http://dweb.me (research server) 18 | * http://dweb.me/ -> https://dweb.me/ 19 | * http://dweb.archive.org/aaa/xxx -> gateway /arc/archive.org/aaa/xxx 20 | * https://{gateway.dweb.me, dweb.me}/ -> https://dweb.archive.org - exact URL only 21 | * https://{gateway.dweb.me, dweb.me}/ proxypass localhost:4244 (gateway python) 22 | * https://{gateway.dweb.me, dweb.me, dweb.archive.org}/examples -> file 23 | * http://dweb.archive.org/{details,search} -> bootloader 24 | * https://{gateway.dweb.me, dweb.me}/arc/archive.org/{details,search} -> bootloader 25 | * https://{dweb.me, gateway.dweb.me, dweb.archive.org}/{ws,wss} proxypass localhost:4002 (websockets for IPFS) - not yet working 26 | ## 27 | 28 | The main differences between the different domains are ... 29 | 30 | * dweb.archive.org answers on http, because its the end-point of a proxypass from https://dweb.archive.org (another machine) 31 | * dweb.me forces http by redirecting to https 32 | * gateway.dweb.me provides access to the python server for any URL at http://gateway.dweb.me:80 33 | * dweb.me and gateway.dweb.me forward exact root URL to '/' to https://dweb.archive.org/ 34 | * dweb.archive.org forwards /{details, search}; dweb.me & gateway.dweb.me forward /arc/archive.org/{details,search} to bootstrap.html 35 | 36 | 37 | ## URLS of interest 38 | dweb.archive.org|dweb.me or gateway.dweb.me|Action 39 | ----------------|--------------------------|------ 40 | /|/|Archive home page via bootloader 41 | /search?q=xyz|/arc/archive.org/search/q=xyz|search page via bootloader 42 | /details/foo|/arc/archive.org/details/foo|details page via bootloader 43 | /ipfs/Q1234|/ipfs/Q1234|IPFS result 44 | /metadata/foo|/arc/archive.org/metadata/foo|cache and return metadata JSON 45 | /leaf/foo|/arc/archive.org/leaf/foo|cache and return leaf record JSON (for naming) 46 | /download/foo/bar|/arc/archive.org/download/foo|return file bar from item foo 47 | n/a|/add,list,store etc|access gateway functionality 48 | -------------------------------------------------------------------------------- /nginx/dweb.archive.org: -------------------------------------------------------------------------------- 1 | ## 2 | # You should look at the following URL's in order to grasp a solid understanding 3 | # of Nginx configuration files in order to fully unleash the power of Nginx. 4 | # http://wiki.nginx.org/Pitfalls 5 | # http://wiki.nginx.org/QuickStart 6 | # http://wiki.nginx.org/Configuration 7 | # 8 | # Generally, you will want to move this file somewhere, and start with a clean 9 | # file but keep this around for reference. Or just disable in sites-enabled. 10 | # 11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples. 12 | # 13 | # SUMMARY 14 | # https://dweb.me (secure server) proxypass http://dweb.me (research server) 15 | # http://dweb.me/ -> https://dweb.me/ 16 | # http://dweb.archive.org/aaa/xxx -> gateway /arc/archive.org/aaa/xxx 17 | # https://{gateway.dweb.me, dweb.me}/ -> https://dweb.archive.org - exact URL only 18 | # https://{gateway.dweb.me, dweb.me}/ proxypass localhost:4244 (gateway python) 19 | # https://{gateway.dweb.me, dweb.me, dweb.archive.org}/{archive,examples} -> file 20 | # http://dweb.archive.org/{details,search} -> bootloader 21 | # https://{gateway.dweb.me, dweb.me}/arc/archive.org/{details,search} -> bootloader 22 | # https://{dweb.me, gateway.dweb.me, dweb.archive.org}/{ws,wss} proxypass localhost:4002 (websockets for IPFS) - not yet working 23 | # https://dweb.archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects} 24 | # https://{dweb.me, gateway}/arc/archive.org/download/xxx -> /arc/archive.org/details/xxx?download=1## 25 | # https://{dweb.archive.org}/download/xxx -> /details/xxx?download=1 26 | ## 27 | 28 | #### 29 | #### THE SITES gateway.dweb.me & dweb.me ARE ALMOST IDENTICAL, ITS HIGHLY LIKEY ANY CHANGES HERE NEED TO BE MADE ON THE OTHER SITE 30 | #### 31 | 32 | server { 33 | # dweb.me and gateway.dweb.me on 443 34 | # dweb.archive.org answers on 80 because its the end-point from port 443 proxypass on the real dweb.archive.org 35 | listen 80; 36 | 37 | root /var/www/html; 38 | 39 | # Add index.php to the list if you are using PHP 40 | index index.html index.htm index.nginx-debian.html; 41 | 42 | server_name dweb.archive.org; 43 | 44 | # Forward details and search -> bootloader 45 | # On dweb.me & gateway.dweb.me thsi is at /arc/archive.org/{details,search}, on dweb.archive.org its at /{details,search} 46 | # Load bootloader which will examine the URL 47 | location ~ ^/$ { 48 | add_header Access-Control-Allow-Origin *; 49 | try_files /archive/bootloader.html =404; 50 | } 51 | 52 | # Catch /download/foo - displayed on details page; rather than /download/foo/bar which goes to gateway 53 | location ~ ^/download/[^/]*$ { 54 | rewrite ^/download/([^/]*)$ /details/$1&download=1 redirect; 55 | } 56 | 57 | location /details { 58 | add_header Access-Control-Allow-Origin *; 59 | try_files /archive/bootloader.html =404; 60 | } 61 | location /search { 62 | add_header Access-Control-Allow-Origin *; 63 | try_files /archive/bootloader.html =404; 64 | } 65 | 66 | # Handle archive.org urls that cant currently be done with Dweb 67 | location ~ ^/(about|bookmarks|donate|projects) { 68 | return 302 https://archive.org$request_uri; 69 | } 70 | 71 | # Not yet working forward, unclear if should be /ws or /wss 72 | location /ws { 73 | proxy_pass http://localhost:4002; 74 | proxy_http_version 1.1; 75 | proxy_set_header Upgrade $http_upgrade; 76 | proxy_set_header Connection "upgrade"; 77 | } 78 | 79 | location /wss { 80 | proxy_pass http://localhost:4002; 81 | proxy_http_version 1.1; 82 | proxy_set_header Upgrade $http_upgrade; 83 | proxy_set_header Connection "upgrade"; 84 | } 85 | 86 | location /favicon.ico { 87 | add_header Access-Control-Allow-Origin *; 88 | # First attempt to serve request as file, then 89 | # as directory, then fall back to displaying a 404. 90 | try_files $uri $uri/ =404; 91 | } 92 | 93 | location /examples { 94 | add_header Access-Control-Allow-Origin *; 95 | # First attempt to serve request as file, then 96 | # as directory, then fall back to displaying a 404. 97 | try_files $uri $uri/ =404; 98 | } 99 | 100 | location /archive { 101 | add_header Access-Control-Allow-Origin *; 102 | # First attempt to serve request as file, then 103 | # as directory, then fall back to displaying a 404. 104 | try_files $uri $uri/ =404; 105 | } 106 | 107 | location /dweb-serviceworker-bundle.js { 108 | add_header Access-Control-Allow-Origin *; 109 | # First attempt to serve request as file, then 110 | # as directory, then fall back to displaying a 404. 111 | try_files /examples/$uri =404; 112 | } 113 | 114 | location /ipfs { 115 | rewrite ^/ipfs/(.*) /ipfs/$1 break; 116 | proxy_set_header Host $host; 117 | proxy_set_header X-Real-IP $remote_addr; 118 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 119 | proxy_set_header X-Forwarded-Proto $scheme; 120 | proxy_pass http://localhost:8080/; 121 | proxy_read_timeout 600; 122 | } 123 | 124 | # # Exact root URL - on dweb.me or gateway.dweb.me redirect to https://dweb.archive.org so client uses https, most of these are going to be crawlers/hackers 125 | # location ~ ^/$ { 126 | # return 301 https://dweb.archive.org/; 127 | # } 128 | 129 | location ~ ^.+/dweb-(transports|objects)-bundle.js { # contains dweb-*-bundles, but doesnt start with it - probably relative url in bootstrap.html 130 | rewrite ^.*/dweb-(transports|objects)-bundle.js.*$ /dweb-$1-bundle.js redirect; 131 | } 132 | 133 | location / { 134 | add_header Access-Control-Allow-Origin *; 135 | try_files $uri /archive$uri @gateway; 136 | } 137 | 138 | # on dweb.me or gatewy.dweb.me Forward everything else to the gateway on port 4244, not on dweb.archive.org as assumes /arc/archive.org 139 | location @gateway { 140 | rewrite (.*) /arc/archive.org$1 break; # On dweb.archive.org rewrite it to /arc/archive.org/... 141 | proxy_set_header Host $host; 142 | proxy_set_header X-Real-IP $remote_addr; 143 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 144 | proxy_set_header X-Forwarded-Proto $scheme; 145 | proxy_pass http://localhost:4244; 146 | proxy_read_timeout 600; 147 | } 148 | 149 | } 150 | -------------------------------------------------------------------------------- /nginx/dweb.me: -------------------------------------------------------------------------------- 1 | ## 2 | # You should look at the following URL's in order to grasp a solid understanding 3 | # of Nginx configuration files in order to fully unleash the power of Nginx. 4 | # http://wiki.nginx.org/Pitfalls 5 | # http://wiki.nginx.org/QuickStart 6 | # http://wiki.nginx.org/Configuration 7 | # 8 | # Generally, you will want to move this file somewhere, and start with a clean 9 | # file but keep this around for reference. Or just disable in sites-enabled. 10 | # 11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples. 12 | # 13 | # SUMMARY 14 | # https://dweb.me (secure server) proxypass http://dweb.me (research server) 15 | # http://dweb.me/ -> https://dweb.me/ 16 | # http://dweb.archive.org/aaa/xxx -> gateway /arc/archive.org/aaa/xxx 17 | # https://{gateway.dweb.me, dweb.me}/ -> https://dweb.archive.org - exact URL only 18 | # https://{gateway.dweb.me, dweb.me}/ proxypass localhost:4244 (gateway python) 19 | # https://{gateway.dweb.me, dweb.me, dweb.archive.org}/{archive,examples} -> file 20 | # http://dweb.archive.org/{details,search} -> bootloader 21 | # https://{gateway.dweb.me, dweb.me}/arc/archive.org/{details,search} -> bootloader 22 | # https://{dweb.me, gateway.dweb.me, dweb.archive.org}/{ws,wss} proxypass localhost:4002 (websockets for IPFS) - not yet working 23 | # https://dweb.archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects} 24 | # https://{gatgeway.dweb.me, dweb.me}/arc/archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects} 25 | # https://{dweb.me, gateway}/arc/archive.org/download/xxx -> /arc/archive.org/details/xxx?download=1## 26 | # https://{dweb.archive.org}/download/xxx -> /details/xxx?download=1 27 | ## 28 | 29 | #### 30 | #### THE SITES gateway.dweb.me & dweb.me ARE ALMOST IDENTICAL, ITS HIGHLY LIKEY ANY CHANGES HERE NEED TO BE MADE ON THE OTHER SITE 31 | #### 32 | 33 | server { 34 | listen 0.0.0.0:80; 35 | server_name dweb.me; 36 | rewrite ^ https://$http_host$request_uri? permanent; # force redirect http to https 37 | } 38 | 39 | # Default server configuration 40 | # 41 | server { 42 | listen 443 http2; 43 | 44 | # SSL configuration 45 | # 46 | # listen 443 ssl default_server; 47 | # listen [::]:443 ssl default_server; 48 | # 49 | # Note: You should disable gzip for SSL traffic. 50 | # See: https://bugs.debian.org/773332 51 | # 52 | # Read up on ssl_ciphers to ensure a secure configuration. 53 | # See: https://bugs.debian.org/765782 54 | # 55 | # Self signed certs generated by the ssl-cert package 56 | # Don't use them in a production server! 57 | # 58 | # include snippets/snakeoil.conf; 59 | 60 | root /var/www/html; 61 | 62 | # Add index.php to the list if you are using PHP 63 | index index.html index.htm index.nginx-debian.html; 64 | 65 | server_name dweb.me; 66 | ssl on; 67 | ssl_certificate /etc/letsencrypt/live/dweb.me/fullchain.pem; 68 | ssl_certificate_key /etc/letsencrypt/live/dweb.me/privkey.pem; 69 | 70 | # Catch /download/foo - displayed on details page; rather than /download/foo/bar which goes to gateway 71 | location ~ ^/arc/archive.org/download/[^/]*$ { 72 | rewrite ^/arc/archive.org/download/([^/]*)$ /arc/archive.org/details/$1&download=1 redirect; 73 | } 74 | 75 | # Forward details and search -> bootloader 76 | # On dweb.me & gateway.dweb.me thsi is at /arc/archive.org/{details,search}, on dweb.archive.org its at /{details,search} 77 | location /arc/archive.org/details { 78 | add_header Access-Control-Allow-Origin *; 79 | try_files /archive/bootloader.html =404; 80 | } 81 | location /arc/archive.org/search { 82 | add_header Access-Control-Allow-Origin *; 83 | try_files /archive/bootloader.html =404; 84 | } 85 | 86 | # Handle archive.org urls that cant currently be done with Dweb 87 | location /arc/archive.org/about { 88 | rewrite /arc/archive.org/(.*) https://archive.org/$1; 89 | } 90 | location /arc/archive.org/donate { 91 | rewrite /arc/archive.org/(.*) https://archive.org/$1; 92 | } 93 | location /arc/archive.org/projects { 94 | rewrite /arc/archive.org/(.*) https://archive.org/$1; 95 | } 96 | 97 | # Not yet working forward, unclear if should be /ws or /wss 98 | location /ws { 99 | proxy_pass http://localhost:4002; 100 | proxy_http_version 1.1; 101 | proxy_set_header Upgrade $http_upgrade; 102 | proxy_set_header Connection "upgrade"; 103 | } 104 | 105 | location /wss { 106 | proxy_pass http://localhost:4002; 107 | proxy_http_version 1.1; 108 | proxy_set_header Upgrade $http_upgrade; 109 | proxy_set_header Connection "upgrade"; 110 | } 111 | 112 | location /favicon.ico { 113 | add_header Access-Control-Allow-Origin *; 114 | # First attempt to serve request as file, then 115 | # as directory, then fall back to displaying a 404. 116 | try_files $uri $uri/ =404; 117 | } 118 | 119 | location /examples { 120 | add_header Access-Control-Allow-Origin *; 121 | # First attempt to serve request as file, then 122 | # as directory, then fall back to displaying a 404. 123 | try_files $uri $uri/ =404; 124 | } 125 | 126 | location /archive { 127 | add_header Access-Control-Allow-Origin *; 128 | # First attempt to serve request as file, then 129 | # as directory, then fall back to displaying a 404. 130 | try_files $uri $uri/ =404; 131 | } 132 | 133 | location /dweb-serviceworker-bundle.js { 134 | add_header Access-Control-Allow-Origin *; 135 | # First attempt to serve request as file, then 136 | # as directory, then fall back to displaying a 404. 137 | try_files /examples/$uri =404; 138 | } 139 | 140 | location /ipfs { 141 | rewrite ^/ipfs/(.*) /ipfs/$1 break; 142 | proxy_set_header Host $host; 143 | proxy_set_header X-Real-IP $remote_addr; 144 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 145 | proxy_set_header X-Forwarded-Proto $scheme; 146 | proxy_pass http://localhost:8080/; 147 | proxy_read_timeout 600; 148 | } 149 | 150 | # Exact root URL - on dweb.me or gateway.dweb.me redirect to https://dweb.archive.org so client uses https, most of these are going to be crawlers/hackers 151 | location ~ ^/$ { 152 | return 301 https://dweb.archive.org/; 153 | } 154 | location ~ ^.+/dweb-(transports|objects)-bundle.js { # contains dweb-*-bundles, but doesnt start with it - probably relative url in bootstrap.html 155 | rewrite ^.*/dweb-(transports|objects)-bundle.js.*$ /dweb-$1-bundle.js redirect; 156 | } 157 | 158 | location / { 159 | add_header Access-Control-Allow-Origin *; 160 | try_files $uri /archive$uri @gateway; 161 | } 162 | 163 | # on dweb.me or gatewy.dweb.me Forward everything else to the gateway on port 4244, not on dweb.archive.org as assumes /arc/archive.org 164 | location @gateway { 165 | proxy_set_header Host $host; 166 | proxy_set_header X-Real-IP $remote_addr; 167 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 168 | proxy_set_header X-Forwarded-Proto $scheme; 169 | proxy_pass http://localhost:4244; 170 | proxy_read_timeout 600; 171 | } 172 | 173 | } 174 | 175 | server { 176 | listen 4245; 177 | root /var/www/html; 178 | server_name dweb.me; 179 | ssl on; 180 | ssl_certificate /etc/letsencrypt/live/dweb.me/fullchain.pem; 181 | ssl_certificate_key /etc/letsencrypt/live/dweb.me/privkey.pem; 182 | 183 | # Tunnel websocket 184 | location / { 185 | proxy_pass http://localhost:4002; 186 | proxy_http_version 1.1; 187 | proxy_set_header Upgrade $http_upgrade; 188 | proxy_set_header Connection "upgrade"; 189 | } 190 | } 191 | -------------------------------------------------------------------------------- /nginx/gateway.dweb.me: -------------------------------------------------------------------------------- 1 | ## 2 | # You should look at the following URL's in order to grasp a solid understanding 3 | # of Nginx configuration files in order to fully unleash the power of Nginx. 4 | # http://wiki.nginx.org/Pitfalls 5 | # http://wiki.nginx.org/QuickStart 6 | # http://wiki.nginx.org/Configuration 7 | # 8 | # Generally, you will want to move this file somewhere, and start with a clean 9 | # file but keep this around for reference. Or just disable in sites-enabled. 10 | # 11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples. 12 | # 13 | # SUMMARY 14 | # https://dweb.me (secure server) proxypass http://dweb.me (research server) 15 | # http://dweb.me/ -> https://dweb.me/ 16 | # http://dweb.archive.org/aaa/xxx -> gateway /arc/archive.org/aaa/xxx 17 | # https://{gateway.dweb.me, dweb.me}/ -> https://dweb.archive.org - exact URL only 18 | # https://{gateway.dweb.me, dweb.me}/ proxypass localhost:4244 (gateway python) 19 | # https://{gateway.dweb.me, dweb.me, dweb.archive.org}/{archive,examples} -> file 20 | # http://dweb.archive.org/{details,search} -> bootloader 21 | # https://{gateway.dweb.me, dweb.me}/arc/archive.org/{details,search} -> bootloader 22 | # https://{dweb.me, gateway.dweb.me, dweb.archive.org}/{ws,wss} proxypass localhost:4002 (websockets for IPFS) - not yet working 23 | # https://dweb.archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects} 24 | # https://{gatgeway.dweb.me, dweb.me}/arc/archive.org/{about, donate, projects} -> https://archive.org/{about,donate,projects} 25 | # https://{dweb.me, gateway}/arc/archive.org/download/xxx -> /arc/archive.org/details/xxx?download=1## 26 | # https://{dweb.archive.org}/download/xxx -> /details/xxx?download=1 27 | ## 28 | 29 | #### 30 | #### THE SITES gateway.dweb.me & dweb.me ARE ALMOST IDENTICAL, ITS HIGHLY LIKEY ANY CHANGES HERE NEED TO BE MADE ON THE OTHER SITE 31 | #### 32 | 33 | # TODO gateway.dweb.me handles / as if it was on port 4244, dweb.me instead passes all to https - its not clear this difference is needed (dweb.me's forward to https prefered) 34 | 35 | server { 36 | # dweb.archive.org answers on 80, dweb.me and gateway.dweb.me on 443 37 | 38 | listen 0.0.0.0:80; 39 | root /var/www/html; 40 | server_name gateway.dweb.me; 41 | 42 | #rewrite ^ https://$http_host$request_uri? permanent; # force redirect http to https 43 | 44 | # Exact root URL - Forward root URL to dweb.archive.org, shouldnt be hitting it here - probably a crawler 45 | location ~ ^/$ { 46 | return 301 https://dweb.archive.org/; 47 | } 48 | 49 | location / { 50 | proxy_set_header Host $host; 51 | proxy_set_header X-Real-IP $remote_addr; 52 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 53 | proxy_set_header X-Forwarded-Proto $scheme; 54 | 55 | proxy_pass http://localhost:4244/; 56 | proxy_read_timeout 600; 57 | } 58 | } 59 | server { 60 | listen 0.0.0.0:80; 61 | server_name dweb.me; 62 | rewrite ^ https://$http_host$request_uri? permanent; # force redirect http to https 63 | } 64 | # Default server configuration 65 | # 66 | server { 67 | listen 443 http2; 68 | 69 | # SSL configuration 70 | # 71 | # listen 443 ssl default_server; 72 | # listen [::]:443 ssl default_server; 73 | # 74 | # Note: You should disable gzip for SSL traffic. 75 | # See: https://bugs.debian.org/773332 76 | # 77 | # Read up on ssl_ciphers to ensure a secure configuration. 78 | # See: https://bugs.debian.org/765782 79 | # 80 | # Self signed certs generated by the ssl-cert package 81 | # Don't use them in a production server! 82 | # 83 | # include snippets/snakeoil.conf; 84 | 85 | root /var/www/html; 86 | 87 | # Add index.php to the list if you are using PHP 88 | index index.html index.htm index.nginx-debian.html; 89 | 90 | server_name gateway.dweb.me; 91 | ssl on; 92 | ssl_certificate /etc/letsencrypt/live/gateway.dweb.me/fullchain.pem; 93 | ssl_certificate_key /etc/letsencrypt/live/gateway.dweb.me/privkey.pem; 94 | 95 | # Catch /download/foo - displayed on details page; rather than /download/foo/bar which goes to gateway 96 | location ~ ^/arc/archive.org/download/[^/]*$ { 97 | rewrite ^/arc/archive.org/download/([^/]*)$ /arc/archive.org/details/$1&download=1 redirect; 98 | } 99 | 100 | # Forward details and search -> bootloader 101 | # On dweb.me & gateway.dweb.me thsi is at /arc/archive.org/{details,search}, on dweb.archive.org its at /{details,search} 102 | location /arc/archive.org/details { 103 | add_header Access-Control-Allow-Origin *; 104 | try_files /archive/bootloader.html =404; 105 | } 106 | location /arc/archive.org/search { 107 | add_header Access-Control-Allow-Origin *; 108 | try_files /archive/bootloader.html =404; 109 | } 110 | 111 | # Handle archive.org urls that cant currently be done with Dweb 112 | location /arc/archive.org/about { 113 | rewrite /arc/archive.org/(.*) https://archive.org/$1; 114 | } 115 | location /arc/archive.org/donate { 116 | rewrite /arc/archive.org/(.*) https://archive.org/$1; 117 | } 118 | location /arc/archive.org/projects { 119 | rewrite /arc/archive.org/(.*) https://archive.org/$1; 120 | } 121 | 122 | # Not yet working forward, unclear if should be /ws or /wss 123 | location /ws { 124 | proxy_pass http://localhost:4002; 125 | proxy_http_version 1.1; 126 | proxy_set_header Upgrade $http_upgrade; 127 | proxy_set_header Connection "upgrade"; 128 | } 129 | 130 | location /wss { 131 | proxy_pass http://localhost:4002; 132 | proxy_http_version 1.1; 133 | proxy_set_header Upgrade $http_upgrade; 134 | proxy_set_header Connection "upgrade"; 135 | } 136 | 137 | location /favicon.ico { 138 | add_header Access-Control-Allow-Origin *; 139 | # First attempt to serve request as file, then 140 | # as directory, then fall back to displaying a 404. 141 | try_files $uri $uri/ =404; 142 | } 143 | 144 | location /examples { 145 | add_header Access-Control-Allow-Origin *; 146 | # First attempt to serve request as file, then 147 | # as directory, then fall back to displaying a 404. 148 | try_files $uri $uri/ =404; 149 | } 150 | 151 | location /archive { 152 | add_header Access-Control-Allow-Origin *; 153 | # First attempt to serve request as file, then 154 | # as directory, then fall back to displaying a 404. 155 | try_files $uri $uri/ =404; 156 | } 157 | 158 | location /dweb-serviceworker-bundle.js { 159 | add_header Access-Control-Allow-Origin *; 160 | # First attempt to serve request as file, then 161 | # as directory, then fall back to displaying a 404. 162 | try_files /examples/$uri =404; 163 | } 164 | 165 | location /ipfs { 166 | rewrite ^/ipfs/(.*) /ipfs/$1 break; 167 | proxy_set_header Host $host; 168 | proxy_set_header X-Real-IP $remote_addr; 169 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 170 | proxy_set_header X-Forwarded-Proto $scheme; 171 | proxy_pass http://localhost:8080/; 172 | proxy_read_timeout 600; 173 | } 174 | 175 | # Exact root URL - on dweb.me or gateway.dweb.me redirect to https://dweb.archive.org so client uses https, most of these are going to be crawlers/hackers 176 | location ~ ^/$ { 177 | return 301 https://dweb.archive.org/; 178 | } 179 | location ~ ^.+/dweb-(transports|objects)-bundle.js { # contains dweb-*-bundles, but doesnt start with it - probably relative url in bootstrap.html 180 | rewrite ^.*/dweb-(transports|objects)-bundle.js.*$ /dweb-$1-bundle.js redirect; 181 | } 182 | 183 | location / { 184 | try_files $uri /archive$uri @gateway; 185 | } 186 | 187 | # on dweb.me or gatewy.dweb.me Forward everything else to the gateway on port 4244, not on dweb.archive.org as assumes /arc/archive.org 188 | location @gateway { 189 | proxy_set_header Host $host; 190 | proxy_set_header X-Real-IP $remote_addr; 191 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 192 | proxy_set_header X-Forwarded-Proto $scheme; 193 | proxy_pass http://localhost:4244; 194 | proxy_read_timeout 600; 195 | } 196 | 197 | } 198 | -------------------------------------------------------------------------------- /nginx/ipfs.dweb.me: -------------------------------------------------------------------------------- 1 | ## 2 | # You should look at the following URL's in order to grasp a solid understanding 3 | # of Nginx configuration files in order to fully unleash the power of Nginx. 4 | # http://wiki.nginx.org/Pitfalls 5 | # http://wiki.nginx.org/QuickStart 6 | # http://wiki.nginx.org/Configuration 7 | # 8 | # Generally, you will want to move this file somewhere, and start with a clean 9 | # file but keep this around for reference. Or just disable in sites-enabled. 10 | # 11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples. 12 | ## 13 | server { 14 | listen 0.0.0.0:80; 15 | server_name ipfs.dweb.me; 16 | rewrite ^ https://$http_host$request_uri? permanent; # force redirect http to https 17 | } 18 | 19 | 20 | # Default server configuration 21 | # 22 | server { 23 | listen 443; 24 | 25 | # SSL configuration 26 | # 27 | # listen 443 ssl default_server; 28 | # listen [::]:443 ssl default_server; 29 | # 30 | # Note: You should disable gzip for SSL traffic. 31 | # See: https://bugs.debian.org/773332 32 | # 33 | # Read up on ssl_ciphers to ensure a secure configuration. 34 | # See: https://bugs.debian.org/765782 35 | # 36 | # Self signed certs generated by the ssl-cert package 37 | # Don't use them in a production server! 38 | # 39 | # include snippets/snakeoil.conf; 40 | 41 | root /var/www/html; 42 | 43 | # Add index.php to the list if you are using PHP 44 | index index.html index.htm index.nginx-debian.html; 45 | 46 | server_name ipfs.dweb.me; 47 | ssl on; 48 | ssl_certificate /etc/letsencrypt/live/dweb.me/fullchain.pem; 49 | ssl_certificate_key /etc/letsencrypt/live/dweb.me/privkey.pem; 50 | 51 | 52 | location / { 53 | # First attempt to serve request as file, then 54 | # as directory, then fall back to displaying a 404. 55 | proxy_set_header Host $host; 56 | proxy_set_header X-Real-IP $remote_addr; 57 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 58 | proxy_set_header X-Forwarded-Proto $scheme; 59 | 60 | proxy_pass http://localhost:8080/; 61 | proxy_read_timeout 600; 62 | } 63 | 64 | # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000 65 | # 66 | #location ~ \.php$ { 67 | # include snippets/fastcgi-php.conf; 68 | # 69 | # # With php7.0-cgi alone: 70 | # fastcgi_pass 127.0.0.1:9000; 71 | # # With php7.0-fpm: 72 | # fastcgi_pass unix:/run/php/php7.0-fpm.sock; 73 | #} 74 | 75 | # deny access to .htaccess files, if Apache's document root 76 | # concurs with nginx's one 77 | # 78 | #location ~ /\.ht { 79 | # deny all; 80 | #} 81 | } 82 | 83 | 84 | # Virtual Host configuration for example.com 85 | # 86 | # You can move that to a different file under sites-available/ and symlink that 87 | # to sites-enabled/ to enable it. 88 | # 89 | #server { 90 | # listen 80; 91 | # listen [::]:80; 92 | # 93 | # server_name example.com; 94 | # 95 | # root /var/www/example.com; 96 | # index index.html; 97 | # 98 | # location / { 99 | # try_files $uri $uri/ =404; 100 | # } 101 | #} 102 | -------------------------------------------------------------------------------- /nginx/ipfsconvert.dweb.me: -------------------------------------------------------------------------------- 1 | ## 2 | # You should look at the following URL's in order to grasp a solid understanding 3 | # of Nginx configuration files in order to fully unleash the power of Nginx. 4 | # http://wiki.nginx.org/Pitfalls 5 | # http://wiki.nginx.org/QuickStart 6 | # http://wiki.nginx.org/Configuration 7 | # 8 | # Generally, you will want to move this file somewhere, and start with a clean 9 | # file but keep this around for reference. Or just disable in sites-enabled. 10 | # 11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples. 12 | ## 13 | server { 14 | listen 0.0.0.0:80; 15 | server_name ipfsconvert.dweb.me; 16 | 17 | # SSL configuration 18 | # 19 | # listen 443 ssl default_server; 20 | # listen [::]:443 ssl default_server; 21 | # 22 | # Note: You should disable gzip for SSL traffic. 23 | # See: https://bugs.debian.org/773332 24 | # 25 | # Read up on ssl_ciphers to ensure a secure configuration. 26 | # See: https://bugs.debian.org/765782 27 | # 28 | # Self signed certs generated by the ssl-cert package 29 | # Don't use them in a production server! 30 | # 31 | # include snippets/snakeoil.conf; 32 | 33 | root /var/www/html; 34 | 35 | # Add index.php to the list if you are using PHP 36 | index index.html index.htm index.nginx-debian.html; 37 | 38 | # ssl on; 39 | # ssl_certificate /etc/letsencrypt/live/ipfsconvert.dweb.me/fullchain.pem; 40 | # ssl_certificate_key /etc/letsencrypt/live/ipfsconvert.dweb.me/privkey.pem; 41 | 42 | 43 | location / { 44 | # First attempt to serve request as file, then 45 | # as directory, then fall back to displaying a 404. 46 | proxy_set_header Host $host; 47 | proxy_set_header X-Real-IP $remote_addr; 48 | proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; 49 | proxy_set_header X-Forwarded-Proto $scheme; 50 | 51 | proxy_pass http://localhost:4245/; 52 | proxy_read_timeout 600; 53 | } 54 | 55 | # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000 56 | # 57 | #location ~ \.php$ { 58 | # include snippets/fastcgi-php.conf; 59 | # 60 | # # With php7.0-cgi alone: 61 | # fastcgi_pass 127.0.0.1:9000; 62 | # # With php7.0-fpm: 63 | # fastcgi_pass unix:/run/php/php7.0-fpm.sock; 64 | #} 65 | 66 | # deny access to .htaccess files, if Apache's document root 67 | # concurs with nginx's one 68 | # 69 | #location ~ /\.ht { 70 | # deny all; 71 | #} 72 | } 73 | 74 | 75 | # Virtual Host configuration for example.com 76 | # 77 | # You can move that to a different file under sites-available/ and symlink that 78 | # to sites-enabled/ to enable it. 79 | # 80 | #server { 81 | # listen 80; 82 | # listen [::]:80; 83 | # 84 | # server_name example.com; 85 | # 86 | # root /var/www/example.com; 87 | # index index.html; 88 | # 89 | # location / { 90 | # try_files $uri $uri/ =404; 91 | # } 92 | #} 93 | -------------------------------------------------------------------------------- /nginx/www.dweb.me: -------------------------------------------------------------------------------- 1 | ## 2 | # You should look at the following URL's in order to grasp a solid understanding 3 | # of Nginx configuration files in order to fully unleash the power of Nginx. 4 | # http://wiki.nginx.org/Pitfalls 5 | # http://wiki.nginx.org/QuickStart 6 | # http://wiki.nginx.org/Configuration 7 | # 8 | # Generally, you will want to move this file somewhere, and start with a clean 9 | # file but keep this around for reference. Or just disable in sites-enabled. 10 | # 11 | # Please see /usr/share/doc/nginx-doc/examples/ for more detailed examples. 12 | ## 13 | server { 14 | server_name www.dweb.me; 15 | return 301 $scheme://dweb.me$request_uri; 16 | } 17 | -------------------------------------------------------------------------------- /python/Btih.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from .NameResolver import NameResolverDir 3 | from .Errors import CodingException, ToBeImplementedException, NoContentException 4 | from .HashStore import MagnetLinkService 5 | from .miscutils import httpget, loads 6 | from .config import config 7 | from .Archive import ArchiveItem 8 | from magneturi import bencode 9 | 10 | 11 | class BtihResolver(NameResolverDir): 12 | """ 13 | Resolve BitTorrent Hashes 14 | Fields: 15 | btih # BitTorrent hash - in ascii B32 format 16 | 17 | This could also easily be extended to 18 | Support "magnetlink" as the thing being looked for (and return btih and other outputs) 19 | Support outputs of itemid, metadata (of item) 20 | 21 | """ 22 | namespace="btih" # Defined in subclasses 23 | 24 | def __init__(self, namespace, hash, **kwargs): 25 | """ 26 | Creates the object 27 | 28 | :param namespace: "btih" 29 | :param hash: Hash representing the object - format is specified by namespace 30 | :param kwargs: Any other args to the URL, ignored for now. 31 | """ 32 | verbose=kwargs.get("verbose") 33 | if verbose: 34 | logging.debug("{0}.__init__({1}, {2}, {3})".format(self.__class__.__name__, namespace, hash, kwargs)) 35 | if namespace != self.namespace: # Checked though should be determined by ServerGateway mapping 36 | raise CodingException(message="namespace != "+self.namespace) 37 | super(BtihResolver, self).__init__(self, namespace, hash, **kwargs) # Note ignores the namespace 38 | self.btih = hash # Ascii B32 version of hash 39 | 40 | @classmethod 41 | def new(cls, namespace, hash, *args, **kwargs): 42 | """ 43 | Called by ServerGateway to handle a URL - passed the parts of the remainder of the URL after the requested format, 44 | 45 | :param namespace: 46 | :param args: 47 | :param kwargs: 48 | :return: 49 | :raise NoContentException: if cant find content directly or via other classes (like DOIfile) 50 | """ 51 | verbose=kwargs.get("verbose") 52 | ch = super(BtihResolver, cls).new(namespace, hash, *args, **kwargs) # By default (on NameResolver) calls cls() which goes to __init__ 53 | return ch 54 | 55 | def itemid(self, verbose=False, **kwargs): 56 | searchurl = config["archive"]["url_btihsearch"] + self.btih 57 | searchres = loads(httpget(searchurl)) 58 | if not searchres["response"]["numFound"]: 59 | return None 60 | return searchres["response"]["docs"][0]["identifier"] 61 | 62 | def retrieve(self, verbose=False, **kwargs): 63 | """ 64 | Fetch the content, dont pass to caller (typically called by NameResolver.content() 65 | TODO - if needed can retrieve the torrent file here - look at HashStore for example of getting from self.url 66 | 67 | :returns: content - i.e. bytes 68 | """ 69 | raise ToBeImplementedException("btih retrieve") 70 | 71 | def content(self, verbose=False, **kwargs): 72 | """ 73 | :returns: content - i.e. bytes 74 | """ 75 | data = self.retrieve() 76 | if verbose: logging.debug("Retrieved doc size={}".format(len(data))) 77 | return {'Content-type': self.mimetype, 78 | 'data': data, 79 | } 80 | 81 | def metadata(self, headers=True, verbose=False, **kwargs): 82 | """ 83 | :param verbose: 84 | :return: 85 | """ 86 | raise ToBeImplementedException(name="btih.metadata()") 87 | 88 | def magnetlink(self, verbose=False, headers=False, **kwargs): 89 | magnetlink = MagnetLinkService.btihget(self.btih) 90 | data = magnetlink or "" # Current paths mean we should have it, but if not we'll return "" as we have no way of doing that lookup 91 | return {"Content-type": "text/plain", "data": data} if headers else data 92 | 93 | def torrenturl(self, verbose=False): # TODO-PERMS only used in torrent() below which doesnt use result so can delete this routine 94 | itemid = self.itemid(verbose=verbose) 95 | if not itemid: 96 | raise NoContentException() 97 | return "https://archive.org/download/{}/{}_archive.torrent".format(itemid, itemid) 98 | 99 | def torrent(self, verbose=False, headers=False, **kwargs): 100 | torrenturl = self.torrenturl(verbose=verbose) # NoContentException if not found # TODO-PERMS unused can delete this line? 101 | data = bencode.bencode(ArchiveItem.modifiedtorrent(self.itemid(), wantmodified=True, verbose=verbose)) 102 | mimetype = "application/x-bittorrent" 103 | return {"Content-type": mimetype, "data": data} if headers else data 104 | 105 | -------------------------------------------------------------------------------- /python/ContentStore.py: -------------------------------------------------------------------------------- 1 | 2 | class ContentStore(HashStore): 3 | Store and retrieve content by its hash. 4 | Could use REDIS or just store in a file - see rawstore and rawfetch in https://github.com/mitra42/dweb/blob/master/dweb/TransportLocal.py for an example 5 | * rawstore(bytes) => multihash 6 | * rawfetch(multihash) => bytes 7 | * Consumes: multihash; hashstore 8 | 9 | Notes: The names are for compatability with a separate client library project. 10 | For now this could use the hashstore or it could use the file system (have code for this) 11 | 12 | -------------------------------------------------------------------------------- /python/Errors.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | 3 | class MyBaseException(Exception): 4 | """ 5 | Base class for Exceptions 6 | 7 | Create subclasses with parameters in their msg e.g. {message} or {name} 8 | and call as in: raise NewException(name="Foo"); 9 | 10 | msgargs Arguments that slot into msg 11 | __str__ Returns msg expanded with msgparms 12 | """ 13 | errno=0 14 | httperror = 500 # See BaseHTTPRequestHandler for list of errors 15 | msg="Generic Model Exception" #: Parameterised string for message 16 | def __init__(self, **kwargs): 17 | self.msgargs=kwargs # Store arbitrary dict of message args (can be used ot output msg from template 18 | 19 | def __str__(self): 20 | try: 21 | return self.msg.format(**self.msgargs) 22 | except: 23 | return self.msg + " UNFORMATABLE ARGS:" + repr(self.msgargs) 24 | 25 | class ToBeImplementedException(MyBaseException): 26 | """ 27 | Raised when some code has not been implemented yet 28 | """ 29 | httperror = 501 30 | msg = "{name} needs implementing" 31 | 32 | # Note TransportError is in Transport.py 33 | 34 | class IPFSException(MyBaseException): 35 | httperror = 500 36 | msg = "IPFS Error: {message}" 37 | 38 | class CodingException(MyBaseException): 39 | httperror = 501 40 | msg = "Coding Error: {message}" 41 | 42 | class SignatureException(MyBaseException): 43 | httperror = 501 44 | msg = "Signature Verification Error: {message}" 45 | 46 | class EncryptionException(MyBaseException): 47 | httperror = 500 # Failure in the encryption code other than lack of authentication 48 | msg = "Encryption error: {message}" 49 | 50 | class ForbiddenException(MyBaseException): 51 | httperror = 403 # Forbidden (WWW Authentication won't help (note there is no real HTTP error for authentication (other than HTTP authentication) failed ) 52 | msg = "Not allowed: {what}" 53 | 54 | class AuthenticationException(MyBaseException): 55 | """ 56 | Raised when some code has not been implemented yet 57 | """ 58 | httperror = 403 # Forbidden - this should be 401 except that requires extra headers (see RFC2616) 59 | msg = "Authentication Exception: {message}" 60 | 61 | class IntentionallyUnimplementedException(MyBaseException): 62 | """ 63 | Raised when some code has not been implemented yet 64 | """ 65 | httperror = 501 66 | msg = "Intentionally not implemented: {message}" 67 | 68 | class DecryptionFailException(MyBaseException): 69 | """ 70 | Raised if decrypytion failed - this could be cos its the wrong (e.g. old) key 71 | """ 72 | httperror = 500 73 | msg = "Decryption fail" 74 | 75 | class SecurityWarning(MyBaseException): 76 | msg = "Security warning: {message}" 77 | 78 | 79 | class AssertionFail(MyBaseException): #TODO-BACKPORT - console.assert on JS should throw this 80 | """ 81 | Raised when something that should be True isn't - usually a coding failure or some change not propogated fully 82 | """ 83 | httperror = 500 84 | msg = "{message}" 85 | 86 | class TransportURLNotFound(MyBaseException): 87 | httperror = 404 88 | msg = "{url} not found" 89 | 90 | class NoContentException(MyBaseException): 91 | httperror = 404 92 | msg = "No content found" 93 | 94 | class MultihashError(MyBaseException): 95 | httperror = 500 96 | msg = "Multihash error {message}" 97 | 98 | class SearchException(MyBaseException): 99 | httperror = 404 100 | msg = "{search} not found" 101 | 102 | class TransportFileNotFound(MyBaseException): 103 | httperror = 404 104 | msg = "file {file} not found" 105 | 106 | """ 107 | 108 | # Following are currently obsolete - not being used in Python or JS 109 | 110 | class PrivateKeyException(MyBaseException): 111 | #Raised when some code has not been implemented yet 112 | httperror = 500 113 | msg = "Operation requires Private Key, but only Public available." 114 | 115 | """ 116 | -------------------------------------------------------------------------------- /python/HashResolvers.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from .NameResolver import NameResolverFile 3 | from .miscutils import loads, dumps, httpget 4 | from .Errors import CodingException, NoContentException, ForbiddenException 5 | from .HashStore import LocationService, MimetypeService 6 | from .LocalResolver import LocalResolverFetch 7 | from .Multihash import Multihash 8 | from .DOI import DOIfile 9 | from .Archive import ArchiveItem, ArchiveFile 10 | from .config import config 11 | 12 | 13 | class HashResolver(NameResolverFile): 14 | """ 15 | Base class for Sha1Hex and ContentHash - used where we are instantiating something of unknown type from a hash of some form. 16 | 17 | Sha1Hex & ContentHash are classes for retrieval by a hash 18 | typically of form sha1hex/1a2b3c for SHA1 19 | 20 | Implements name resolution of the ContentHash namespace, via a local store and any other internal archive method 21 | 22 | Future Work 23 | * Build way to preload the hashstore with the hashes and URLs from various parts of the Archive 24 | """ 25 | namespace = None # Defined in subclasses 26 | multihashfield = None # Defined in subclasses 27 | archivefilemetadatafield = None # Defined in subclasses 28 | 29 | def __init__(self, namespace, hash, **kwargs): 30 | """ 31 | Creates the object 32 | 33 | :param namespace: "contenthash" 34 | :param hash: Hash representing the object - format is specified by namespace 35 | :param kwargs: Any other args to the URL, ignored for now. 36 | """ 37 | """ 38 | Pseudo-code 39 | Looks up the multihash in Location Service to find where can be retrieved from, does not retrieve it. 40 | """ 41 | verbose = kwargs.get("verbose") 42 | if verbose: 43 | logging.debug("{0}.__init__({1}, {2}, {3})".format(self.__class__.__name__, namespace, hash, kwargs)) 44 | if namespace != self.namespace: # Defined in subclasses 45 | raise CodingException(message="namespace != "+self.namespace) 46 | super(HashResolver, self).__init__(self, namespace, hash, **kwargs) # Note ignores the name 47 | self.multihash = Multihash(**{self.multihashfield: hash}) 48 | self.url = LocationService.get(self.multihash.multihash58, verbose) #TODO-FUTURE recognize different types of location, currently assumes URL 49 | #logging.debug("XXX@HashResolver.__init__ setting {} .url = {}".format(self.multihash.multihash58, self.url)) 50 | self.mimetype = MimetypeService.get(self.multihash.multihash58, verbose) # Should be after DOIfile resolution, which will set mimetype in MimetypeService 51 | self._metadata = None # Not resolved yet 52 | self._doifile = None # Not resolved yet 53 | 54 | # noinspection PyMethodOverriding 55 | @classmethod 56 | def new(cls, namespace, hash, *args, **kwargs): 57 | """ 58 | Called by ServerGateway to handle a URL - passed the parts of the remainder of the URL after the requested format, 59 | 60 | :param namespace: 61 | :param hash: hash or next part of name within namespace 62 | :param args: rest of path 63 | :param kwargs: 64 | :return: 65 | :raise NoContentException: if cant find content directly or via other classes (like DOIfile) 66 | """ 67 | verbose = kwargs.get("verbose") 68 | if hash == HashFileEmpty.emptymeta[cls.archivefilemetadatafield]: 69 | return HashFileEmpty(verbose) # Empty file 70 | ch = super(HashResolver, cls).new(namespace, hash, *args, **kwargs) # By default (on NameResolver) calls cls() which goes to __init__ 71 | if not ch.url: 72 | if verbose: logging.debug("No URL, looking on archive for {0}.{1}".format(namespace, hash)) 73 | try: 74 | #!SEE-OTHERHASHES -this is where we look things up in the DOI.sql etc essentially cycle through some other classes, asking if they know the URL 75 | # ch = DOIfile(multihash=ch.multihash).url # Will fill in url if known. Note will now return a DOIfile, not a Sha1Hex 76 | return ch.searcharchivefor(verbose=verbose) # Will now be a ArchiveFile 77 | except NoContentException as e: 78 | pass; # If doesnt or cant find on archive, we can still check locally 79 | if not kwargs.get("nolocal") and (not ch.url or ch.url.startswith("local:")): 80 | ch = LocalResolverFetch.new("rawfetch", hash, **kwargs) 81 | if not (ch and ch.url): 82 | raise NoContentException() 83 | return ch 84 | 85 | def push(self, obj): 86 | """ 87 | Add a Shard to a ContentHash - 88 | :return: 89 | """ 90 | pass # Note could probably be defined on NameResolverFile class 91 | 92 | def retrieve(self, verbose=False, **kwargsx): 93 | """ 94 | Fetch the content, dont pass to caller (typically called by NameResolver.content() 95 | 96 | :returns: content - i.e. bytes 97 | :raise: TransportFileNotFound, ForbiddenException, HTTPError if cant find url 98 | """ 99 | # TODO-STREAMS future work to return a stream 100 | if not self.url: 101 | raise NoContentException() 102 | if self.url.startswith("local:"): 103 | raise CodingException(message="Shouldnt get here, should convert to LocalResolver in HashResolver.new: {0}".format(self.url)) 104 | """ 105 | u = self.url.split('/') 106 | if u[1] == "rawfetch": 107 | assert(False) # hook to LocalResolver/rawfetch if need this 108 | else: 109 | raise CodingException(message="unsupported for local: {0}".format(self.url)) 110 | """ 111 | else: 112 | return httpget(self.url) # Err TransportFileNotFound or HTTPError 113 | 114 | def searcharchivefor(self, multihash=None, verbose=False, **kwargs): 115 | # Note this only works on certain machines 116 | # And will return a ArchiveFile 117 | # Multihash must already be sha1, or convertable into it. 118 | mh = multihash or self.multihash 119 | if mh.code != mh.SHA1: #Can only search on sha1 currently 120 | raise NoContentException() 121 | searchurl = config["archive"]["url_sha1search"] + (multihash or self.multihash).sha1hex 122 | res = loads(httpget(searchurl)) 123 | #logging.info("XXX@searcharchivefor res={}".format(res)) 124 | if res.get("error"): 125 | # {"error": "internal use only"} 126 | raise ForbiddenException(what="SHA1 search from machine unless its whitelisted by Aaron ip=:"+res.get("error")) 127 | if not res["hits"]["total"]: 128 | # {"key":"sha1","val":"88d4b0d91acd3c25139804afbf4aef4e675bef63","hits":{"total":0,"matches":[]}} 129 | raise NoContentException() 130 | # {"key": "sha1", "val": "88...2", "hits": {"total": 1, "matches": [{"identifier": [""],"name": [""]}]}} 131 | firstmatch = res["hits"]["matches"][0] 132 | logging.info("ArchiveFile.new({},{},{}".format("archiveid", firstmatch["identifier"][0], firstmatch["name"][0])) 133 | return ArchiveItem.new("archiveid", firstmatch["identifier"][0], firstmatch["name"][0], verbose=True) # Note uses ArchiveItem because need to retrieve item level metadata as well 134 | 135 | def content(self, verbose=False, **kwargs): 136 | """ 137 | :returns: content - i.e. bytes 138 | """ 139 | data = self.retrieve() 140 | if verbose: logging.debug("Retrieved doc size={}".format(len(data))) 141 | return {'Content-type': self.mimetype, 142 | 'data': data, 143 | } 144 | 145 | def metadata(self, headers=True, verbose=False, **kwargs): 146 | """ 147 | :param verbose: 148 | :param headers: True if caller wants HTTP response headers 149 | :return: 150 | """ 151 | #logging.info("XXX@HR.metadata m={}, u={}".format(self._metadata, self.url)) 152 | if not self._metadata: 153 | try: 154 | if not self._doifile: 155 | self._doifile = DOIfile(multihash=self.multihash, verbose=verbose) # If not found, dont set url/metadata etc raises NoContentException 156 | self._metadata = self._metadata or ( 157 | self._doifile and self._doifile.metadata(headers=False, verbose=verbose)) 158 | except NoContentException as e: 159 | pass # Ignore absence of DOI file, try next 160 | if not self._metadata and self.url and self.url.startswith(config["archive"]["url_download"]): 161 | u = self.url[len(config["archive"]["url_download"]):].split('/') # [ itemid, filename ] 162 | self._metadata = ArchiveItem.new("archiveid", *u).metadata(headers=False) # Note will retun an ArchiveFile since passing the filename 163 | mimetype = 'application/json' # Note this is the mimetype of the response, not the mimetype of the file 164 | return {"Content-type": mimetype, "data": self._metadata} if headers else self._metadata 165 | 166 | # def canonical - not needed as already in a canonical form 167 | 168 | 169 | class Sha1Hex(HashResolver): 170 | """ 171 | URL: `/xxx/sha1hex/Q...` (forwarded here by ServerGateway methods) 172 | """ 173 | namespace = "sha1hex" 174 | multihashfield = "sha1hex" # Field to Multihash.init 175 | archivefilemetadatafield = "sha1" 176 | 177 | 178 | class ContentHash(HashResolver): 179 | """ 180 | URL: `/xxx/contenthash/Q...` (forwarded here by ServerGateway methods) 181 | """ 182 | namespace = "contenthash" 183 | multihashfield = "multihash58" # Field to Multihash.init 184 | archivefilemetadatafield = "multihash58" # Not quite true, really combined into URL in "contenthash" but this is for detecting emptyhash 185 | 186 | class HashFileEmpty(HashResolver): 187 | # Catch special case of an empty file and deliver an empty file 188 | emptymeta = { # Example from archive.org/metadata/AboutBan1935/AboutBan1935.asr.srt 189 | "name": "emptyfile.txt", 190 | "source": "original", 191 | "format": "Unknown", 192 | "size": "0", 193 | "md5": "d41d8cd98f00b204e9800998ecf8427e", 194 | "crc32": "00000000", 195 | "sha1": "da39a3ee5e6b4b0d3255bfef95601890afd80709", 196 | "contenthash": "contenthash:/contenthash/5dtpkBuw5TeS42SJSTt33HCE3ht4rC", 197 | "multihash58": "5dtpkBuw5TeS42SJSTt33HCE3ht4rC", 198 | } 199 | 200 | # noinspection PyMissingConstructor 201 | def __init__(self, verbose=False): 202 | # Intentionally not calling superclass's init. 203 | self.mimetype = "application/octet-stream" 204 | 205 | def retrieve(self, _headers=None, verbose=False, **kwargs): 206 | # Return a empty file 207 | return '' 208 | 209 | def metadata(self, headers=None, verbose=False, **kwargs): 210 | mimetype = 'application/json' # Note this is the mimetype of the response, not the mimetype of the file 211 | return {"Content-type": mimetype, "data": self.emptymeta } if headers else self.emptymeta 212 | -------------------------------------------------------------------------------- /python/HashStore.py: -------------------------------------------------------------------------------- 1 | """ 2 | Hash Store set of classes for storage and retrieval 3 | """ 4 | import redis 5 | import logging 6 | from .Errors import CodingException 7 | from .TransportIPFS import TransportIPFS 8 | from .miscutils import loads, dumps 9 | 10 | class HashStore(object): 11 | """ 12 | Superclass for key value storage, a shim around REDIS intended to be subclassed (see LocationService for example) 13 | 14 | Will tie to a REDIS database initially. 15 | 16 | Class Fields: 17 | _redis: redis object Redis Connection object once connection to redis once established, 18 | 19 | Fields: 20 | redisfield: string name of field in redis store being used. 21 | 22 | Class methods: 23 | redis() Initiate connection to redis or return already open one. 24 | 25 | Instance methods: 26 | hash_set(multihash, field, value, verbose=False) Set Redis.multihash.field to value 27 | hash_get(multihash, field, verbose=False) Retrieve value of Redis.multihash.field 28 | set(multihash, value, verbose=False) Set Redis.multihash. = value 29 | get(multihash, value, verbose=False) Retrieve Redis.multihash. 30 | 31 | Delete and Push are not supported but could be if required. 32 | 33 | Subclasses map 34 | 35 | Note Contenthash = multihash base58 of content (typically SHA1 on IA at present) 36 | itemid = archive's item id, e.g. "commute" 37 | 38 | Class StoredAt Maps To 39 | StateService __STATE__. field arbitraryvalue For global state 40 | StateService __STATE__.LastDHTround number? Used by cron_ipfs.py to track whats up next 41 | LocationService .location url As returned by rawstore or url of content on IA 42 | MimetypeService .mimetype mimetype 43 | IPLDService Not used currently 44 | IPLDHashService .ipld IPFS hash e.g. Q123 or z123 (hash of the IPLD) 45 | ThumbnailIPFSfromItemIdService .thumbnailipfs ipfsurl e.g. ipfs:/ipfs/Q1… 46 | MagnetLinkService bits:.magnetlink magnetlink 47 | MagnetLinkService archived:.magnetlink magnetlink 48 | TitleService archived:.title title Used to map collection item’s to their titles (cache search query) 49 | """ 50 | 51 | _redis = None # Will be connected to a redis instance by redis() 52 | redisfield = None # Subclasses define this, and use set & get 53 | 54 | @classmethod 55 | def redis(cls): 56 | if not HashStore._redis: 57 | logging.debug("HashStore connecting to Redis") 58 | HashStore._redis = redis.StrictRedis( # Note uses HashStore cos this connection is shared across subclasses 59 | host="localhost", 60 | port=6379, 61 | db=0, 62 | decode_responses=True 63 | ) 64 | return HashStore._redis 65 | 66 | def __init__(self): 67 | raise CodingException(message="It is meaningless to instantiate an instance of HashStore, its all class methods") 68 | 69 | @classmethod 70 | def hash_set(cls, multihash, field, value, verbose=False): 71 | """ 72 | :param multihash: 73 | :param field: 74 | :param value: 75 | :return: 76 | """ 77 | if verbose: logging.debug("Hash set: {0} {1}={2}".format(multihash, field, value)) 78 | cls.redis().hset(multihash, field, value) 79 | 80 | @classmethod 81 | def hash_get(cls, multihash, field, verbose=False): 82 | """ 83 | 84 | :param multihash: 85 | :param field: 86 | :return: 87 | """ 88 | res = cls.redis().hget(multihash, field) 89 | if verbose: logging.debug("Hash found: {0} {1}={2}".format(multihash, field, res)) 90 | return res 91 | 92 | @classmethod 93 | def set(cls, multihash, value, verbose=False): 94 | """ 95 | 96 | :param multihash: 97 | :param value: What we want to store in the redisfield 98 | :return: 99 | """ 100 | return cls.hash_set(multihash, cls.redisfield, value, verbose) 101 | 102 | @classmethod 103 | def get(cls, multihash, verbose=False): 104 | """ 105 | 106 | :param multihash: 107 | :return: string stored in Redis 108 | """ 109 | return cls.hash_get(multihash, cls.redisfield, verbose) 110 | 111 | 112 | @classmethod 113 | def archiveidget(cls, itemid, verbose=False): 114 | return cls.get("archiveid:"+itemid) 115 | 116 | @classmethod 117 | def archiveidset(cls, itemid, value, verbose=False): 118 | return cls.set("archiveid:" + itemid, value) 119 | 120 | @classmethod 121 | def btihget(cls, btihhash, verbose=False): 122 | return cls.get("btih:"+btihhash) 123 | 124 | @classmethod 125 | def btihset(cls, btihhash, value, verbose=False): 126 | return cls.set("btih:"+btihhash, value) 127 | 128 | class StateService(HashStore): 129 | """ 130 | Store some global state for the server 131 | 132 | Field Value Means 133 | LastDHTround ?? Used by cron_ipfs.py to record which part of hash table it last worked on 134 | """ 135 | 136 | @classmethod 137 | def set(cls, field, value, verbose=False): 138 | """ 139 | Store to global state 140 | field: Name of field to store 141 | value: Content to store 142 | """ 143 | return cls.hash_set("__STATE__", field, dumps(value), verbose) 144 | 145 | @classmethod 146 | def get(cls, field, verbose=False): 147 | """ 148 | Store to global state saving 149 | :param field: 150 | :return: string stored in Redis 151 | """ 152 | res = cls.hash_get("__STATE__", field, verbose) 153 | if res is None: 154 | return None 155 | else: 156 | return loads(res) 157 | 158 | class LocationService(HashStore): 159 | """ 160 | OLD NOTES 161 | Maps hashes to locations 162 | * set(multihash, location) 163 | * get(multihash) => url (currently) 164 | * Consumes: Hashstore 165 | * ConsumedBy: DOI Name Resolver 166 | 167 | The multihash represents a file or a part of a file. Build upon hashstore. 168 | It is split out because this could be a useful service on its own. 169 | """ 170 | redisfield = "location" 171 | 172 | 173 | class MimetypeService(HashStore): 174 | # Maps contenthash58 to mimetype 175 | redisfield = "mimetype" 176 | 177 | 178 | class IPLDService(HashStore): 179 | # TODO-IPFS may need to move this to ContentStore (which needs implementing) 180 | # Note this doesnt appear to be used except by IPLDFile/IPLDdir which themselves arent used 181 | redisfield = "ipld" 182 | 183 | 184 | class IPLDHashService(HashStore): 185 | # Maps contenthash58 to IPLD's multihash CIDv0 or CIDv1 186 | redisfield = "ipldhash" 187 | 188 | class ThumbnailIPFSfromItemIdService(HashStore): 189 | # Maps itemid to IPFS URL (e.g. ipfs:/ipfs/Q123...) 190 | redisfield = "thumbnailipfs" 191 | 192 | class MagnetLinkService(HashStore): 193 | # uses archiveidset/get 194 | redisfield = "magnetlink" 195 | 196 | class TitleService(HashStore): 197 | # Cache collection names, they dont change often enough to worry 198 | # uses archiveidset/get 199 | # TODO-REDIS note this is caching for ever, which is generally a bad idea ! Should figure out how to make Redis expire this cache every few days 200 | redisfield = "title" 201 | -------------------------------------------------------------------------------- /python/LocalResolver.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from .TransportLocal import TransportLocal 3 | from .NameResolver import NameResolverFile 4 | from .HashStore import LocationService 5 | from .Multihash import Multihash 6 | from .miscutils import loads, dumps 7 | from .Errors import TransportFileNotFound 8 | 9 | #TODO add caching to headers returned so not repeatedly pinged for same file 10 | 11 | class LocalResolver(NameResolverFile): 12 | """ 13 | Subclass of NameResolverFile to resolve hashes locally 14 | 15 | Attributes: 16 | _contenthash Multihash of content 17 | 18 | Supports 19 | contenthash via NameResolver default 20 | """ 21 | 22 | @classmethod 23 | def new(cls, namespace, *args, **kwargs): # Used by Gateway 24 | if kwargs.get("verbose"): 25 | logging.debug("{0}.new namespace={1} args={2} kwargs={3}" 26 | .format(cls.__name__, namespace, args, kwargs)) 27 | return super(LocalResolver, cls).new(namespace, *args, **kwargs) # Calls __init__() by default 28 | 29 | @staticmethod 30 | def transport(verbose=False): 31 | return TransportLocal(options={"local": {"dir": ".cache"}}, 32 | verbose=verbose) # TODO-LOCAL move to options at higher level 33 | 34 | class LocalResolverStore(LocalResolver): 35 | 36 | @classmethod 37 | def new(cls, namespace, *args, **kwargs): # Used by Gateway 38 | verbose = kwargs.get("verbose") 39 | obj = super(LocalResolverStore, cls).new(namespace, *args, **kwargs) # Calls __init__() by default 40 | res = cls.transport(verbose=verbose).rawstore(data=kwargs["data"], returns="contenthash,url") 41 | obj._contenthash = res["contenthash"] # Returned via contenthash() in NameResolveer 42 | obj.url = res["url"] #TODO-LOCAL this is going to be wrong its currently local:/rawfetch/Q... 43 | LocationService.set(obj._contenthash.multihash58, obj.url, verbose=verbose) # Let LocationService know we have it locally 44 | return obj 45 | 46 | 47 | 48 | class LocalResolverFetch(LocalResolver): 49 | @classmethod 50 | def new(cls, namespace, *args, **kwargs): # Used by Gateway 51 | verbose = kwargs.get("verbose") 52 | obj = super(LocalResolverFetch, cls).new(namespace, *args, **kwargs) # Calls __init__() by default 53 | obj._contenthash = Multihash(multihash58=args[0]) 54 | # Not looking up URL in LocationService yet, will look up if needed 55 | # Not fetching data, will be retrieved by content() method etc 56 | obj.url = cls.transport(verbose).url(multihash=obj._contenthash) 57 | return obj 58 | 59 | @property 60 | def mimetype(self): 61 | return "application/octet-stream" # By default we don't know what it is #TODO-LOCAL look up in MimetypeService just in case ... 62 | 63 | def retrieve(self, verbose=False, **kwargs): 64 | try: 65 | return self.transport(verbose=verbose).rawfetch(multihash=self._contenthash) 66 | except TransportFileNotFound as e1: # Not found in block store, lets try contenthash 67 | logging.debug("LocalResolverFetch.retrieve: err={}".format(e1)) 68 | try: 69 | from .HashResolvers import ContentHash # Avoid a circular reference 70 | contenthash = self._contenthash.multihash58 71 | logging.debug("LocalResolverFetch.retrieve falling back to contenthash: {}".format(contenthash)) 72 | return ContentHash.new("contenthash", contenthash, verbose=verbose, nolocal=True).retrieve(verbose=verbose) 73 | except Exception as e: 74 | logging.debug("Fallback failed, raising original error") 75 | raise e1 76 | 77 | class LocalResolverAdd(LocalResolver): 78 | 79 | @classmethod 80 | def new(cls, namespace, url, *args, data=None, **kwargs): # Used by Gateway 81 | verbose = kwargs.get("verbose") 82 | obj = super(LocalResolverAdd, cls).new(namespace, *args, **kwargs) # Calls __init__() by default 83 | if isinstance(data, (str, bytes)): # Assume its JSON 84 | data = loads(data) # HTTP just delivers bytes 85 | cls.transport(verbose=verbose).rawadd(url, data) 86 | return obj 87 | 88 | class LocalResolverList(LocalResolver): 89 | 90 | @classmethod 91 | def new(cls, namespace, hash, *args, data=None, **kwargs): # Used by Gateway 92 | verbose = kwargs.get("verbose") 93 | obj = super(LocalResolverList, cls).new(namespace, hash, *args, **kwargs) # Calls __init__() by default 94 | obj._contenthash = Multihash(multihash58=hash) 95 | return obj 96 | 97 | def metadata(self, headers=True, verbose=False, **kwargs): 98 | data = self.transport(verbose=verbose).rawlist(self._contenthash.multihash58, verbose=verbose) 99 | mimetype = 'application/json'; 100 | return {"Content-type": mimetype, "data": data} if headers else data 101 | 102 | class KeyValueTable(LocalResolver): 103 | @classmethod 104 | def new(cls, namespace, database, table, *args, data=None, **kwargs): # Used by Gateway 105 | verbose = kwargs.get("verbose") 106 | obj = super(KeyValueTable, cls).new(namespace, database, table, *args, **kwargs) # Calls __init__() by default 107 | obj.database = database 108 | obj.table = table 109 | #obj.data = data # For use by command 110 | #obj.args = args 111 | #obj.kwargs = kwargs # Esp key=a or key=[a,b,c] 112 | return obj 113 | 114 | def set(self, verbose=False, headers=False, data=None, **kwargs): # set/table/ 115 | #TODO check pubkey or have transport do it - and save with it 116 | if isinstance(data, (str, bytes)): # Assume its JSON 117 | data = loads(data) # HTTP just delivers bytes 118 | self.transport(verbose=verbose).set(database=self.database, table=self.table, keyvaluelist=data, value=None, verbose=verbose) 119 | 120 | def get(self, verbose=False, headers=False, **kwargs): # set/table/ 121 | # TODO check pubkey or have transport do it - and save with it 122 | res = self.transport(verbose=verbose).get(database=self.database, table=self.table, keys = kwargs["key"] if isinstance(kwargs["key"], list) else [kwargs["key"]], verbose=verbose) 123 | return { "Content-type": "application/json", "data": res} if headers else res 124 | 125 | 126 | def delete(self, verbose=False, headers=False, **kwargs): # set/table/ 127 | # TODO check pubkey or have transport do it - and save with it 128 | self.transport(verbose=verbose).delete(database=self.database, table=self.table, keys = kwargs["key"] if isinstance(kwargs["key"], list) else [kwargs["key"]], verbose=verbose) 129 | 130 | def keys(self, verbose=False, headers=False, **kwargs): # set/table/ 131 | # TODO check pubkey or have transport do it - and save with it 132 | res = self.transport(verbose=verbose).keys(database=self.database, table=self.table, verbose=verbose) 133 | return { "Content-type": "application/json", "data": res} if headers else res 134 | 135 | def getall(self, verbose=False, headers=False, **kwargs): # set/table/ 136 | # TODO check pubkey or have transport do it - and save with it 137 | res = self.transport(verbose=verbose).getall(database=self.database, table=self.table, verbose=verbose) 138 | return { "Content-type": "application/json", "data": res} if headers else res 139 | 140 | 141 | -------------------------------------------------------------------------------- /python/Multihash.py: -------------------------------------------------------------------------------- 1 | """ 2 | A set of classes to hold different kinds of hashes etc and convert between them, 3 | 4 | Much of this was adapted from https://github.com/tehmaze/python-multihash, 5 | which seems to have evolved from the pip3 multihash, which is seriously broken. 6 | """ 7 | 8 | import hashlib 9 | import struct 10 | import sha3 11 | import pyblake2 12 | import base58 13 | import binascii 14 | import logging 15 | 16 | from sys import version as python_version 17 | if python_version.startswith('3'): 18 | from urllib.parse import urlparse 19 | else: 20 | from urlparse import urlparse # See https://docs.python.org/2/library/urlparse.html 21 | from .Errors import MultihashError 22 | 23 | class Multihash(object): 24 | """ 25 | Superclass for all kinds of hashes, this is for convenience in passing things around between some places that want binary, or 26 | multihash or hex. 27 | 28 | core storage is as a multihash_binary i.e. [ code, length, digest...] 29 | 30 | Each instance: 31 | code = SHA1, SHA256 etc (uses integer conventions from multihash 32 | """ 33 | 34 | # Constants 35 | # 0x01..0x0F are app specific (unused) 36 | SHA1 = 0x11 37 | SHA2_256 = 0x12 38 | SHA2_512 = 0x13 39 | SHA3 = 0x14 40 | BLAKE2B = 0x40 41 | BLAKE2S = 0x41 42 | 43 | FUNCS = { 44 | SHA1: hashlib.sha1, 45 | SHA2_256: hashlib.sha256, 46 | # Alternative use nacl.hash.sha256(data, encoder=nacl.encoding.RawEncoder) which has different footprint 47 | SHA2_512: hashlib.sha512, 48 | SHA3: lambda: hashlib.new('sha3_512'), 49 | BLAKE2B: lambda: pyblake2.blake2b(), 50 | BLAKE2S: lambda: pyblake2.blake2s(), 51 | } 52 | LENGTHS = { 53 | SHA1: 20, 54 | SHA2_256: 32, 55 | SHA2_512: 64, 56 | SHA3: 64, 57 | BLAKE2B: 64, 58 | BLAKE2S: 32, 59 | } 60 | 61 | def assertions(self, code=None): 62 | if code and code != self.code: 63 | raise MultihashError(message="Expecting code {}, got {}".format(code, self.code)) 64 | if self.code not in self.FUNCS: 65 | raise MultihashError(message="Unsupported Hash type {}".format(self.code)) 66 | if (self.digestlength != len(self.digest)) or (self.digestlength != self.LENGTHS[self.code]): 67 | raise MultihashError(message="Invalid lengths: expect {}, byte {}, len {}" 68 | .format(self.LENGTHS[self.code], self.digestlength, len(self.digest))) 69 | 70 | def __init__(self, multihash58=None, sha1hex=None, data=None, code=None, url=None): 71 | """ 72 | Accept variety of parameters, 73 | 74 | :param multihash_58: 75 | """ 76 | digest = None 77 | 78 | if url: # Assume its of the form somescheme:/somescheme/Q... 79 | logging.debug("url={} {}".format(url.__class__.__name__,url)) 80 | if isinstance(url, str) and "/" in url: # https://.../Q... 81 | url = urlparse(url) 82 | if not isinstance(url, str): 83 | multihash58 = url.path.split('/')[-1] 84 | else: 85 | multihash58 = url 86 | if multihash58[0] not in ('5','Q'): # Simplistic check that it looks ok-ish 87 | raise MultihashError(message="Invalid hash portion of URL {}".format(multihash58)) 88 | if multihash58: 89 | self._multihash_binary = base58.b58decode(multihash58) 90 | if sha1hex: 91 | if python_version.startswith('2'): 92 | digest = sha1hex.decode('hex') # Python2 93 | else: 94 | digest = bytes.fromhex(sha1hex) # Python3 95 | code = self.SHA1 96 | if data and code: 97 | digest = self._hash(code, data) 98 | if digest and code: 99 | self._multihash_binary = bytearray([code, len(digest)]) 100 | self._multihash_binary.extend(digest) 101 | self.assertions() # Check consistency 102 | 103 | def _hash(self, code, data): 104 | if not code in self.FUNCS: 105 | raise MultihashError(message="Cant encode hash code={}".format(code)) 106 | hashfn = self.FUNCS.get(code)() # Note it calls the function in that strange way hashes work! 107 | if isinstance(data, bytes): 108 | hashfn.update(data) 109 | elif isinstance(data, str): 110 | # In Python 3 this is ok, would be better if we were sure it was utf8 111 | # raise MultihashError(message="Should be passing bytes, not strings as could encode multiple ways") # TODO can remove this if really need to handle UTF8 strings, but better to push conversion upstream 112 | hashfn.update(data.encode('utf-8')) 113 | return hashfn.digest() 114 | 115 | def check(self, data): 116 | assert self.digest == self._hash(self.code, data), "Hash doesnt match expected" 117 | 118 | @property 119 | def code(self): 120 | return self._multihash_binary[0] 121 | 122 | @property 123 | def digestlength(self): 124 | return self._multihash_binary[1] 125 | 126 | @property 127 | def digest(self): 128 | """ 129 | :return: bytes, the digest part of any multihash 130 | """ 131 | return self._multihash_binary[2:] 132 | 133 | @property 134 | def sha1hex(self): 135 | """ 136 | :return: The hex of the sha1 (as used in DOI sqlite tables) 137 | """ 138 | self.assertions(self.SHA1) 139 | return binascii.hexlify(self.digest).decode('utf-8') # The decode is turn bytes b'a1b2' to str 'a1b2' 140 | 141 | @property 142 | def multihash58(self): 143 | foo = base58.b58encode(bytes(self._multihash_binary)) # Documentation says returns bytes, Mac returns string, want string 144 | if isinstance(foo,bytes): 145 | return foo.decode('ascii') 146 | else: 147 | return foo -------------------------------------------------------------------------------- /python/NameResolver.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import requests 3 | from urllib.parse import urlparse 4 | from .Errors import ToBeImplementedException, NoContentException, IPFSException 5 | from .Multihash import Multihash 6 | from .HashStore import LocationService, MimetypeService, IPLDHashService 7 | from .config import config 8 | from .miscutils import httpget 9 | from .TransportIPFS import TransportIPFS 10 | 11 | 12 | 13 | class NameResolver(object): 14 | """ 15 | The NameResolver group of classes manage recognizing a name, and connecting it to resources 16 | we have at the Archive. 17 | 18 | These are base classes for specific name resolvers like DOI 19 | 20 | it specifies a set of methods we expect to be able to do on a subclass, 21 | and may have default code for some of them based on assumptions about the data structure of subclasses. 22 | 23 | Each subclass of NameResolver must support: 24 | content() Generate an output to return to a browser. (Can be a dict, array or string, later will add Streams) (?? Not sure if should implement content for dirs) 25 | 26 | Each subclass of NameResolver can provide, but can also use default: 27 | contenthash() The hash of the content 28 | 29 | Logically it can represent one or multiple files depending on subclass 30 | 31 | Attributes reqd: 32 | name: Name of the object being retrieved (short string) 33 | namespace: Store the namespace here. 34 | 35 | A subclass can have any meta-data fields, recommended ones include. 36 | contentSize: The size of the content in brief (compatible with Schema.org, not compatible with standard Archive metadata) 37 | contentType: The mime-type of the content, (TODO check against schema.org), not compatible with standard Archive metadata which uses three letter types like PNG 38 | """ 39 | 40 | def __init__(self, namespace, *args, **kwargs): # Careful if change, note its the default __init__ for NameResolverDir, NameResolverFile, NameResolverSearch etc 41 | self._list = [] 42 | 43 | @classmethod 44 | def new(cls, namespace, *args, **kwargs): 45 | """ 46 | Default creation of new obj, returns None if not found (to allow multiple attempts to instantiate) 47 | 48 | :param namespace: 49 | :param args: 50 | :param kwargs: 51 | :return: 52 | """ 53 | try: 54 | return cls(namespace, *args, **kwargs) 55 | except NoContentException: 56 | return None 57 | 58 | def retrieve(self, _headers=None, verbose=False, **kwargs): 59 | """ 60 | 61 | :return: 62 | """ 63 | raise ToBeImplementedException(name=self.__class__.__name__+".retrieve()") 64 | 65 | def content(self, _headers=None, verbose=False, **kwargs): 66 | """ 67 | Return the content, by default its just the result of self.retrieve() which must be defined in superclass 68 | Requires mimetype to be set in subclass 69 | 70 | :param verbose: 71 | :return: 72 | """ 73 | return {"Content-type": self.mimetype, "data": self.retrieve(_headers=_headers)} 74 | 75 | def metadata(self, verbose=False, **kwargs): 76 | """ 77 | 78 | :return: 79 | """ 80 | raise ToBeImplementedException(name=self.__class__.__name__+".metadata()") 81 | 82 | def contenthash(self, verbose=False): 83 | """ 84 | By default contenthash is the hash of the content. 85 | 86 | :return: 87 | """ 88 | if not self._contenthash: 89 | self._contenthash = Multihash(data=self.content(), code=Multihash.SHA2_256) 90 | return {'Content-type': 'text/plain', 91 | 'data': self._contenthash.multihash58 92 | } 93 | 94 | def contenturl(self, verbose=False): 95 | """ 96 | By default contenthash is the hash of the content. 97 | 98 | :return: 99 | """ 100 | if not self._contenthash: 101 | self._contenthash = Multihash(data=self.content(), code=Multihash.SHA2_256) 102 | return {'Content-type': 'text/plain', 103 | 'data': "https://dweb.me/contenthash/"+self._contenthash.multihash58, # TODO parameterise server name, maybe store from incoming URL 104 | } 105 | 106 | def push(self, obj): 107 | """ 108 | Add a NameResolverShard to a NameResolverFile or a NameResolverFile to a NameResolverDir - in both cases on _list field 109 | Doesnt check class of object added to allow for variety of nested constructs. 110 | 111 | :param obj: NameResolverShard, NameResolverFile, or NameResolverDir 112 | :return: 113 | """ 114 | self._list.append(obj) 115 | 116 | @classmethod 117 | def canonical(cls, namespace, *args, **kwargs): 118 | """ 119 | If this method isn't subclassed, then its already a canonical form so return with slashes 120 | 121 | :param cls: 122 | :param namespace: 123 | :param [args]: List of arguments to URL 124 | :return: Concatenated args with / by default (subclasses will override) 125 | """ 126 | return namespace, args.join('/') # By default reconcatonate args 127 | 128 | 129 | class NameResolverDir(NameResolver): 130 | 131 | """ 132 | Represents a set of files, 133 | 134 | Attributes: 135 | _list: Hold data for a list of files (NameResolverFile) in the directory. 136 | files(): An iterator over _list - returns NameResolverFile 137 | name: Name of the directory 138 | """ 139 | def files(self): 140 | return self._list 141 | 142 | 143 | class NameResolverFile(NameResolver): 144 | """ 145 | Represents a single file, and its shards, 146 | It contains enough info for retrieval of the file e.g. HTTP URL, or server and path. Also can have byterange, 147 | 148 | Attributes: 149 | _list: Hold data for a list of shards in this file. 150 | shards(): An iterator over _list 151 | See NameResolver for other metadata fields 152 | 153 | TODO - define fields for location & byterange 154 | 155 | Any other field can be used as namespace specific metadata 156 | """ 157 | shardsize = 256000 # A default for shard size, TODO-IPLD determine best size, subclasses can overwrite, or ignore for things like video. 158 | 159 | def shards(self): 160 | """ 161 | Return an iterator that returns each of the NameResolverShard in the file's _list attribute. 162 | * Each time called, should: 163 | * read next `shardsize` bytes from content (either from a specific byterange, or by reading from an open stream) 164 | * Pass that through multihash58 service to get a base58 multihash 165 | * Return that multihash, plus metadata (size may be all required) 166 | * Store the mapping between that multihash, and location (inc byterange) in locationstore 167 | * May Need to cache the structure, but since the IPLD that calls this will be cached, that might not be needed. 168 | """ 169 | raise ToBeImplementedException(name="NameResolverFile.shards") 170 | pass 171 | 172 | def cache_ipfs(self, url=None, data=None, forceurlstore=False, forceadd=False, printlog=False, announcedht=False, size=None, verbose=False ): 173 | """ 174 | Cache in IPFS, will automatically select no action, urlstore or add unless constrained by forcexxx 175 | Before doing this, should have checked if IPLDHashService can return the hash already 176 | 177 | :param url: # If present is the url of the file 178 | :param data: # If present is the data for the file 179 | :param forceurlstore: # Override default and use urlstore 180 | :param forceadd: # Override default and use add 181 | :raises: # IPFS Exeption if its failing 182 | :return: # IPLDhash 183 | 184 | Logical combinations of arguments attempt to get the "right" result. 185 | forceurlstore && url => urlstore 186 | forceurlstore && !url => error 187 | forceadd && data => add 188 | forceadd && !data && url => fetch data then add 189 | url && data && !forceurl && !forcedata => default to urlstore (ignore data) 190 | """ 191 | # TODO-PERMS, this cant be caching to IPFS if dont have permission 192 | if not config["ipfs"].get("url_urlstore"): # If not running on machine with urlstore 193 | forceadd = True 194 | if url and forceadd: # To "add" from an URL we need to retrieve and then urlstore 195 | (data, self.mimetype) = httpget(url, wantmime=True) 196 | if not self.multihash: # Since we've got the data, we can compute SHA1 from it 197 | if verbose: logging.debug("Computing SHA1 hash of url {}".format(url)) 198 | self.multihash = Multihash(data=data, code=Multihash.SHA1) 199 | # Since we retrieved mimetype we can save it, since not set in metadata 200 | MimetypeService.set(self.multihash.multihash58, self.mimetype, verbose=verbose) 201 | if (url and not forceadd): 202 | did = "urlstore" 203 | ipldurl = TransportIPFS().store(urlfrom=url, pinggateway=False, verbose=verbose) # Can throw IPFSExeption 204 | elif data: # Either provided or fetched from URL 205 | did = "add" 206 | ipldurl = TransportIPFS().store(data=data, pinggateway=False, mimetype=self.mimetype, verbose=verbose) 207 | else: 208 | raise errors.CodingException(message="Invalid options to cache_ipfs forceurlstore={} forceadd={} url={} data len={}"\ 209 | .format(forceurlstore, forceadd, url, len(data) if data else 0)) 210 | # Each of the successful routes through above leaves us with ipldurl 211 | ipldhash = urlparse(ipldurl).path.split('/')[2] 212 | if announcedht: 213 | TransportIPFS().announcedht(ipldhash) # Let DHT know - dont wait for up to 10 hours for next cycle 214 | IPLDHashService.set(self.multihash.multihash58, ipldhash) 215 | #("URL", "Add/Urlstore", "Hash", "Size", "Announced") 216 | if size and data and (len(data) != size): 217 | size = "{}!={}".format(size, len(data)) 218 | print('"{}","{}","{}","{}","{}"'.format(url, did, ipldhash, size, announcedht)) 219 | return ipldhash 220 | 221 | 222 | def cache_content(self, url, wantipfs=False, verbose=False): 223 | """ 224 | Retrieve content from a URL, cache it in various places especially IPFS, and set tables so can be retrieved by contenthash 225 | 226 | Requires multihash to be set prior to this, if required it could be set from the retrieved data 227 | Call path is Archivefile.metadata > ArchiveFile.cache_content > NameResolverFile.cache_content 228 | 229 | :param url: URL - typically inside archive.org of contents 230 | :param transport: Either None (for all) or a list of transports to cache for 231 | In particular, transport needs to be None or contain IPFS to cache in IPFS 232 | :param verbose: 233 | :return: 234 | """ 235 | ipldhash = self.multihash and IPLDHashService.get(self.multihash.multihash58) # May be None, we don't know it 236 | if ipldhash: 237 | self.mimetype = MimetypeService.get(self.multihash.multihash58, verbose=verbose) 238 | ipldhash = IPLDHashService.get(self.multihash.multihash58, verbose=verbose) 239 | else: 240 | if wantipfs: 241 | #TODO could check sha1 here, but would be slow 242 | #TODO-URLSTORE delete old cache 243 | #TODO-URLSTORE - check dont need mimetype 244 | if not self.multihash: 245 | (data, self.mimetype) = httpget(url, wantmime=True) # SLOW - retrieval 246 | if verbose: logging.debug("Computing SHA1 hash of url {}".format(url)) 247 | self.multihash = Multihash(data=data, code=Multihash.SHA1) 248 | ipldhash = self.multihash and IPLDHashService.get(self.multihash.multihash58) # Try again now have hash 249 | MimetypeService.set(self.multihash.multihash58, self.mimetype, verbose=verbose) 250 | if not ipldhash: # We might have got it now especially for _files.xml if unchanged- 251 | # Can throw IPFSException - ignore it 252 | try: 253 | # TODO-PERMS cache_ipfs should be checking permissions 254 | ipldhash = self.cache_ipfs(url=url, verbose=verbose, announcedht=False) # Not announcing to DHT here, its too slow (16+ seconds) better to let first client fail, try gateway, fail again, and subsequent work. 255 | if verbose: logging.debug("ipfs pushed to: {}".format(ipldhash)) 256 | except IPFSException as e: 257 | pass # Ignore it - wont have an ipldhash, but usually dont care 258 | if self.multihash: 259 | LocationService.set(self.multihash.multihash58, url, verbose=verbose) 260 | return {"ipldhash": ipldhash} 261 | 262 | 263 | class NameResolverShard(NameResolver): 264 | """ 265 | Represents a single shard returned by a NameResolverFile.shards() iterator 266 | Holds enough info to do a byte-range retrieval of just those bytes from a server, 267 | And a multihash that could be retrieved by IPFS for just this shard. 268 | """ 269 | pass 270 | 271 | 272 | class NameResolverSearchItem(NameResolver): 273 | """ 274 | Represents each element in a search 275 | """ 276 | pass 277 | 278 | 279 | class NameResolverSearch(NameResolver): 280 | """ 281 | Represents the results of a search 282 | """ 283 | pass 284 | -------------------------------------------------------------------------------- /python/OutputFormat.py: -------------------------------------------------------------------------------- 1 | class OutputFormat(object): 2 | pass 3 | 4 | -------------------------------------------------------------------------------- /python/ServerBase.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | import logging 3 | from .miscutils import dumps # Use our own version of dumps - more compact and handles datetime etc 4 | from json import loads # Not our own loads since dumps is JSON compliant 5 | from sys import version as python_version 6 | from cgi import parse_header, parse_multipart 7 | #from Dweb import Dweb # Import Dweb library (wont use for Academic project 8 | #TODO-API needs writing up 9 | import html 10 | from http import HTTPStatus 11 | from .config import config 12 | 13 | """ 14 | This file is intended to be Application independent , i.e. not dependent on Dweb Library 15 | """ 16 | 17 | if python_version.startswith('3'): 18 | from urllib.parse import parse_qs, parse_qsl, urlparse, unquote 19 | from http.server import BaseHTTPRequestHandler, HTTPServer 20 | from socketserver import ThreadingMixIn 21 | else: # Python 2 22 | from urlparse import parse_qs, parse_qsl, urlparse # See https://docs.python.org/2/library/urlparse.html 23 | from urllib import unquote 24 | from SocketServer import ThreadingMixIn 25 | import threading 26 | from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer 27 | # See https://docs.python.org/2/library/basehttpserver.html for docs on how servers work 28 | # also /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/BaseHTTPServer.py for good error code list 29 | 30 | import traceback 31 | 32 | from .Errors import MyBaseException, ToBeImplementedException, TransportFileNotFound 33 | #from Transport import TransportBlockNotFound, TransportFileNotFound 34 | #from TransportHTTP import TransportHTTP 35 | 36 | class HTTPdispatcherException(MyBaseException): 37 | httperror = 400 # Unimplemented 38 | msg = "HTTP request {req} not recognized" 39 | 40 | class HTTPargrequiredException(MyBaseException): 41 | httperror = 400 # UnimplementedAccess 42 | msg = "HTTP request {req} requires {arg}" 43 | 44 | class DWEBMalformedURLException(MyBaseException): 45 | httperror = 400 46 | msg = "Malformed URL {path}" 47 | 48 | class ThreadedHTTPServer(ThreadingMixIn, HTTPServer): 49 | """Handle requests in a separate thread.""" 50 | 51 | class MyHTTPRequestHandler(BaseHTTPRequestHandler): 52 | """ 53 | Generic HTTPRequestHandler, extends BaseHTTPRequestHandler, to make it easier to use 54 | """ 55 | # Carefull - do not define __init__ as it is run for each incoming request. 56 | # TODO-STREAMS add support for longer (streamed) files on both upload and download, allow a stream to be passed back from the subclasses routines. 57 | 58 | """ 59 | Simple (standard) HTTPdispatcher, 60 | Subclasses should define "exposed" as a list of exposed methods 61 | """ 62 | exposed = [] 63 | protocol_version = "HTTP/1.1" 64 | onlyexposed = False # Dont Limit to @exposed functions (override in subclass if using @exposed) 65 | defaultipandport = { "ipandport": ('localhost', 8080) } 66 | expectedExceptions = () # List any exceptions that you "expect" (and dont want stacktraces for) 67 | 68 | @classmethod 69 | def serve_forever(cls, ipandport=None, verbose=False, **options): 70 | """ 71 | Start a server, 72 | ERR: socket.error if address(port) in use. 73 | 74 | :param ipandport: Ip and port to listen on, else use defaultipandport 75 | :param verbose: If want debugging 76 | :param options: Stored on class for access by handlers 77 | :return: Never returns 78 | """ 79 | cls.ipandport = ipandport or cls.defaultipandport 80 | cls.verbose = verbose 81 | cls.options = options 82 | #HTTPServer(cls.ipandport, cls).serve_forever() # Start http server 83 | logging.info("Server starting on {0}:{1}:{2}".format(cls.ipandport[0], cls.ipandport[1], cls.options or "")) 84 | ThreadedHTTPServer(cls.ipandport, cls).serve_forever() # OR Start http server 85 | logging.error("Server exited") # It never should 86 | 87 | def _dispatch(self, **postvars): 88 | """ 89 | HTTP dispatcher (replaced a more complex version Sept 2017 90 | URLS of form GET /foo/bar/baz?a=b,c=d 91 | Are passed to foo(bar,baz,a=b,c=d) which mirrors Python argument conventions i.e. if def foo(bar,baz,**kwargs) then foo(aaa,bbb) == foo(baz=bbb, bar=aaa) 92 | POST will pass a dictionary, if its just a body text or json it will be passed with a single value { date: content data } 93 | In case of conflict, postvars overwrite args in the query string, but you shouldn't be getting both in most cases. 94 | 95 | :param vars: 96 | :return: 97 | """ 98 | # In documentation, assuming call with /foo/aaa/bbb?x=ccc,y=ddd 99 | try: 100 | # TODO-PERMS make sure we are getting X-ORIGINATING-IP or similar here then make sure passed all way thru to httpget callls 101 | logging.info("dispatcher: {0}".format(self.path)) # Always log URLs in 102 | o = urlparse(self.path) # Parsed URL {path:"/foo/aaa/bbb", query: "bbb?x=ccc,y=ddd"} 103 | 104 | # Get url args, remove HTTP quote (e.g. %20=' '), ignore leading / and anything before it. Will always be at least one item (empty after /) 105 | args = [ unquote(u) for u in o.path.split('/')][1:] 106 | cmd = args.pop(0) # foo 107 | #kwargs = dict(parse_qsl(o.query)) # { baz: bbb, bar: aaa } 108 | kwargs = {} 109 | for (k,b) in parse_qsl(o.query): 110 | a = kwargs.get(k) 111 | kwargs[k] = b if (a is None) else a+[b] if (isinstance(a,list)) else [a,b] 112 | if cmd == "": 113 | cmd = config["httpserver"]["root_path"]; 114 | # Drop through and parse that command 115 | if cmd == "favicon.ico": # May general case this for a set of top level links e.g. robots.txt 116 | self.send_response(301) 117 | self.send_header('Location',config["httpserver"]["favicon_url"]) 118 | self.end_headers() 119 | elif cmd in config["ignoreurls"]: # Looks like hacking or ignorable e.g. robots.txt, note this just ignores /arc/archive.org/xyz 120 | raise TransportFileNotFound(file=o.path) 121 | else: 122 | kwargs.update(postvars) 123 | 124 | cmds = [self.command + "_" + cmd, cmd, self.command + "_" + cmd.replace(".","_"), cmd.replace(".","_")] 125 | try: 126 | func = next(getattr(self, c, None) for c in cmds if getattr(self, c, None)) 127 | except StopIteration: 128 | func = None 129 | #func = getattr(self, self.command + "_" + cmd, None) or getattr(self, cmd, None) # self.POST_foo or self.foo (should be a method) 130 | if not func or (self.onlyexposed and not func.exposed): 131 | raise HTTPdispatcherException(req=cmd) # Will be caught in except 132 | res = func(*args, **kwargs) 133 | # Function should return 134 | 135 | # Send the content-type 136 | self.send_response(200) # Send an ok response 137 | contenttype = res.get("Content-type","application/octet-stream") 138 | self.send_header('Content-type', contenttype) 139 | if self.headers.get('Origin'): # Handle CORS (Cross-Origin) 140 | self.send_header('Access-Control-Allow-Origin', '*') 141 | # self.send_header('Access-Control-Allow-Origin', self.headers['Origin']) # '*' didnt work 142 | data = res.get("data","") 143 | if data or isinstance(data, (list, tuple, dict)): # Allow empty arrays toreturn as [] or empty dict as {} 144 | if isinstance(data, (dict, list, tuple)): # Turn it into JSON 145 | data = dumps(data) # Does our own version to handle classes like datetime 146 | #elif hasattr(data, "dumps"): # Unclear if this is used except maybe in TransportDist_Peer 147 | # raise ToBeImplementedException(message="Just checking if this is used anywhere, dont think so") 148 | # data = dumps(data) # And maype this should be data.dumps() 149 | if isinstance(data, str): 150 | #logging.debug("converting to utf-8") 151 | if python_version.startswith('2'): # Python3 should be unicode, need to be careful if convert 152 | if contenttype.startswith('text') or contenttype in ('application/json',): # Only convert types we know are strings that could be unicode 153 | data = data.encode("utf-8") # Needed to make sure any unicode in data converted to utf8 BUT wont work for intended binary -- its still a string 154 | if python_version.startswith('3'): 155 | data = bytes(data,"utf-8") # In Python3 requests wont work on strings, have to convert to bytes explicitly 156 | if not isinstance(data, (bytes, str)): 157 | #logging.debug(data) 158 | # Raise an exception - will not honor the status already sent, but this shouldnt happen as coding 159 | # error in the dispatched function if it returns anything else 160 | raise ToBeImplementedException(name=self.__class__.__name__+"._dispatch for return data "+data.__class__.__name__) 161 | self.send_header('content-length', str(len(data)) if data else 0) 162 | self.end_headers() 163 | if data: 164 | self.wfile.write(data) # Write content of result if applicable 165 | # Thows BrokenPipeError if browser has gone away 166 | #self.wfile.close() 167 | except BrokenPipeError as e: 168 | logging.error("Broken Pipe Error (browser probably gave up waiting) url={}".format(self.path)) 169 | # Don't send error as the browser has gone away 170 | except Exception as e: # Gentle errors, entry in log is sufficient (note line is app specific) 171 | # TypeError Message will be like "sandbox() takes exactly 3 arguments (2 given)" or whatever exception returned by function 172 | httperror = e.httperror if hasattr(e, "httperror") else 500 173 | if not (self.expectedExceptions and isinstance(e, self.expectedExceptions)): # Unexpected error 174 | logging.error("Sending Unexpected Error {0}:".format(httperror), exc_info=True) 175 | else: 176 | logging.info("Sending Error {0}:{1}".format(httperror, str(e))) 177 | #if self.headers.get('Origin'): # Handle CORS (Cross-Origin) 178 | #self.send_header('Access-Control-Allow-Origin', '*') # '*' didnt work 179 | # self.send_header('Access-Control-Allow-Origin', self.headers['Origin']) # '*' didnt work 180 | self.send_error(httperror, str(e)) # Send an error response 181 | 182 | 183 | def do_GET(self): 184 | #logging.debug(self.headers) 185 | self._dispatch() 186 | 187 | def do_OPTIONS(self): 188 | #logging.info("Options request") 189 | self.send_response(200) 190 | self.send_header('Access-Control-Allow-Methods', "POST,GET,OPTIONS") 191 | self.send_header('Access-Control-Allow-Headers', self.headers['Access-Control-Request-Headers']) # Allow anythihg, but '*' doesnt work 192 | self.send_header('content-length','0') 193 | self.send_header('Content-Type','text/plain') 194 | if self.headers.get('Origin'): 195 | self.send_header('Access-Control-Allow-Origin', '*') # '*' didnt work 196 | # self.send_header('Access-Control-Allow-Origin', self.headers['Origin']) # '*' didnt work 197 | self.end_headers() 198 | 199 | def do_POST(self): 200 | """ 201 | Handle a HTTP POST - reads data in a variety of common formats and passes to _dispatch 202 | 203 | :return: 204 | """ 205 | try: 206 | #logging.debug(self.headers) 207 | ctype, pdict = parse_header(self.headers['content-type']) 208 | #logging.debug("Contenttype={0}, dict={1}".format(ctype, pdict)) 209 | if ctype == 'multipart/form-data': 210 | postvars = parse_multipart(self.rfile, pdict) 211 | elif ctype == 'application/x-www-form-urlencoded': 212 | # This route is taken by browsers using jquery as no easy wayto uploadwith octet-stream 213 | # If its just singular like data="foo" then return single values else (unusual) lists 214 | length = int(self.headers['content-length']) 215 | postvars = { p: (q[0] if (isinstance(q, list) and len(q)==1) else q) for p,q in parse_qs( 216 | self.rfile.read(length), 217 | keep_blank_values=1).items() } # In Python2 this was iteritems, I think items will work in both cases. 218 | elif ctype in ('application/octet-stream', 'text/plain'): # Block sends this 219 | length = int(self.headers['content-length']) 220 | postvars = {"data": self.rfile.read(length)} 221 | elif ctype == 'application/json': 222 | length = int(self.headers['content-length']) 223 | postvars = {"data": loads(self.rfile.read(length))} 224 | else: 225 | postvars = {} 226 | self._dispatch(**postvars) 227 | except Exception as e: 228 | #except ZeroDivisionError as e: # Uncomment this to actually throw exception (since it wont be caught here) 229 | # Return error to user, should have been logged already 230 | httperror = e.httperror if hasattr(e, "httperror") else 500 231 | self.send_error(httperror, str(e)) # Send an error response 232 | 233 | def send_error(self, code, message=None, explain=None): 234 | """ 235 | THIS IS A COPY OF superclass's send_error with cors header added 236 | """ 237 | """Send and log an error reply. 238 | 239 | Arguments are 240 | * code: an HTTP error code 241 | 3 digits 242 | * message: a simple optional 1 line reason phrase. 243 | *( HTAB / SP / VCHAR / %x80-FF ) 244 | defaults to short entry matching the response code 245 | * explain: a detailed message defaults to the long entry 246 | matching the response code. 247 | 248 | This sends an error response (so it must be called before any 249 | output has been generated), logs the error, and finally sends 250 | a piece of HTML explaining the error to the user. 251 | 252 | """ 253 | 254 | try: 255 | shortmsg, longmsg = self.responses[code] 256 | except KeyError: 257 | shortmsg, longmsg = '???', '???' 258 | if message is None: 259 | message = shortmsg 260 | if explain is None: 261 | explain = longmsg 262 | self.log_error("code %d, message %s", code, message) 263 | self.send_response(code, message) 264 | self.send_header('Connection', 'close') 265 | 266 | # Message body is omitted for cases described in: 267 | # - RFC7230: 3.3. 1xx, 204(No Content), 304(Not Modified) 268 | # - RFC7231: 6.3.6. 205(Reset Content) 269 | body = None 270 | if (code >= 200 and 271 | code not in (HTTPStatus.NO_CONTENT, 272 | HTTPStatus.RESET_CONTENT, 273 | HTTPStatus.NOT_MODIFIED)): 274 | # HTML encode to prevent Cross Site Scripting attacks 275 | # (see bug #1100201) 276 | content = (self.error_message_format % { 277 | 'code': code, 278 | 'message': html.escape(message, quote=False), 279 | 'explain': html.escape(explain, quote=False) 280 | }) 281 | body = content.encode('UTF-8', 'replace') 282 | self.send_header("Content-Type", self.error_content_type) 283 | self.send_header('Content-Length', int(len(body))) 284 | self.send_header('Access-Control-Allow-Origin', '*') 285 | self.end_headers() 286 | 287 | if self.command != 'HEAD' and body: 288 | self.wfile.write(body) 289 | 290 | 291 | def exposed(func): 292 | def wrapped(*args, **kwargs): 293 | result = func(*args, **kwargs) 294 | return result 295 | 296 | wrapped.exposed = True 297 | return wrapped 298 | -------------------------------------------------------------------------------- /python/SmartDict.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | import dateutil.parser # pip py-dateutil 3 | from json import dumps, loads 4 | from .Errors import ToBeImplementedException, EncryptionException 5 | """ 6 | from Dweb import Dweb 7 | from Transportable import Transportable 8 | """ 9 | 10 | # THIS FILE IS COPIED FROM THE OLD DWEB repo IT IS NOT TESTED FULLY, SO BIG CHUNKS ARE COMMENTED OUT 11 | # - ONLY PARTS NEEDED FOR KEYPAIR ARE BACKPORTED FROM JS AND UNCOMMENTED 12 | 13 | class SmartDict(object): 14 | 15 | """ 16 | Stores a data structure, usually a single layer Javascript dictionary object. 17 | SmartDict is intended to support the mechanics of storage and retrieval while being subclassed to implement functionality 18 | that understands what the data means. 19 | 20 | By default any fields not starting with “_” will be stored, and any object will be converted into its url. 21 | 22 | The hooks for encrypting and decrypting data are at this level, depending on the _acl field, but are implemented by code in KeyPair. 23 | 24 | _acl If set (on master) defines storage as encrypted 25 | """ 26 | table = "sd" 27 | 28 | def __init__(self, data=None, verbose=False, **options): 29 | """ 30 | Creates and initialize a new SmartDict. 31 | 32 | :param data: String|Object, If a string (typically JSON), then parse first. 33 | A object with attributes to set on SmartDict via _setdata 34 | :param options: Passed to _setproperties, by default overrides attributes set by data 35 | """ 36 | # COPIED BACK FROM JS 2018-07-02 37 | self._urls = [] # Empty URLs - will be loaded by SmartDict.p_fetch if loading from an URL 38 | self._setdata(data) # The data being stored - note _setdata usually subclassed does not store or set _url 39 | self._setproperties(options) # Note this will override any properties set with data #TODO-SMARTDICT need this 40 | 41 | def __str__(self): 42 | return self.__class__.__name__+"("+str(self.__dict__)+")" 43 | 44 | def __repr__(self): 45 | return repr(self.__dict__) 46 | 47 | # Allow access to arbitrary attributes, allows chaining e.g. xx.data.len = foo 48 | def __setattr__(self, name, value): 49 | # THis code was running self.dirty() - problem is that it clears url during loading from the dWeb 50 | if name[0] != "_": 51 | if "date" in name and isinstance(value,basestring): 52 | value = dateutil.parser.parse(value) 53 | return super(SmartDict, self).__setattr__(name, value) # Calls any property esp _data 54 | 55 | def _setproperties(self, options): # Call chain is ... onloaded or constructor > _setdata > _setproperties > __setattr__ 56 | # Checked against JS 20180703 57 | for k in options: 58 | self.__setattr__(k, options[k]) 59 | 60 | def __getattr__(self, name): # Need this in Python while JS supports foo._url 61 | return self.__dict__.get(name) 62 | 63 | """ 64 | 65 | def preflight(self, dd): 66 | "-"-" 67 | Default handler for preflight, strips attributes starting “_” and stores and converts objects to urls. 68 | Subclassed in AccessControlList and KeyPair to avoid storing private keys. 69 | :param dd: dictionary to convert.. 70 | :return: converted dictionary 71 | "-"-" 72 | res = { 73 | k: dd[k].store()._url if isinstance(dd[k], Transportable) else dd[k] 74 | for k in dd 75 | if k[0] != '_' 76 | } 77 | res["table"] = res.get("table",self.table) # Assumes if used table as a field, that not relying on it being the table for loading 78 | assert res["table"] 79 | return res 80 | 81 | def _getdata(self): 82 | "-"-" 83 | Prepares data for sending. Retrieves attributes, runs through preflight. 84 | If there is an _acl field then it passes data through it for encrypting (see AccessControl library) 85 | Exception: UnicodeDecodeError - if its binary 86 | :return: String suitable for rawstore 87 | "-"-" 88 | try: 89 | res = self.transport().dumps(self.preflight(self.__dict__.copy())) # Should call self.dumps below { k:self.__dict__[k] for k in self.__dict__ if k[0]!="_" }) 90 | except UnicodeDecodeError as e: 91 | print "Unicode error in StructuredBlock" 92 | print self.__dict__ 93 | raise e 94 | if self._acl: # Need to encrypt 95 | encdata = self._acl.encrypt(res, b64=True) 96 | dic = {"encrypted": encdata, "acl": self._acl._publicurl, "table": self.table} 97 | res = self.transport().dumps(dic) 98 | return res 99 | 100 | ABOVE HERE NOT BACKPORTED FROM JS 101 | """ 102 | 103 | def _setdata(self, value): 104 | """ 105 | Stores data, subclass this if the data should be interpreted as its stored. 106 | value Object, or JSON string to load into object. 107 | """ 108 | # Note SmartDict expects value to be a dictionary, which should be the case since the HTTP requester interprets as JSON 109 | # Call chain is ... or constructor > _setdata > _setproperties > __setattr__ 110 | # COPIED BACK FROM JS 2018-07-02 111 | value = loads(value) if isinstance(value, str) else value # Will throw exception if it isn't JSON 112 | if value and ("encrypted" in value): 113 | raise EncryptionException("Should have been decrypted in fetch") 114 | self._setproperties(value); 115 | 116 | """ 117 | BELOW HERE NOT BACKPORTED FROM JS 118 | 119 | def _match(self, key, val): 120 | if key[0] == '.': 121 | return (key == '.instance' and isinstance(self, val)) 122 | else: 123 | return (val == self.__dict__[key]) 124 | 125 | def match(self, dict): 126 | "-"-" 127 | Checks if a object matches for each key:value pair in the dictionary. 128 | Any key starting with "." is treated specially esp: 129 | .instanceof: class: Checks if this is a instance of the class 130 | other fields will be supported here, any unsupported field results in a false. 131 | 132 | :returns: boolean, true if matches 133 | "-"-" 134 | return all([self._match(k, dict[k]) for k in dict]) 135 | 136 | 137 | @classmethod 138 | def fetch(cls, url, verbose): 139 | "-"-" 140 | Fetches the object from Dweb, passes to decrypt in case it needs decrypting, 141 | and creates an object of the appropriate class and passes data to _setdata 142 | This should not need subclassing, (subclass _setdata or decrypt instead). 143 | 144 | :return: New object - e.g. StructuredBlock or MutableBlock 145 | :catch: TransportError - can probably, or should throw TransportError if transport fails 146 | :throws: TransportError if url invalid, Authentication Error 147 | "-"-" 148 | from letter2class.py import LetterToClass 149 | if verbose: print "SmartDict.fetch", url; 150 | data = super(SmartDict, cls).fetch(url, verbose) #Fetch the data Throws TransportError immediately if url invalid, expect it to catch if Transport fails 151 | data = Dweb.transport(url).loads(data) # Parse JSON //TODO-REL3 maybe function in Transportable 152 | table = data.table # Find the class it belongs to 153 | cls = LetterToClass[table] # Gets class name, then looks up in Dweb - avoids dependency 154 | if not cls: 155 | raise ToBeImplementedException("SmartDict.fetch: "+table+" isnt implemented in table2class") 156 | if not isinstance(Dweb.table2class[table], cls): 157 | raise ForbiddenException("Avoiding data driven hacks to other classes - seeing "+table); 158 | data = cls.decrypt(data, verbose) # decrypt - may return string or obj , note it can be suclassed for different encryption 159 | data["_url"] = url; # Save where we got it - preempts a store - must do this afer decrypt 160 | return cls(data) 161 | 162 | @classmethod 163 | def decrypt(data, verbose): 164 | "-"-" 165 | This is a hook to an upper layer for decrypting data, if the layer isn't there then the data wont be decrypted. 166 | Chain is SD.fetch > SD.decryptdata > ACL|KC.decrypt, then SD.setdata 167 | 168 | :param data: possibly encrypted object produced from json stored on Dweb 169 | :return: same object if not encrypted, or decrypted version 170 | "-"-" 171 | return AccessControlList.decryptdata(data, verbose) 172 | 173 | def dumps(self): # Called by json_default, but preflight() is used in most scenarios rather than this 174 | 1/0 # DOnt believe this is used 175 | return {k: self.__dict__[k] for k in self.__dict__ if k[0] != "_"} # Serialize the dict, excluding _xyz 176 | 177 | def copy(self): 178 | return self.__class__(self.__dict__.copy()) 179 | """ -------------------------------------------------------------------------------- /python/Transport.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | from datetime import datetime 3 | import logging 4 | 5 | from .miscutils import dumps, loads 6 | from urllib.parse import urlparse 7 | from .Errors import ToBeImplementedException, MyBaseException, IntentionallyUnimplementedException 8 | 9 | class TransportBlockNotFound(MyBaseException): 10 | httperror = 404 11 | msg = "{url} not found" 12 | 13 | class TransportURLNotFound(MyBaseException): 14 | httperror = 404 15 | msg = "{url}, {options} not found" 16 | 17 | class TransportFileNotFound(MyBaseException): 18 | httperror = 404 19 | msg = "{file} not found" 20 | 21 | class TransportPathNotFound(MyBaseException): 22 | httperror = 404 23 | msg = "{path} not found for obj {url}" 24 | 25 | class TransportUnrecognizedCommand(MyBaseException): 26 | httperror = 500 27 | msg = "Class {classname} doesnt have a command {command}" 28 | 29 | 30 | class Transport(object): 31 | """ 32 | Setup the resource and open any P2P connections etc required to be done just once. 33 | In almost all cases this will call the constructor of the subclass 34 | Should return a new Promise that resolves to a instance of the subclass 35 | 36 | :param obj transportoptions: Data structure required by underlying transport layer (format determined by that layer) 37 | :param boolean verbose: True for debugging output 38 | :param options: Data structure stored on the .options field of the instance returned. 39 | :resolve Transport: Instance of subclass of Transport 40 | """ 41 | 42 | def __init__(self, options, verbose): 43 | """ 44 | :param options: 45 | """ 46 | raise ToBeImplementedException(name=cls.__name__+".__init__") 47 | 48 | @classmethod 49 | def setup(cls, options, verbose): 50 | """ 51 | Called to deliver a transport instance of a particular class 52 | 53 | :param options: Options to subclasses init method 54 | :return: None 55 | """ 56 | raise ToBeImplementedException(name=cls.__name__+".setup") 57 | 58 | 59 | def _lettertoclass(self, abbrev): 60 | #TODO-BACKPORTING - check if really needed after finish port (was needed on server) 61 | from letter2class import LetterToClass 62 | return LetterToClass.get(abbrev, None) 63 | 64 | def supports(self, url, func=None): #TODO-API 65 | """ 66 | Determine if this transport supports a certain set of URLs 67 | 68 | :param url: String or parsed URL 69 | :param func: Function being attempted on url 70 | :return: True if this protocol supports these URLs 71 | """ 72 | if not url: return True # Can handle default URLs 73 | if isinstance(url, basestring): 74 | url = urlparse(url) # For efficiency, only parse once. 75 | if not url.scheme: raise CodingException(message="url passed with no scheme (part before :): "+url) 76 | return (url.scheme in self.urlschemes) and (not func or func in self.supportFunctions) #Lower case, NO trailing : (unlike JS) 77 | 78 | 79 | def url(self, data): 80 | """ 81 | Return an identifier for the data without storing 82 | 83 | :param string|Buffer data arbitrary data 84 | :return string valid id to retrieve data via rawfetch 85 | """ 86 | raise ToBeImplementedException(name=cls.__name__+".url") 87 | 88 | def info(self, **options): #TODO-API 89 | raise ToBeImplementedException(name=cls.__name__+".info") 90 | 91 | def rawstore(self, data=None, verbose=False, **options): 92 | raise ToBeImplementedException(name=cls.__name__+".rawstore") 93 | 94 | def store(self, command=None, cls=None, url=None, path=None, data=None, verbose=False, **options): 95 | raise ToBeImplementedException(message="Backporting - unsure if needed - match JS Dweb"); # TODO-BACKPORTING 96 | #store(command, cls, url, path, data, options) = fetch(cls, url, path, options).command(data|data._data, options) 97 | #store(url, data) 98 | if not isinstance(data, basestring): 99 | data = data._getdata() 100 | if command: 101 | # TODO not so sure about this production, document any uses here if there are any 102 | obj = self.fetch(command=None, cls=None, url=url, path=path, verbose=verbose, **options) 103 | return obj.command(data=data, verbose=False, **options) 104 | else: 105 | return self.rawstore(data=data, verbose=verbose, **options) 106 | 107 | def rawfetch(self, url=None, verbose=False, **options): 108 | """ 109 | Fetch data from a url and return as a (binary) string 110 | 111 | :param url: 112 | :param options: { ignorecache if shouldnt use any cached value (mostly in testing); 113 | :return: str 114 | """ 115 | raise ToBeImplementedException(name=cls.__name__+".rawfetch") 116 | 117 | def fetch(self, command=None, cls=None, url=None, path=None, verbose=False, **options): 118 | """ 119 | More comprehensive fetch function, can be sublassed either by the objects being fetched or the transport. 120 | Exceptions: TransportPathNotFound, TransportUnrecognizedCommand 121 | 122 | :param command: Command to be performed on the retrieved data (e.g. content, or size) 123 | :param cls: Class of object being returned, if None will return a str 124 | :param url: Hash of object to retrieve 125 | :param path: Path within object represented by url 126 | :param verbose: 127 | :param options: Passed to command, NOT passed to subcalls as for example mucks up sb.__init__ by dirtying - this might be reconsidered 128 | :return: 129 | """ 130 | if verbose: logging.debug("Transport.fetch command={0} cls={1} url={2} path={3} options={4}".format(command, cls, url, path, options)) 131 | #TODO-BACKPORTING see if needed after full port - hint it was used in ServerHTTP but not on client side 132 | if cls: 133 | if isinstance(cls, basestring): # Handle abbreviations for cls 134 | cls = self._lettertoclass(cls) 135 | obj = cls(url=url, verbose=verbose).fetch(verbose=verbose) 136 | # Can't pass **options to cls as disrupt sb.__init__ by causing dirty 137 | # Not passing **options to fetch, but probably could 138 | else: 139 | obj = self.rawfetch(url, verbose=verbose) # Not passing **options, probably could but not used 140 | #if verbose: logging.debug("Transport.fetch obj={0}".format(obj)) 141 | if path: 142 | obj = obj.path(path, verbose=verbose) # Not passing **options as ignored, but probably could 143 | #TODO handle not found exception 144 | if not obj: 145 | raise TransportPathNotFound(path=path, url=url) 146 | if not command: 147 | return obj 148 | else: 149 | if not cls: 150 | raise TransportUnrecognizedCommand(command=command, classname="None") 151 | func = getattr(obj, command, None) 152 | if not func: 153 | raise TransportUnrecognizedCommand(command=command, classname=cls.__name__) 154 | return func(verbose=verbose, **options) 155 | 156 | def rawadd(self, url, sig, verbose=False, subdir=None, **options): 157 | raise ToBeImplementedException(name=cls.__name__+".rawadd") 158 | 159 | def add(self, urls=None, date=None, signature=None, signedby=None, verbose=False, obj=None, **options ): 160 | #TODO-BACKPORTING check if still needed after Backport - not used in JS 161 | #add(dataurl, sig, date, keyurl) 162 | if (obj and not url): 163 | url = obj._url 164 | return self.rawadd(urls=urls, date=date, signature=signature, signedby=signedby, verbose=verbose, **options) # TODO would be better to store object 165 | 166 | def rawlist(self, url=None, verbose=False, **options): 167 | raise ToBeImplementedException(name=cls.__name__+".rawlist") 168 | 169 | def list(self, command=None, cls=None, url=None, path=None, verbose=False, **options): 170 | """ 171 | 172 | :param command: if found: list.commnd(list(cls, url, path) 173 | :param cls: if found (cls(l) for l in list(url) 174 | :param url: Hash of list to look up - usually url of private key of signer 175 | :param path: Ignored for now, unclear how applies 176 | :param verbose: 177 | :param options: 178 | :return: 179 | """ 180 | raise ToBeImplementedException("Backporting - unsure if needed - match JS Dweb"); #TODO-BACKPORTING 181 | 182 | res = rawlist(url, verbose=verbose, **options) 183 | if cls: 184 | if isinstance(cls, basestring): # Handle abbreviations for cls 185 | cls = self._lettertoclass(cls) 186 | res = [ cls(l) for l in res ] 187 | if command: 188 | func = getattr(CommonList, command, None) #TODO May not work, might have to turn res into CommonList first 189 | if not func: 190 | raise TransportUnrecognizedCommand(command=command, classname=cls.__name__) 191 | res = func(res, verbose=verbose, **options) 192 | return res 193 | 194 | def rawreverse(self, url=None, verbose=False, **options): 195 | raise ToBeImplementedException(name=cls.__name__+".rawreverse") 196 | 197 | 198 | def reverse(self, command=None, cls=None, url=None, path=None, verbose=False, **options): 199 | """ 200 | 201 | :param command: if found: reverse.commnd(list(cls, url, path) 202 | :param cls: if found (cls(l) for l in reverse(url) 203 | :param url: Hash of reverse to look up - usually url of data signed 204 | :param path: Ignored for now, unclear how applies 205 | :param verbose: 206 | :param options: 207 | :return: 208 | """ 209 | raise ToBeImplementedException(message="Backporting - unsure if needed - match JS Dweb"); #TODO-BACKPORTING 210 | 211 | res = rawreverse(url, verbose=verbose, **options) 212 | if cls: 213 | if isinstance(cls, basestring): # Handle abbreviations for cls 214 | cls = self._lettertoclass(cls) 215 | res = [ cls(l) for l in res ] 216 | if command: 217 | func = getattr(self, command, None) 218 | if not func: 219 | raise TransportUnrecognizedCommand(command=command, classname=cls.__name__) 220 | res = func(res, verbose=verbose, **options) 221 | return res 222 | 223 | #TODO-BACKPORT add listmonitor 224 | -------------------------------------------------------------------------------- /python/TransportHTTP.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | import logging 3 | from .Transport import Transport 4 | 5 | class TransportHTTP(Transport): 6 | """ 7 | Subclass of Transport. 8 | Implements the raw primitives via an http API to a local IPFS instance 9 | Only partially complete TODO - get from old library 10 | """ 11 | 12 | # urlschemes = ['http','https'] - subclasses as can handle all 13 | supportFunctions = ['set'] 14 | 15 | def __init__(self, options=None, verbose=False): 16 | """ 17 | Create a transport object (use "setup" instead) 18 | |Exceptions: TransportFileNotFound if dir invalid, IOError other OS error (e.g. cant make directory) 19 | 20 | :param dir: 21 | :param options: 22 | """ 23 | self.options = options or {} 24 | pass 25 | 26 | def __repr__(self): 27 | return self.__class__.__name__ + " " + dumps(self.options) 28 | 29 | def supports(self, url, func): 30 | return (func in supportfunctions) and (url.startswith('https:') or url.startswith('http')) # Local can handle any kind of URL, since cached. 31 | 32 | def set(self, url, keyvalues, value, verbose): 33 | pass # TODO-DOMAIN complete 34 | -------------------------------------------------------------------------------- /python/TransportIPFS.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | import json 3 | import logging 4 | from .miscutils import loads, dumps 5 | from .Transport import Transport 6 | from .config import config 7 | import requests # HTTP requests 8 | from .miscutils import httpget 9 | from .Errors import IPFSException 10 | 11 | 12 | class TransportIPFS(Transport): 13 | """ 14 | Subclass of Transport. 15 | Implements the raw primitives via an http API to a local IPFS instance 16 | Only partially complete 17 | """ 18 | 19 | # urlschemes = ['ipfs'] - subclasses as can handle all 20 | supportFunctions = ['store','fetch'] 21 | 22 | def __init__(self, options=None, verbose=False): 23 | """ 24 | Create a transport object (use "setup" instead) 25 | |Exceptions: TransportFileNotFound if dir invalid, IOError other OS error (e.g. cant make directory) 26 | 27 | :param dir: 28 | :param options: 29 | """ 30 | self.options = options or {} 31 | pass 32 | 33 | def __repr__(self): 34 | return self.__class__.__name__ + " " + dumps(self.options) 35 | 36 | def supports(self, url, func): 37 | return url.startswith('ipfs:') # Local can handle any kind of URL, since cached. 38 | 39 | #TODO-LOCAL - feed this back into ServerGateway.info 40 | def info(self, **options): 41 | return { "type": "ipfs", "options": self.options } 42 | 43 | def rawfetch(self, url=None, verbose=False, **options): 44 | """ 45 | Fetch a block from IPFS 46 | Exception: TransportFileNotFound if file doesnt exist 47 | #TODO-STREAM make return stream to HTTP and so on 48 | 49 | :param url: 50 | :param multihash: a Multihash structure 51 | :param options: 52 | :return: 53 | """ 54 | raise ToBeImplementedException(name="TransportIPFS.rawfetch") 55 | 56 | def pinggateway(self, ipldhash): 57 | """ 58 | Pin to gateway or JS clients wont see it TODO remove this when client relay working (waiting on IPFS) 59 | This next line is to get around bug in IPFS propogation 60 | See https://github.com/ipfs/js-ipfs/issues/1156 61 | Feb2018: Note this is waiting on a workaround by IPFS (David > Kyle > Lars ) 62 | : param ipldhash Hash of form z... or Q.... or array of ipldhash 63 | """ 64 | if isinstance(ipldhash, (list,tuple,set)): 65 | for i in ipldhash: 66 | self.pinggateway(i) 67 | headers = { "Connection": "keep-alive"} 68 | ipfsgatewayurl = "https://ipfs.io/ipfs/{}".format(ipldhash) 69 | res = requests.head(ipfsgatewayurl, headers=headers); # Going to ignore the result 70 | logging.debug("Transportipfs.pinggateway workaround for JS-IPFS issue #1156 - pin gateway for {}".format(ipfsgatewayurl)) 71 | 72 | def announcedht(self, ipldhash): 73 | """ 74 | Periodically tell URLstore to announce blocks or JS clients wont see it 75 | This next line is to get around bug in IPFS propogation 76 | : param ipldhash Hash of form z... or Q.... or array of ipldhash 77 | """ 78 | if isinstance(ipldhash, (list,tuple,set)): 79 | for i in ipldhash: 80 | self.announcedht(i) 81 | headers = { "Connection": "keep-alive"} 82 | ipfsurl = config["ipfs"]["url_dht_provide"] 83 | res = requests.get(ipfsurl, headers=headers, params={'arg': ipldhash}) # Ignoring result 84 | logging.debug("Transportipfs.announcedht for {}?arg={}".format(ipfsurl, ipldhash)) # Log whether verbose or not 85 | 86 | def rawstore(self, data=None, verbose=False, returns=None, pinggateway=True, mimetype=None, **options): 87 | """ 88 | Store the data on IPFS 89 | Exception: TransportFileNotFound if file doesnt exist 90 | 91 | :param data: opaque data to store (currently must be bytes, not str) 92 | :param returns: Comma separated string if want result as a dict, support "url","contenthash" 93 | :raises: IPFSException if cant reach server 94 | :return: url of data e.g. ipfs:/ipfs/Qm123abc 95 | """ 96 | assert (not returns), 'Not supporting "returns" parameter to TransportIPFS.store at this point' 97 | ipfsurl = config["ipfs"]["url_add_data"] 98 | if verbose: logging.debug("Posting IPFS to {0}".format(ipfsurl)) 99 | headers = { "Connection": "keep-alive"} 100 | try: 101 | res = requests.post(ipfsurl, headers=headers, params={ 'trickle': 'true', 'pin': 'true'}, files={'file': ('', data, mimetype)}).json() 102 | #except ConnectionError as e: # TODO - for some reason this never catches even though it reports "ConnectionError" as the class 103 | except requests.exceptions.ConnectionError as e: # Alternative - too broad a catch but not expecting other errors 104 | pass 105 | raise IPFSException(message="Unable to post to local IPFS at {} it is probably not running or wedged".format(ipfsurl)) 106 | logging.debug("IPFS result={}".format(res)) 107 | ipldhash = res['Hash'] 108 | if pinggateway: 109 | self.pinggateway(ipldhash) 110 | return "ipfs:/ipfs/{}".format(ipldhash) 111 | 112 | def store(self, data=None, urlfrom=None, verbose=False, mimetype=None, pinggateway=True, returns=None, **options): 113 | """ 114 | Higher level store semantics 115 | 116 | :param data: 117 | :param urlfrom: URL to fetch from for storage, allows optimisation (e.g. pass it a stream) or mapping in transport 118 | :param verbose: 119 | :param pinggateway: True (default) to ping ipfs.io so that it knows where to find, (alternative is to allow browser to ping it on failure to retrieve) 120 | :param mimetype: 121 | :param options: 122 | :raises: IPFSException if cant reach server or doesnt return JSON 123 | :return: 124 | """ 125 | assert (not returns), 'Not supporting "returns" parameter to TransportIPFS.store at this point' 126 | try: 127 | headers = { "Connection": "keep-alive"} 128 | if urlfrom and config["ipfs"].get("url_urlstore"): # On a machine with urlstore and passed a url 129 | ipfsurl = config["ipfs"]["url_urlstore"] 130 | res = requests.get(ipfsurl, headers=headers, params={'arg': urlfrom, 'trickle': 'true', 'nocopy': 'true', 'cid-version':"1"}).json() 131 | ipldhash = res['Key'] 132 | # Now pin to gateway or JS clients wont see it TODO remove this when client relay working (waiting on IPFS) 133 | # This next line is to get around bug in IPFS propogation 134 | # See https://github.com/ipfs/js-ipfs/issues/1156 135 | if pinggateway: 136 | self.pinggateway(ipldhash) 137 | url = "ipfs:/ipfs/{}".format(ipldhash) 138 | else: # Need to store via "add" 139 | if not data or not mimetype and urlfrom: 140 | (data, mimetype) = httpget(urlfrom, wantmime=True) # This is a fetch from somewhere else before putting to gateway 141 | if not isinstance(data, (str,bytes)): # We've got data, but if its an object turn into JSON, (example is name/archiveid which passes metadata) 142 | data = dumps(data) 143 | url = self.rawstore(data=data, verbose=verbose, returns=returns, mimetype=mimetype, pinggateway=pinggateway, **options) # IPFSException if down 144 | return url 145 | except (KeyError) as e: 146 | raise IPFSException(message="Bad format back from IPFS - no key field" + json.dumps(res)) 147 | except (json.decoder.JSONDecodeError) as e: 148 | raise IPFSException(message="Bad format back from IPFS - not JSON;"+str(e)) 149 | except (requests.exceptions.ConnectionError) as e: 150 | raise IPFSException(message="IPFS refused connection;"+str(e)) 151 | 152 | -------------------------------------------------------------------------------- /python/TransportLocal.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | import logging 3 | #from sys import version as python_version 4 | #if python_version.startswith('3'): 5 | # from urllib.parse import urlparse 6 | #else: 7 | # from urlparse import urlparse # See https://docs.python.org/2/library/urlparse.html 8 | import os # For isdir and exists 9 | 10 | # Neither of these are used in the Gateway which could be extended 11 | #from Transport import Transport 12 | #from Dweb import Dweb 13 | from .Errors import TransportFileNotFound 14 | from .Multihash import Multihash 15 | from .miscutils import loads, dumps 16 | from .Transport import Transport 17 | 18 | 19 | class TransportLocal(Transport): 20 | """ 21 | Subclass of Transport. 22 | Implements the raw primitives as reads and writes of file system. 23 | """ 24 | 25 | # urlschemes = ['http'] - subclasses as can handle all 26 | 27 | 28 | def __init__(self, options, verbose): 29 | """ 30 | Create a transport object (use "setup" instead) 31 | |Exceptions: TransportFileNotFound if dir invalid, IOError other OS error (e.g. cant make directory) 32 | 33 | :param dir: 34 | :param options: 35 | """ 36 | subdirs = "list", "reverse", "block" 37 | dir = options["local"]["dir"] 38 | if not os.path.isdir(dir): 39 | os.mkdir(dir) 40 | self.dir = dir 41 | for table in subdirs: 42 | dirname = "%s/%s" % (self.dir, table) 43 | if not os.path.isdir(dirname): 44 | os.mkdir(dirname) 45 | self.options = options 46 | 47 | def __repr__(self): 48 | return self.__class__.__name__ + " " + dumps(self.options) 49 | 50 | @classmethod 51 | def OBSsetup(cls, options, verbose): #TODO-LOCAL maybe not needed 52 | """ 53 | Setup local transport to use dir 54 | Exceptions: TransportFileNotFound if dir invalid 55 | 56 | :param dir: Directory to use for storage 57 | :param options: Unused currently 58 | """ 59 | t = cls(options, verbose) 60 | Dweb.transports["local"] = t 61 | Dweb.transportpriority.append(t) 62 | return t 63 | 64 | #see other !ADD-TRANSPORT-COMMAND - add a function copying the format below 65 | 66 | def supports(self, url, func): 67 | return True # Local can handle any kind of URL, since cached. 68 | 69 | #TODO-LOCAL - feed this back into ServerGateway.info 70 | def info(self, **options): 71 | return { "type": "local", "options": self.options } 72 | 73 | def _filename(self, subdir, multihash=None, verbose=False, **options): 74 | # Utility function to get filename to use for storage 75 | return "%s/%s/%s" % (self.dir, subdir, multihash.multihash58) 76 | 77 | def _tablefilename(self, database, table, subdir="table", createdatabase=False): 78 | # Utility function to get filename to use for storage 79 | dir = "{}/{}/{}".format(self.dir, subdir, database) 80 | if createdatabase and not os.path.isdir(dir): 81 | os.mkdir(dir) 82 | return "{}/{}".format(dir, table) 83 | 84 | def url(self, data=None, multihash=None): 85 | """ 86 | Return an identifier for the data without storing 87 | 88 | :param data string|Buffer data arbitrary data 89 | :param multihash string of form Q... 90 | :return string valid id to retrieve data via rawfetch 91 | """ 92 | 93 | if data: 94 | multihash = Multihash(data=data, code=Multihash.SHA2_256) 95 | return "local:/rawfetch/{0}".format(multihash.multihash58 if isinstance(multihash, Multihash) else multihash) 96 | 97 | def rawfetch(self, url=url, multihash=None, verbose=False, **options): 98 | """ 99 | Fetch a block from the local file system 100 | Exception: TransportFileNotFound if file doesnt exist 101 | #TODO-STREAM make return stream to HTTP and so on 102 | 103 | :param url: Of form somescheme:/something/hash 104 | :param multihash: a Multihash structure 105 | :param options: 106 | :return: 107 | """ 108 | multihash = multihash or Multihash(url=url) 109 | filename = self._filename("block", multihash) 110 | try: 111 | if verbose: logging.debug("Opening {0}".format(filename)) 112 | with open(filename, 'rb') as file: 113 | content = file.read() 114 | if verbose: logging.debug("Opened") 115 | return content 116 | except (IOError, FileNotFoundError) as e: 117 | logging.debug("TransportLocal.rawfetch err={}".format(e)) 118 | raise TransportFileNotFound(file=filename) 119 | 120 | def _rawlistreverse(self, filename=None, verbose=False, **options): 121 | """ 122 | Retrieve record(s) matching a url (usually the url of a key), in this case from a local directory 123 | Exception: IOError if file doesnt exist 124 | 125 | :param url: Hash in table to be retrieved or url ending in that hash 126 | :return: list of dictionaries for each item retrieved 127 | """ 128 | try: 129 | f = open(filename, 'rb') 130 | s = [ loads(s) for s in f.readlines() ] 131 | f.close() 132 | return s 133 | except IOError as e: 134 | return [] 135 | #Trying commenting out error, and returning empty array 136 | #raise TransportFileNotFound(file=filename) 137 | 138 | def rawlist(self, url, verbose=False, **options): 139 | """ 140 | Retrieve record(s) matching a url (usually the url of a key), in this case from a local directory 141 | Exception: IOError if file doesnt exist 142 | 143 | :param url: URL to be retrieved 144 | :return: list of dictionaries for each item retrieved 145 | """ 146 | if verbose: logging.debug("TransportLocal:rawlist {0}".format(url)) 147 | filename = self._filename("list", multihash= Multihash(url=url), verbose=verbose, **options) 148 | return self._rawlistreverse(filename=filename, verbose=False, **options) 149 | 150 | 151 | def rawreverse(self, url, verbose=False, **options): 152 | 153 | """ 154 | Retrieve record(s) matching a url (usually the url of a key), in this case from a local directory 155 | Exception: IOError if file doesnt exist 156 | 157 | :param url: Hash in table to be retrieved or url ending in hash 158 | :return: list of dictionaries for each item retrieved 159 | """ 160 | filename = self._filename("reverse", multihash= Multihash(url=url), verbose=verbose, **options) 161 | return self._rawlistreverse(filename=filename, verbose=False, **options) 162 | 163 | def rawstore(self, data=None, verbose=False, returns=None, **options): 164 | """ 165 | Store the data locally 166 | Exception: TransportFileNotFound if file doesnt exist 167 | 168 | :param data: opaque data to store (currently must be bytes, not str) 169 | :param returns: Comma separated string if want result as a dict, support "url","contenthash" 170 | :return: url of data 171 | """ 172 | assert data is not None # Its meaningless (or at least I think so) to store None (empty string is meaningful) #TODO-LOCAL move assert to CodingException 173 | contenthash=Multihash(data=data, code=Multihash.SHA2_256) 174 | filename = self._filename("block", multihash=contenthash, verbose=verbose, **options) 175 | try: 176 | f = open(filename, 'wb') 177 | f.write(data) 178 | f.close() 179 | except IOError as e: 180 | raise TransportFileNotFound(file=filename) 181 | url = self.url(multihash=contenthash) 182 | if returns: 183 | returns = returns.split(',') 184 | return { k: url if k=="url" else contenthash if k=="contenthash" else "ERROR" for k in returns } 185 | else: 186 | return url 187 | 188 | 189 | def _rawadd(self, filename, value): 190 | try: 191 | with open(filename, 'ab') as f: 192 | f.write(value) 193 | except IOError as e: 194 | raise TransportFileNotFound(file=filename) 195 | 196 | def rawadd(self, url, sig, verbose=False, subdir=None, **options): 197 | """ 198 | Store a signature in a pair of DHTs 199 | Exception: IOError if file doesnt exist 200 | 201 | :param url: List to store on 202 | :param Signature sig: including { date, signature, signedby, urls} 203 | :param subdir: Can select list or reverse to store only one or both halfs of the list. This is used in TransportDistPeer as the two halfs are stored in diffrent parts of the DHT 204 | :param verbose: 205 | :param options: 206 | :return: 207 | """ 208 | subdir = subdir or ("list","reverse") # By default store forward and backwards 209 | if verbose: logging.debug("TransportLocal.rawadd {0} {1} subdir={2} options={3}" 210 | .format(url, sig, subdir, options)) 211 | value = dumps(sig) + "\n" #Note this is a compact dump 212 | value = value.encode('utf-8') 213 | if "list" in subdir: 214 | self._rawadd( 215 | self._filename("list", multihash= Multihash(url=url), verbose=verbose, **options), # List of things signedby 216 | value) 217 | """ 218 | # Reverse removed for now, as not used, and causes revision issues with Multi. 219 | if "reverse" in subdir: 220 | if not isinstance(urls, (list, tuple, set)): 221 | urls = [urls] 222 | for u in urls: 223 | self._rawadd( 224 | self._filename("reverse", multihash= Multihash(url=u), verbose=verbose, **options), # Lists that this object is on 225 | value) 226 | """ 227 | 228 | def set(self, url=None, database=None, table=None, keyvaluelist=None, keyvalues=None, value=None, verbose=False): 229 | #Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end 230 | #Each line is a seperate keyvalue pair since each needs to be signed so that they can be verified by recipients who might only read one key 231 | # Note url & keyvalues or keyvalues|value are not supported yet 232 | filename = self._tablefilename(database, table, createdatabase=True) 233 | #TODO-KEYVALUE check and store sig which has to be on each keyvalue, not on entire set 234 | #TODO-KEYVALUE encode string in value for storing in quoted string 235 | appendable = "".join([ dumps(kv)+"\n" for kv in keyvaluelist ]).encode('utf-8') # Essentially jSON for array but without enclosing [ ] 236 | self._rawadd(filename, appendable) 237 | 238 | def get(self, url=None, database=None, table=None, keys=None, verbose=False): 239 | #Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end 240 | filename = self._tablefilename(database, table, createdatabase=True) 241 | resarr = self._rawlistreverse(filename=filename, verbose=False) # [ {key:k1, value:v1} {key:k2, value:v2}, {key:k1, value:v3}] 242 | #TODO-KEYVALUE check sig which has to be on each keyvalue, not on entire set 243 | resdict = { kv["key"]: kv.get("value") for kv in resarr if kv["key"] in keys } # {k1:v3, k2:v2} - replaces earlier with later values for same key 244 | return resdict 245 | 246 | def delete(self, url=None, database=None, table=None, keys=None, verbose=False): 247 | # Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end 248 | filename = self._tablefilename(database, table) 249 | # TODO-KEYVALUE check and store sig which has to be on each keyvalue, not on entire set 250 | appendable = ("\n".join([dumps({"key": key}) for key in keys]) + "\n").encode('utf-8') 251 | self._rawadd(filename, appendable) 252 | 253 | def keys(self, url=None, database=None, table=None, verbose=False): 254 | # Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end 255 | filename = self._tablefilename(database, table, createdatabase=True) 256 | resarr = self._rawlistreverse(filename=filename, verbose=False) # [ {key:k1, value:v1} {key:k2, value:v2}, {key:k1, value:v3}] 257 | #TODO-KEYVALUE check sig which has to be on each keyvalue, not on entire set 258 | resdict = { kv["key"]: kv.get("value") for kv in resarr } # {k1:v3, k2:v2} - replaces earlier with later values for same key 259 | return list(resdict.keys()) # keys() returns a dict_keys object, want to return a list 260 | 261 | def getall(self, url=None, database=None, table=None, verbose=False): 262 | # Add keyvalues to a table, note it doesnt delete existing keys and values, just writes to end 263 | filename = self._tablefilename(database, table, createdatabase=True) 264 | #logging.debug("XXX@getal {}".format(filename)) 265 | resarr = self._rawlistreverse(filename=filename, verbose=False) # [ {key:k1, value:v1} {key:k2, value:v2}, {key:k1, value:v3}] 266 | #TODO-KEYVALUE check sig which has to be on each keyvalue, not on entire set 267 | resdict = { kv["key"]: kv.get("value") for kv in resarr } # {k1:v3, k2:v2} - replaces earlier with later values for same key 268 | return resdict 269 | 270 | 271 | -------------------------------------------------------------------------------- /python/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/internetarchive/dweb-gateway/a32b67dba701da6cd79204a488b787fe15112974/python/__init__.py -------------------------------------------------------------------------------- /python/config.py: -------------------------------------------------------------------------------- 1 | # encoding: utf-8 2 | import socket 3 | import logging 4 | import urllib.parse 5 | 6 | 7 | config = { 8 | "archive": { 9 | "url_download": "https://archive.org/download/", # TODO-PERMS - usage checked 10 | "url_servicesimg": "https://archive.org/services/img/", 11 | "url_metadata": "https://archive.org/metadata/", 12 | "url_btihsearch": 'https://archive.org/advancedsearch.php?fl=identifier,btih&output=json&rows=1&q=btih:', 13 | "url_sha1search": "http://archive.org/services/dwhf.php?key=sha1&val=", 14 | }, 15 | "ipfs": { 16 | "url_add_data": "http://localhost:5001/api/v0/add", # FOr use on gateway or if run "ipfs daemon" on test machine 17 | # "url_add_data": "https://ipfs.dweb.me/api/v0/add", # note Kyle was using localhost:5001/api/v0/add which wont resolve externally. 18 | # "url_add_url": "http://localhost:5001/api/v0/add", #TODO-IPFS move uses of url_add_data to urladd when its working 19 | "url_urlstore": "http://localhost:5001/api/v0/urlstore/add", # Should have "ipfs daemon" running locally 20 | "url_dht_provide": "http://localhost:5001/api/v0/dht/provide", 21 | }, 22 | "gateway": { 23 | "url_metadata": "https://https://dweb.me/arc/archive.org/metadata/", 24 | "url_download": "https://dweb.me/arc/archive.org/download/", # TODO-PERMS usage checked 25 | "url_servicesimg": "https://dweb.me/arc/archive.org/thumbnail/", 26 | "url_torrent": "https://dweb.me/arc/archive.org/torrent/", #TODO-PERMS CHECK USAGE 27 | }, 28 | "httpserver": { # Configuration used by generic HTTP server 29 | "favicon_url": "https://dweb.me/favicon.ico", 30 | "root_path": "info", 31 | }, 32 | "domains": { 33 | # This is also name of directory in /usr/local/dweb-gateway/.cache/table, if change this then can safely rename that directory to new name to retain metadata saved 34 | "metadataverifykey": 'NACL VERIFY:h9MB6YOnYEgby-ZRkFKzY3rPDGzzGZ8piGNwi9ltBf0=', 35 | "metadatapassphrase": "Replace this with something secret/arc/archive.org/metadata", # TODO - change for something secret! 36 | "directory": '/usr/local/dweb-gateway/.cache/table/', # Used by maintenance note overridden below for mitraglass (mitra's laptop) 37 | }, 38 | "directories": { 39 | "bootloader": "/usr/local/dweb-archive/dist/bootloader.html", # Location of bootloader file, note overridden below for mitraglass (mitra's laptop) 40 | }, 41 | "logging": { 42 | "level": logging.DEBUG, 43 | # "filename": '/var/log/dweb/dweb-gateway', # Use stdout for logging and redirect in supervisorctl 44 | }, 45 | "ignoreurls": [ # Ignore these, they are hacks or similar 46 | urllib.parse.unquote("%E2%80%9D"), 47 | ".well-known", 48 | "clientaccesspolicy.xml", 49 | "db", 50 | "index.php", 51 | "mysqladmin", 52 | "login.cgi", 53 | "robots.txt", #Not a hack, but we dont have one TODO 54 | "phpmyadmin", 55 | "phpMyAdminold", 56 | "phpMyAdmin.old", 57 | "phpmyadmin-old", 58 | "phpMyadmin_bak", 59 | "phpMyAdmin", 60 | "phpma" 61 | "phpmyadmin0", 62 | "phpmyadmin1", 63 | "phpmyadmin2", 64 | "pma", 65 | "PMA", 66 | "scripts", 67 | "setup.php", 68 | "sitemap.xml", 69 | "sqladmin", 70 | "tools", 71 | "typo3", 72 | "web", 73 | "www", 74 | "xampp", 75 | ], 76 | "torrent_reject_list": [ # Baked into torrentmaker at in petabox/sw/bin/ia_make_torrent.py # See Archive/inTorrent() 77 | "_archive.torrent", # Torrent file isnt in itself ! 78 | "_files.xml", 79 | "_reviews.xml", 80 | "_all.torrent", # aborted abuie torrent-izing 81 | "_64kb_mp3.zip", # old packaged streamable mp3s for etree 82 | "_256kb_mp3.zip", 83 | "_vbr_mp3.zip", 84 | "_meta.txt", # s3 upload turds 85 | "_raw_jp2.zip", # scribe nodes 86 | "_orig_cr2.tar", 87 | "_orig_jp2.tar", 88 | "_raw_jpg.tar", # could exclude scandata.zip too maybe... 89 | "_meta.xml" # Always written after the torrent so cant be in it 90 | ], 91 | "torrent_reject_collections": [ # See Archive/inTorrent() 92 | "loggedin", 93 | "georestricted" 94 | ], 95 | "have_no_sha1_list": [ 96 | "_files.xml" 97 | ] 98 | } 99 | if socket.gethostname() in ["wwwb-dev0.fnf.archive.org"]: 100 | pass 101 | elif socket.gethostname().startswith('mitraglass'): 102 | config["directories"]["bootloader"] = "/Users/mitra/git/dweb-archive/bootloader.html" 103 | config["domains"]["directory"] = "/Users/mitra/git/dweb-gateway/.cache/table/" 104 | else: 105 | # Probably on docker 106 | pass 107 | 108 | -------------------------------------------------------------------------------- /python/elastic_schema.json: -------------------------------------------------------------------------------- 1 | { 2 | "mappings": { 3 | "work": { 4 | "_all": { "enabled": true }, 5 | "properties": { 6 | "doi": { "type": "keyword" }, 7 | "title": { "type": "text", "boost": 3.0 }, 8 | "authors": { "type": "text", "boost": 2.0 }, 9 | "journal": { "type": "text" }, 10 | "date": { "type": "date" }, 11 | "publisher":{ "type": "text", "include_in_all": false }, 12 | "topic": { "type": "text", "include_in_all": false }, 13 | "media": { "type": "keyword", "include_in_all": false } 14 | } 15 | } 16 | } 17 | } 18 | -------------------------------------------------------------------------------- /python/maintenance.py: -------------------------------------------------------------------------------- 1 | import logging 2 | # This is run every 10 minutes by Cron (10 * 58 = 580 ~ 10 hours) 3 | from python.config import config 4 | import redis 5 | import base58 6 | from .HashStore import StateService 7 | from .TransportIPFS import TransportIPFS 8 | 9 | logging.basicConfig(**config["logging"]) # For server 10 | 11 | def resetipfs(removeipfs=False, reseedipfs=False, removemagnet=False, announcedht=False, verbose=False, fixbadurls=False): 12 | """ 13 | Loop over and "reset" ipfs 14 | :param removeipfs: If set will remove all cached pointers to IPFS - note this is part of a three stage process see notes in cleanipfs.sh 15 | :param reseedipfs: If set we will ping the ipfs.io gateway to make sure it knows about our files, this isn't used any more 16 | :param removemagnet: Remove all cached magnet links (e.g. to add a new default tracker 17 | :param announcedht: Announce our files to the DHT - currently run by cron regularly 18 | :param verbose: Generate verbose debugging - the code below could use more of this 19 | :param fixbadurls: Removes some historically bad URLs, this was done so isn't needed again - just left as a modifyable stub. 20 | :return: 21 | """ 22 | knownbadhashes = [ 23 | "zb2rhhEncXjn7PnqJ16mzfeug1bqWuupQ3PnkhnWLpAaDatiZ", # audio 24 | "zb2rhiSEszTZ4YuY7GJScy6jKZTJuR97MLs7KSe2nKLHwb4A7", # texts 25 | "zb2rhk2FYVEy5VRHmaEzor7NuA936E8GGaokZFurKmUE959zx", # movies 26 | ] 27 | r = redis.StrictRedis(host="localhost", port=6379, db=0, decode_responses=True) 28 | reseeded = 0 29 | removed = 0 30 | magremoved = 0 31 | total = 0 32 | withipfs = 0 33 | withmagnet = 0 34 | announceddht = 0 35 | if announcedht: 36 | dhtround = ((int(((StateService.get("LastDHTround", verbose)) or 0)) + 1) % 58) 37 | StateService.set("LastDHTround", dhtround, verbose) 38 | dhtroundletter = base58.b58encode_int(dhtround) 39 | logging.debug("DHT round: {}".format(dhtroundletter)) 40 | for i in r.scan_iter(): 41 | total = total+1 42 | if fixbadurls: 43 | url = r.hget(i, "url") 44 | if urls.startswith("ipfs:"): 45 | logging.debug("Would delete {} .url= {}".format(i,url)) 46 | #r.hdel(i, "url") 47 | for k in ["magnetlink"]: 48 | magnetlink = r.hget(i, k) 49 | if magnetlink: 50 | withmagnet = withmagnet + 1 51 | if removemagnet: 52 | r.hdel(i, k) 53 | magremoved = magremoved + 1 54 | 55 | for k in [ "ipldhash", "thumbnailipfs" ]: 56 | ipfs = r.hget(i, k) 57 | #print(i, ipfs) 58 | if ipfs: 59 | withipfs = withipfs + 1 60 | ipfs = ipfs.replace("ipfs:/ipfs/", "") # The hash 61 | if removeipfs or (ipfs in knownbadhashes): 62 | r.hdel(i, k) 63 | removed = removed + 1 64 | if reseedipfs: 65 | #logging.debug("Reseeding {} {}".format(i, ipfs)) # Logged in TransportIPFS 66 | TransportIPFS().pinggateway(ipfs) 67 | reseeded = reseeded + 1 68 | if announcedht: 69 | #print("Testing ipfs {} .. {} from {}".format(ipfs[6],dhtroundletter,ipfs)) 70 | if dhtroundletter == ipfs[6]: # Compare far enough into string to be random 71 | # logging.debug("Announcing {} {}".format(i, ipfs)) # Logged in TransportIPFS 72 | TransportIPFS().announcedht(ipfs) 73 | announceddht = announceddht + 1 74 | logging.debug("Scanned {}, withipfs {}, deleted {}, reseeded {}, announced {}, magremoved {}".format(total, withipfs, removed, reseeded, announceddht, magremoved)) 75 | 76 | # To announce DHT under cron 77 | #logging.basicConfig(**config["logging"]) # For server 78 | #resetipfs(announcedht=True) 79 | 80 | # To fully reset IPFS need to also ... 81 | # rm /usr/local/dweb-gateway/.cache/table/{config["domains"]["metadataverifykey"]} which is where leafs stored - these refer to IPFS hashes for metadata 82 | # Clean out the repo (Arkadiy to provide info) 83 | -------------------------------------------------------------------------------- /python/miscutils.py: -------------------------------------------------------------------------------- 1 | """ 2 | This is a place to put miscellaneous utilities, not specific to this project 3 | """ 4 | import json # Note dont "from json import dumps" as clashes with overdefined dumps below 5 | from datetime import datetime 6 | import requests 7 | import logging 8 | from magneturi import bencode 9 | import base64 10 | import hashlib 11 | import urllib.parse 12 | from .Errors import TransportURLNotFound, ForbiddenException 13 | from .config import config 14 | 15 | 16 | 17 | def mergeoptions(a, b): 18 | """ 19 | Deep merge options dictionaries 20 | - note this might not (yet) handle Arrays correctly but handles nested dictionaries 21 | 22 | :param a,b: Dictionaries 23 | :returns: Deep copied merge of the dictionaries 24 | """ 25 | c = a.copy() 26 | for key in b: 27 | val = b[key] 28 | if isinstance(val, dict) and a.get(key, None): 29 | c[key] = mergeoptions(a[key], b[key]) 30 | else: 31 | c[key] = b[key] 32 | return c 33 | 34 | def dumps(obj): #TODO-BACKPORT FROM GATEWAY TO DWEB - moved from Transport to miscutils 35 | """ 36 | Convert arbitrary data into a JSON string that can be deterministically hashed or compared. 37 | Must be valid for loading with json.loads (unless change all calls to that). 38 | Exception: UnicodeDecodeError if data is binary 39 | 40 | :param obj: Any 41 | :return: JSON string that can be deterministically hashed or compared 42 | """ 43 | # ensure_ascii = False was set otherwise if try and read binary content, and embed as "data" in StructuredBlock then complains 44 | # if it cant convert it to UTF8. (This was an example for the Wrenchicon), but loads couldnt handle return anyway. 45 | # sort_keys = True so that dict always returned same way so can be hashed 46 | # separators = (,:) gets the most compact representation 47 | return json.dumps(obj, sort_keys=True, separators=(',', ':'), default=json_default) 48 | 49 | def loads(s): 50 | """ 51 | 52 | :param s: JSON string to convert 53 | :return: Python dictionary, array, string etc depending on s 54 | :raises: json.decoder.JSONDecodeError if not json 55 | """ 56 | if isinstance(s, bytes): #TODO can remove once python upgraded to 3.6.2 57 | s = s.decode('utf-8') 58 | return json.loads(s) # Will fail if s empty, or not json 59 | 60 | def json_default(obj): #TODO-BACKPORT FROM GATEWAY TO DWEB - moved from Transport to miscutils 61 | """ 62 | Default JSON serialiser especially for handling datetime, can add handling for other special cases here 63 | 64 | :param obj: Anything json dumps can't serialize 65 | :return: string for extended types 66 | """ 67 | if isinstance(obj, datetime): # Using isinstance rather than hasattr because __getattr__ always returns True 68 | #if hasattr(obj,"isoformat"): # Especially for datetime 69 | return obj.isoformat() 70 | try: 71 | return obj.dumps() # See if the object has its own dumps 72 | except Exception as e: 73 | raise TypeError("Type {0} not serializable".format(obj.__class__.__name__)) from e 74 | 75 | 76 | def httpget(url, wantmime=False, range=None): 77 | # Returns the content - i.e. bytes 78 | # Raises TransportFileNotFound or HTTPError TODO latter error should be caughts 79 | #TODO-STREAMS future work to return a stream 80 | #TODO-PERMS should ideally check perms here, or pass flag to make it check or similar 81 | #TODO-PERMS should also pass the X-ORIGINATING-IP (?) header, but need to figure out how to get that. 82 | r = None # So that if exception in get, r is still defined and can be tested for None 83 | try: 84 | logging.debug("GET {} {}".format(url, range if range else "")) 85 | headers = { "Connection": "keep-alive"} 86 | if range: headers["range"] = range 87 | r = requests.get(url, headers=headers) 88 | r.raise_for_status() 89 | if not r.encoding or ("application/pdf" in r.headers.get('content-type')) or ("image/" in r.headers.get('content-type')): 90 | data = r.content # Should work for PDF or other binary types 91 | else: 92 | data = r.text 93 | if wantmime: 94 | return data, r.headers.get('content-type') 95 | else: 96 | return data 97 | #TODO-STREAM support streams in future 98 | 99 | except (requests.exceptions.RequestException, requests.exceptions.HTTPError, requests.exceptions.InvalidSchema) as e: 100 | if r is not None and (r.status_code == 404): 101 | raise TransportURLNotFound(url=url) 102 | elif r is not None and (r.status_code == 403): 103 | raise ForbiddenException(what=e) 104 | else: 105 | logging.error("HTTP request failed err={}".format(e)) 106 | raise e 107 | except requests.exceptions.MissingSchema as e: 108 | logging.error("HTTP request failed", exc_info=True) 109 | raise e # For now just raise it 110 | 111 | -------------------------------------------------------------------------------- /python/requirements.txt: -------------------------------------------------------------------------------- 1 | sha3 2 | redis 3 | requests 4 | #multihash - no longer used and its buggy anyway 5 | py-dateutil 6 | base58 7 | pynacl # - pynacl is needed, but requires sudo - uncomment and run as sudo once. 8 | pyblake2 9 | # For Brians search, maybe not in production 10 | flask 11 | #binascii - built in 12 | #hashlib - built in I think, and fails to install with pip3 13 | #struct - built in 14 | magneturi # To decode magnet files 15 | bencode # To decode Bittorrents binary encoding 16 | 17 | -------------------------------------------------------------------------------- /python/test/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/internetarchive/dweb-gateway/a32b67dba701da6cd79204a488b787fe15112974/python/test/__init__.py -------------------------------------------------------------------------------- /python/test/_utils.py: -------------------------------------------------------------------------------- 1 | from python.ServerGateway import DwebGatewayHTTPRequestHandler 2 | 3 | 4 | def _processurl(url, verbose=False, headers={}, **kwargs): 5 | # Simulates HTTP Server process - wont work for all methods 6 | args = url.split('/') 7 | method = args.pop(0) 8 | DwebGatewayHTTPRequestHandler.headers = headers # This is a kludge, put headers on class, method expects an instance. 9 | f = getattr(DwebGatewayHTTPRequestHandler, method) 10 | assert f 11 | namespace = args.pop(0) 12 | if verbose: kwargs["verbose"] = True 13 | res = f(DwebGatewayHTTPRequestHandler, namespace, *args, **kwargs) 14 | return res 15 | -------------------------------------------------------------------------------- /python/test/test_LocationService.py: -------------------------------------------------------------------------------- 1 | from python.HashStore import HashStore, LocationService 2 | 3 | MULTIHASH = "testmultihash" 4 | FIELD = "testfield" 5 | VALUE = "testvalue" 6 | 7 | 8 | def test_hash_store(): 9 | HashStore.hash_set(MULTIHASH, FIELD, VALUE) 10 | assert HashStore.hash_get(MULTIHASH, FIELD) == VALUE 11 | 12 | def test_location_service(): 13 | LocationService.set(MULTIHASH, VALUE) 14 | LocationService.get(MULTIHASH) 15 | -------------------------------------------------------------------------------- /python/test/test_archive.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from datetime import datetime 3 | from ._utils import _processurl 4 | from python.miscutils import dumps, loads 5 | from python.Archive import ArchiveItemNotFound 6 | from python.config import config 7 | 8 | logging.basicConfig(level=logging.DEBUG) # Log to stderr 9 | 10 | def test_archiveid(): 11 | verbose=False 12 | if verbose: logging.debug("Starting test_archiveid") 13 | itemid = "commute" 14 | btih='XCMYARDAKNWYBERJHUSQR5RJG63JX46B' 15 | magnetlink='magnet:?xt=urn:btih:XCMYARDAKNWYBERJHUSQR5RJG63JX46B&tr=http%3A%2F%2Fbt1.archive.org%3A6969%2Fannounce&tr=http%3A%2F%2Fbt2.archive.org%3A6969%2Fannounce&tr=wss%3A%2F%2Ftracker.btorrent.xyz&tr=wss%3A%2F%2Ftracker.openwebtorrent.com&tr=wss%3A%2F%2Ftracker.fastcast.nz&ws=https%3A%2F%2Fdweb.me%2Farc%2Farchive.org%2Fdownload%2F&xs=https%3A%2F%2Fdweb.me%2Farc%2Farchive.org%2Ftorrent%2Fcommute' 16 | res = _processurl("arc/archive.org/metadata/{}".format(itemid), verbose) # Simulate what the server would do with the URL 17 | 18 | if verbose: logging.debug("test_archiveid metadata returned {0}".format(res)) 19 | assert res["data"]["metadata"]["identifier"] == itemid 20 | assert res["data"]["metadata"]["magnetlink"] == magnetlink 21 | assert "ipfs:/ipfs" in res["data"]["metadata"]["thumbnaillinks"][0] 22 | assert itemid in res["data"]["metadata"]["thumbnaillinks"][1] 23 | if verbose: logging.debug("test_archiveid complete") 24 | res = _processurl("magnetlink/btih/{}".format(btih), verbose) 25 | if verbose: logging.debug("test_archiveid magnetlink returned {0}".format(res)) 26 | assert res["data"] == magnetlink 27 | 28 | def test_collectionsortorder(): 29 | verbose=True 30 | itemid="prelinger" 31 | collectionurl = "arc/archive.org/metadata/{}" 32 | res = _processurl(collectionurl.format(itemid), verbose) # Simulate what the server would do with the URL 33 | assert res["data"]["collection_sort_order"] == "-downloads" 34 | 35 | def test_leaf(): 36 | verbose=False 37 | if verbose: logging.debug("Starting test_leaf") 38 | # Test it can respond to leaf requests 39 | item = "commute" 40 | # leafurl="leaf/archiveid" OLD FORM 41 | leafurl="arc/archive.org/leaf" 42 | res = _processurl(leafurl, verbose=verbose, key=item) # Simulate what the server would do with the URL 43 | if verbose: logging.debug("{} returned {}".format(leafurl, res)) 44 | leafurl="get/table/{}/domain".format(config["domains"]["metadataverifykey"]) #TODO-ARC 45 | res = _processurl(leafurl, verbose=verbose, key=item) # Should get value cached above 46 | if verbose: logging.debug("{} returned {}".format(leafurl, res)) 47 | 48 | def test_archiveerrs(): 49 | verbose=True 50 | if verbose: logging.debug("Starting test_archiveid") 51 | itemid = "nosuchitematall" 52 | try: 53 | res = _processurl("arc/archive.org/metadata/{}".format(itemid), verbose) # Simulate what the server would do with the URL 54 | except ArchiveItemNotFound as e: 55 | pass # Expecting an error 56 | 57 | def test_search(): 58 | verbose=True 59 | kwargs1={ # Taken from example home page 60 | 'output': "json", 61 | 'q': "mediatype:collection AND NOT noindex:true AND NOT collection:web AND NOT identifier:fav-* AND NOT identifier:what_cd AND NOT identifier:cd AND NOT identifier:vinyl AND NOT identifier:librarygenesis AND NOT identifier:bibalex AND NOT identifier:movies AND NOT identifier:audio AND NOT identifier:texts AND NOT identifier:software AND NOT identifier:image AND NOT identifier:data AND NOT identifier:web AND NOT identifier:additional_collections AND NOT identifier:animationandcartoons AND NOT identifier:artsandmusicvideos AND NOT identifier:audio_bookspoetry AND NOT identifier:audio_foreign AND NOT identifier:audio_music AND NOT identifier:audio_news AND NOT identifier:audio_podcast AND NOT identifier:audio_religion AND NOT identifier:audio_tech AND NOT identifier:computersandtechvideos AND NOT identifier:coverartarchive AND NOT identifier:culturalandacademicfilms AND NOT identifier:ephemera AND NOT identifier:gamevideos AND NOT identifier:inlibrary AND NOT identifier:moviesandfilms AND NOT identifier:newsandpublicaffairs AND NOT identifier:ourmedia AND NOT identifier:radioprograms AND NOT identifier:samples_only AND NOT identifier:spiritualityandreligion AND NOT identifier:stream_only AND NOT identifier:television AND NOT identifier:test_collection AND NOT identifier:usgovfilms AND NOT identifier:vlogs AND NOT identifier:youth_media", 62 | 'rows': "75", 63 | 'sort[]': "-downloads", 64 | 'and[]': "" 65 | } 66 | kwargs2={ # Take from example search 67 | 'output': "json", 68 | 'q': "prelinger", 69 | 'rows': "75", 70 | 'sort[]': "", 71 | 'and[]': "" 72 | } 73 | #res = _processurl("metadata/advancedsearch", verbose, **kwargs2) # Simulate what the server would do with the URL 74 | res = _processurl("arc/archive.org/advancedsearch", verbose, **kwargs2) # Simulate what the server would do with the URL 75 | #logging.debug("XXX@65") 76 | logging.debug(res) 77 | -------------------------------------------------------------------------------- /python/test/test_doi.py: -------------------------------------------------------------------------------- 1 | from python.Multihash import Multihash 2 | import logging 3 | from ._utils import _processurl 4 | 5 | DOIURL = "metadata/doi/10.1001/jama.2009.1064" 6 | CONTENTMULTIHASH = "5dqpnTaoMSJPpsHna58ZJHcrcJeAjW" 7 | PDF_SHA1HEX="02efe2abec13a309916c6860de5ad8a8a096fe5d" 8 | #CONTENTHASHURL = "content/contenthash/" + CONTENTMULTIHASH # OLD STYLE 9 | CONTENTHASHURL = "contenthash/" + CONTENTMULTIHASH 10 | #SHA1HEXMETADATAURL = "metadata/sha1hex/"+PDF_SHA1HEX # OLD STYLE 11 | #SHA1HEXCONTENTURL = "content/sha1hex/"+PDF_SHA1HEX # OLD STYLE 12 | SHA1HEXCONTENTURL = "sha1hex/" + PDF_SHA1HEX 13 | CONTENTSIZE = 262438 14 | QBF="The Quick Brown Fox" 15 | BASESTRING="A quick brown fox" 16 | SHA1BASESTRING="5drjPwBymU5TC4YNFK5aXXpwpFFbww" # Sha1 of above 17 | 18 | 19 | logging.basicConfig(level=logging.DEBUG) # Log to stderr 20 | 21 | def test_doi_resolve(): 22 | verbose=False # True to debug 23 | res = _processurl(DOIURL, verbose) 24 | assert res["Content-type"] == "application/json" 25 | #assert res["data"]["files"][0]["sha1hex"] == PDF_SHA1HEX, "Would check sha1hex, but not returning now do multihash58" 26 | assert res["data"]["files"][0]["multihash58"] == CONTENTMULTIHASH 27 | 28 | 29 | def test_contenthash_resolve(): 30 | verbose=False # True to debug 31 | res = _processurl(CONTENTHASHURL, verbose) # Simulate what the server would do with the URL 32 | assert res["Content-type"] == "application/pdf", "Check retrieved content of expected type" 33 | assert len(res["data"]) == CONTENTSIZE, "Check retrieved content of expected length" 34 | multihash = Multihash(data=res["data"], code=Multihash.SHA1) 35 | assert multihash.multihash58 == CONTENTMULTIHASH, "Check retrieved content has same multihash58_sha1 as we expect" 36 | assert multihash.sha1hex == PDF_SHA1HEX, "Check retrieved content has same hex sha1 as we expect" 37 | 38 | def test_sha1hexcontent_resolve(): 39 | verbose = False # True to debug 40 | res = _processurl(SHA1HEXCONTENTURL, verbose) # Simulate what the server would do with the URL 41 | assert res["Content-type"] == "application/pdf", "Check retrieved content of expected type" 42 | assert len(res["data"]) == CONTENTSIZE, "Check retrieved content of expected length" 43 | multihash = Multihash(data=res["data"], code=Multihash.SHA1) 44 | assert multihash.multihash58 == CONTENTMULTIHASH, "Check retrieved content has same multihash58_sha1 as we expect" 45 | assert multihash.sha1hex == PDF_SHA1HEX, "Check retrieved content has same hex sha1 as we expect" 46 | 47 | def test_sha1hexmetadata_resolve(): 48 | verbose = False # True to debug 49 | res = _processurl("sha1hex/"+PDF_SHA1HEX, verbose, output="metadata") # Simulate what the server would do with the URL 50 | if verbose: logging.debug("test_sha1hexmetadata_resolve {0}".format(res)) 51 | assert res["Content-type"] == "application/json", "Check retrieved content of expected type" 52 | assert res["data"]["metadata"]["size_bytes"] == CONTENTSIZE 53 | assert res["data"]["metadata"]["multihash58"] == CONTENTMULTIHASH, "Expecting multihash58 of sha1" 54 | 55 | -------------------------------------------------------------------------------- /python/test/test_local.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from datetime import datetime 3 | from ._utils import _processurl 4 | from python.miscutils import dumps, loads 5 | 6 | logging.basicConfig(level=logging.DEBUG) # Log to stderr 7 | 8 | CONTENTMULTIHASH = "5dqpnTaoMSJPpsHna58ZJHcrcJeAjW" 9 | BASESTRING="A quick brown fox" 10 | SHA1BASESTRING="5drjPwBymU5TC4YNFK5aXXpwpFFbww" # Sha1 of above 11 | 12 | 13 | def test_local(): 14 | verbose=True 15 | res = _processurl("contenthash/rawstore", verbose, data=BASESTRING.encode('utf-8')) # Simulate what the server would do with the URL #TODO-ARC 16 | if verbose: logging.debug("test_local store returned {0}".format(res)) 17 | contenthash = res["data"] 18 | res = _processurl("content/rawfetch/{0}".format(contenthash), verbose) # Simulate what the server would do with the URL #TODO-ARC 19 | if verbose: logging.debug("test_local content/rawfetch/{0} returned {1}".format(contenthash, res)) 20 | assert res["data"].decode('utf-8') == BASESTRING 21 | #res = _processurl("content/contenthash/{0}".format(contenthash), verbose) # OLD STYLE 22 | res = _processurl("contenthash/{0}".format(contenthash), verbose) 23 | if verbose: logging.debug("test_local content/contenthash/{0} returned {1}".format(contenthash, res)) 24 | 25 | def test_list(): 26 | verbose = True 27 | date = datetime.utcnow().isoformat() 28 | adddict = { "urls": [ CONTENTMULTIHASH ], "date": date, "signature": "XXYYYZZZ", "signedby": [ SHA1BASESTRING ], "verbose": verbose } 29 | res = _processurl("void/rawadd/"+SHA1BASESTRING, verbose, data=dumps(adddict)) #TODO-ARC 30 | if verbose: logging.debug("test_list {0}".format(res)) 31 | res = _processurl("metadata/rawlist/{0}".format(SHA1BASESTRING), verbose, data=dumps(adddict)) #TODO-ARC 32 | if verbose: logging.debug("rawlist returned {0}".format(res)) 33 | assert res["data"][-1]["date"] == date 34 | 35 | def test_keyvaluetable(): #TODO-ARC 36 | verbose=True 37 | database = "Q123456789" 38 | table = "mytesttable" 39 | res = _processurl("set/table/{}/{}".format(database, table), data=dumps([{"key": "aaa", "value": "AAA"}, {"key": "bbb", "value": "BBB"}]), verbose=verbose) 40 | res = _processurl("get/table/{}/{}".format(database, table), key="aaa", verbose=verbose) 41 | assert res["data"]["aaa"] == "AAA" 42 | res = _processurl("get/table/{}/{}".format(database, table), key=["aaa","bbb"], verbose=verbose) 43 | assert res["data"]["aaa"] == "AAA" and res["data"]["bbb"] == "BBB" 44 | res = _processurl("delete/table/{}/{}".format(database, table), key="aaa", verbose=verbose) 45 | res = _processurl("get/table/{}/{}".format(database, table), key="aaa", verbose=verbose) 46 | assert res["data"]["aaa"] is None 47 | res = _processurl("keys/table/{}/{}".format(database, table), verbose=verbose) 48 | assert len(res["data"]) == 2 49 | res = _processurl("getall/table/{}/{}".format(database, table), verbose=verbose) 50 | assert res["data"]["aaa"] is None and res["data"]["bbb"] == "BBB" 51 | -------------------------------------------------------------------------------- /python/test/test_multihash.py: -------------------------------------------------------------------------------- 1 | from python.Multihash import Multihash 2 | 3 | BASESTRING="A quick brown fox" 4 | SHA1BASESTRING="5drjPwBymU5TC4YNFK5aXXpwpFFbww" 5 | 6 | PDF_SHA1HEX="02efe2abec13a309916c6860de5ad8a8a096fe5d" 7 | PDF_MULTIHASHSHA1_58="5dqpnTaoMSJPpsHna58ZJHcrcJeAjW" 8 | 9 | def test_sha1(): 10 | assert Multihash(data=BASESTRING.encode('utf-8'), code=Multihash.SHA1).multihash58 == SHA1BASESTRING, "Check expected sha1 from encoding basestring" 11 | assert Multihash(sha1hex=PDF_SHA1HEX).multihash58 == PDF_MULTIHASHSHA1_58 12 | assert Multihash(multihash58=PDF_MULTIHASHSHA1_58).sha1hex == PDF_SHA1HEX 13 | -------------------------------------------------------------------------------- /rungate.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from python.ServerGateway import DwebGatewayHTTPRequestHandler 3 | # This is just used for running tests 4 | from python.config import config 5 | logging.basicConfig(**config["logging"]) 6 | DwebGatewayHTTPRequestHandler.DwebGatewayHTTPServeForever({'ipandport': ('localhost',4244)}) # Run local gateway -------------------------------------------------------------------------------- /scripts/install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -x 4 | ARG=$1 5 | GITNAME=dweb-gateway 6 | GITDIR=/usr/local/${GITNAME} 7 | SERVICENAME="dweb:dweb-gateway" 8 | 9 | cd $GITDIR 10 | #pip install --disable-pip-version-check -U $PIPS 11 | pip3 -q install --disable-pip-version-check -U -r python/requirements.txt 12 | [ -d data ] || mkdir data 13 | # First push whatever branch we are on 14 | git status | grep 'nothing to commit' || git commit -a -m "Changes made on server" 15 | git status | grep 'git push' && git push 16 | 17 | # Now switch to deployed branch - we'll probably be on it already 18 | git checkout deployed # Will run server branch 19 | git pull 20 | 21 | # Now merge the origin of deployable 22 | git merge origin/deployable 23 | 24 | # And pt 25 | git status | grep 'nothing to commit' || git commit -a -m "Merged deployable into deployed on server" 26 | git status | grep 'git push' && git push 27 | 28 | if [ ! -f data/idents_files_urls.sqlite ] 29 | then 30 | curl -L -o data/idents_files_urls.sqlite.gz https://archive.org/download/ia_papers_manifest_20170919/index/idents_files_urls.sqlite.gz 31 | gunzip data/idents_files_urls.sqlite.gz 32 | fi 33 | 34 | diff -r nginx /etc/nginx/sites-enabled 35 | if [ "$ARG" == "NGINX" ] 36 | then 37 | sudo cp nginx/* /etc/nginx/sites-available 38 | if sudo service nginx reload 39 | then 40 | echo "NGINX restarted" 41 | else 42 | systemctl status nginx.service 43 | fi 44 | fi 45 | diff etc_supervisor_conf.d_dweb.conf /etc/supervisor/conf.d/dweb.conf 46 | 47 | sudo supervisorctl restart $SERVICENAME 48 | 49 | if [ "$ARG" == "TORRENT" ] 50 | then 51 | echo "TODO run some kind of installer from dweb-transport" 52 | fi -------------------------------------------------------------------------------- /scripts/reset_ipfs.sh: -------------------------------------------------------------------------------- 1 | # Removes references to IPFS from the server, and cleans up ipfs. 2 | # It might be worth running `sudo ipfs sh; cd /home/ipfs; gzip -c -r .ipfs >ipfsrepo.20180915.prerestore.zip` before 3 | # And will need to run the preseeder in /usr/local/dweb-mirror afterwards to get the popular collections back in 4 | cd /usr/local/dweb-gateway 5 | 6 | python3 -c ' 7 | import logging 8 | import os 9 | from python.config import config 10 | from python.maintenance import resetipfs 11 | 12 | logging.basicConfig(**config["logging"]) # For server 13 | cachetabledomain=config["domains"]["directory"]+config["domains"]["metadataverifykey"]+"/domain" 14 | cachetable=config["domains"]["directory"]+config["domains"]["metadataverifykey"] 15 | 16 | print("Step 1: removing", cachetable, "which is where leafs stored - these refer to IPFS hashes for metadata") 17 | try: 18 | os.remove(cachetabledomain) 19 | except FileNotFoundError: # Might already have been deleted 20 | pass 21 | try: 22 | os.rmdir(cachetable) 23 | except FileNotFoundError: # Might already have been deleted 24 | pass 25 | 26 | print("Step 2: Remove all REDIS links to IPFS hashes") 27 | resetipfs(removeipfs=True) 28 | 29 | print("Step 3: Clearing out IPFS repo") 30 | 31 | ' 32 | # The sudo stuff below here isn't tested - all these commands need running as ipfs 33 | #sudo -u ipfs ipfs pin ls --type recursive -q | sudo -u ipfs xargs ipfs pin rm 34 | #sudo -u ipfs ipfs repo gc 35 | 36 | -------------------------------------------------------------------------------- /scripts/temp.sh: -------------------------------------------------------------------------------- 1 | cd /usr/local/dweb-gateway 2 | 3 | python3 -c ' 4 | import logging 5 | import os 6 | from python.config import config 7 | from python.maintenance import resetipfs 8 | 9 | logging.basicConfig(**config["logging"]) # For server 10 | cachetabledomain=config["domains"]["directory"]+config["domains"]["metadataverifykey"]+"/domain" 11 | cachetable=config["domains"]["directory"]+config["domains"]["metadataverifykey"] 12 | 13 | print("Step 2: Remove all REDIS links to MAGNETLINKS hashes") 14 | resetipfs(removemagnet=True) 15 | 16 | ' 17 | 18 | -------------------------------------------------------------------------------- /scripts/tests.sh: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env bash 2 | 3 | # This is just a quick test set until proper Python tests are built 4 | 5 | #python -m python.ServerGateway & 6 | 7 | set -x 8 | curl https://gateway.dweb.me/info 9 | echo; echo # Terminate response and blank line 10 | curl https://gateway.dweb.me/metadata/doi/10.1001/jama.2009.1064?verbose=True 11 | echo; echo # Terminate response and blank line 12 | 13 | # Fetch the sha1 multihash from above 14 | curl -D- -o /dev/null https://dweb.me/contenthash/5dqpnTaoMSJPpsHna58ZJHcrcJeAjW?verbose=True 15 | echo; echo # Terminate response and blank line 16 | 17 | echo "Now trying errors" 18 | #curl https://gateway.dweb.me/INVALIDCOMMAND 19 | #curl https://gateway.dweb.me/content/doi/10.INVALIDPUB/jama.2009.1064?verbose=True 20 | #curl https://gateway.dweb.me/content/doi/10.1001/INVALIDDOC.2009.1064?verbose=True 21 | -------------------------------------------------------------------------------- /temp.py: -------------------------------------------------------------------------------- 1 | import logging 2 | # This is run every 10 minutes by Cron (10 * 58 = 580 ~ 10 hours) 3 | from python.config import config 4 | import redis 5 | import base58 6 | from python.HashStore import StateService 7 | from python.TransportIPFS import TransportIPFS 8 | from python.maintenance import resetipfs 9 | 10 | logging.basicConfig(**config["logging"]) # For server 11 | resetipfs() # Empty - should just count, and delete known bad hashes 12 | 13 | # To fully reset IPFS need to also ... 14 | # rm /usr/local/dweb-gateway/.cache/table/{config["domains"]["metadataverifykey"]} which is where leafs stored - these refer to IPFS hashes for metadata 15 | --------------------------------------------------------------------------------