├── .gitignore ├── LICENSE ├── README.rst ├── cookiecutter.json ├── update-adblock.py └── {{cookiecutter.folder_name}} ├── docker-compose.yml ├── filters ├── easylist.txt ├── easylist_noadult.txt ├── easyprivacy.txt ├── fanboy-annoyance.txt └── fanboy-social.txt ├── haproxy.cfg ├── proxy-profiles ├── default.ini └── tor.ini └── show-stats /.gitignore: -------------------------------------------------------------------------------- 1 | aquarium/ 2 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Hyperion Gray, LLC 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in 13 | all copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 21 | THE SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | Aquarium 2 | ======== 3 | 4 | Aquarium is a cookiecuter_ template for hassle-free 5 | `Docker Compose`_ + Splash_ setup. Think of it as a Splash instance 6 | with extra features and without common pitfalls. 7 | 8 | .. _cookiecuter: http://cookiecutter.rtfd.org 9 | .. _Splash: https://github.com/scrapinghub/splash 10 | .. _Docker Compose: https://docs.docker.com/compose/ 11 | 12 | Usage 13 | ----- 14 | 15 | First, make sure Docker and Docker Compose are installed. 16 | 17 | Then install cookiecutter:: 18 | 19 | pip install cookiecutter 20 | 21 | or (on OS X + homebrew):: 22 | 23 | brew install cookiecutter 24 | 25 | Then generate a folder with config files:: 26 | 27 | cookiecutter gh:TeamHG-Memex/aquarium 28 | 29 | With all default options it'll create an ``aquarium`` folder in the current 30 | path. Go to this folder and start the Splash cluster:: 31 | 32 | cd ./aquarium 33 | docker-compose up 34 | 35 | Then use http://:8050 as a regular Splash_ instance. On Linux 36 | http://0.0.0.0:8050 should work; on OS X and Windows IP address depends on 37 | boot2docker or docker-machine. 38 | 39 | Options 40 | ------- 41 | 42 | When generating a config, cookiecutter will ask a bunch of questions. 43 | 44 | * ``folder_name (default is "aquarium")`` - a name of the target folder. 45 | * ``num_splashes (default is "3")`` - a number of Splash instances to create. 46 | To utilize full server capacity it makes sense to create slightly more Splash 47 | instances than CPU cores - e.g. on 2-core machine 3 instances often 48 | work best. 49 | * ``splash_version (default is "3.0")`` - a version of scrapighub/splash 50 | Docker image. 51 | * ``auth_user (default is "user")``, ``auth_password (default is "userpass")`` 52 | - HTTP Basic Auth credentials for Splash. 53 | * ``splash_verbosity (default is "1")`` - Splash log verbosity, from 0 to 5. 54 | * ``max_timeout (default is "3600")`` - maximum allowed timeout. 55 | * ``maxrss_mb (default is "3000")`` - a soft memory limit, in MB. Splash 56 | container will be restarted after some time if it starts to use more memory 57 | then this value. 58 | * ``splash_slots (default is 5)`` - a number of Splash slots to use, i.e. 59 | how many render jobs to run in parallel in a single Splash process. 60 | * ``stats_enabled (default is "1")`` - whether to enable HAProxy stats. 61 | If stats are enabled visit http://:8036 to see stats page. 62 | * ``stats_auth (default is "admin:adminpass")`` - HTTP Basic Auth credentials 63 | for HAProxy stats. 64 | * ``tor (default is "1")`` - enter 0 to disable Tor_ support. When Tor support 65 | is enabled, all .onion links are opened using Tor. In addition to 66 | that, there is ``tor`` `Splash proxy profile`_ which you can use to render 67 | any page using Tor. 68 | * ``adblock (default is "1")`` - Enter 0 to disable AdBlock Plus 69 | `request filters`_ (FIXME: this option is not working yet; 70 | filters are always available). By default, the following filters 71 | are available: 72 | 73 | * `easylist`: default set of EasyList_ filters for English; 74 | * `easyprivacy`: EasyPrivacy filters remove tracking scripts; 75 | * `easylist_noadult`: EasyList variant without filters for adult domains; 76 | * `fanboy-social`: removes social media content such as the Facebook like 77 | buttons and other widgets. 78 | * `fanboy-annoyance`: blocks Social Media content, in-page pop-ups 79 | and other annoyances; use it to decrease loading times and uncluttering 80 | pages. `fanboy-social` is already included in this filter. 81 | 82 | .. _Tor: http://torproject.org 83 | .. _Splash proxy profile: http://splash.readthedocs.org/en/latest/api.html#proxy-profiles 84 | .. _request filters: http://splash.readthedocs.org/en/latest/api.html#request-filters 85 | .. _EasyList: https://easylist.to/ 86 | 87 | Contributing 88 | ------------ 89 | 90 | * Source code: https://github.com/TeamHG-Memex/aquarium 91 | * Bug tracker: https://github.com/TeamHG-Memex/aquarium/issues 92 | 93 | License is MIT. 94 | 95 | ---- 96 | 97 | .. image:: https://hyperiongray.s3.amazonaws.com/define-hg.svg 98 | :target: https://www.hyperiongray.com/?pk_campaign=github&pk_kwd=aquarium 99 | :alt: define hyperiongray 100 | -------------------------------------------------------------------------------- /cookiecutter.json: -------------------------------------------------------------------------------- 1 | { 2 | "folder_name": "aquarium", 3 | "num_splashes": 3, 4 | "splash_version": "3.0", 5 | "auth_user": "user", 6 | "auth_password": "userpass", 7 | "splash_verbosity": 1, 8 | "max_timeout": 3600, 9 | "maxrss_mb": 3000, 10 | "splash_slots": 5, 11 | "stats_enabled": 1, 12 | "stats_auth": "admin:adminpass", 13 | "tor": 1, 14 | 15 | "_copy_without_render": [ 16 | "*filters/*.txt" 17 | ] 18 | } 19 | -------------------------------------------------------------------------------- /update-adblock.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | import os 4 | import requests 5 | 6 | 7 | FILTERS = { 8 | "easylist": "https://easylist.to/easylist/easylist.txt", 9 | "easyprivacy": "https://easylist.to/easylist/easyprivacy.txt", 10 | # "easyprivacy_nointernational": "https://easylist-downloads.adblockplus.org/easyprivacy_nointernational.txt", 11 | "easylist_noadult": "https://easylist-downloads.adblockplus.org/easylist_noadult.txt", 12 | # "antiadblockfilters": "https://easylist-downloads.adblockplus.org/antiadblockfilters.txt", 13 | "fanboy-annoyance": "https://easylist.to/easylist/fanboy-annoyance.txt", 14 | "fanboy-social": "https://easylist.to/easylist/fanboy-social.txt", 15 | # "easylistgermany": "https://easylist.to/easylistgermany/easylistgermany.txt", 16 | # "easylistitaly": "https://easylist-downloads.adblockplus.org/easylistitaly.txt", 17 | # "easylistdutch": "https://easylist-downloads.adblockplus.org/easylistdutch.txt", 18 | # "liste_fr": "https://easylist-downloads.adblockplus.org/liste_fr.txt", 19 | # "easylistchina": "https://easylist-downloads.adblockplus.org/easylistchina.txt", 20 | # "adblock_bg": "http://stanev.org/abp/adblock_bg.txt", 21 | # "abpindo": "https://indonesianadblockrules.googlecode.com/hg/subscriptions/abpindo.txt", 22 | # "liste_ar": "https://liste-ar-adblock.googlecode.com/hg/Liste_AR.txt", 23 | # "adblock_cz": "https://adblock-czechoslovaklist.googlecode.com/svn/filters.txt", 24 | # "latvian_list": "https://gitorious.org/adblock-latvian/adblock-latvian/raw/master:lists/latvian-list.txt", 25 | # "EasyListHebrew": "https://raw.github.com/AdBlockPlusIsrael/EasyListHebrew/master/EasyListHebrew.txt", 26 | # "easylistlithuania": "http://margevicius.lt/easylistlithuania.txt", 27 | } 28 | 29 | 30 | def update(path): 31 | if not os.path.exists(path): 32 | os.mkdir(path) 33 | 34 | for idx, (name, url) in enumerate(FILTERS.items(), start=1): 35 | print("[%d/%d] [%s] downloading %s" % (idx, len(FILTERS), name, url)) 36 | try: 37 | resp = requests.get(url, timeout=10) 38 | except Exception as e: 39 | print(e) 40 | else: 41 | fn = os.path.join(path, name+".txt") 42 | with open(fn, 'wb') as f: 43 | f.write(resp.content) 44 | 45 | 46 | if __name__ == '__main__': 47 | update(os.path.join("{{cookiecutter.folder_name}}", "filters")) 48 | -------------------------------------------------------------------------------- /{{cookiecutter.folder_name}}/docker-compose.yml: -------------------------------------------------------------------------------- 1 | {% set num_splashes = cookiecutter.num_splashes|int %} 2 | {% set splash_slots = cookiecutter.splash_slots|int %} 3 | {% set max_timeout = cookiecutter.max_timeout|int %} 4 | {% set maxrss = cookiecutter.maxrss_mb | int %} 5 | {% set mem_limit = "%dm" | format(maxrss * 1.4) %} 6 | {% set memswap_limit = "%dm" | format(maxrss * 1.8) %} 7 | {% set tor = cookiecutter.tor | int %} 8 | {% set verbosity = cookiecutter.splash_verbosity %} 9 | 10 | version: '2' 11 | 12 | services: 13 | haproxy: 14 | image: haproxy:1.7 15 | ports: 16 | # stats 17 | - "8036:8036" 18 | 19 | # splash 20 | - "8050:8050" 21 | links: 22 | {%- for i in range(num_splashes) %} 23 | - splash{{i}} 24 | {%- endfor %} 25 | volumes: 26 | - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro 27 | 28 | {%- for i in range(num_splashes) %} 29 | 30 | splash{{i}}: 31 | image: scrapinghub/splash:{{ cookiecutter.splash_version }} 32 | command: --max-timeout {{ max_timeout }} --slots {{ splash_slots }} --maxrss {{ maxrss }} --verbosity {{ verbosity }} 33 | expose: 34 | - 8050 35 | mem_limit: {{ mem_limit }} 36 | memswap_limit: {{ memswap_limit }} 37 | restart: always 38 | 39 | {%- if tor %} 40 | links: 41 | - tor 42 | volumes: 43 | - ./proxy-profiles:/etc/splash/proxy-profiles:ro 44 | - ./filters:/etc/splash/filters:ro 45 | {%- endif %} 46 | 47 | {%- endfor %} 48 | 49 | 50 | {% if tor %} 51 | tor: 52 | image: jess/tor-proxy 53 | expose: 54 | - 9050 55 | logging: 56 | driver: "none" 57 | restart: always 58 | 59 | {% endif %} 60 | -------------------------------------------------------------------------------- /{{cookiecutter.folder_name}}/haproxy.cfg: -------------------------------------------------------------------------------- 1 | {% set max_timeout = cookiecutter.max_timeout|int %} 2 | {% set splash_slots = cookiecutter.splash_slots|int %} 3 | {% set num_splashes = cookiecutter.num_splashes|int %} 4 | 5 | # HAProxy 1.7 config for Splash. It assumes Splash instances are executed 6 | # on the same machine and connected to HAProxy using Docker links. 7 | global 8 | # raise it if necessary 9 | maxconn 512 10 | # required for stats page 11 | stats socket /tmp/haproxy 12 | 13 | userlist users 14 | user {{ cookiecutter.auth_user }} insecure-password {{ cookiecutter.auth_password }} 15 | 16 | defaults 17 | log global 18 | mode http 19 | 20 | # remove requests from a queue when clients disconnect; 21 | # see https://cbonte.github.io/haproxy-dconv/1.7/configuration.html#4.2-option%20abortonclose 22 | option abortonclose 23 | 24 | # gzip can save quite a lot of traffic with json, html or base64 data 25 | compression algo gzip 26 | compression type text/html text/plain application/json 27 | 28 | # increase these values if you want to 29 | # allow longer request queues in HAProxy 30 | timeout connect {{ max_timeout }}s 31 | timeout client {{ max_timeout }}s 32 | timeout server {{ max_timeout }}s 33 | 34 | {% if cookiecutter.stats_enabled %} 35 | # visit 0.0.0.0:8036 to see HAProxy stats page 36 | listen stats 37 | bind *:8036 38 | mode http 39 | stats enable 40 | stats hide-version 41 | stats show-legends 42 | stats show-desc Splash Cluster 43 | stats uri / 44 | stats refresh 10s 45 | stats realm Haproxy\ Statistics 46 | stats auth {{ cookiecutter.stats_auth }} 47 | {% endif %} 48 | 49 | # Splash Cluster configuration 50 | frontend http-in 51 | bind *:8050 52 | 53 | # http basic auth 54 | acl auth_ok http_auth(users) 55 | http-request auth realm Splash if !auth_ok 56 | http-request allow if auth_ok 57 | http-request deny 58 | 59 | # don't apply the same limits for non-render endpoints 60 | acl staticfiles path_beg /_harviewer/ 61 | acl misc path / /info /_debug /debug 62 | 63 | use_backend splash-cluster if auth_ok !staticfiles !misc 64 | use_backend splash-misc if auth_ok staticfiles 65 | use_backend splash-misc if auth_ok misc 66 | 67 | backend splash-cluster 68 | option httpchk GET / 69 | balance leastconn 70 | 71 | # try another instance when connection is dropped 72 | retries 2 73 | option redispatch 74 | 75 | {%- for i in range(num_splashes) %} 76 | server splash-{{i}} splash{{i}}:8050 check maxconn {{ splash_slots }} inter 2s fall 10 observe layer4 77 | {%- endfor %} 78 | 79 | backend splash-misc 80 | balance roundrobin 81 | 82 | {%- for i in range(num_splashes) %} 83 | server splash-{{i}} splash{{i}}:8050 check fall 15 84 | {%- endfor %} 85 | -------------------------------------------------------------------------------- /{{cookiecutter.folder_name}}/proxy-profiles/default.ini: -------------------------------------------------------------------------------- 1 | ; enable tor for .onion links 2 | [proxy] 3 | host = tor 4 | ;host = 192.168.99.100 5 | port = 9050 6 | type = socks5 7 | 8 | [rules] 9 | whitelist= 10 | .*\.onion$ 11 | .*\.onion[/\?].* 12 | -------------------------------------------------------------------------------- /{{cookiecutter.folder_name}}/proxy-profiles/tor.ini: -------------------------------------------------------------------------------- 1 | ; enable tor for all requests 2 | [proxy] 3 | host = tor 4 | ;host = 192.168.99.100 5 | port = 9050 6 | type = socks5 7 | -------------------------------------------------------------------------------- /{{cookiecutter.folder_name}}/show-stats: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | docker-compose ps -q | xargs docker stats 3 | --------------------------------------------------------------------------------