├── .gitignore ├── Dockerfile ├── README.md ├── build.R └── spider.config /.gitignore: -------------------------------------------------------------------------------- 1 | .Rproj.user 2 | .Rproj 3 | .Rhistory 4 | .RData 5 | *.Rproj 6 | license\.txt 7 | licence\.txt 8 | crawls\* 9 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | # modified ubuntu https://github.com/phusion/baseimage-docker 2 | FROM phusion/baseimage:master 3 | CMD ["/sbin/my_init"] 4 | 5 | RUN apt-get update && apt-get install -y \ 6 | wget \ 7 | xdg-utils \ 8 | zenity \ 9 | ttf-mscorefonts-installer \ 10 | fonts-wqy-zenhei \ 11 | libgconf-2-4 12 | 13 | RUN wget --no-verbose https://download.screamingfrog.co.uk/products/seo-spider/screamingfrogseospider_16.5_all.deb && \ 14 | dpkg -i /screamingfrogseospider_16.5_all.deb && \ 15 | apt-get install -f -y 16 | 17 | COPY spider.config /root/.ScreamingFrogSEOSpider/spider.config 18 | COPY licence.txt /root/.ScreamingFrogSEOSpider/licence.txt 19 | 20 | RUN mkdir /home/crawls 21 | 22 | ENTRYPOINT ["/usr/bin/screamingfrogseospider"] 23 | 24 | CMD ["--help"] 25 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # ScreamingFrog Docker 2 | 3 | Provides headless screaming frogs. 4 | 5 | Helped by [`databulle`](https://www.databulle.com/blog/seo/screaming-frog-headless.html) - thank you! 6 | 7 | Contains a Docker installation Ubuntu ScreamingFrog v16.5 intended to be used for its [Command Line Interface](https://www.screamingfrog.co.uk/seo-spider/user-guide/general/#command-line). 8 | 9 | ## Installation 10 | 11 | 1. Clone repo 12 | 2. Add a license.txt file with your username on the first line, and key on the second. 13 | 14 | 3. Run: 15 | 16 | `docker build -t screamingfrog .` 17 | 18 | Or submit to Google Build Triggers, which will host it for you privately at a URL like 19 | `gcr.io/your-project/screamingfrog-docker:a2ffbd174483aaa27473ef6e0eee404f19058b1a` - for use in Kubernetes and such like. 20 | 21 | ## Usage 22 | 23 | Once the image is built it can be reached via `docker run screamingfrog`. By default it will show `--help` 24 | 25 | ``` 26 | > docker run screamingfrog 27 | 28 | usage: ScreamingFrogSEOSpider [crawl-file|options] 29 | 30 | Positional arguments: 31 | crawl-file 32 | Specify a crawl to load. This argument will be ignored if there 33 | are any other options specified 34 | 35 | Options: 36 | --crawl 37 | Start crawling the supplied URL 38 | 39 | --crawl-list 40 | Start crawling the specified URLs in list mode 41 | 42 | --config 43 | Supply a config file for the spider to use 44 | 45 | --use-majestic 46 | Use Majestic API during crawl 47 | 48 | --use-mozscape 49 | Use Mozscape API during crawl 50 | 51 | --use-ahrefs 52 | Use Ahrefs API during crawl 53 | 54 | --use-google-analytics 55 | Use Google Analytics API during crawl 56 | 57 | --use-google-search-console 58 | Use Google Search Console API during crawl 59 | 60 | --headless 61 | Run in silent mode without a user interface 62 | 63 | --output-folder 64 | Where to store saved files. Default: current working directory 65 | 66 | --export-format 67 | Supply a format to be used for all exports 68 | 69 | --overwrite 70 | Overwrite files in output directory 71 | 72 | --timestamped-output 73 | Create a timestamped folder in the output directory, and store 74 | all output there 75 | 76 | --save-crawl 77 | Save the completed crawl 78 | 79 | --export-tabs 80 | Supply a comma separated list of tabs to export. You need to 81 | specify the tab name and the filter name separated by a colon 82 | 83 | --bulk-export <[submenu:]export,...> 84 | Supply a comma separated list of bulk exports to perform. The 85 | export names are the same as in the Bulk Export menu in the UI. 86 | To access exports in a submenu, use 87 | 88 | --save-report <[submenu:]report,...> 89 | Supply a comma separated list of reports to save. The report 90 | names are the same as in the Report menu in the UI. To access 91 | reports in a submenu, use 92 | 93 | --create-sitemap 94 | Creates a sitemap from the completed crawl 95 | 96 | --create-images-sitemap 97 | Creates an images sitemap from the completed crawl 98 | 99 | -h, --help 100 | Print this message and exit 101 | ``` 102 | 103 | ## Crawling 104 | 105 | Crawl a website via the example below. You need to add a local volume if you want to save the results to your laptop. A folder of `/home/crawls/` is available in the Docker image you can save crawl results to. 106 | 107 | The example below starts a headless crawl of `http://iihnordic.com` and saves the crawl and a bulk export of "All Outlinks" to a local folder, that is linked to the `/home/crawls` folder within the container. 108 | 109 | ``` 110 | > docker run -v /Users/mark/screamingfrog-docker/crawls:/home/crawls screamingfrog --crawl http://iihnordic.com --headless --save-crawl --output-folder /home/crawls --timestamped-output --bulk-export 'All Outlinks' 111 | 112 | 2018-09-20 12:51:11,640 [main] INFO - Persistent config file does not exist, /root/.ScreamingFrogSEOSpider/spider.config 113 | 2018-09-20 12:51:11,827 [8] [main] INFO - Application Started 114 | 2018-09-20 12:51:11,836 [8] [main] INFO - Running: Screaming Frog SEO Spider 10.0 115 | 2018-09-20 12:51:11,837 [8] [main] INFO - Build: 5784af3aa002681ab5f8e98aee1f43c1be2944af 116 | 2018-09-20 12:51:11,838 [8] [main] INFO - Platform Info: Name 'Linux' Version '4.9.93-linuxkit-aufs' Arch 'amd64' 117 | 2018-09-20 12:51:11,838 [8] [main] INFO - Java Info: Vendor 'Oracle Corporation' URL 'http://java.oracle.com/' Version '1.8.0_161' Home '/usr/share/screamingfrogseospider/jre' 118 | 2018-09-20 12:51:11,838 [8] [main] INFO - VM args: -Xmx2g, -XX:+UseG1GC, -XX:+UseStringDeduplication, -enableassertions, -XX:ErrorFile=/root/.ScreamingFrogSEOSpider/hs_err_pid%p.log, -Djava.ext.dirs=/usr/share/screamingfrogseospider/jre/lib/ext 119 | 2018-09-20 12:51:11,839 [8] [main] INFO - Log File: /root/.ScreamingFrogSEOSpider/trace.txt 120 | 2018-09-20 12:51:11,839 [8] [main] INFO - Fatal Log File: /root/.ScreamingFrogSEOSpider/crash.txt 121 | 2018-09-20 12:51:11,840 [8] [main] INFO - Logging Status: OK 122 | 2018-09-20 12:51:11,840 [8] [main] INFO - Memory: Physical=2.0GB, Used=12MB, Free=19MB, Total=32MB, Max=2048MB, Using 0% 123 | 2018-09-20 12:51:11,841 [8] [main] INFO - Licence File: /root/.ScreamingFrogSEOSpider/licence.txt 124 | 2018-09-20 12:51:11,841 [8] [main] INFO - Licence Status: invalid 125 | .... 126 | .... 127 | .... 128 | 2018-09-20 13:52:14,682 [8] [SaveFileWriter 1] INFO - SpiderTaskUpdate [mCompleted=0, mTotal=0] 129 | 2018-09-20 13:52:14,688 [8] [SaveFileWriter 1] INFO - Crawl saved in: 0 hrs 0 mins 0 secs (154) 130 | 2018-09-20 13:52:14,690 [8] [SpiderMain 1] INFO - Spider changing state from: SpiderWritingToDiskState to: SpiderCrawlIdleState 131 | 2018-09-20 13:52:14,695 [8] [main] INFO - Exporting All Outlinks 132 | 2018-09-20 13:52:14,695 [8] [main] INFO - Saving All Outlinks 133 | 2018-09-20 13:52:14,700 [8] [ReportManager 1] INFO - Writing report All Outlinks to /home/crawls/2018.09.20.13.51.43/all_outlinks.csv 134 | 2018-09-20 13:52:14,871 [8] [ReportManager 1] INFO - Completed writing All Outlinks in 0 hrs 0 mins 0 secs (172) 135 | 2018-09-20 13:52:14,872 [8] [exitlogger] INFO - Application Exited 136 | ``` 137 | 138 | -------------------------------------------------------------------------------- /build.R: -------------------------------------------------------------------------------- 1 | library(googleCloudRunner) 2 | 3 | cr_deploy_docker(".", image_name = "screaming-frog-iih") 4 | -------------------------------------------------------------------------------- /spider.config: -------------------------------------------------------------------------------- 1 | storage.db_dir= 2 | eula.accepted=11 3 | --------------------------------------------------------------------------------