├── .gitignore
├── Dockerfile
├── README.md
├── build.R
└── spider.config


/.gitignore:
--------------------------------------------------------------------------------
1 | .Rproj.user
2 | .Rproj
3 | .Rhistory
4 | .RData
5 | *.Rproj
6 | license\.txt
7 | licence\.txt
8 | crawls\*
9 | 


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | # modified ubuntu https://github.com/phusion/baseimage-docker
 2 | FROM phusion/baseimage:master
 3 | CMD ["/sbin/my_init"]
 4 | 
 5 | RUN apt-get update && apt-get install -y \
 6 |     wget \
 7 |     xdg-utils \
 8 |     zenity \
 9 |     ttf-mscorefonts-installer \
10 |     fonts-wqy-zenhei \
11 |     libgconf-2-4
12 | 
13 | RUN wget --no-verbose https://download.screamingfrog.co.uk/products/seo-spider/screamingfrogseospider_16.5_all.deb && \
14 |     dpkg -i /screamingfrogseospider_16.5_all.deb && \
15 |     apt-get install -f -y
16 | 
17 | COPY spider.config /root/.ScreamingFrogSEOSpider/spider.config
18 | COPY licence.txt /root/.ScreamingFrogSEOSpider/licence.txt
19 | 
20 | RUN mkdir /home/crawls
21 | 
22 | ENTRYPOINT ["/usr/bin/screamingfrogseospider"]
23 | 
24 | CMD ["--help"]
25 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # ScreamingFrog Docker
  2 | 
  3 | Provides headless screaming frogs.
  4 | 
  5 | Helped by [`databulle`](https://www.databulle.com/blog/seo/screaming-frog-headless.html) - thank you!
  6 | 
  7 | Contains a Docker installation Ubuntu ScreamingFrog v16.5 intended to be used for its [Command Line Interface](https://www.screamingfrog.co.uk/seo-spider/user-guide/general/#command-line).
  8 | 
  9 | ## Installation
 10 | 
 11 | 1. Clone repo
 12 | 2. Add a license.txt file with your username on the first line, and key on the second.
 13 | 
 14 | 3. Run:
 15 | 
 16 | `docker build -t screamingfrog .`
 17 | 
 18 | Or submit to Google Build Triggers, which will host it for you privately at a URL like 
 19 | `gcr.io/your-project/screamingfrog-docker:a2ffbd174483aaa27473ef6e0eee404f19058b1a` - for use in Kubernetes and such like. 
 20 | 
 21 | ## Usage
 22 | 
 23 | Once the image is built it can be reached via `docker run screamingfrog`.  By default it will show `--help`
 24 | 
 25 | ```
 26 | > docker run screamingfrog
 27 | 
 28 | usage: ScreamingFrogSEOSpider [crawl-file|options]
 29 | 
 30 | Positional arguments:
 31 |     crawl-file
 32 |                          Specify a crawl to load. This argument will be ignored if there
 33 |                          are any other options specified
 34 | 
 35 | Options:
 36 |     --crawl <url>
 37 |                          Start crawling the supplied URL
 38 | 
 39 |     --crawl-list <list file>
 40 |                          Start crawling the specified URLs in list mode
 41 | 
 42 |     --config <config>
 43 |                          Supply a config file for the spider to use
 44 | 
 45 |     --use-majestic
 46 |                          Use Majestic API during crawl
 47 | 
 48 |     --use-mozscape
 49 |                          Use Mozscape API during crawl
 50 | 
 51 |     --use-ahrefs
 52 |                          Use Ahrefs API during crawl
 53 | 
 54 |     --use-google-analytics <google account> <account> <property> <view> <segment>
 55 |                          Use Google Analytics API during crawl
 56 | 
 57 |     --use-google-search-console <google account> <website>
 58 |                          Use Google Search Console API during crawl
 59 | 
 60 |     --headless
 61 |                          Run in silent mode without a user interface
 62 | 
 63 |     --output-folder <output>
 64 |                          Where to store saved files. Default: current working directory
 65 | 
 66 |     --export-format <csv|xls|xlsx>
 67 |                          Supply a format to be used for all exports
 68 | 
 69 |     --overwrite
 70 |                          Overwrite files in output directory
 71 | 
 72 |     --timestamped-output
 73 |                          Create a timestamped folder in the output directory, and store
 74 |                          all output there
 75 | 
 76 |     --save-crawl
 77 |                          Save the completed crawl
 78 | 
 79 |     --export-tabs <tab:filter,...>
 80 |                          Supply a comma separated list of tabs to export. You need to
 81 |                          specify the tab name and the filter name separated by a colon
 82 | 
 83 |     --bulk-export <[submenu:]export,...>
 84 |                          Supply a comma separated list of bulk exports to perform. The
 85 |                          export names are the same as in the Bulk Export menu in the UI.
 86 |                          To access exports in a submenu, use <submenu-name:export-name>
 87 | 
 88 |     --save-report <[submenu:]report,...>
 89 |                          Supply a comma separated list of reports to save. The report
 90 |                          names are the same as in the Report menu in the UI. To access
 91 |                          reports in a submenu, use <submenu-name:report-name>
 92 | 
 93 |     --create-sitemap
 94 |                          Creates a sitemap from the completed crawl
 95 | 
 96 |     --create-images-sitemap
 97 |                          Creates an images sitemap from the completed crawl
 98 | 
 99 |  -h, --help
100 |                          Print this message and exit
101 | ```
102 | 
103 | ## Crawling
104 | 
105 | Crawl a website via the example below.  You need to add a local volume if you want to save the results to your laptop.  A folder of `/home/crawls/` is available in the Docker image you can save crawl results to.
106 | 
107 | The example below starts a headless crawl of `http://iihnordic.com` and saves the crawl and a bulk export of "All Outlinks" to a local folder, that is linked to the `/home/crawls` folder within the container.
108 | 
109 | ```
110 | > docker run -v /Users/mark/screamingfrog-docker/crawls:/home/crawls screamingfrog --crawl http://iihnordic.com --headless --save-crawl --output-folder /home/crawls --timestamped-output --bulk-export 'All Outlinks'
111 | 
112 | 2018-09-20 12:51:11,640 [main] INFO  - Persistent config file does not exist, /root/.ScreamingFrogSEOSpider/spider.config
113 | 2018-09-20 12:51:11,827 [8] [main] INFO  - Application Started
114 | 2018-09-20 12:51:11,836 [8] [main] INFO  - Running: Screaming Frog SEO Spider 10.0
115 | 2018-09-20 12:51:11,837 [8] [main] INFO  - Build: 5784af3aa002681ab5f8e98aee1f43c1be2944af
116 | 2018-09-20 12:51:11,838 [8] [main] INFO  - Platform Info: Name 'Linux' Version '4.9.93-linuxkit-aufs' Arch 'amd64'
117 | 2018-09-20 12:51:11,838 [8] [main] INFO  - Java Info: Vendor 'Oracle Corporation' URL 'http://java.oracle.com/' Version '1.8.0_161' Home '/usr/share/screamingfrogseospider/jre'
118 | 2018-09-20 12:51:11,838 [8] [main] INFO  - VM args: -Xmx2g, -XX:+UseG1GC, -XX:+UseStringDeduplication, -enableassertions, -XX:ErrorFile=/root/.ScreamingFrogSEOSpider/hs_err_pid%p.log, -Djava.ext.dirs=/usr/share/screamingfrogseospider/jre/lib/ext
119 | 2018-09-20 12:51:11,839 [8] [main] INFO  - Log File: /root/.ScreamingFrogSEOSpider/trace.txt
120 | 2018-09-20 12:51:11,839 [8] [main] INFO  - Fatal Log File: /root/.ScreamingFrogSEOSpider/crash.txt
121 | 2018-09-20 12:51:11,840 [8] [main] INFO  - Logging Status: OK
122 | 2018-09-20 12:51:11,840 [8] [main] INFO  - Memory: Physical=2.0GB, Used=12MB, Free=19MB, Total=32MB, Max=2048MB, Using 0%
123 | 2018-09-20 12:51:11,841 [8] [main] INFO  - Licence File: /root/.ScreamingFrogSEOSpider/licence.txt
124 | 2018-09-20 12:51:11,841 [8] [main] INFO  - Licence Status: invalid
125 | ....
126 | ....
127 | ....
128 | 2018-09-20 13:52:14,682 [8] [SaveFileWriter 1] INFO  - SpiderTaskUpdate [mCompleted=0, mTotal=0]
129 | 2018-09-20 13:52:14,688 [8] [SaveFileWriter 1] INFO  - Crawl saved in: 0 hrs 0 mins 0 secs (154)
130 | 2018-09-20 13:52:14,690 [8] [SpiderMain 1] INFO  - Spider changing state from: SpiderWritingToDiskState to: SpiderCrawlIdleState
131 | 2018-09-20 13:52:14,695 [8] [main] INFO  - Exporting All Outlinks
132 | 2018-09-20 13:52:14,695 [8] [main] INFO  - Saving All Outlinks
133 | 2018-09-20 13:52:14,700 [8] [ReportManager 1] INFO  - Writing report All Outlinks to /home/crawls/2018.09.20.13.51.43/all_outlinks.csv
134 | 2018-09-20 13:52:14,871 [8] [ReportManager 1] INFO  - Completed writing All Outlinks in 0 hrs 0 mins 0 secs (172)
135 | 2018-09-20 13:52:14,872 [8] [exitlogger] INFO  - Application Exited
136 | ```
137 | 
138 | 


--------------------------------------------------------------------------------
/build.R:
--------------------------------------------------------------------------------
1 | library(googleCloudRunner)
2 | 
3 | cr_deploy_docker(".", image_name = "screaming-frog-iih")
4 | 


--------------------------------------------------------------------------------
/spider.config:
--------------------------------------------------------------------------------
1 | storage.db_dir=
2 | eula.accepted=11
3 | 


--------------------------------------------------------------------------------