├── .gitignore ├── README.md ├── config ├── ivy.xml ├── nutch-site.xml └── regex-urlfilter.txt ├── docker-compose.yml └── nutch ├── Dockerfile └── startup.sh /.gitignore: -------------------------------------------------------------------------------- 1 | /crawldata 2 | /data 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Apache Nutch, Elasticsearch, MongoDB 2 | This repo contains 1) a Dockerfile build for Apache Nutch and 2) a docker-compose Setup for the usage with Elasticsearch and MongoDB. 3 | 4 | Info: Currently MongoDB is not attached and used. 5 | 6 | ## Apache Nutch Docker Build 7 | The [Dockerfile](./nutch/Dockerfile) provides a Docker Build of Apache Nutch published as [smartive/nutch](https://hub.docker.com/r/smartive/nutch/). 8 | There are two published builds: 9 | - `latest` contains [Apache Nutch v1.13](https://github.com/apache/nutch/tree/release-1.13) for Elasticsearch 2.3.* 10 | - `es-5` contains a [modified version of Apache Nutch v1.13](https://github.com/smartive/nutch/tree/feature/es-5) ready for Elasticsearch 5.4.* 11 | 12 | ## Apache Nutch docker-compose Setup for Elasticsearch 2.3.* and 5.4.* and MongoDB 13 | 14 | [This repo nutch-elasticsearch-mongodb](https://github.com/smartive/docker-nutch-elasticsearch-mongodb) contains a [docker-compose](https://github.com/smartive/docker-nutch-elasticsearch-mongodb/blob/master/docker-compose.yml) configuration for Apache Nutch with Elasticsearch 2.3.* / 5.4.* and MongoDB. 15 | 16 | To get started checkout the [Repo](https://github.com/smartive/docker-nutch-elasticsearch-mongodb) and run: 17 | 18 | ```bash 19 | git clone git@github.com:smartive/docker-nutch-elasticsearch-mongodb.git 20 | cd ./docker-nutch-elasticsearch-mongodb && docker-compose up 21 | ``` 22 | 23 | This will fire up the nutchserver and webapp. Visit [http://localhost:8080/](http://localhost:8080/). 24 | 25 | ### Manual Run 26 | 27 | ```bash 28 | docker-compose run -p 8080:8080 -p 8081:8081 --name=manual_nutch --rm --entrypoint=bash nutch 29 | ``` 30 | 31 | Then inside the docker box create the seed file: 32 | ``` 33 | echo "https://smartive.ch/" > seed.txt 34 | ``` 35 | 36 | Then open `regex-urlfilter.txt` and replace the last line to limit the crawl to the domain `smartive.ch`: 37 | ```bash 38 | vi nutch/conf/regex-urlfilter.txt 39 | # Inside regex-urlfilter.txt replace the last line `+.` with: 40 | +^https://smartive\.ch 41 | ``` 42 | 43 | Then start the crawl 44 | ```bash 45 | nutch/bin/crawl -i -s seed.txt crawldata 2 46 | ``` 47 | 48 | ES index only from existing crawl database: 49 | ``` 50 | /root/nutch/bin/nutch index crawldata/crawldb -linkdb crawldata/linkdb crawldata/segments/20170706210640 51 | ``` 52 | 53 | # Credits 54 | This Dockerfile and docker-compose Setup is partly based on [tpickett/mongo-elasticsearch-nutch](https://github.com/tpickett/mongo-elasticsearch-nutch). 55 | 56 | [Apache Nutch](http://nutch.apache.org/) is a highly extensible and scalable open source web crawler software project. A well matured, production ready crawler. 57 | 58 | -------------------------------------------------------------------------------- /config/ivy.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 13 | 14 | 15 | 16 | 18 | 19 | Nutch is an open source web-search 20 | software. It builds on 21 | Hadoop, Tika and Solr, adding web-specifics, 22 | such as a crawler, a link-graph 23 | database etc. 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 133 | 134 | 135 | 136 | 137 | 138 | 139 | 140 | 141 | 142 | 143 | 144 | -------------------------------------------------------------------------------- /config/nutch-site.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | parser.character.encoding.default 9 | utf-8 10 | 11 | 12 | storage.data.store.class 13 | org.apache.gora.mongodb.store.MongoStore 14 | Default class for storing data 15 | 16 | 17 | 18 | http.agent.name 19 | Test Crawler 20 | 21 | 22 | 23 | http.content.limit 24 | -1 25 | 26 | 27 | 28 | plugin.includes 29 | protocol-selenium|urlfilter-regex|index-(basic|anchor|metadata)|query-(basic|site|url|lang)|indexer-elastic-rest|parse-(text|html|tika|metatags)|scoring-opic|urlnormalizer-(pass|regex|basic) 30 | 31 | 39 | 40 | 41 | 50 | 51 | tika.uppercase.element.names 52 | true 53 | Determines whether TikaParser should uppercase the element name while generating the DOM 54 | for a page, as done by Neko (used per default by parse-html)(see NUTCH-1592). 55 | 56 | 57 | 58 | tika.extractor 59 | boilerpipe 60 | 61 | Which text extraction algorithm to use. Valid values are: boilerpipe or none. 62 | 63 | 64 | 65 | tika.extractor.boilerpipe.algorithm 66 | ArticleExtractor 67 | 68 | Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor 69 | or CanolaExtractor. 70 | 71 | 72 | 73 | 74 | 75 | 76 | elastic.host 77 | elasticsearch 78 | Comma-separated list of hostnames to send documents to using 79 | TransportClient. Either host and port must be defined or cluster. 80 | 81 | 82 | 83 | 84 | elastic.port 85 | 9300 86 | The port to connect to using TransportClient. 87 | 88 | 89 | 90 | elastic.cluster 91 | elasticsearch 92 | The cluster name to discover. Either host and port must be defined 93 | or cluster. 94 | 95 | 96 | 97 | 98 | elastic.index 99 | nutch_test 100 | Default index to send documents to. 101 | 102 | 103 | 104 | elastic.max.bulk.docs 105 | 250 106 | Maximum size of the bulk in number of documents. 107 | 108 | 109 | 110 | elastic.max.bulk.size 111 | 2500500 112 | Maximum size of the bulk in bytes. 113 | 114 | 115 | 116 | elastic.exponential.backoff.millis 117 | 100 118 | Initial delay for the BulkProcessor's exponential backoff policy. 119 | 120 | 121 | 122 | 123 | elastic.exponential.backoff.retries 124 | 10 125 | Number of times the BulkProcessor's exponential backoff policy 126 | should retry bulk operations. 127 | 128 | 129 | 130 | 131 | elastic.bulk.close.timeout 132 | 600 133 | Number of seconds allowed for the BulkProcessor to complete its 134 | last operation. 135 | 136 | 137 | 138 | 139 | 140 | elastic.rest.host 141 | elasticsearch 142 | The hostname to send documents to using Elasticsearch Jest. Both host 143 | and port must be defined 144 | 145 | 146 | 147 | 148 | elastic.rest.port 149 | 9200 150 | The port to connect to using Elasticsearch Jest. 151 | 152 | 153 | 154 | elastic.rest.index 155 | nutch_rest 156 | Default index to send documents to. 157 | 158 | 159 | 160 | elastic.rest.type 161 | doc 162 | Default type to send documents to. 163 | 164 | 165 | 166 | elastic.rest.max.bulk.docs 167 | 250 168 | Maximum size of the bulk in number of documents. 169 | 170 | 171 | 172 | elastic.rest.max.bulk.size 173 | 26214400 174 | Maximum size of the bulk in bytes. 175 | 176 | 177 | 178 | elastic.rest.https 179 | false 180 | 181 | "true" to enable https, "false" to disable https 182 | If you've disabled http access (by forcing https), be sure to 183 | set this to true, otherwise you might get "connection reset by peer". 184 | 185 | 186 | 187 | 188 | elastic.rest.user 189 | 190 | Username for auth credentials (only used when https is enabled) 191 | 192 | 193 | 194 | elastic.rest.password 195 | 196 | Password for auth credentials (only used when https is enabled) 197 | 198 | 199 | 200 | elastic.rest.trustallhostnames 201 | false 202 | 203 | "true" to trust elasticsearch server's certificate even if its listed domain name does not 204 | match the domain they are hosted on 205 | "false" to check if the elasticsearch server's certificate's listed domain is the same domain 206 | that it is hosted on, and if it doesn't, then fail to index 207 | (only used when https is enabled) 208 | 209 | 210 | 211 | 212 | 213 | 214 | metatags.names 215 | author,description,keywords,image, 216 | Names of the metatags to extract, separated by ','. 217 | Use '*' to extract all metatags. Prefixes the names with 'metatag.' 218 | in the parse-metadata. For instance to index description and keywords, 219 | you need to activate the plugin index-metadata and set the value of the 220 | parameter 'index.parse.md' to 'metatag.description,metatag.keywords'. 221 | 222 | 223 | 224 | index.parse.md 225 | metatag.description,metatag.keywords,metatag.author,metatag.image 226 | 227 | Comma-separated list of keys to be taken from the parse metadata to generate fields. 228 | Can be used e.g. for 'description' or 'keywords' provided that these values are generated 229 | by a parser (see parse-metatags plugin) 230 | 231 | 232 | 233 | index.content.md 234 | 235 | 236 | Comma-separated list of keys to be taken from the content metadata to generate fields. 237 | 238 | 239 | 240 | index.db.md 241 | 242 | 243 | Comma-separated list of keys to be taken from the crawldb metadata to generate fields. 244 | Can be used to index values propagated from the seeds with the plugin urlmeta 245 | 246 | 247 | 248 | 249 | selenium.driver 250 | phantomjs 251 | 252 | A String value representing the flavour of Selenium 253 | WebDriver() to use. Currently the following options 254 | exist - 'firefox', 'chrome', 'safari', 'opera', 'phantomjs', and 'remote'. 255 | If 'remote' is used it is essential to also set correct properties for 256 | 'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host' and 257 | 'selenium.hub.protocol'. 258 | 259 | 260 | 261 | 262 | 263 | libselenium.page.load.delay 264 | 10 265 | 266 | The delay in seconds to use when loading a page with lib-selenium. This 267 | setting is used by protocol-selenium and protocol-interactiveselenium 268 | since they depending on lib-selenium for fetching. 269 | 270 | 271 | 272 | 273 | -------------------------------------------------------------------------------- /config/regex-urlfilter.txt: -------------------------------------------------------------------------------- 1 | # Licensed to the Apache Software Foundation (ASF) under one or more 2 | # contributor license agreements. See the NOTICE file distributed with 3 | # this work for additional information regarding copyright ownership. 4 | # The ASF licenses this file to You under the Apache License, Version 2.0 5 | # (the "License"); you may not use this file except in compliance with 6 | # the License. You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | 16 | 17 | # The default url filter. 18 | # Better for whole-internet crawling. 19 | 20 | # Each non-comment, non-blank line contains a regular expression 21 | # prefixed by '+' or '-'. The first matching pattern in the file 22 | # determines whether a URL is included or ignored. If no pattern 23 | # matches, the URL is ignored. 24 | 25 | # skip file: ftp: and mailto: urls 26 | -^(file|ftp|mailto): 27 | 28 | # skip image and other suffixes we can't yet parse 29 | # for a more extensive coverage use the urlfilter-suffix plugin 30 | -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|svg|SVG)$ 31 | 32 | # skip URLs containing certain characters as probable queries, etc. 33 | -[?*!@=] 34 | 35 | # skip URLs with slash-delimited segment that repeats 3+ times, to break loops 36 | -.*(/[^/]+)/[^/]+\1/[^/]+\1/ 37 | 38 | # skip twitter 39 | -^https://twitter\.com 40 | 41 | # skip facebook 42 | -^https://www\.facebook\.com 43 | 44 | # accept anything else 45 | +. 46 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | nutch: 2 | image: smartive/nutch 3 | ports: 4 | - "8080:8080" 5 | - "8081:8081" 6 | links: 7 | - "mongodb:mongodb" 8 | - "elasticsearch:elasticsearch" 9 | volumes: 10 | - "./config/ivy.xml:/root/nutch/ivy/ivy.xml" 11 | - "./config/nutch-site.xml:/root/nutch/conf/nutch-site.xml" 12 | - "./config/regex-urlfilter.txt:/root/nutch/conf/regex-urlfilter.txt" 13 | - "./crawldata:/root/crawldata" 14 | mongodb: 15 | image: mongo 16 | ports: 17 | - "27020:27017" 18 | volumes: 19 | - "./data/mongo:/data/db" 20 | elasticsearch: 21 | image: elasticsearch:2.3.3 22 | ports: 23 | - "9200:9200" 24 | - "9300:9300" 25 | -------------------------------------------------------------------------------- /nutch/Dockerfile: -------------------------------------------------------------------------------- 1 | # Based on https://raw.githubusercontent.com/apache/nutch/master/docker/Dockerfile 2 | 3 | FROM java:8 4 | MAINTAINER smartive AG 5 | 6 | ENV NUTCH_HOME /root/nutch 7 | ENV PHANTOM_JS phantomjs-2.1.1-linux-x86_64 8 | 9 | WORKDIR /root/ 10 | 11 | # Get the package containing apt-add-repository installed for adding repositories 12 | RUN apt-get update && \ 13 | apt-get upgrade -y 14 | 15 | # Add the repository that we'll pull java down from. 16 | #RUN add-apt-repository -y ppa:webupd8team/java && apt-get update && apt-get upgrade -y 17 | 18 | # Get Oracle Java 1.7 installed 19 | #RUN echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections && apt-get install -y oracle-java7-installer oracle-java7-set-default 20 | 21 | # Install various dependencies 22 | RUN apt-get install -y \ 23 | ant \ 24 | openssh-server \ 25 | vim \ 26 | telnet \ 27 | git \ 28 | rsync \ 29 | curl \ 30 | build-essential \ 31 | chrpath \ 32 | libssl-dev \ 33 | libxft-dev \ 34 | libfreetype6 \ 35 | libfreetype6-dev \ 36 | libfontconfig1 \ 37 | libfontconfig1-dev 38 | 39 | # Set up JAVA_HOME 40 | #RUN echo 'export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")' >> $HOME/.bashrc 41 | 42 | # Install PhantomJS 43 | RUN wget https://bitbucket.org/ariya/phantomjs/downloads/$PHANTOM_JS.tar.bz2 && \ 44 | tar xvjf $PHANTOM_JS.tar.bz2 && \ 45 | mv $PHANTOM_JS /usr/local/share && \ 46 | ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin 47 | 48 | # Checkout and build the nutch trunk 49 | RUN wget https://github.com/apache/nutch/archive/master.zip && unzip master.zip && mv nutch-master nutch_source && cd nutch_source && ant 50 | 51 | # Convenience symlink to Nutch runtime local 52 | RUN ln -s nutch_source/runtime/local $NUTCH_HOME 53 | 54 | ADD startup.sh /root/startup.sh 55 | RUN chmod +x /root/startup.sh 56 | 57 | ENTRYPOINT ["/root/startup.sh"] 58 | -------------------------------------------------------------------------------- /nutch/startup.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | # Start nutch webserver for controlling with REST API 4 | $NUTCH_HOME/bin/nutch nutchserver > /dev/null & 5 | # Start nutch web gui 6 | $NUTCH_HOME/bin/nutch webapp 7 | --------------------------------------------------------------------------------