├── .gitignore
├── README.md
├── config
├── ivy.xml
├── nutch-site.xml
└── regex-urlfilter.txt
├── docker-compose.yml
└── nutch
├── Dockerfile
└── startup.sh
/.gitignore:
--------------------------------------------------------------------------------
1 | /crawldata
2 | /data
3 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Apache Nutch, Elasticsearch, MongoDB
2 | This repo contains 1) a Dockerfile build for Apache Nutch and 2) a docker-compose Setup for the usage with Elasticsearch and MongoDB.
3 |
4 | Info: Currently MongoDB is not attached and used.
5 |
6 | ## Apache Nutch Docker Build
7 | The [Dockerfile](./nutch/Dockerfile) provides a Docker Build of Apache Nutch published as [smartive/nutch](https://hub.docker.com/r/smartive/nutch/).
8 | There are two published builds:
9 | - `latest` contains [Apache Nutch v1.13](https://github.com/apache/nutch/tree/release-1.13) for Elasticsearch 2.3.*
10 | - `es-5` contains a [modified version of Apache Nutch v1.13](https://github.com/smartive/nutch/tree/feature/es-5) ready for Elasticsearch 5.4.*
11 |
12 | ## Apache Nutch docker-compose Setup for Elasticsearch 2.3.* and 5.4.* and MongoDB
13 |
14 | [This repo nutch-elasticsearch-mongodb](https://github.com/smartive/docker-nutch-elasticsearch-mongodb) contains a [docker-compose](https://github.com/smartive/docker-nutch-elasticsearch-mongodb/blob/master/docker-compose.yml) configuration for Apache Nutch with Elasticsearch 2.3.* / 5.4.* and MongoDB.
15 |
16 | To get started checkout the [Repo](https://github.com/smartive/docker-nutch-elasticsearch-mongodb) and run:
17 |
18 | ```bash
19 | git clone git@github.com:smartive/docker-nutch-elasticsearch-mongodb.git
20 | cd ./docker-nutch-elasticsearch-mongodb && docker-compose up
21 | ```
22 |
23 | This will fire up the nutchserver and webapp. Visit [http://localhost:8080/](http://localhost:8080/).
24 |
25 | ### Manual Run
26 |
27 | ```bash
28 | docker-compose run -p 8080:8080 -p 8081:8081 --name=manual_nutch --rm --entrypoint=bash nutch
29 | ```
30 |
31 | Then inside the docker box create the seed file:
32 | ```
33 | echo "https://smartive.ch/" > seed.txt
34 | ```
35 |
36 | Then open `regex-urlfilter.txt` and replace the last line to limit the crawl to the domain `smartive.ch`:
37 | ```bash
38 | vi nutch/conf/regex-urlfilter.txt
39 | # Inside regex-urlfilter.txt replace the last line `+.` with:
40 | +^https://smartive\.ch
41 | ```
42 |
43 | Then start the crawl
44 | ```bash
45 | nutch/bin/crawl -i -s seed.txt crawldata 2
46 | ```
47 |
48 | ES index only from existing crawl database:
49 | ```
50 | /root/nutch/bin/nutch index crawldata/crawldb -linkdb crawldata/linkdb crawldata/segments/20170706210640
51 | ```
52 |
53 | # Credits
54 | This Dockerfile and docker-compose Setup is partly based on [tpickett/mongo-elasticsearch-nutch](https://github.com/tpickett/mongo-elasticsearch-nutch).
55 |
56 | [Apache Nutch](http://nutch.apache.org/) is a highly extensible and scalable open source web crawler software project. A well matured, production ready crawler.
57 |
58 |
--------------------------------------------------------------------------------
/config/ivy.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
13 |
14 |
15 |
16 |
18 |
19 | Nutch is an open source web-search
20 | software. It builds on
21 | Hadoop, Tika and Solr, adding web-specifics,
22 | such as a crawler, a link-graph
23 | database etc.
24 |
25 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
47 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
58 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
69 |
70 |
71 |
72 |
73 |
74 |
75 |
76 |
77 |
78 |
79 |
80 |
81 |
82 |
83 |
84 |
85 |
86 |
87 |
88 |
89 |
90 |
91 |
92 |
93 |
94 |
95 |
96 |
97 |
98 |
99 |
100 |
101 |
102 |
103 |
104 |
105 |
106 |
107 |
108 |
109 |
110 |
111 |
112 |
113 |
114 |
115 |
116 |
117 |
118 |
119 |
120 |
121 |
122 |
123 |
124 |
125 |
126 |
127 |
128 |
129 |
130 |
131 |
132 |
133 |
134 |
135 |
136 |
137 |
138 |
139 |
140 |
141 |
142 |
143 |
144 |
--------------------------------------------------------------------------------
/config/nutch-site.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 | parser.character.encoding.default
9 | utf-8
10 |
11 |
12 | storage.data.store.class
13 | org.apache.gora.mongodb.store.MongoStore
14 | Default class for storing data
15 |
16 |
17 |
18 | http.agent.name
19 | Test Crawler
20 |
21 |
22 |
23 | http.content.limit
24 | -1
25 |
26 |
27 |
28 | plugin.includes
29 | protocol-selenium|urlfilter-regex|index-(basic|anchor|metadata)|query-(basic|site|url|lang)|indexer-elastic-rest|parse-(text|html|tika|metatags)|scoring-opic|urlnormalizer-(pass|regex|basic)
30 |
31 |
39 |
40 |
41 |
50 |
51 | tika.uppercase.element.names
52 | true
53 | Determines whether TikaParser should uppercase the element name while generating the DOM
54 | for a page, as done by Neko (used per default by parse-html)(see NUTCH-1592).
55 |
56 |
57 |
58 | tika.extractor
59 | boilerpipe
60 |
61 | Which text extraction algorithm to use. Valid values are: boilerpipe or none.
62 |
63 |
64 |
65 | tika.extractor.boilerpipe.algorithm
66 | ArticleExtractor
67 |
68 | Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
69 | or CanolaExtractor.
70 |
71 |
72 |
73 |
74 |
75 |
76 | elastic.host
77 | elasticsearch
78 | Comma-separated list of hostnames to send documents to using
79 | TransportClient. Either host and port must be defined or cluster.
80 |
81 |
82 |
83 |
84 | elastic.port
85 | 9300
86 | The port to connect to using TransportClient.
87 |
88 |
89 |
90 | elastic.cluster
91 | elasticsearch
92 | The cluster name to discover. Either host and port must be defined
93 | or cluster.
94 |
95 |
96 |
97 |
98 | elastic.index
99 | nutch_test
100 | Default index to send documents to.
101 |
102 |
103 |
104 | elastic.max.bulk.docs
105 | 250
106 | Maximum size of the bulk in number of documents.
107 |
108 |
109 |
110 | elastic.max.bulk.size
111 | 2500500
112 | Maximum size of the bulk in bytes.
113 |
114 |
115 |
116 | elastic.exponential.backoff.millis
117 | 100
118 | Initial delay for the BulkProcessor's exponential backoff policy.
119 |
120 |
121 |
122 |
123 | elastic.exponential.backoff.retries
124 | 10
125 | Number of times the BulkProcessor's exponential backoff policy
126 | should retry bulk operations.
127 |
128 |
129 |
130 |
131 | elastic.bulk.close.timeout
132 | 600
133 | Number of seconds allowed for the BulkProcessor to complete its
134 | last operation.
135 |
136 |
137 |
138 |
139 |
140 | elastic.rest.host
141 | elasticsearch
142 | The hostname to send documents to using Elasticsearch Jest. Both host
143 | and port must be defined
144 |
145 |
146 |
147 |
148 | elastic.rest.port
149 | 9200
150 | The port to connect to using Elasticsearch Jest.
151 |
152 |
153 |
154 | elastic.rest.index
155 | nutch_rest
156 | Default index to send documents to.
157 |
158 |
159 |
160 | elastic.rest.type
161 | doc
162 | Default type to send documents to.
163 |
164 |
165 |
166 | elastic.rest.max.bulk.docs
167 | 250
168 | Maximum size of the bulk in number of documents.
169 |
170 |
171 |
172 | elastic.rest.max.bulk.size
173 | 26214400
174 | Maximum size of the bulk in bytes.
175 |
176 |
177 |
178 | elastic.rest.https
179 | false
180 |
181 | "true" to enable https, "false" to disable https
182 | If you've disabled http access (by forcing https), be sure to
183 | set this to true, otherwise you might get "connection reset by peer".
184 |
185 |
186 |
187 |
188 | elastic.rest.user
189 |
190 | Username for auth credentials (only used when https is enabled)
191 |
192 |
193 |
194 | elastic.rest.password
195 |
196 | Password for auth credentials (only used when https is enabled)
197 |
198 |
199 |
200 | elastic.rest.trustallhostnames
201 | false
202 |
203 | "true" to trust elasticsearch server's certificate even if its listed domain name does not
204 | match the domain they are hosted on
205 | "false" to check if the elasticsearch server's certificate's listed domain is the same domain
206 | that it is hosted on, and if it doesn't, then fail to index
207 | (only used when https is enabled)
208 |
209 |
210 |
211 |
212 |
213 |
214 | metatags.names
215 | author,description,keywords,image,
216 | Names of the metatags to extract, separated by ','.
217 | Use '*' to extract all metatags. Prefixes the names with 'metatag.'
218 | in the parse-metadata. For instance to index description and keywords,
219 | you need to activate the plugin index-metadata and set the value of the
220 | parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
221 |
222 |
223 |
224 | index.parse.md
225 | metatag.description,metatag.keywords,metatag.author,metatag.image
226 |
227 | Comma-separated list of keys to be taken from the parse metadata to generate fields.
228 | Can be used e.g. for 'description' or 'keywords' provided that these values are generated
229 | by a parser (see parse-metatags plugin)
230 |
231 |
232 |
233 | index.content.md
234 |
235 |
236 | Comma-separated list of keys to be taken from the content metadata to generate fields.
237 |
238 |
239 |
240 | index.db.md
241 |
242 |
243 | Comma-separated list of keys to be taken from the crawldb metadata to generate fields.
244 | Can be used to index values propagated from the seeds with the plugin urlmeta
245 |
246 |
247 |
248 |
249 | selenium.driver
250 | phantomjs
251 |
252 | A String value representing the flavour of Selenium
253 | WebDriver() to use. Currently the following options
254 | exist - 'firefox', 'chrome', 'safari', 'opera', 'phantomjs', and 'remote'.
255 | If 'remote' is used it is essential to also set correct properties for
256 | 'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host' and
257 | 'selenium.hub.protocol'.
258 |
259 |
260 |
261 |
262 |
263 | libselenium.page.load.delay
264 | 10
265 |
266 | The delay in seconds to use when loading a page with lib-selenium. This
267 | setting is used by protocol-selenium and protocol-interactiveselenium
268 | since they depending on lib-selenium for fetching.
269 |
270 |
271 |
272 |
273 |
--------------------------------------------------------------------------------
/config/regex-urlfilter.txt:
--------------------------------------------------------------------------------
1 | # Licensed to the Apache Software Foundation (ASF) under one or more
2 | # contributor license agreements. See the NOTICE file distributed with
3 | # this work for additional information regarding copyright ownership.
4 | # The ASF licenses this file to You under the Apache License, Version 2.0
5 | # (the "License"); you may not use this file except in compliance with
6 | # the License. You may obtain a copy of the License at
7 | #
8 | # http://www.apache.org/licenses/LICENSE-2.0
9 | #
10 | # Unless required by applicable law or agreed to in writing, software
11 | # distributed under the License is distributed on an "AS IS" BASIS,
12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 | # See the License for the specific language governing permissions and
14 | # limitations under the License.
15 |
16 |
17 | # The default url filter.
18 | # Better for whole-internet crawling.
19 |
20 | # Each non-comment, non-blank line contains a regular expression
21 | # prefixed by '+' or '-'. The first matching pattern in the file
22 | # determines whether a URL is included or ignored. If no pattern
23 | # matches, the URL is ignored.
24 |
25 | # skip file: ftp: and mailto: urls
26 | -^(file|ftp|mailto):
27 |
28 | # skip image and other suffixes we can't yet parse
29 | # for a more extensive coverage use the urlfilter-suffix plugin
30 | -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|svg|SVG)$
31 |
32 | # skip URLs containing certain characters as probable queries, etc.
33 | -[?*!@=]
34 |
35 | # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
36 | -.*(/[^/]+)/[^/]+\1/[^/]+\1/
37 |
38 | # skip twitter
39 | -^https://twitter\.com
40 |
41 | # skip facebook
42 | -^https://www\.facebook\.com
43 |
44 | # accept anything else
45 | +.
46 |
--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
1 | nutch:
2 | image: smartive/nutch
3 | ports:
4 | - "8080:8080"
5 | - "8081:8081"
6 | links:
7 | - "mongodb:mongodb"
8 | - "elasticsearch:elasticsearch"
9 | volumes:
10 | - "./config/ivy.xml:/root/nutch/ivy/ivy.xml"
11 | - "./config/nutch-site.xml:/root/nutch/conf/nutch-site.xml"
12 | - "./config/regex-urlfilter.txt:/root/nutch/conf/regex-urlfilter.txt"
13 | - "./crawldata:/root/crawldata"
14 | mongodb:
15 | image: mongo
16 | ports:
17 | - "27020:27017"
18 | volumes:
19 | - "./data/mongo:/data/db"
20 | elasticsearch:
21 | image: elasticsearch:2.3.3
22 | ports:
23 | - "9200:9200"
24 | - "9300:9300"
25 |
--------------------------------------------------------------------------------
/nutch/Dockerfile:
--------------------------------------------------------------------------------
1 | # Based on https://raw.githubusercontent.com/apache/nutch/master/docker/Dockerfile
2 |
3 | FROM java:8
4 | MAINTAINER smartive AG
5 |
6 | ENV NUTCH_HOME /root/nutch
7 | ENV PHANTOM_JS phantomjs-2.1.1-linux-x86_64
8 |
9 | WORKDIR /root/
10 |
11 | # Get the package containing apt-add-repository installed for adding repositories
12 | RUN apt-get update && \
13 | apt-get upgrade -y
14 |
15 | # Add the repository that we'll pull java down from.
16 | #RUN add-apt-repository -y ppa:webupd8team/java && apt-get update && apt-get upgrade -y
17 |
18 | # Get Oracle Java 1.7 installed
19 | #RUN echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select true | /usr/bin/debconf-set-selections && apt-get install -y oracle-java7-installer oracle-java7-set-default
20 |
21 | # Install various dependencies
22 | RUN apt-get install -y \
23 | ant \
24 | openssh-server \
25 | vim \
26 | telnet \
27 | git \
28 | rsync \
29 | curl \
30 | build-essential \
31 | chrpath \
32 | libssl-dev \
33 | libxft-dev \
34 | libfreetype6 \
35 | libfreetype6-dev \
36 | libfontconfig1 \
37 | libfontconfig1-dev
38 |
39 | # Set up JAVA_HOME
40 | #RUN echo 'export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")' >> $HOME/.bashrc
41 |
42 | # Install PhantomJS
43 | RUN wget https://bitbucket.org/ariya/phantomjs/downloads/$PHANTOM_JS.tar.bz2 && \
44 | tar xvjf $PHANTOM_JS.tar.bz2 && \
45 | mv $PHANTOM_JS /usr/local/share && \
46 | ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
47 |
48 | # Checkout and build the nutch trunk
49 | RUN wget https://github.com/apache/nutch/archive/master.zip && unzip master.zip && mv nutch-master nutch_source && cd nutch_source && ant
50 |
51 | # Convenience symlink to Nutch runtime local
52 | RUN ln -s nutch_source/runtime/local $NUTCH_HOME
53 |
54 | ADD startup.sh /root/startup.sh
55 | RUN chmod +x /root/startup.sh
56 |
57 | ENTRYPOINT ["/root/startup.sh"]
58 |
--------------------------------------------------------------------------------
/nutch/startup.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | # Start nutch webserver for controlling with REST API
4 | $NUTCH_HOME/bin/nutch nutchserver > /dev/null &
5 | # Start nutch web gui
6 | $NUTCH_HOME/bin/nutch webapp
7 |
--------------------------------------------------------------------------------