├── .gitignore ├── LICENSE.txt ├── README.md ├── lib-selenium ├── build.xml ├── ivy.xml ├── plugin.xml └── src │ ├── java │ └── org │ │ └── apache │ │ └── nutch │ │ └── protocol │ │ └── selenium │ │ └── HttpWebClient.java │ └── pom.xml └── protocol-selenium ├── .idea ├── .name ├── compiler.xml ├── copyright │ └── profiles_settings.xml ├── encodings.xml ├── misc.xml ├── modules.xml ├── scopes │ └── scope_settings.xml ├── vcs.xml └── workspace.xml ├── build.xml ├── ivy.xml ├── plugin.xml └── src ├── java └── org │ └── apache │ └── nutch │ └── protocol │ └── selenium │ ├── Http.java │ ├── HttpResponse.java │ └── package.html ├── pom.xml └── target └── classes └── org └── apache └── nutch └── protocol └── htmlunit └── package.html /.gitignore: -------------------------------------------------------------------------------- 1 | *.iml 2 | *.DS_Store 3 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | 2 | Apache License 3 | Version 2.0, January 2004 4 | http://www.apache.org/licenses/ 5 | 6 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 7 | 8 | 1. Definitions. 9 | 10 | "License" shall mean the terms and conditions for use, reproduction, 11 | and distribution as defined by Sections 1 through 9 of this document. 12 | 13 | "Licensor" shall mean the copyright owner or entity authorized by 14 | the copyright owner that is granting the License. 15 | 16 | "Legal Entity" shall mean the union of the acting entity and all 17 | other entities that control, are controlled by, or are under common 18 | control with that entity. For the purposes of this definition, 19 | "control" means (i) the power, direct or indirect, to cause the 20 | direction or management of such entity, whether by contract or 21 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 22 | outstanding shares, or (iii) beneficial ownership of such entity. 23 | 24 | "You" (or "Your") shall mean an individual or Legal Entity 25 | exercising permissions granted by this License. 26 | 27 | "Source" form shall mean the preferred form for making modifications, 28 | including but not limited to software source code, documentation 29 | source, and configuration files. 30 | 31 | "Object" form shall mean any form resulting from mechanical 32 | transformation or translation of a Source form, including but 33 | not limited to compiled object code, generated documentation, 34 | and conversions to other media types. 35 | 36 | "Work" shall mean the work of authorship, whether in Source or 37 | Object form, made available under the License, as indicated by a 38 | copyright notice that is included in or attached to the work 39 | (an example is provided in the Appendix below). 40 | 41 | "Derivative Works" shall mean any work, whether in Source or Object 42 | form, that is based on (or derived from) the Work and for which the 43 | editorial revisions, annotations, elaborations, or other modifications 44 | represent, as a whole, an original work of authorship. For the purposes 45 | of this License, Derivative Works shall not include works that remain 46 | separable from, or merely link (or bind by name) to the interfaces of, 47 | the Work and Derivative Works thereof. 48 | 49 | "Contribution" shall mean any work of authorship, including 50 | the original version of the Work and any modifications or additions 51 | to that Work or Derivative Works thereof, that is intentionally 52 | submitted to Licensor for inclusion in the Work by the copyright owner 53 | or by an individual or Legal Entity authorized to submit on behalf of 54 | the copyright owner. For the purposes of this definition, "submitted" 55 | means any form of electronic, verbal, or written communication sent 56 | to the Licensor or its representatives, including but not limited to 57 | communication on electronic mailing lists, source code control systems, 58 | and issue tracking systems that are managed by, or on behalf of, the 59 | Licensor for the purpose of discussing and improving the Work, but 60 | excluding communication that is conspicuously marked or otherwise 61 | designated in writing by the copyright owner as "Not a Contribution." 62 | 63 | "Contributor" shall mean Licensor and any individual or Legal Entity 64 | on behalf of whom a Contribution has been received by Licensor and 65 | subsequently incorporated within the Work. 66 | 67 | 2. Grant of Copyright License. Subject to the terms and conditions of 68 | this License, each Contributor hereby grants to You a perpetual, 69 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 70 | copyright license to reproduce, prepare Derivative Works of, 71 | publicly display, publicly perform, sublicense, and distribute the 72 | Work and such Derivative Works in Source or Object form. 73 | 74 | 3. Grant of Patent License. Subject to the terms and conditions of 75 | this License, each Contributor hereby grants to You a perpetual, 76 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 77 | (except as stated in this section) patent license to make, have made, 78 | use, offer to sell, sell, import, and otherwise transfer the Work, 79 | where such license applies only to those patent claims licensable 80 | by such Contributor that are necessarily infringed by their 81 | Contribution(s) alone or by combination of their Contribution(s) 82 | with the Work to which such Contribution(s) was submitted. If You 83 | institute patent litigation against any entity (including a 84 | cross-claim or counterclaim in a lawsuit) alleging that the Work 85 | or a Contribution incorporated within the Work constitutes direct 86 | or contributory patent infringement, then any patent licenses 87 | granted to You under this License for that Work shall terminate 88 | as of the date such litigation is filed. 89 | 90 | 4. Redistribution. You may reproduce and distribute copies of the 91 | Work or Derivative Works thereof in any medium, with or without 92 | modifications, and in Source or Object form, provided that You 93 | meet the following conditions: 94 | 95 | (a) You must give any other recipients of the Work or 96 | Derivative Works a copy of this License; and 97 | 98 | (b) You must cause any modified files to carry prominent notices 99 | stating that You changed the files; and 100 | 101 | (c) You must retain, in the Source form of any Derivative Works 102 | that You distribute, all copyright, patent, trademark, and 103 | attribution notices from the Source form of the Work, 104 | excluding those notices that do not pertain to any part of 105 | the Derivative Works; and 106 | 107 | (d) If the Work includes a "NOTICE" text file as part of its 108 | distribution, then any Derivative Works that You distribute must 109 | include a readable copy of the attribution notices contained 110 | within such NOTICE file, excluding those notices that do not 111 | pertain to any part of the Derivative Works, in at least one 112 | of the following places: within a NOTICE text file distributed 113 | as part of the Derivative Works; within the Source form or 114 | documentation, if provided along with the Derivative Works; or, 115 | within a display generated by the Derivative Works, if and 116 | wherever such third-party notices normally appear. The contents 117 | of the NOTICE file are for informational purposes only and 118 | do not modify the License. You may add Your own attribution 119 | notices within Derivative Works that You distribute, alongside 120 | or as an addendum to the NOTICE text from the Work, provided 121 | that such additional attribution notices cannot be construed 122 | as modifying the License. 123 | 124 | You may add Your own copyright statement to Your modifications and 125 | may provide additional or different license terms and conditions 126 | for use, reproduction, or distribution of Your modifications, or 127 | for any such Derivative Works as a whole, provided Your use, 128 | reproduction, and distribution of the Work otherwise complies with 129 | the conditions stated in this License. 130 | 131 | 5. Submission of Contributions. Unless You explicitly state otherwise, 132 | any Contribution intentionally submitted for inclusion in the Work 133 | by You to the Licensor shall be under the terms and conditions of 134 | this License, without any additional terms or conditions. 135 | Notwithstanding the above, nothing herein shall supersede or modify 136 | the terms of any separate license agreement you may have executed 137 | with Licensor regarding such Contributions. 138 | 139 | 6. Trademarks. This License does not grant permission to use the trade 140 | names, trademarks, service marks, or product names of the Licensor, 141 | except as required for reasonable and customary use in describing the 142 | origin of the Work and reproducing the content of the NOTICE file. 143 | 144 | 7. Disclaimer of Warranty. Unless required by applicable law or 145 | agreed to in writing, Licensor provides the Work (and each 146 | Contributor provides its Contributions) on an "AS IS" BASIS, 147 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 148 | implied, including, without limitation, any warranties or conditions 149 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 150 | PARTICULAR PURPOSE. You are solely responsible for determining the 151 | appropriateness of using or redistributing the Work and assume any 152 | risks associated with Your exercise of permissions under this License. 153 | 154 | 8. Limitation of Liability. In no event and under no legal theory, 155 | whether in tort (including negligence), contract, or otherwise, 156 | unless required by applicable law (such as deliberate and grossly 157 | negligent acts) or agreed to in writing, shall any Contributor be 158 | liable to You for damages, including any direct, indirect, special, 159 | incidental, or consequential damages of any character arising as a 160 | result of this License or out of the use or inability to use the 161 | Work (including but not limited to damages for loss of goodwill, 162 | work stoppage, computer failure or malfunction, or any and all 163 | other commercial damages or losses), even if such Contributor 164 | has been advised of the possibility of such damages. 165 | 166 | 9. Accepting Warranty or Additional Liability. While redistributing 167 | the Work or Derivative Works thereof, You may choose to offer, 168 | and charge a fee for, acceptance of support, warranty, indemnity, 169 | or other liability obligations and/or rights consistent with this 170 | License. However, in accepting such obligations, You may act only 171 | on Your own behalf and on Your sole responsibility, not on behalf 172 | of any other Contributor, and only if You agree to indemnify, 173 | defend, and hold each Contributor harmless for any liability 174 | incurred by, or claims asserted against, such Contributor by reason 175 | of your accepting any such warranty or additional liability. 176 | 177 | END OF TERMS AND CONDITIONS 178 | 179 | APPENDIX: How to apply the Apache License to your work. 180 | 181 | To apply the Apache License to your work, attach the following 182 | boilerplate notice, with the fields enclosed by brackets "[]" 183 | replaced with your own identifying information. (Don't include 184 | the brackets!) The text should be enclosed in the appropriate 185 | comment syntax for the file format. We also recommend that a 186 | file or class name and description of purpose be included on the 187 | same "printed page" as the copyright notice for easier 188 | identification within third-party archives. 189 | 190 | Copyright [yyyy] [name of copyright owner] 191 | 192 | Licensed under the Apache License, Version 2.0 (the "License"); 193 | you may not use this file except in compliance with the License. 194 | You may obtain a copy of the License at 195 | 196 | http://www.apache.org/licenses/LICENSE-2.0 197 | 198 | Unless required by applicable law or agreed to in writing, software 199 | distributed under the License is distributed on an "AS IS" BASIS, 200 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 201 | See the License for the specific language governing permissions and 202 | limitations under the License. 203 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Nutch Selenium 2 | ============== 3 | 4 | This plugin allows you to fetch javascript pages using Selenium, while relying on the rest of the awesome Nutch stack! This allows you to 5 | 6 | A) Leverage Nutch, a world class web crawler 7 | 8 | B) Not have to use some paid service just to perform large-scale javascript/ajax aware web crawls 9 | 10 | C) Not have to wait another 2 years for Nutch to patch in either the [Ajax crawler hashbang workaround](https://issues.apache.org/jira/browse/NUTCH-1323) and then, not having to patch it to get the use case of ammending the original url with the hashbang-workaround's content. 11 | 12 | The underlying code is based on the nutch-htmlunit plugin, which was in turn based on nutch-httpclient. I also have patches to send through on nutch-htmlunit which get it working with nutch 2.2.1, so stay tuned if you want to use htmlunit for some reason. 13 | 14 | 15 | ## IMPORTANT NOTES: 16 | 17 | ~~This plugin is currently being merged into the Nutch Core - see [issue #1933 on Nutch's JIRA](https://issues.apache.org/jira/browse/NUTCH-1933)~~ 18 | 19 | 1. This plugin is currently in the nutch core. See [lib-selenium](https://github.com/apache/nutch/tree/master/src/plugin/lib-selenium) and [protocol-selenium](https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium). 20 | 21 | 2. As a result of #1, this plugin is unsupported on github. Please see the [Nutch JIRA](https://issues.apache.org/jira/browse/NUTCH/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel) for issues. 22 | 23 | ## Installation (tested on Ubuntu 14.0x) 24 | 25 | Part 1: Setting up Selenium 26 | 27 | A) Ensure that you have Firefox installed 28 | ``` 29 | # More info about the package @ [launchpad](https://launchpad.net/ubuntu/trusty/+source/firefox) 30 | 31 | sudo apt-get install firefox 32 | ``` 33 | B) Install Xvfb and its associates 34 | ``` 35 | sudo apt-get install xorg synaptic xvfb gtk2-engines-pixbuf xfonts-cyrillic xfonts-100dpi \ 36 | xfonts-75dpi xfonts-base xfonts-scalable freeglut3-dev dbus-x11 openbox x11-xserver-utils \ 37 | libxrender1 cabextract 38 | ``` 39 | C) Set a display for Xvfb, so that firefox believes a display is connected 40 | ``` 41 | sudo /usr/bin/Xvfb :11 -screen 0 1024x768x24 & 42 | sudo export DISPLAY=:11 43 | ``` 44 | Part 2: Installing plugin for Nutch (where NUTCH_HOME is the root of your nutch install) 45 | 46 | A) Add Selenium to your Nutch dependencies 47 | ``` 48 | 49 | 50 | 51 | 52 | ... 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | ``` 63 | B) Add the required plugins to your `NUTCH_HOME/src/plugin/build.xml` 64 | ``` 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | ... 73 | 74 | 75 | 76 | ... 77 | 78 | ``` 79 | C) Ensure that the plugin will be used as the fetcher/initial parser in your config 80 | ``` 81 | 82 | 83 | 84 | ... 85 | 86 | plugin.includes 87 | protocol-selenium|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic 88 | Regular expression naming plugin directory names to 89 | include. Any plugin not matching this expression is excluded. 90 | In any case you need at least include the nutch-extensionpoints plugin. By 91 | default Nutch includes crawling just HTML and plain text via HTTP, 92 | and basic indexing and search plugins. In order to use HTTPS please enable 93 | protocol-httpclient, but be aware of possible intermittent problems with the 94 | underlying commons-httpclient library. 95 | 96 | 97 | ``` 98 | D) Add the plugin folders to your installation's `NUTCH_HOME/src/plugin` directory 99 | 100 | ![Nutch plugin directory](http://i.imgur.com/CzLqoqO.png) 101 | 102 | E) Compile nutch 103 | ``` 104 | ant runtime 105 | ``` 106 | 107 | F) Start your web crawl (Ensure that you followed the above steps and have started your xvfb display as shown above) 108 | ``` 109 | NUTCH_HOME/runtime/local/bin/crawl /opt/apache-nutch-2.2.1/urls/ webpage $NUTCH_SOLR_SERVER $NUTCH_CRAWL_DEPTH 110 | ``` 111 | 112 | -------------------------------------------------------------------------------- /lib-selenium/build.xml: -------------------------------------------------------------------------------- 1 | 2 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | -------------------------------------------------------------------------------- /lib-selenium/ivy.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | Apache Nutch 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | -------------------------------------------------------------------------------- /lib-selenium/plugin.xml: -------------------------------------------------------------------------------- 1 | 2 | 18 | 21 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | -------------------------------------------------------------------------------- /lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java: -------------------------------------------------------------------------------- 1 | package org.apache.nutch.protocol.selenium; 2 | 3 | import org.apache.hadoop.conf.Configuration; 4 | import org.slf4j.Logger; 5 | import org.slf4j.LoggerFactory; 6 | 7 | import org.openqa.selenium.By; 8 | import org.openqa.selenium.WebDriver; 9 | import org.openqa.selenium.WebElement; 10 | import org.openqa.selenium.firefox.FirefoxDriver; 11 | import org.openqa.selenium.firefox.FirefoxProfile; 12 | import org.openqa.selenium.firefox.internal.ProfilesIni; 13 | import org.openqa.selenium.support.ui.ExpectedCondition; 14 | import org.openqa.selenium.support.ui.WebDriverWait; 15 | 16 | import java.lang.String; 17 | 18 | public class HttpWebClient { 19 | 20 | private static final Logger LOG = LoggerFactory.getLogger("org.apache.nutch.protocol"); 21 | 22 | public static ThreadLocal threadWebDriver = new ThreadLocal() { 23 | 24 | @Override 25 | protected WebDriver initialValue() 26 | { 27 | FirefoxProfile profile = new FirefoxProfile(); 28 | profile.setPreference("permissions.default.stylesheet", 2); 29 | profile.setPreference("permissions.default.image", 2); 30 | profile.setPreference("dom.ipc.plugins.enabled.libflashplayer.so", "false"); 31 | WebDriver driver = new FirefoxDriver(profile); 32 | return driver; 33 | }; 34 | }; 35 | 36 | public static String getHtmlPage(String url, Configuration conf) { 37 | WebDriver driver = null; 38 | 39 | try { 40 | driver = new FirefoxDriver(); 41 | // } WebDriver driver = threadWebDriver.get(); 42 | // if (driver == null) { 43 | // driver = new FirefoxDriver(); 44 | // } 45 | 46 | driver.get(url); 47 | 48 | // Wait for the page to load, timeout after 3 seconds 49 | new WebDriverWait(driver, 3); 50 | 51 | String innerHtml = driver.findElement(By.tagName("body")).getAttribute("innerHTML"); 52 | 53 | return innerHtml; 54 | 55 | // I'm sure this catch statement is a code smell ; borrowing it from lib-htmlunit 56 | } catch (Exception e) { 57 | throw new RuntimeException(e); 58 | } finally { 59 | if (driver != null) try { driver.quit(); } catch (Exception e) { throw new RuntimeException(e); } 60 | } 61 | }; 62 | 63 | public static String getHtmlPage(String url) { 64 | return getHtmlPage(url, null); 65 | } 66 | } -------------------------------------------------------------------------------- /lib-selenium/src/pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 5 | 4.0.0 6 | 7 | groupId 8 | lib-selenium 9 | 1.0-SNAPSHOT 10 | 11 | 12 | -------------------------------------------------------------------------------- /protocol-selenium/.idea/.name: -------------------------------------------------------------------------------- 1 | protocol-htmlunit -------------------------------------------------------------------------------- /protocol-selenium/.idea/compiler.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 22 | 23 | 24 | -------------------------------------------------------------------------------- /protocol-selenium/.idea/copyright/profiles_settings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | -------------------------------------------------------------------------------- /protocol-selenium/.idea/encodings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | -------------------------------------------------------------------------------- /protocol-selenium/.idea/misc.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 11 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | -------------------------------------------------------------------------------- /protocol-selenium/.idea/modules.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | -------------------------------------------------------------------------------- /protocol-selenium/.idea/scopes/scope_settings.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | -------------------------------------------------------------------------------- /protocol-selenium/.idea/vcs.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | -------------------------------------------------------------------------------- /protocol-selenium/.idea/workspace.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 13 | 14 | 15 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 98 | 99 | 102 | 103 | 104 | 105 | 108 | 109 | 110 | 111 | 114 | 115 | 118 | 119 | 120 | 121 | 124 | 125 | 128 | 129 | 132 | 133 | 134 | 135 | 138 | 139 | 142 | 143 | 146 | 147 | 150 | 151 | 152 | 153 | 154 | 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 176 | 177 | 178 | 179 | 180 | 181 | 182 | 183 | 184 | 185 | 186 | 187 | 194 | 195 | 196 | 197 | 198 | 199 | 210 | 211 | 212 | 213 | 231 | 238 | 239 | 240 | 241 | 255 | 256 | 257 | 258 | 259 | 260 | 281 | 294 | 295 | 296 | 315 | 316 | 325 | 329 | 330 | 331 | 348 | 349 | 350 | 363 | 364 | 365 | 383 | 384 | 385 | localhost 386 | 5050 387 | 388 | 389 | 390 | 391 | 392 | 393 | 394 | 395 | 396 | 1402934991106 397 | 1402934991106 398 | 399 | 400 | 401 | 402 | 403 | 405 | 406 | 407 | 408 | 409 | 410 | 411 | 412 | 413 | 414 | 415 | 416 | 417 | 418 | 419 | 420 | 421 | 422 | 423 | 424 | 425 | 426 | 427 | 428 | 429 | 430 | 431 | 432 | 433 | 434 | 435 | 436 | 437 | 438 | 439 | 442 | 445 | 446 | 447 | 449 | 450 | 453 | 454 | 455 | 456 | 457 | 458 | 459 | 460 | 461 | 462 | 463 | 464 | 465 | 466 | 467 | 468 | 469 | 470 | 471 | 472 | 473 | 474 | 475 | 476 | 477 | 478 | 479 | 480 | 481 | 482 | 483 | 488 | 489 | 490 | 491 | 492 | 493 | No facets are configured 494 | 495 | 500 | 501 | 502 | 503 | 504 | 505 | 506 | 511 | 512 | 513 | 514 | 515 | 516 | 1.7 517 | 518 | 523 | 524 | 525 | 526 | 527 | 528 | protocol-htmlunit 529 | 530 | 535 | 536 | 537 | 538 | 539 | 540 | 1.7 541 | 542 | 547 | 548 | 549 | 550 | 551 | 552 | 553 | 558 | 559 | 560 | 561 | 562 | 563 | 564 | 565 | -------------------------------------------------------------------------------- /protocol-selenium/build.xml: -------------------------------------------------------------------------------- 1 | 2 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | -------------------------------------------------------------------------------- /protocol-selenium/ivy.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | Apache Nutch 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | -------------------------------------------------------------------------------- /protocol-selenium/plugin.xml: -------------------------------------------------------------------------------- 1 | 2 | 18 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 39 | 40 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | -------------------------------------------------------------------------------- /protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java: -------------------------------------------------------------------------------- 1 | package org.apache.nutch.protocol.selenium; 2 | 3 | // JDK imports 4 | import java.io.IOException; 5 | import java.net.URL; 6 | import java.util.Collection; 7 | import java.util.HashSet; 8 | 9 | import org.apache.hadoop.conf.Configuration; 10 | import org.apache.nutch.net.protocols.Response; 11 | import org.apache.nutch.protocol.http.api.HttpBase; 12 | import org.apache.nutch.protocol.ProtocolException; 13 | import org.apache.nutch.util.NutchConfiguration; 14 | import org.apache.nutch.storage.WebPage; 15 | import org.apache.nutch.storage.WebPage.Field; 16 | 17 | import org.apache.nutch.protocol.selenium.HttpResponse; 18 | 19 | import org.slf4j.Logger; 20 | import org.slf4j.LoggerFactory; 21 | 22 | public class Http extends HttpBase { 23 | 24 | public static final Logger LOG = LoggerFactory.getLogger(Http.class); 25 | 26 | private static final Collection FIELDS = new HashSet(); 27 | 28 | static { 29 | FIELDS.add(WebPage.Field.MODIFIED_TIME); 30 | FIELDS.add(WebPage.Field.HEADERS); 31 | } 32 | 33 | public Http() { 34 | super(LOG); 35 | } 36 | 37 | @Override 38 | public void setConf(Configuration conf) { 39 | super.setConf(conf); 40 | } 41 | 42 | public static void main(String[] args) throws Exception { 43 | Http http = new Http(); 44 | http.setConf(NutchConfiguration.create()); 45 | main(http, args); 46 | } 47 | 48 | @Override 49 | protected Response getResponse(URL url, WebPage page, boolean redirect) 50 | throws ProtocolException, IOException { 51 | return new HttpResponse(this, url, page, getConf()); 52 | } 53 | 54 | @Override 55 | public Collection getFields() { 56 | return FIELDS; 57 | } 58 | } 59 | -------------------------------------------------------------------------------- /protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java: -------------------------------------------------------------------------------- 1 | package org.apache.nutch.protocol.selenium; 2 | 3 | // JDK imports 4 | import java.io.BufferedInputStream; 5 | import java.io.EOFException; 6 | import java.io.IOException; 7 | import java.io.OutputStream; 8 | import java.io.PushbackInputStream; 9 | import java.net.InetSocketAddress; 10 | import java.net.Socket; 11 | import java.net.URL; 12 | 13 | import org.apache.commons.lang.StringUtils; 14 | import org.apache.hadoop.conf.Configuration; 15 | // import org.apache.nutch.crawl.CrawlDatum; 16 | import org.apache.nutch.storage.WebPage; 17 | import org.apache.nutch.metadata.Metadata; 18 | import org.apache.nutch.metadata.SpellCheckedMetadata; 19 | import org.apache.nutch.net.protocols.HttpDateFormat; 20 | import org.apache.nutch.net.protocols.Response; 21 | import org.apache.nutch.protocol.ProtocolException; 22 | import org.apache.nutch.protocol.http.api.HttpBase; 23 | import org.apache.nutch.protocol.http.api.HttpException; 24 | 25 | import org.openqa.selenium.By; 26 | import org.openqa.selenium.WebDriver; 27 | import org.openqa.selenium.WebElement; 28 | import org.openqa.selenium.firefox.FirefoxDriver; 29 | import org.openqa.selenium.support.ui.ExpectedCondition; 30 | import org.openqa.selenium.support.ui.WebDriverWait; 31 | 32 | /* Most of this code was borrowed from protocol-htmlunit; which in turn borrowed it from protocol-httpclient */ 33 | 34 | public class HttpResponse implements Response { 35 | 36 | private Http http; 37 | private URL url; 38 | private String orig; 39 | private String base; 40 | private byte[] content; 41 | private int code; 42 | private Metadata headers = new SpellCheckedMetadata(); 43 | 44 | /** The nutch configuration */ 45 | private Configuration conf = null; 46 | 47 | public HttpResponse(Http http, URL url, WebPage page, Configuration conf) throws ProtocolException, IOException { 48 | 49 | this.conf = conf; 50 | this.http = http; 51 | this.url = url; 52 | this.orig = url.toString(); 53 | this.base = url.toString(); 54 | 55 | if (!"http".equals(url.getProtocol())) 56 | throw new HttpException("Not an HTTP url:" + url); 57 | 58 | if (Http.LOG.isTraceEnabled()) { 59 | Http.LOG.trace("fetching " + url); 60 | } 61 | 62 | String path = "".equals(url.getFile()) ? "/" : url.getFile(); 63 | 64 | // some servers will redirect a request with a host line like 65 | // "Host: :80" to "http:///"- they 66 | // don't want the :80... 67 | 68 | String host = url.getHost(); 69 | int port; 70 | String portString; 71 | if (url.getPort() == -1) { 72 | port = 80; 73 | portString = ""; 74 | } else { 75 | port = url.getPort(); 76 | portString = ":" + port; 77 | } 78 | Socket socket = null; 79 | 80 | try { 81 | socket = new Socket(); // create the socket 82 | socket.setSoTimeout(http.getTimeout()); 83 | 84 | // connect 85 | String sockHost = http.useProxy() ? http.getProxyHost() : host; 86 | int sockPort = http.useProxy() ? http.getProxyPort() : port; 87 | InetSocketAddress sockAddr = new InetSocketAddress(sockHost, sockPort); 88 | socket.connect(sockAddr, http.getTimeout()); 89 | 90 | // make request 91 | OutputStream req = socket.getOutputStream(); 92 | 93 | StringBuffer reqStr = new StringBuffer("GET "); 94 | if (http.useProxy()) { 95 | reqStr.append(url.getProtocol() + "://" + host + portString + path); 96 | } else { 97 | reqStr.append(path); 98 | } 99 | 100 | reqStr.append(" HTTP/1.0\r\n"); 101 | 102 | reqStr.append("Host: "); 103 | reqStr.append(host); 104 | reqStr.append(portString); 105 | reqStr.append("\r\n"); 106 | 107 | reqStr.append("Accept-Encoding: x-gzip, gzip, deflate\r\n"); 108 | 109 | String userAgent = http.getUserAgent(); 110 | if ((userAgent == null) || (userAgent.length() == 0)) { 111 | if (Http.LOG.isErrorEnabled()) { 112 | Http.LOG.error("User-agent is not set!"); 113 | } 114 | } else { 115 | reqStr.append("User-Agent: "); 116 | reqStr.append(userAgent); 117 | reqStr.append("\r\n"); 118 | } 119 | 120 | reqStr.append("Accept-Language: "); 121 | reqStr.append(this.http.getAcceptLanguage()); 122 | reqStr.append("\r\n"); 123 | 124 | reqStr.append("Accept: "); 125 | reqStr.append(this.http.getAccept()); 126 | reqStr.append("\r\n"); 127 | 128 | if (page.getModifiedTime() > 0) { 129 | reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(page.getModifiedTime())); 130 | reqStr.append("\r\n"); 131 | } 132 | reqStr.append("\r\n"); 133 | 134 | byte[] reqBytes = reqStr.toString().getBytes(); 135 | 136 | req.write(reqBytes); 137 | req.flush(); 138 | 139 | PushbackInputStream in = // process response 140 | new PushbackInputStream(new BufferedInputStream(socket.getInputStream(), Http.BUFFER_SIZE), 141 | Http.BUFFER_SIZE); 142 | 143 | StringBuffer line = new StringBuffer(); 144 | 145 | boolean haveSeenNonContinueStatus = false; 146 | while (!haveSeenNonContinueStatus) { 147 | // parse status code line 148 | this.code = parseStatusLine(in, line); 149 | // parse headers 150 | parseHeaders(in, line); 151 | haveSeenNonContinueStatus = code != 100; // 100 is "Continue" 152 | } 153 | 154 | //readPlainContent(in); 155 | readPlainContent(url); 156 | 157 | } finally { 158 | if (socket != null) 159 | socket.close(); 160 | } 161 | 162 | } 163 | 164 | /* ------------------------- * 165 | * * 166 | * ------------------------- */ 167 | 168 | public URL getUrl() { 169 | return url; 170 | } 171 | 172 | public int getCode() { 173 | return code; 174 | } 175 | 176 | public String getHeader(String name) { 177 | return headers.get(name); 178 | } 179 | 180 | public Metadata getHeaders() { 181 | return headers; 182 | } 183 | 184 | public byte[] getContent() { 185 | return content; 186 | } 187 | 188 | /* ------------------------- * 189 | * * 190 | * ------------------------- */ 191 | 192 | private void readPlainContent(URL url) throws IOException { 193 | String page = HttpWebClient.getHtmlPage(url.toString(), conf); 194 | 195 | content = page.getBytes("UTF-8"); 196 | } 197 | 198 | private int parseStatusLine(PushbackInputStream in, StringBuffer line) throws IOException, HttpException { 199 | readLine(in, line, false); 200 | 201 | int codeStart = line.indexOf(" "); 202 | int codeEnd = line.indexOf(" ", codeStart + 1); 203 | 204 | // handle lines with no plaintext result code, ie: 205 | // "HTTP/1.1 200" vs "HTTP/1.1 200 OK" 206 | if (codeEnd == -1) 207 | codeEnd = line.length(); 208 | 209 | int code; 210 | try { 211 | code = Integer.parseInt(line.substring(codeStart + 1, codeEnd)); 212 | } catch (NumberFormatException e) { 213 | throw new HttpException("bad status line '" + line + "': " + e.getMessage(), e); 214 | } 215 | 216 | return code; 217 | } 218 | 219 | private void processHeaderLine(StringBuffer line) throws IOException, HttpException { 220 | 221 | int colonIndex = line.indexOf(":"); // key is up to colon 222 | if (colonIndex == -1) { 223 | int i; 224 | for (i = 0; i < line.length(); i++) 225 | if (!Character.isWhitespace(line.charAt(i))) 226 | break; 227 | if (i == line.length()) 228 | return; 229 | throw new HttpException("No colon in header:" + line); 230 | } 231 | String key = line.substring(0, colonIndex); 232 | 233 | int valueStart = colonIndex + 1; // skip whitespace 234 | while (valueStart < line.length()) { 235 | int c = line.charAt(valueStart); 236 | if (c != ' ' && c != '\t') 237 | break; 238 | valueStart++; 239 | } 240 | String value = line.substring(valueStart); 241 | headers.set(key, value); 242 | } 243 | 244 | // Adds headers to our headers Metadata 245 | private void parseHeaders(PushbackInputStream in, StringBuffer line) throws IOException, HttpException { 246 | 247 | while (readLine(in, line, true) != 0) { 248 | 249 | // handle HTTP responses with missing blank line after headers 250 | int pos; 251 | if (((pos = line.indexOf(" 0) { 286 | // at EOL -- check for continued line if the current 287 | // (possibly continued) line wasn't blank 288 | if (allowContinuedLine) 289 | switch (peek(in)) { 290 | case ' ': 291 | case '\t': // line is continued 292 | in.read(); 293 | continue; 294 | } 295 | } 296 | return line.length(); // else complete 297 | default: 298 | line.append((char) c); 299 | } 300 | } 301 | throw new EOFException(); 302 | } 303 | 304 | private static int peek(PushbackInputStream in) throws IOException { 305 | int value = in.read(); 306 | in.unread(value); 307 | return value; 308 | } 309 | } 310 | -------------------------------------------------------------------------------- /protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |

Protocol plugin which supports retrieving documents via selenium.

4 | 5 | 6 | -------------------------------------------------------------------------------- /protocol-selenium/src/pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 5 | 4.0.0 6 | 7 | groupId 8 | protocol-selenium 9 | 1.0-SNAPSHOT 10 | 11 | 12 | -------------------------------------------------------------------------------- /protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html: -------------------------------------------------------------------------------- 1 | 2 | 3 |

Protocol plugin which supports retrieving documents via the htmlunit.

4 | 5 | 6 | --------------------------------------------------------------------------------