├── TODO.txt ├── .gitignore ├── .travis.yml ├── src ├── test │ ├── resources │ │ └── de │ │ │ └── jetwick │ │ │ └── snacktory │ │ │ ├── golem.html │ │ │ ├── br-online.html │ │ │ ├── spiegel.html │ │ │ ├── starmagazine.html │ │ │ ├── yomiuri.html │ │ │ ├── yomiuri2.html │ │ │ ├── badenc.html │ │ │ ├── no-hidden.html │ │ │ ├── artificial │ │ │ └── zero-weight-images.html │ │ │ ├── no-hidden2.html │ │ │ ├── i4online.html │ │ │ ├── grapevinyl.html │ │ │ ├── cnbc.html │ │ │ ├── daltoncaldwell.html │ │ │ ├── universetoday.html │ │ │ ├── facebook.html │ │ │ └── facebook2.html │ └── java │ │ └── de │ │ └── jetwick │ │ └── snacktory │ │ ├── HtmlFetcherProxyTest.java │ │ ├── OutputFormatterTest.java │ │ ├── ConverterTest.java │ │ ├── HtmlFetcherIntegrationTest.java │ │ ├── SHelperTest.java │ │ └── ArticleTextExtractorTodoTester.java └── main │ ├── java │ └── de │ │ └── jetwick │ │ └── snacktory │ │ ├── ImageResult.java │ │ ├── SCache.java │ │ ├── MapEntry.java │ │ ├── JResult.java │ │ ├── OutputFormatter.java │ │ ├── Converter.java │ │ ├── SHelper.java │ │ └── HtmlFetcher.java │ └── resources │ └── log4j.properties ├── CONTRIBUTORS.md ├── pom.xml ├── README.md └── test_data └── 3.html /TODO.txt: -------------------------------------------------------------------------------- 1 | rewrite with all the learned stuff -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | target/ 2 | nb-configuration.xml 3 | nbactions.xml 4 | *~ 5 | deploy-*.sh 6 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: java 2 | notifications: 3 | email: 4 | - peathal@yahoo.de -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/golem.html: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/karussell/snacktory/HEAD/src/test/resources/de/jetwick/snacktory/golem.html -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/br-online.html: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/karussell/snacktory/HEAD/src/test/resources/de/jetwick/snacktory/br-online.html -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/spiegel.html: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/karussell/snacktory/HEAD/src/test/resources/de/jetwick/snacktory/spiegel.html -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/starmagazine.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | The content of this website is not available in your area. 4 | 5 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/yomiuri.html: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/karussell/snacktory/HEAD/src/test/resources/de/jetwick/snacktory/yomiuri.html -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/yomiuri2.html: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/karussell/snacktory/HEAD/src/test/resources/de/jetwick/snacktory/yomiuri2.html -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/badenc.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | True or false: You MUST have a coffee to start your day? | Answerbag 6 | 7 | 8 | 9 | 10 | 11 | -------------------------------------------------------------------------------- /CONTRIBUTORS.md: -------------------------------------------------------------------------------- 1 | [Contributors](https://github.com/karussell/snacktory/contributors) 2 | 3 | * bejean, #22, #23, #31, #32, #33 several improvements e.g. integrate work of chrisalexander 4 | * chrisalexander, #16, #17 multi image support 5 | * dajac, #12, #13, #14, #25, #26, #28 enhance meta data extraction and more 6 | * ifesdjeen, created initial work, we forked 7 | * jloomis, #2 8 | * karussell, lead developer 9 | * kireet, #9, #18 added keywords support 10 | * pyr, #11 11 | * soebbing, #1 Added support for protocol relative URLs 12 | * tjerkw, #27 13 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/no-hidden.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 |
This is the hidden text which shouldn't be shown and it is a bit longer so normally prefered
8 |
This is the text which is shorter but visible
9 | 10 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/artificial/zero-weight-images.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | Zero weight images test 4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 |

Some heading

12 |

Content of the first paragraph and it has to be longer than 50 characters.

13 | 14 |

Another heading

15 | 16 |
17 | 18 | 19 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/no-hidden2.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 |
This is the hidden text which shouldn't be shown and it is a bit longer so normally prefered
8 |
This is the NONE-HIDDEN text which shouldn't be shown and it is a bit longer so normally prefered
9 |
This is the text which is shorter but visible
10 | 11 | -------------------------------------------------------------------------------- /src/main/java/de/jetwick/snacktory/ImageResult.java: -------------------------------------------------------------------------------- 1 | package de.jetwick.snacktory; 2 | 3 | import org.jsoup.nodes.Element; 4 | 5 | /** 6 | * Class which encapsulates the data from an image found under an element 7 | * 8 | * @author Chris Alexander, chris@chris-alexander.co.uk 9 | */ 10 | public class ImageResult { 11 | 12 | public String src; 13 | public Integer weight; 14 | public String title; 15 | public int height; 16 | public int width; 17 | public String alt; 18 | public boolean noFollow; 19 | public Element element; 20 | 21 | public ImageResult(String src, Integer weight, String title, int height, int width, String alt, boolean noFollow) { 22 | this.src = src; 23 | this.weight = weight; 24 | this.title = title; 25 | this.height = height; 26 | this.width = width; 27 | this.alt = alt; 28 | this.noFollow = noFollow; 29 | } 30 | } 31 | -------------------------------------------------------------------------------- /src/main/java/de/jetwick/snacktory/SCache.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Peter Karich 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | /** 19 | * 20 | * @author Peter Karich 21 | */ 22 | public interface SCache { 23 | 24 | JResult get(String url); 25 | 26 | void put(String url, JResult res); 27 | 28 | int getSize(); 29 | } 30 | -------------------------------------------------------------------------------- /src/main/resources/log4j.properties: -------------------------------------------------------------------------------- 1 | # 2 | # Copyright (C) 2010 Peter Karich <> 3 | # 4 | # Licensed under the Apache License, Version 2.0 (the "License"); 5 | # you may not use this file except in compliance with the License. 6 | # You may obtain a copy of the License at 7 | # 8 | # http://www.apache.org/licenses/LICENSE-2.0 9 | # 10 | # Unless required by applicable law or agreed to in writing, software 11 | # distributed under the License is distributed on an "AS IS" BASIS, 12 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | # See the License for the specific language governing permissions and 14 | # limitations under the License. 15 | # 16 | 17 | # overwrite this file from command line via: 18 | # -Dlog4j.configuration=file 19 | # print internal debug => -Dlog4j.debug 20 | 21 | log4j.appender.FileApp=org.apache.log4j.FileAppender 22 | log4j.appender.FileApp.File=${user.home}/.jetwick/logging.txt 23 | log4j.appender.FileApp.layout=org.apache.log4j.PatternLayout 24 | log4j.appender.FileApp.layout.ConversionPattern=[%d{DATE} - %-5p] %m%n 25 | 26 | log4j.appender.StdoutApp=org.apache.log4j.ConsoleAppender 27 | log4j.appender.StdoutApp.layout=org.apache.log4j.PatternLayout 28 | log4j.appender.StdoutApp.layout.conversionPattern=%d [%t] %-5p %c - %m%n 29 | 30 | log4j.logger.de.jetwick.snacktory=INFO, StdoutApp -------------------------------------------------------------------------------- /src/test/java/de/jetwick/snacktory/HtmlFetcherProxyTest.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2015 Peter Karich 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | import static org.junit.Assert.assertEquals; 19 | 20 | import java.net.InetSocketAddress; 21 | import java.net.Proxy; 22 | import java.net.Proxy.Type; 23 | 24 | import org.junit.Test; 25 | 26 | /** 27 | * Tests for HtmlFetcher proxy feature. 28 | */ 29 | public class HtmlFetcherProxyTest { 30 | 31 | public HtmlFetcherProxyTest() { 32 | } 33 | 34 | @Test 35 | public void testSocksProxy() { 36 | HtmlFetcher fetcher = new HtmlFetcher(); 37 | Proxy proxy = new Proxy(Type.valueOf("SOCKS"), new InetSocketAddress("127.0.0.1", 3128)); 38 | fetcher.setProxy(proxy); 39 | 40 | assertEquals("Invalid SOCKS proxy type name", "SOCKS", fetcher.getProxy().type().name()); 41 | } 42 | 43 | @Test 44 | public void testNoProxy() { 45 | HtmlFetcher fetcher = new HtmlFetcher(); 46 | assertEquals("HtmlFetch proxy server was not a NO_PROXY proxy", Proxy.NO_PROXY, fetcher.getProxy()); 47 | } 48 | 49 | } 50 | -------------------------------------------------------------------------------- /src/test/java/de/jetwick/snacktory/OutputFormatterTest.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2012 Peter Karich info@jetsli.de 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | import java.util.Arrays; 19 | import java.util.List; 20 | 21 | import org.jsoup.Jsoup; 22 | import org.jsoup.nodes.Document; 23 | import org.junit.*; 24 | import static org.junit.Assert.*; 25 | 26 | /** 27 | * 28 | * @author Peter Karich 29 | */ 30 | public class OutputFormatterTest { 31 | 32 | @Test 33 | public void testSkipHidden() { 34 | OutputFormatter formatter = new OutputFormatter(); 35 | Document doc = Jsoup.parse("
xy
test
"); 36 | StringBuilder sb = new StringBuilder(); 37 | formatter.appendTextSkipHidden(doc, sb); 38 | assertEquals("test", sb.toString()); 39 | } 40 | 41 | @Test 42 | public void testTextList() { 43 | OutputFormatter formatter = new OutputFormatter(); 44 | Document doc = Jsoup.parse("

aa

bb

cc

"); 45 | assertEquals(Arrays.asList("aa", "bb", "cc"), formatter.getTextList(doc)); 46 | } 47 | } 48 | -------------------------------------------------------------------------------- /src/main/java/de/jetwick/snacktory/MapEntry.java: -------------------------------------------------------------------------------- 1 | /** 2 | * Copyright (C) 2010 Peter Karich <> 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); you may not 5 | * use this file except in compliance with the License. You may obtain a copy of 6 | * the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT 12 | * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the 13 | * License for the specific language governing permissions and limitations under 14 | * the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | import java.io.Serializable; 19 | import java.util.Map; 20 | 21 | /** 22 | * Simple impl of Map.Entry. So that we can have ordered maps. 23 | * 24 | * @author Peter Karich, peat_hal ‘at’ users ‘dot’ sourceforge ‘dot’ net 25 | */ 26 | public class MapEntry implements Map.Entry, Serializable { 27 | 28 | private static final long serialVersionUID = 1L; 29 | private K key; 30 | private V value; 31 | 32 | public MapEntry(K key, V value) { 33 | this.key = key; 34 | this.value = value; 35 | } 36 | 37 | @Override 38 | public K getKey() { 39 | return key; 40 | } 41 | 42 | @Override 43 | public V getValue() { 44 | return value; 45 | } 46 | 47 | @Override 48 | public V setValue(V value) { 49 | this.value = value; 50 | return value; 51 | } 52 | 53 | @Override 54 | public String toString() { 55 | return getKey() + ", " + getValue(); 56 | } 57 | 58 | @Override 59 | public boolean equals(Object obj) { 60 | if (obj == null) 61 | return false; 62 | if (getClass() != obj.getClass()) 63 | return false; 64 | final MapEntry other = (MapEntry) obj; 65 | if (this.key != other.key && (this.key == null || !this.key.equals(other.key))) 66 | return false; 67 | if (this.value != other.value && (this.value == null || !this.value.equals(other.value))) 68 | return false; 69 | return true; 70 | } 71 | 72 | @Override 73 | public int hashCode() { 74 | int hash = 7; 75 | hash = 19 * hash + (this.key != null ? this.key.hashCode() : 0); 76 | hash = 19 * hash + (this.value != null ? this.value.hashCode() : 0); 77 | return hash; 78 | } 79 | } 80 | -------------------------------------------------------------------------------- /src/test/java/de/jetwick/snacktory/ConverterTest.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Peter Karich 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | import junit.framework.TestCase; 19 | import org.jsoup.Jsoup; 20 | 21 | /** 22 | * 23 | * @author Peter Karich 24 | */ 25 | public class ConverterTest extends TestCase { 26 | 27 | public ConverterTest(String testName) { 28 | super(testName); 29 | } 30 | 31 | public void testDetermineEncoding() throws Exception { 32 | Converter d = new Converter(); 33 | d.streamToString(getClass().getResourceAsStream("faz.html")); 34 | assertEquals("utf-8", d.getEncoding()); 35 | 36 | d = new Converter(); 37 | d.streamToString(getClass().getResourceAsStream("yomiuri.html")); 38 | assertEquals("shift_jis", d.getEncoding()); 39 | 40 | d = new Converter(); 41 | d.streamToString(getClass().getResourceAsStream("yomiuri2.html")); 42 | assertEquals("shift_jis", d.getEncoding()); 43 | 44 | d = new Converter(); 45 | d.streamToString(getClass().getResourceAsStream("spiegel.html")); 46 | assertEquals("iso-8859-1", d.getEncoding()); 47 | 48 | d = new Converter(); 49 | d.streamToString(getClass().getResourceAsStream("itunes.html")); 50 | assertEquals("utf-8", d.getEncoding()); 51 | 52 | d = new Converter(); 53 | d.streamToString(getClass().getResourceAsStream("twitter.html")); 54 | assertEquals("utf-8", d.getEncoding()); 55 | 56 | // youtube DOES not specify the encoding AND assumes utf-8 !? 57 | d = new Converter(); 58 | d.streamToString(getClass().getResourceAsStream("youtube.html")); 59 | assertEquals("utf-8", d.getEncoding()); 60 | 61 | d = new Converter(); 62 | d.streamToString(getClass().getResourceAsStream("nyt.html")); 63 | assertEquals("utf-8", d.getEncoding()); 64 | 65 | d = new Converter(); 66 | d.streamToString(getClass().getResourceAsStream("badenc.html")); 67 | assertEquals("utf-8", d.getEncoding()); 68 | 69 | d = new Converter(); 70 | d.streamToString(getClass().getResourceAsStream("br-online.html")); 71 | assertEquals("iso-8859-15", d.getEncoding()); 72 | } 73 | 74 | public void testMaxBytesExceedingButGetTitleNevertheless() throws Exception { 75 | Converter d = new Converter(); 76 | d.setMaxBytes(10000); 77 | String str = d.streamToString(getClass().getResourceAsStream("faz.html")); 78 | assertEquals("utf-8", d.getEncoding()); 79 | assertEquals("Im Gespräch: Umweltaktivist Stewart Brand: Ihr Deutschen steht allein da " 80 | + "- Atomdebatte - FAZ.NET", Jsoup.parse(str).select("title").text()); 81 | } 82 | } 83 | -------------------------------------------------------------------------------- /pom.xml: -------------------------------------------------------------------------------- 1 | 2 | 4 | 4.0.0 5 | 6 | de.jetwick 7 | snacktory 8 | 1.3-SNAPSHOT 9 | jar 10 | 11 | Snacktory 12 | http://maven.apache.org 13 | 14 | 15 | UTF-8 16 | 1.7.5 17 | 18 | 19 | 20 | 21 | junit 22 | junit 23 | 4.11 24 | test 25 | 26 | 27 | org.jsoup 28 | jsoup 29 | 1.7.2 30 | 31 | 32 | org.slf4j 33 | slf4j-api 34 | ${slf4j.version} 35 | 36 | 37 | 38 | 39 | org.slf4j 40 | slf4j-log4j12 41 | ${slf4j.version} 42 | runtime 43 | 44 | 45 | 48 | 49 | log4j 50 | log4j 51 | 1.2.14 52 | compile 53 | 54 | 55 | 56 | 57 | 58 | true 59 | org.apache.maven.plugins 60 | maven-compiler-plugin 61 | 3.1 62 | 63 | 1.6 64 | 1.6 65 | 66 | 67 | 68 | org.apache.maven.plugins 69 | maven-resources-plugin 70 | 2.6 71 | 72 | UTF-8 73 | 74 | 75 | 76 | 77 | 78 | 79 | karussell_snapshots 80 | https://github.com/karussell/mvnrepo/raw/master/snapshots 81 | 82 | 83 | karussell_releases 84 | https://github.com/karussell/mvnrepo/raw/master/releases/ 85 | 86 | 87 | 88 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Future 2 | 3 | Snacktory is no longer actively maintained by @karussell. 4 | 5 | [Crux](https://github.com/chimbori/crux) is a fork under active development and is the recommended alternative. 6 | 7 | - Available under the same permissive Apache 2.0 License. 8 | - Adds several new features, such as Rich Text output (HTML), preserves links, extracts more metadata content, etc. 9 | - Optimized for Android. Decoupled from optional dependencies such as HttpUrlConnection, log4j, etc. 10 | - Actively developed by [Chimbori](https://github.com/chimbori), the developers of [Hermit, a Lite Apps Browser for Android](https://hermit.chimbori.com/). 11 | - Already being used in multiple apps. 12 | - Crux has a different architecture from Snacktory: it is designed as a collection of several separate APIs instead of a single one. Clients can pick and choose which ones they wish to use. 13 | - As a result, Crux is not a drop-in replacement for Snacktory, but fairly easy to migrate to. 14 | 15 | # Snacktory 16 | 17 | This is a small helper utility for people who don't want to write yet another java clone of Readability. 18 | In most cases, this is applied to articles, although it should work for any website to find its major 19 | area, extract its text, keywords, its main picture and more. 20 | 21 | The resulting quality is high, even [paper.li uses](https://twitter.com/timetabling/status/274193754615853056) the core of snacktory. 22 | Also have a look into [this article](http://karussell.wordpress.com/2011/07/12/introducing-jetslide-news-reader/), 23 | it describes a news aggregator service which uses snacktory. But jetslide is no longer online. 24 | 25 | Snacktory borrows some ideas and a lot of test cases from [goose](https://github.com/GravityLabs/goose) 26 | and [jreadability](https://github.com/ifesdjeen/jReadability): 27 | 28 | 29 | # License 30 | 31 | The software stands under Apache 2 License and comes with NO WARRANTY 32 | 33 | # Features 34 | 35 | * article text detection 36 | * get top image url(s) 37 | * get top video url 38 | * extraction of description, keywords, ... 39 | * good detection for none-english sites (German, Japanese, ...), snacktory does not depend on the word count in its text detection to support CJK languages 40 | * good charset detection 41 | * possible to do URL resolving, but caching is still possible after resolving 42 | * skipping some known filetypes 43 | * no http GET required to run the core tests 44 | 45 | TODOs 46 | 47 | * only top text supported at the moment 48 | 49 | 50 | # Usage 51 | 52 | Include the repo at: https://github.com/karussell/mvnrepo 53 | 54 | Then add the dependency 55 | 56 | ```xml 57 | 58 | de.jetwick 59 | snacktory 60 | 1.1 61 | 62 | 63 | ``` 64 | 65 | If you need this for Android be sure you read [this issue](https://github.com/karussell/snacktory/issues/36). 66 | 67 | Or, if you prefer, you can use a build generated by [jitpack.io](https://jitpack.io/#karussell/snacktory). 68 | 69 | Now you can use it as follows: 70 | 71 | ```java 72 | HtmlFetcher fetcher = new HtmlFetcher(); 73 | // set cache. e.g. take the map implementation from google collections: 74 | // fetcher.setCache(new MapMaker().concurrencyLevel(20).maximumSize(count). 75 | // expireAfterWrite(minutes, TimeUnit.MINUTES).makeMap(); 76 | 77 | JResult res = fetcher.fetchAndExtract(articleUrl, resolveTimeout, true); 78 | String text = res.getText(); 79 | String title = res.getTitle(); 80 | String imageUrl = res.getImageUrl(); 81 | ``` 82 | 83 | # Build 84 | 85 | via Maven. Maven will automatically resolve dependencies to jsoup, log4j and slf4j-api 86 | -------------------------------------------------------------------------------- /src/test/java/de/jetwick/snacktory/HtmlFetcherIntegrationTest.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Peter Karich 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | import org.junit.Test; 19 | import static org.junit.Assert.*; 20 | 21 | /** 22 | * 23 | * @author Peter Karich 24 | */ 25 | public class HtmlFetcherIntegrationTest { 26 | 27 | public HtmlFetcherIntegrationTest() { 28 | } 29 | 30 | @Test 31 | public void testNoException() throws Exception { 32 | JResult res = new HtmlFetcher().fetchAndExtract("http://www.tumblr.com/xeb22gs619", 10000, true); 33 | // System.out.println("tumblr:" + res.getUrl()); 34 | 35 | // res = new HtmlFetcher().fetchAndExtract("http://www.faz.net/-01s7fc", 10000, true); 36 | // System.out.println("faz:" + res.getUrl()); 37 | 38 | res = new HtmlFetcher().fetchAndExtract("http://www.google.com/url?sa=x&q=http://www.taz.de/1/politik/asien/artikel/1/anti-atomkraft-nein-danke/&ct=ga&cad=caeqargbiaaoataaoabaltmh7qrialaawabibwrllurf&cd=d5glzns5m_4&usg=afqjcnetx___sph8sjwhjwi-_mmdnhilra&utm_source=twitterfeed&utm_medium=twitter", 10000, true); 39 | assertEquals("http://www.taz.de/1/politik/asien/artikel/1/anti-atomkraft-nein-danke/", res.getUrl()); 40 | // System.out.println("google redirect:" + res.getUrl()); 41 | 42 | res = new HtmlFetcher().fetchAndExtract("http://bit.ly/gyFxfv", 10000, true); 43 | assertEquals("http://www.obiavi-bg.com/obiava_688245-6|260|262|/%D0%BF%D1%80%D0%BE%D0%BB%D0%B5%D1%82%D0%BD%D0%B0-%D0%BF%D1%80%D0%BE%D0%BC%D0%BE%D1%86%D0%B8%D1%8F-%D0%B4%D0%B0-%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%B8%D1%80%D0%B0%D0%BC%D0%B5-%D1%81-java.html?utm_source=twitterfeed&utm_medium=twitter", 44 | res.getUrl()); 45 | } 46 | 47 | @Test 48 | public void testWithTitle() throws Exception { 49 | JResult res = new HtmlFetcher().fetchAndExtract("http://www.midgetmanofsteel.com/2011/03/its-only-matter-of-time-before-fox-news.html", 10000, true); 50 | assertEquals("It's Only a Matter of Time Before Fox News Takes Out a Restraining Order", res.getTitle()); 51 | assertEquals("2011/03", res.getDate()); 52 | } 53 | 54 | // do not support this uglyness 55 | // @Test 56 | // public void doubleRedirect() throws Exception { 57 | // JResult res = new HtmlFetcher().fetchAndExtract("http://bit.ly/eZPI1c", 10000, true); 58 | // assertEquals("12 Minuten Battlefield 3 Gameplay - ohne Facebook-Bedingungen | Spaß und Spiele", res.getTitle()); 59 | // } 60 | // not available anymore 61 | // @Test 62 | // public void testTwitpicGzipDoesNOTwork() throws Exception { 63 | // JResult res = new HtmlFetcher().fetchAndExtract("http://twitpic.com/4kuem8", 12000, true); 64 | // assertTrue(res.getText(), res.getText().contains("*Not* what you want to see")); 65 | // } 66 | @Test 67 | public void testEncoding() throws Exception { 68 | JResult res = new HtmlFetcher().fetchAndExtract("http://www.yomiuri.co.jp/science/", 10000, true); 69 | assertEquals("科学・ITニュース:読売新聞(YOMIURI ONLINE)", res.getTitle()); 70 | } 71 | 72 | @Test 73 | public void testHashbang() throws Exception { 74 | JResult res = new HtmlFetcher().fetchAndExtract("http://www.facebook.com/democracynow", 10000, true); 75 | assertTrue(res.getTitle(), res.getTitle().startsWith("Democracy Now!")); 76 | 77 | // not available anymore 78 | // res = new HtmlFetcher().fetchAndExtract("http://twitter.com/#!/th61/status/57141697720745984", 10000, true); 79 | // assertTrue(res.getTitle(), res.getTitle().startsWith("Twitter / TH61: “@AntiAtomPiraten:")); 80 | } 81 | 82 | public void testImage() throws Exception { 83 | JResult res = new HtmlFetcher().fetchAndExtract("http://grfx.cstv.com/schools/okla/graphics/auto/20110505_schedule.jpg", 10000, true); 84 | assertEquals("http://grfx.cstv.com/schools/okla/graphics/auto/20110505_schedule.jpg", res.getImageUrl()); 85 | assertTrue(res.getTitle().isEmpty()); 86 | assertTrue(res.getText().isEmpty()); 87 | } 88 | 89 | @Test 90 | public void testFurther() throws Exception { 91 | JResult res = new HtmlFetcher().fetchAndExtract("https://linksunten.indymedia.org/de/node/41619?utm_source=twitterfeed&utm_medium=twitter", 10000, true); 92 | assertTrue(res.getText(), res.getText().startsWith("Es gibt kein ruhiges Hinterland! Schon wieder den ")); 93 | } 94 | 95 | @Test 96 | public void testDoubleResolve() throws Exception { 97 | JResult res = new HtmlFetcher().fetchAndExtract("http://t.co/eZRKcEYI", 10000, true); 98 | assertTrue(res.getTitle(), res.getTitle().startsWith("GitHub - teleject/Responsive-Web-Design-Artboards")); 99 | } 100 | 101 | @Test 102 | public void testXml() throws Exception { 103 | String str = new HtmlFetcher().fetchAsString("https://karussell.wordpress.com/feed/", 10000); 104 | assertTrue(str, str.startsWith(" textList; 43 | private Collection keywords; 44 | private List images = null; 45 | 46 | public JResult() { 47 | } 48 | 49 | public String getUrl() { 50 | if (url == null) 51 | return ""; 52 | return url; 53 | } 54 | 55 | public JResult setUrl(String url) { 56 | this.url = url; 57 | return this; 58 | } 59 | 60 | public JResult setOriginalUrl(String originalUrl) { 61 | this.originalUrl = originalUrl; 62 | return this; 63 | } 64 | 65 | public String getOriginalUrl() { 66 | return originalUrl; 67 | } 68 | 69 | public JResult setCanonicalUrl(String canonicalUrl) { 70 | this.canonicalUrl = canonicalUrl; 71 | return this; 72 | } 73 | 74 | public String getCanonicalUrl() { 75 | return canonicalUrl; 76 | } 77 | 78 | public String getFaviconUrl() { 79 | if (faviconUrl == null) 80 | return ""; 81 | return faviconUrl; 82 | } 83 | 84 | public JResult setFaviconUrl(String faviconUrl) { 85 | this.faviconUrl = faviconUrl; 86 | return this; 87 | } 88 | 89 | public JResult setRssUrl(String rssUrl) { 90 | this.rssUrl = rssUrl; 91 | return this; 92 | } 93 | 94 | public String getRssUrl() { 95 | if (rssUrl == null) 96 | return ""; 97 | return rssUrl; 98 | } 99 | 100 | public String getDescription() { 101 | if (description == null) 102 | return ""; 103 | return description; 104 | } 105 | 106 | public JResult setDescription(String description) { 107 | this.description = description; 108 | return this; 109 | } 110 | 111 | public String getImageUrl() { 112 | if (imageUrl == null) 113 | return ""; 114 | return imageUrl; 115 | } 116 | 117 | public JResult setImageUrl(String imageUrl) { 118 | this.imageUrl = imageUrl; 119 | return this; 120 | } 121 | 122 | public String getText() { 123 | if (text == null) 124 | return ""; 125 | 126 | return text; 127 | } 128 | 129 | public JResult setText(String text) { 130 | this.text = text; 131 | return this; 132 | } 133 | 134 | public List getTextList() { 135 | if(this.textList == null) 136 | return new ArrayList(); 137 | return this.textList; 138 | } 139 | 140 | public JResult setTextList(List textList) { 141 | this.textList = textList; 142 | return this; 143 | } 144 | 145 | public String getTitle() { 146 | if (title == null) 147 | return ""; 148 | return title; 149 | } 150 | 151 | public JResult setTitle(String title) { 152 | this.title = title; 153 | return this; 154 | } 155 | 156 | public String getVideoUrl() { 157 | if (videoUrl == null) 158 | return ""; 159 | return videoUrl; 160 | } 161 | 162 | public JResult setVideoUrl(String videoUrl) { 163 | this.videoUrl = videoUrl; 164 | return this; 165 | } 166 | 167 | public JResult setDate(String date) { 168 | this.dateString = date; 169 | return this; 170 | } 171 | 172 | public Collection getKeywords() { 173 | return keywords; 174 | } 175 | 176 | public void setKeywords(Collection keywords) { 177 | this.keywords = keywords; 178 | } 179 | 180 | /** 181 | * @return get date from url or guessed from text 182 | */ 183 | public String getDate() { 184 | return dateString; 185 | } 186 | 187 | /** 188 | * @return images list 189 | */ 190 | public List getImages() { 191 | if (images == null) 192 | return Collections.emptyList(); 193 | return images; 194 | } 195 | 196 | /** 197 | * @return images count 198 | */ 199 | public int getImagesCount() { 200 | if (images == null) 201 | return 0; 202 | return images.size(); 203 | } 204 | 205 | /** 206 | * set images list 207 | */ 208 | public void setImages(List images) { 209 | this.images = images; 210 | } 211 | 212 | @Override 213 | public String toString() { 214 | return "title:" + getTitle() + " imageUrl:" + getImageUrl() + " text:" + text; 215 | } 216 | } 217 | -------------------------------------------------------------------------------- /src/main/java/de/jetwick/snacktory/OutputFormatter.java: -------------------------------------------------------------------------------- 1 | package de.jetwick.snacktory; 2 | 3 | import org.jsoup.Jsoup; 4 | import org.jsoup.nodes.Element; 5 | import org.jsoup.select.Elements; 6 | 7 | import java.util.ArrayList; 8 | import java.util.Arrays; 9 | import java.util.List; 10 | import java.util.regex.Pattern; 11 | import org.jsoup.nodes.Node; 12 | import org.jsoup.nodes.TextNode; 13 | 14 | /** 15 | * @author goose | jim 16 | * @author karussell 17 | * 18 | * this class will be responsible for taking our top node and stripping out junk 19 | * we don't want and getting it ready for how we want it presented to the user 20 | */ 21 | public class OutputFormatter { 22 | 23 | public static final int MIN_PARAGRAPH_TEXT = 50; 24 | private static final List NODES_TO_REPLACE = Arrays.asList("strong", "b", "i"); 25 | private Pattern unlikelyPattern = Pattern.compile("display\\:none|visibility\\:hidden"); 26 | protected final int minParagraphText; 27 | protected final List nodesToReplace; 28 | protected String nodesToKeepCssSelector = "p"; 29 | 30 | public OutputFormatter() { 31 | this(MIN_PARAGRAPH_TEXT, NODES_TO_REPLACE); 32 | } 33 | 34 | public OutputFormatter(int minParagraphText) { 35 | this(minParagraphText, NODES_TO_REPLACE); 36 | } 37 | 38 | public OutputFormatter(int minParagraphText, List nodesToReplace) { 39 | this.minParagraphText = minParagraphText; 40 | this.nodesToReplace = nodesToReplace; 41 | } 42 | 43 | /** 44 | * set elements to keep in output text 45 | */ 46 | public void setNodesToKeepCssSelector(String nodesToKeepCssSelector) { 47 | this.nodesToKeepCssSelector = nodesToKeepCssSelector; 48 | } 49 | 50 | /** 51 | * takes an element and turns the P tags into \n\n 52 | */ 53 | public String getFormattedText(Element topNode) { 54 | removeNodesWithNegativeScores(topNode); 55 | StringBuilder sb = new StringBuilder(); 56 | append(topNode, sb, nodesToKeepCssSelector); 57 | String str = SHelper.innerTrim(sb.toString()); 58 | if (str.length() > 100) 59 | return str; 60 | 61 | // no subelements 62 | if (str.isEmpty() || !topNode.text().isEmpty() && str.length() <= topNode.ownText().length()) 63 | str = topNode.text(); 64 | 65 | // if jsoup failed to parse the whole html now parse this smaller 66 | // snippet again to avoid html tags disturbing our text: 67 | return Jsoup.parse(str).text(); 68 | } 69 | 70 | /** 71 | * Takes an element and returns a list of texts extracted from the P tags 72 | */ 73 | public List getTextList(Element topNode) { 74 | List texts = new ArrayList(); 75 | for(Element element : topNode.select(this.nodesToKeepCssSelector)) { 76 | if(element.hasText()) { 77 | texts.add(element.text()); 78 | } 79 | } 80 | return texts; 81 | } 82 | 83 | /** 84 | * If there are elements inside our top node that have a negative gravity 85 | * score remove them 86 | */ 87 | protected void removeNodesWithNegativeScores(Element topNode) { 88 | Elements gravityItems = topNode.select("*[gravityScore]"); 89 | for (Element item : gravityItems) { 90 | int score = Integer.parseInt(item.attr("gravityScore")); 91 | if (score < 0 || item.text().length() < minParagraphText) 92 | item.remove(); 93 | } 94 | } 95 | 96 | protected void append(Element node, StringBuilder sb, String tagName) { 97 | // is select more costly then getElementsByTag? 98 | MAIN: 99 | for (Element e : node.select(tagName)) { 100 | Element tmpEl = e; 101 | // check all elements until 'node' 102 | while (tmpEl != null && !tmpEl.equals(node)) { 103 | if (unlikely(tmpEl)) 104 | continue MAIN; 105 | tmpEl = tmpEl.parent(); 106 | } 107 | 108 | String text = node2Text(e); 109 | if (text.isEmpty() || text.length() < minParagraphText || text.length() > SHelper.countLetters(text) * 2) 110 | continue; 111 | 112 | sb.append(text); 113 | sb.append("\n\n"); 114 | } 115 | } 116 | 117 | boolean unlikely(Node e) { 118 | if (e.attr("class") != null && e.attr("class").toLowerCase().contains("caption")) 119 | return true; 120 | 121 | String style = e.attr("style"); 122 | String clazz = e.attr("class"); 123 | if (unlikelyPattern.matcher(style).find() || unlikelyPattern.matcher(clazz).find()) 124 | return true; 125 | return false; 126 | } 127 | 128 | void appendTextSkipHidden(Element e, StringBuilder accum) { 129 | for (Node child : e.childNodes()) { 130 | if (unlikely(child)) 131 | continue; 132 | if (child instanceof TextNode) { 133 | TextNode textNode = (TextNode) child; 134 | String txt = textNode.text(); 135 | accum.append(txt); 136 | } else if (child instanceof Element) { 137 | Element element = (Element) child; 138 | if (accum.length() > 0 && element.isBlock() && !lastCharIsWhitespace(accum)) 139 | accum.append(" "); 140 | else if (element.tagName().equals("br")) 141 | accum.append(" "); 142 | appendTextSkipHidden(element, accum); 143 | } 144 | } 145 | } 146 | 147 | boolean lastCharIsWhitespace(StringBuilder accum) { 148 | if (accum.length() == 0) 149 | return false; 150 | return Character.isWhitespace(accum.charAt(accum.length() - 1)); 151 | } 152 | 153 | protected String node2TextOld(Element el) { 154 | return el.text(); 155 | } 156 | 157 | protected String node2Text(Element el) { 158 | StringBuilder sb = new StringBuilder(200); 159 | appendTextSkipHidden(el, sb); 160 | return sb.toString(); 161 | } 162 | 163 | public OutputFormatter setUnlikelyPattern(String unlikelyPattern) { 164 | this.unlikelyPattern = Pattern.compile(unlikelyPattern); 165 | return this; 166 | } 167 | 168 | public OutputFormatter appendUnlikelyPattern(String str) { 169 | return setUnlikelyPattern(unlikelyPattern.toString() + "|" + str); 170 | } 171 | } 172 | -------------------------------------------------------------------------------- /src/test/java/de/jetwick/snacktory/SHelperTest.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Peter Karich 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | import org.junit.Test; 19 | import static org.junit.Assert.*; 20 | 21 | /** 22 | * 23 | * @author Peter Karich 24 | */ 25 | public class SHelperTest { 26 | 27 | public SHelperTest() { 28 | } 29 | 30 | @Test 31 | public void testInnerTrim() { 32 | assertEquals("", SHelper.innerTrim(" ")); 33 | assertEquals("t", SHelper.innerTrim(" t ")); 34 | assertEquals("t t t", SHelper.innerTrim("t t t ")); 35 | assertEquals("t t", SHelper.innerTrim("t \nt ")); 36 | assertEquals("t peter", SHelper.innerTrim("t peter ")); 37 | assertEquals("t t", SHelper.innerTrim("t \n t ")); 38 | } 39 | 40 | @Test 41 | public void testCount() { 42 | assertEquals(1, SHelper.count("hi wie &test; gehts", "&test;")); 43 | assertEquals(1, SHelper.count("&test;", "&test;")); 44 | assertEquals(2, SHelper.count("&test;&test;", "&test;")); 45 | assertEquals(2, SHelper.count("&test; &test;", "&test;")); 46 | assertEquals(3, SHelper.count("&test; test; &test; plu &test;", "&test;")); 47 | } 48 | 49 | @Test 50 | public void longestSubstring() { 51 | // assertEquals(9, ArticleTextExtractor.longestSubstring("hi hello how are you?", "hello how")); 52 | assertEquals("hello how", SHelper.getLongestSubstring("hi hello how are you?", "hello how")); 53 | assertEquals(" people if ", SHelper.getLongestSubstring("x now if people if todo?", "I know people if you")); 54 | assertEquals("", SHelper.getLongestSubstring("?", "people")); 55 | assertEquals("people", SHelper.getLongestSubstring(" people ", "people")); 56 | } 57 | 58 | @Test 59 | public void testHashbang() { 60 | assertEquals("sdfiasduhf+asdsad+sdfsdf#!", SHelper.removeHashbang("sdfiasduhf+asdsad#!+sdfsdf#!")); 61 | assertEquals("sdfiasduhf+asdsad+sdfsdf#!", SHelper.removeHashbang("sdfiasduhf+asdsad#!+sdfsdf#!")); 62 | } 63 | 64 | @Test 65 | public void testIsVideoLink() { 66 | assertTrue(SHelper.isVideoLink("m.vimeo.com")); 67 | assertTrue(SHelper.isVideoLink("m.youtube.com")); 68 | assertTrue(SHelper.isVideoLink("www.youtube.com")); 69 | assertTrue(SHelper.isVideoLink("http://youtube.com")); 70 | assertTrue(SHelper.isVideoLink("http://www.youtube.com")); 71 | 72 | assertTrue(SHelper.isVideoLink("https://youtube.com")); 73 | 74 | assertFalse(SHelper.isVideoLink("test.com")); 75 | assertFalse(SHelper.isVideoLink("irgendwas.com/youtube.com")); 76 | } 77 | 78 | @Test 79 | public void testExctractHost() { 80 | assertEquals("techcrunch.com", 81 | SHelper.extractHost("http://techcrunch.com/2010/08/13/gantto-takes-on-microsoft-project-with-web-based-project-management-application/")); 82 | } 83 | 84 | @Test 85 | public void testFavicon() { 86 | assertEquals("http://www.n24.de/news/../../../media/imageimport/images/content/favicon.ico", 87 | SHelper.useDomainOfFirstArg4Second("http://www.n24.de/news/newsitem_6797232.html", "../../../media/imageimport/images/content/favicon.ico")); 88 | SHelper.useDomainOfFirstArg4Second("http://www.n24.de/favicon.ico", "/favicon.ico"); 89 | SHelper.useDomainOfFirstArg4Second("http://www.n24.de/favicon.ico", "favicon.ico"); 90 | } 91 | 92 | @Test 93 | public void testFaviconProtocolRelative() throws Exception { 94 | assertEquals("http://de.wikipedia.org/apple-touch-icon.png", 95 | SHelper.useDomainOfFirstArg4Second("http://de.wikipedia.org/favicon", "//de.wikipedia.org/apple-touch-icon.png")); 96 | } 97 | 98 | @Test 99 | public void testImageProtocolRelative() throws Exception { 100 | assertEquals("http://upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Greece.svg/150px-Flag_of_Greece.svg.png", 101 | SHelper.useDomainOfFirstArg4Second("http://de.wikipedia.org/wiki/Griechenland", "//upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Flag_of_Greece.svg/150px-Flag_of_Greece.svg.png")); 102 | } 103 | 104 | @Test 105 | public void testEncodingCleanup() { 106 | assertEquals("utf-8", SHelper.encodingCleanup("utf-8")); 107 | assertEquals("utf-8", SHelper.encodingCleanup("utf-8\"")); 108 | assertEquals("utf-8", SHelper.encodingCleanup("utf-8'")); 109 | assertEquals("test-8", SHelper.encodingCleanup(" test-8 &")); 110 | } 111 | 112 | @Test 113 | public void testUglyFacebook() { 114 | assertEquals("http://www.bet.com/collegemarketingreps&h=42263", 115 | SHelper.getUrlFromUglyFacebookRedirect("http://www.facebook.com/l.php?u=http%3A%2F%2Fwww.bet.com%2Fcollegemarketingreps&h=42263")); 116 | } 117 | 118 | @Test 119 | public void testEstimateDate() { 120 | assertNull(SHelper.estimateDate("http://www.facebook.com/l.php?u=http%3A%2F%2Fwww.bet.com%2Fcollegemarketin")); 121 | assertEquals("2010/02/15", SHelper.estimateDate("http://www.vogella.de/blog/2010/02/15/twitter-android/")); 122 | assertEquals("2010/02", SHelper.estimateDate("http://www.vogella.de/blog/2010/02/twitter-android/12")); 123 | assertEquals("2009/11/05", SHelper.estimateDate("http://cagataycivici.wordpress.com/2009/11/05/mobile-twitter-client-with-jsf/")); 124 | assertEquals("2009", SHelper.estimateDate("http://cagataycivici.wordpress.com/2009/sf/12/1/")); 125 | assertEquals("2011/06", SHelper.estimateDate("http://bdoughan.blogspot.com/2011/06/using-jaxbs-xmlaccessortype-to.html")); 126 | assertEquals("2011", SHelper.estimateDate("http://bdoughan.blogspot.com/2011/13/using-jaxbs-xmlaccessortype-to.html")); 127 | } 128 | 129 | @Test 130 | public void testCompleteDate() { 131 | assertNull(SHelper.completeDate(null)); 132 | assertEquals("2001/01/01", SHelper.completeDate("2001")); 133 | assertEquals("2001/11/01", SHelper.completeDate("2001/11")); 134 | assertEquals("2001/11/02", SHelper.completeDate("2001/11/02")); 135 | } 136 | } 137 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/i4online.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | I-4 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 31 | 32 | 37 |
38 |
49 | 99 |
100 | Introduction 101 | 102 |
103 |
104 |

Upcoming events: Forum 79

105 | 106 |

24th to 26th June 2013

107 | 108 |

Just one week to go and everything is set for the summer Forum 2013.  The Forum features an excellent programme, some very special keynote speakers and an extra add on event for the early evening on Wednesday 26th.   Full details are all in the Members section of the website.

109 |
110 |

Upcoming events: July webinar

111 | 112 |

July 2013 - date to be confirmed

113 | 114 |

The July webinar -"Finding the Needle in a Needle Stack: Surveillance Analytics" - will explore the real world of big data analytics for information security.  Details here soon.  The recording and presentation materials for the May webinar are in the Members section of the website.

115 |
116 |

Upcoming events: September Regional Meeting

117 | 118 |

25th September 2013

119 | 120 |

The next Member's one day Reginal Meeting will be held in central London, UK, on 25th September.  Full details will be sent to Members and posted here in mid-June.

121 |
122 | 123 |
124 | 133 |
134 |
135 | 143 | 149 | 150 | -------------------------------------------------------------------------------- /src/main/java/de/jetwick/snacktory/Converter.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Peter Karich 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | import java.io.*; 19 | import java.net.SocketTimeoutException; 20 | import java.nio.charset.Charset; 21 | import org.slf4j.Logger; 22 | import org.slf4j.LoggerFactory; 23 | 24 | /** 25 | * This class is not thread safe. Use one new instance every time due to 26 | * encoding variable. 27 | * 28 | * @author Peter Karich 29 | */ 30 | public class Converter { 31 | 32 | private static final Logger logger = LoggerFactory.getLogger(Converter.class); 33 | public final static String UTF8 = "UTF-8"; 34 | public final static String ISO = "ISO-8859-1"; 35 | public final static int K2 = 2048; 36 | private int maxBytes = 1000000 / 2; 37 | private String encoding; 38 | private String url; 39 | 40 | public Converter(String urlOnlyHint) { 41 | url = urlOnlyHint; 42 | } 43 | 44 | public Converter() { 45 | } 46 | 47 | public Converter setMaxBytes(int maxBytes) { 48 | this.maxBytes = maxBytes; 49 | return this; 50 | } 51 | 52 | public static String extractEncoding(String contentType) { 53 | String[] values; 54 | if (contentType != null) 55 | values = contentType.split(";"); 56 | else 57 | values = new String[0]; 58 | 59 | String charset = ""; 60 | 61 | for (String value : values) { 62 | value = value.trim().toLowerCase(); 63 | 64 | if (value.startsWith("charset=")) 65 | charset = value.substring("charset=".length()); 66 | } 67 | 68 | // http1.1 says ISO-8859-1 is the default charset 69 | if (charset.length() == 0) 70 | charset = ISO; 71 | 72 | return charset; 73 | } 74 | 75 | public String getEncoding() { 76 | if (encoding == null) 77 | return ""; 78 | return encoding.toLowerCase(); 79 | } 80 | 81 | public String streamToString(InputStream is) { 82 | return streamToString(is, maxBytes, encoding); 83 | } 84 | 85 | public String streamToString(InputStream is, String enc) { 86 | return streamToString(is, maxBytes, enc); 87 | } 88 | 89 | /** 90 | * reads bytes off the string and returns a string 91 | * 92 | * @param is 93 | * @param maxBytes The max bytes that we want to read from the input stream 94 | * @return String 95 | */ 96 | public String streamToString(InputStream is, int maxBytes, String enc) { 97 | encoding = enc; 98 | // Http 1.1. standard is iso-8859-1 not utf8 :( 99 | // but we force utf-8 as youtube assumes it ;) 100 | if (encoding == null || encoding.isEmpty()) 101 | encoding = UTF8; 102 | 103 | BufferedInputStream in = null; 104 | try { 105 | in = new BufferedInputStream(is, K2); 106 | ByteArrayOutputStream output = new ByteArrayOutputStream(); 107 | 108 | // detect encoding with the help of meta tag 109 | try { 110 | in.mark(K2 * 2); 111 | String tmpEnc = detectCharset("charset=", output, in, encoding); 112 | if (tmpEnc != null) 113 | encoding = tmpEnc; 114 | else { 115 | logger.debug("no charset found in first stage"); 116 | // detect with the help of xml beginning ala encoding="charset" 117 | tmpEnc = detectCharset("encoding=", output, in, encoding); 118 | if (tmpEnc != null) 119 | encoding = tmpEnc; 120 | else 121 | logger.debug("no charset found in second stage"); 122 | } 123 | 124 | if (!Charset.isSupported(encoding)) 125 | throw new UnsupportedEncodingException(encoding); 126 | } catch (UnsupportedEncodingException e) { 127 | logger.warn("Using default encoding:" + UTF8 128 | + " problem:" + e.getMessage() + " encoding:" + encoding + " " + url); 129 | encoding = UTF8; 130 | } 131 | 132 | // SocketException: Connection reset 133 | // IOException: missing CR => problem on server (probably some xml character thing?) 134 | // IOException: Premature EOF => socket unexpectly closed from server 135 | int bytesRead = output.size(); 136 | byte[] arr = new byte[K2]; 137 | while (true) { 138 | if (bytesRead >= maxBytes) { 139 | logger.warn("Maxbyte of " + maxBytes + " exceeded! Maybe html is now broken but try it nevertheless. Url: " + url); 140 | break; 141 | } 142 | 143 | int n = in.read(arr); 144 | if (n < 0) 145 | break; 146 | bytesRead += n; 147 | output.write(arr, 0, n); 148 | } 149 | 150 | return output.toString(encoding); 151 | } catch (SocketTimeoutException e) { 152 | logger.info(e.toString() + " url:" + url); 153 | } catch (IOException e) { 154 | logger.warn(e.toString() + " url:" + url); 155 | } finally { 156 | if (in != null) { 157 | try { 158 | in.close(); 159 | } catch (Exception e) { 160 | } 161 | } 162 | } 163 | return ""; 164 | } 165 | 166 | /** 167 | * This method detects the charset even if the first call only returns some 168 | * bytes. It will read until 4K bytes are reached and then try to determine 169 | * the encoding 170 | * 171 | * @throws IOException 172 | */ 173 | protected String detectCharset(String key, ByteArrayOutputStream bos, BufferedInputStream in, 174 | String enc) throws IOException { 175 | 176 | // Grab better encoding from stream 177 | byte[] arr = new byte[K2]; 178 | int nSum = 0; 179 | while (nSum < K2) { 180 | int n = in.read(arr); 181 | if (n < 0) 182 | break; 183 | 184 | nSum += n; 185 | bos.write(arr, 0, n); 186 | } 187 | 188 | String str = bos.toString(enc); 189 | int encIndex = str.indexOf(key); 190 | int clength = key.length(); 191 | if (encIndex > 0) { 192 | char startChar = str.charAt(encIndex + clength); 193 | int lastEncIndex; 194 | if (startChar == '\'') 195 | // if we have charset='something' 196 | lastEncIndex = str.indexOf("'", ++encIndex + clength); 197 | else if (startChar == '\"') 198 | // if we have charset="something" 199 | lastEncIndex = str.indexOf("\"", ++encIndex + clength); 200 | else { 201 | // if we have "text/html; charset=utf-8" 202 | int first = str.indexOf("\"", encIndex + clength); 203 | if (first < 0) 204 | first = Integer.MAX_VALUE; 205 | 206 | // or "text/html; charset=utf-8 " 207 | int sec = str.indexOf(" ", encIndex + clength); 208 | if (sec < 0) 209 | sec = Integer.MAX_VALUE; 210 | lastEncIndex = Math.min(first, sec); 211 | 212 | // or "text/html; charset=utf-8 ' 213 | int third = str.indexOf("'", encIndex + clength); 214 | if (third > 0) 215 | lastEncIndex = Math.min(lastEncIndex, third); 216 | } 217 | 218 | // re-read byte array with different encoding 219 | // assume that the encoding string cannot be greater than 40 chars 220 | if (lastEncIndex > encIndex + clength && lastEncIndex < encIndex + clength + 40) { 221 | String tmpEnc = SHelper.encodingCleanup(str.substring(encIndex + clength, lastEncIndex)); 222 | try { 223 | in.reset(); 224 | bos.reset(); 225 | return tmpEnc; 226 | } catch (IOException ex) { 227 | logger.warn("Couldn't reset stream to re-read with new encoding " + tmpEnc + " " 228 | + ex.toString()); 229 | } 230 | } 231 | } 232 | return null; 233 | } 234 | } 235 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/grapevinyl.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | Magnetic Morning - Getting Nowhere (Music Video) @ Grapevinyl 5 | 6 | 7 | 8 | 9 | 10 | 11 | 21 | 22 | 23 | 115 | 128 | 138 | 139 | 140 | 162 | 163 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/cnbc.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | News Headlines 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 |
26 |
36 | 37 | 38 | 39 |
40 |
Real-Time Quote 41 | 44 | 45 | 46 | 47 | 48 | 49 | 50 |
51 |
52 |
53 | 54 |
55 |
US News
Page 1 of 2 | Next Page
Show Entire Article

Chinese Art Expert 'Skeptical' of Record-Setting Vase
CNBC.com | December 20, 2010 | 02:22 PM EST

56 |

A prominent expert on Chinese works of art has expressed doubts as to the authenticity of an antique Chinese vase sold last month at a London auction for 51.6 million pounds ($83.2 million)—an auction record for Chinese works of art.

“I’ve seen the vase itself. I went out to look at it,” New York gallery owner James Lally of J.J. Lally & Co. told CNBC. "I’m very skeptical of that piece. If you asked me I would say don’t bid on it.”

The porcelain vase is said to be an 18th-century Qing Dynasty piece from the Imperial court of Emperor Qianlong, the category of Chinese porcelain most coveted by Chinese collectors in a market that is soaring ever higher with every international auction.

Found in a London suburb by a woman clearing out her late sister’s house, the 16-inch high vase went on the block at the small West London auction house of Bainbridges, where it was expected to fetch around 1 million pounds. After a bidding war, the vase was scooped up by a Chinese bidder representing an anonymous client.

“There are a number of people who do not find that piece convincing. And I think people who were bidding on it, some of them on the telephone, were taking an enormous risk,” Lally said.

Page 1 of 2 | Next Page
Show Entire Article 57 |
More Top Stories
58 | 59 | 60 | 101 |
102 |
103 |
104 | 105 | 106 | 107 |

108 |
109 | 110 |
111 | 112 |
113 |
Real-Time Quote 114 | 117 | 118 | 119 | 120 | 121 | 122 | 123 |
124 |
125 | 126 |
131 |
132 | 133 |
omniture pixel
134 |
135 | 136 | 137 | 138 | 139 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/daltoncaldwell.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | Dear Mark Zuckerberg by Dalton Caldwell 7 | 8 | 9 | 15 | 16 | 17 | 19 | 20 | 25 | 26 | 27 | 63 | 64 |
65 |
66 | 67 |
68 |
69 |

70 | Dear Mark Zuckerberg 71 |

72 |

Mark,

73 | 74 |

On June 13, 2012, at 4:30 p.m., I attended a meeting at Facebook HQ in Menlo Park, California. In addition to myself, the meeting was attended by executives at Facebook with the following titles: “VP, Engineering & Products”, “VP, Partnerships”,“VP, Corporate & Business Development”, and “Director, Developer Relations/Open Graph”.

75 | 76 |

As I understood at the time, the purpose of the meeting was for me to present/demonstrate a new iOS app & service I have been building on the Facebook Platform. Previously, I had been reassured by Facebook dev-relations employees that the service I was building was an interesting/ valuable use of Open Graph & Facebook Platform. I was hoping the outcome of this meeting would be executive-level support for my impending product launch.

77 | 78 |

The meeting took an odd turn when the individuals in the room explained that the product I was building was competitive with your recently-announced Facebook App Center product. Your executives explained to me that they would hate to have to compete with the “interesting product” I had built, and that since I am a “nice guy with a good reputation” that they wanted to acquire my company to help build App Center.

79 | 80 |

I quickly became skeptical and explained that I was not interested in an acqui-hire. I said that if Facebook wanted to have a serious conversation about acquiring my team and product, I would entertain the idea. Otherwise, I had zero interest in seeing my product shut down and joining Facebook. I told your team I would rather reboot my company than go down that route.

81 | 82 |

Strangely, your “platform developer relations” executive made no attempt to defend my position. Rather, he explained that he was recently given ownership of App Center, and that because of new ad units they were building, he was now responsible for over $1B/year in ad revenue. The execs in the room made clear that the success of my product would be an impediment to your ad revenue financial goals, and thus even offering me the chance to be acquired was a noble and kind move on their part.

83 | 84 |

I am not sure if this bubbled up to you, Mark, but after this all happened I directly communicated my feedback regarding just how unhappy I was with this situation to one of your executives. The executive apologized and said he would take my feedback under consideration.

85 | 86 |

Mark, I know for a fact that my experience was not an isolated incident. Several other startup founders & Facebook employees have told me that what I experienced was part of a systematic M&A “formula”. Your team doesn’t seem to understand that being “good negotiators” vs implying that you will destroy someone’s business built on your “open platform” are not the same thing. I know all about intimidation-based negotiation tactics: I experienced them for years while dealing with the music industry. Bad-faith negotiations are inexcusable, and I didn’t want to believe your company would stoop this low. My mistake.

87 | 88 |

In a lot of ways, I got what I deserved. I have come to the conclusion that I took this foolhardy risk because the Twitter “platform” was even more of a joke than the Facebook “platform”. As someone that wants to build quality social software, software that doesn’t force users to re-create their friends list, or not use oAuth, etc., I have to endure huge platform risk. Personally speaking, I am resolved to never write another line of code for rotten-to-the-core “platforms” like Facebook or Twitter. Lesson learned.

89 | 90 |

Mark, I don’t believe that the humans working at Facebook or Twitter want to do the wrong thing. The problem is, employees at Facebook and Twitter are watching your stock price fall, and that is causing them to freak out. Your company, and Twitter, have demonstrably proven that they are willing to screw with users and 3rd-party developer ecosystems, all in the name of ad-revenue. Once you start down the slippery-slope of messing with developers and users, I don’t have any confidence you will stop.

91 | 92 |

I believe that future social platforms will behave more like infrastructure, and less like media companies. I believe that a number of smaller, interoperable social platforms with a clear, sustainable business models will usurp you. These future companies will be valued at a small fraction of what Facebook and Twitter currently are. I think that is OK. Platforms are judged by the value generated by their ecosystem, not by the value the platforms directly capture.

93 | 94 |

I don’t think you or your employees are bad people. I just think you constructed a business that has financial motivations that are not in-line with users & developers. Even if my project isn’t the mechanism that instigates this change, the change will happen.

95 | 96 |

Mark, based on everything I know about you, I think you get all of this. It’s why you launched FB platform to begin with. Do remember how you used to always refer to Facebook as a “social utility”? That is an interesting term to use. I haven’t heard you use that terminology in a while. I can guess why.

97 | 98 |

Anyway, Mark, perhaps the public markets & your employees will give you the time and goodwill to fix the obvious structural flaws in your “platform” business. You are in a very challenging position right now. Good luck.

99 | 100 |

Respectfully,

101 | 102 |

Dalton Caldwell

103 | 104 | 116 |
117 | 118 | 123 |
124 | 125 | 126 | 127 | 128 | -------------------------------------------------------------------------------- /test_data/3.html: -------------------------------------------------------------------------------- 1 | 2 | 7 | Where to See Silicon Valley 8 | 9 |


Where to See Silicon Valley

10 | 17 |
12 | Want to start a startup? Get funded by 13 | Y Combinator. 14 | 15 |
18 |

19 | October 2010

Silicon Valley proper is mostly suburban sprawl. At first glance 20 | it doesn't seem there's anything to see. It's not the sort of place 21 | that has conspicuous monuments. But if you look, there are subtle 22 | signs you're in a place that's different from other places.

1. Stanford 23 | University

Stanford is a strange place. Structurally it is to an ordinary 24 | university what suburbia is to a city. It's enormously spread out, 25 | and feels surprisingly empty much of the time. But notice the 26 | weather. It's probably perfect. And notice the beautiful mountains 27 | to the west. And though you can't see it, cosmopolitan San Francisco 28 | is 40 minutes to the north. That combination is much of the reason 29 | Silicon Valley grew up around this university and not some other 30 | one.

2. University 32 | Ave

A surprising amount of the work of the Valley is done in the cafes 33 | on or just off University Ave in Palo Alto. If you visit on a 34 | weekday between 10 and 5, you'll often see founders pitching 35 | investors. In case you can't tell, the founders are the ones leaning 36 | forward eagerly, and the investors are the ones sitting back with 37 | slightly pained expressions.

3. The Lucky 39 | Office

The office at 165 University Ave was Google's first. Then it was 40 | Paypal's. (Now it's Wepay's.) The interesting thing about it is 41 | the location. It's a smart move to put a startup in a place with 42 | restaurants and people walking around instead of in an office park, 43 | because then the people who work there want to stay there, instead 44 | of fleeing as soon as conventional working hours end. They go out 45 | for dinner together, talk about ideas, and then come back and 46 | implement them.

It's important to realize that Google's current location in an 47 | office park is not where they started; it's just where they were 48 | forced to move when they needed more space. Facebook was till 49 | recently across the street, till they too had to move because they 50 | needed more space.

4. Old 51 | Palo Alto

Palo Alto was not originally a suburb. For the first 100 years or 52 | so of its existence, it was a college town out in the countryside. 53 | Then in the mid 1950s it was engulfed in a wave of suburbia that 54 | raced down the peninsula. But Palo Alto north of Oregon expressway 55 | still feels noticeably different from the area around it. It's one 56 | of the nicest places in the Valley. The buildings are old (though 57 | increasingly they are being torn down and replaced with generic 58 | McMansions) and the trees are tall. But houses are very 59 | expensive—around $1000 per square foot. This is post-exit 60 | Silicon Valley.

61 | 5. Sand 63 | Hill Road

It's interesting to see the VCs' offices on the north side of Sand 64 | Hill Road precisely because they're so boringly uniform. The 65 | buildings are all more or less the same, their exteriors express 66 | very little, and they are arranged in a confusing maze. (I've been 67 | visiting them for years and I still occasionally get lost.) It's 68 | not a coincidence. These buildings are a pretty accurate reflection 69 | of the VC business.

If you go on a weekday you may see groups of founders there to meet 70 | VCs. But mostly you won't see anyone; bustling is the last word 71 | you'd use to describe the atmos. Visiting Sand Hill Road reminds 72 | you that the opposite of "down and dirty" would be "up and clean."

6. Castro 74 | Street

It's a tossup whether Castro Street or University Ave should be 75 | considered the heart of the Valley now. University Ave would have 76 | been 10 years ago. But Palo Alto is getting expensive. Increasingly 77 | startups are located in Mountain View, and Palo Alto is a place 78 | they come to meet investors. Palo Alto has a lot of different 79 | cafes, but there is one that clearly dominates in Mountain View: 80 | Red 82 | Rock.

7. Google

Google spread out from its first building here 84 | to a lot of the surrounding ones. But the 85 | buildings were built at different times by different people, 86 | the place doesn't have the sterile, walled-off feel that a typical 87 | large company's headquarters have. It definitely has a flavor of 88 | its own though. You sense there is something afoot. The general 89 | atmos is vaguely utopian; there are lots of Priuses, and people who 90 | look like they drive them.

You can't get into Google unless you know someone there. It's very 91 | much worth seeing inside if you can, though. Ditto for Facebook, 92 | at the end of California Ave in Palo Alto, though there is nothing 93 | to see outside.

8. Skyline 94 | Drive

Skyline Drive runs along the crest of the Santa Cruz mountains. On 95 | one side is the Valley, and on the other is the sea—which 96 | because it's cold and foggy and has few harbors, plays surprisingly 97 | little role in the lives of people in the Valley, considering how 98 | close it is. Along some parts of Skyline the dominant trees are 99 | huge redwoods, and in others they're live oaks. Redwoods mean those 100 | are the parts where the fog off the coast comes in at night; redwoods 101 | condense rain out of fog. The MROSD manages a collection of great walking trails off 103 | Skyline.

9. 280

Silicon Valley has two highways running the length of it: 101, which 105 | is pretty ugly, and 280, which is one of the more beautiful highways 106 | in the world. I always take 280 when I have a choice. Notice the 107 | long narrow lake to the west? That's the San Andreas Fault. It 108 | runs along the base of the hills, then heads uphill through Portola 109 | Valley. One of the MROSD trails runs right along 111 | the fault. A string of rich neighborhoods runs along the 112 | foothills to the west of 280: Woodside, Portola Valley, Los Altos 113 | Hills, Saratoga, Los Gatos.

SLAC goes right under 280 a little bit south of Sand Hill Road. And a couple miles south of that is the Valley's equivalent of the "Welcome to Las Vegas" sign: The Dish.



114 | Notes

I skipped the Computer 115 | History Museum because this is a list of where to see the Valley 116 | itself, not where to see artifacts from it. I also skipped San 117 | Jose. San Jose calls itself the capital of Silicon Valley, but 118 | when people in the Valley use the phrase "the city," they mean San 119 | Francisco. San Jose is a dotted line on a map.

Thanks to Sam Altman, Paul Buchheit, Patrick Collison, and Jessica Livingston 120 | for reading drafts of this.




121 | 124 | 128 | 132 | 136 | 140 | 144 | 150 | -------------------------------------------------------------------------------- /src/main/java/de/jetwick/snacktory/SHelper.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Peter Karich 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | import java.io.UnsupportedEncodingException; 19 | import java.net.CookieHandler; 20 | import java.net.CookieManager; 21 | import java.net.CookiePolicy; 22 | import java.net.URLDecoder; 23 | import java.net.URLEncoder; 24 | import java.security.SecureRandom; 25 | import java.security.cert.CertificateException; 26 | import java.security.cert.X509Certificate; 27 | import java.text.SimpleDateFormat; 28 | import java.util.regex.Matcher; 29 | import java.util.regex.Pattern; 30 | 31 | import javax.net.ssl.KeyManager; 32 | import javax.net.ssl.SSLContext; 33 | import javax.net.ssl.TrustManager; 34 | import javax.net.ssl.X509TrustManager; 35 | import org.jsoup.nodes.Element; 36 | 37 | /** 38 | * 39 | * @author Peter Karich 40 | */ 41 | public class SHelper { 42 | 43 | public static final String UTF8 = "UTF-8"; 44 | private static final Pattern SPACE = Pattern.compile(" "); 45 | 46 | public static String replaceSpaces(String url) { 47 | if (!url.isEmpty()) { 48 | url = url.trim(); 49 | if (url.contains(" ")) { 50 | Matcher spaces = SPACE.matcher(url); 51 | url = spaces.replaceAll("%20"); 52 | } 53 | } 54 | return url; 55 | } 56 | 57 | public static int count(String str, String substring) { 58 | int c = 0; 59 | int index1 = str.indexOf(substring); 60 | if (index1 >= 0) { 61 | c++; 62 | c += count(str.substring(index1 + substring.length()), substring); 63 | } 64 | return c; 65 | } 66 | 67 | /** 68 | * remove more than two spaces or newlines 69 | */ 70 | public static String innerTrim(String str) { 71 | if (str.isEmpty()) 72 | return ""; 73 | 74 | StringBuilder sb = new StringBuilder(); 75 | boolean previousSpace = false; 76 | for (int i = 0; i < str.length(); i++) { 77 | char c = str.charAt(i); 78 | if (c == ' ' || (int) c == 9 || c == '\n') { 79 | previousSpace = true; 80 | continue; 81 | } 82 | 83 | if (previousSpace) 84 | sb.append(' '); 85 | 86 | previousSpace = false; 87 | sb.append(c); 88 | } 89 | return sb.toString().trim(); 90 | } 91 | 92 | /** 93 | * Starts reading the encoding from the first valid character until an 94 | * invalid encoding character occurs. 95 | */ 96 | public static String encodingCleanup(String str) { 97 | StringBuilder sb = new StringBuilder(); 98 | boolean startedWithCorrectString = false; 99 | for (int i = 0; i < str.length(); i++) { 100 | char c = str.charAt(i); 101 | if (Character.isDigit(c) || Character.isLetter(c) || c == '-' || c == '_') { 102 | startedWithCorrectString = true; 103 | sb.append(c); 104 | continue; 105 | } 106 | 107 | if (startedWithCorrectString) 108 | break; 109 | } 110 | return sb.toString().trim(); 111 | } 112 | 113 | /** 114 | * @return the longest substring as str1.substring(result[0], result[1]); 115 | */ 116 | public static String getLongestSubstring(String str1, String str2) { 117 | int res[] = longestSubstring(str1, str2); 118 | if (res == null || res[0] >= res[1]) 119 | return ""; 120 | 121 | return str1.substring(res[0], res[1]); 122 | } 123 | 124 | public static int[] longestSubstring(String str1, String str2) { 125 | if (str1 == null || str1.isEmpty() || str2 == null || str2.isEmpty()) 126 | return null; 127 | 128 | // dynamic programming => save already identical length into array 129 | // to understand this algo simply print identical length in every entry of the array 130 | // i+1, j+1 then reuses information from i,j 131 | // java initializes them already with 0 132 | int[][] num = new int[str1.length()][str2.length()]; 133 | int maxlen = 0; 134 | int lastSubstrBegin = 0; 135 | int endIndex = 0; 136 | for (int i = 0; i < str1.length(); i++) { 137 | for (int j = 0; j < str2.length(); j++) { 138 | if (str1.charAt(i) == str2.charAt(j)) { 139 | if ((i == 0) || (j == 0)) 140 | num[i][j] = 1; 141 | else 142 | num[i][j] = 1 + num[i - 1][j - 1]; 143 | 144 | if (num[i][j] > maxlen) { 145 | maxlen = num[i][j]; 146 | // generate substring from str1 => i 147 | lastSubstrBegin = i - num[i][j] + 1; 148 | endIndex = i + 1; 149 | } 150 | } 151 | } 152 | } 153 | return new int[]{lastSubstrBegin, endIndex}; 154 | } 155 | 156 | public static String getDefaultFavicon(String url) { 157 | return useDomainOfFirstArg4Second(url, "/favicon.ico"); 158 | } 159 | 160 | /** 161 | * @param urlForDomain extract the domain from this url 162 | * @param path this url does not have a domain 163 | * @return 164 | */ 165 | public static String useDomainOfFirstArg4Second(String urlForDomain, String path) { 166 | if (path.startsWith("http")) 167 | return path; 168 | 169 | if ("favicon.ico".equals(path)) 170 | path = "/favicon.ico"; 171 | 172 | if (path.startsWith("//")) { 173 | // wikipedia special case, see tests 174 | if (urlForDomain.startsWith("https:")) 175 | return "https:" + path; 176 | 177 | return "http:" + path; 178 | } else if (path.startsWith("/")) 179 | return "http://" + extractHost(urlForDomain) + path; 180 | else if (path.startsWith("../")) { 181 | int slashIndex = urlForDomain.lastIndexOf("/"); 182 | if (slashIndex > 0 && slashIndex + 1 < urlForDomain.length()) 183 | urlForDomain = urlForDomain.substring(0, slashIndex + 1); 184 | 185 | return urlForDomain + path; 186 | } 187 | return path; 188 | } 189 | 190 | public static String extractHost(String url) { 191 | return extractDomain(url, false); 192 | } 193 | 194 | public static String extractDomain(String url, boolean aggressive) { 195 | if (url.startsWith("http://")) 196 | url = url.substring("http://".length()); 197 | else if (url.startsWith("https://")) 198 | url = url.substring("https://".length()); 199 | 200 | if (aggressive) { 201 | if (url.startsWith("www.")) 202 | url = url.substring("www.".length()); 203 | 204 | // strip mobile from start 205 | if (url.startsWith("m.")) 206 | url = url.substring("m.".length()); 207 | } 208 | 209 | int slashIndex = url.indexOf("/"); 210 | if (slashIndex > 0) 211 | url = url.substring(0, slashIndex); 212 | 213 | return url; 214 | } 215 | 216 | public static boolean isVideoLink(String url) { 217 | url = extractDomain(url, true); 218 | return url.startsWith("youtube.com") || url.startsWith("video.yahoo.com") 219 | || url.startsWith("vimeo.com") || url.startsWith("blip.tv"); 220 | } 221 | 222 | public static boolean isVideo(String url) { 223 | return url.endsWith(".mpeg") || url.endsWith(".mpg") || url.endsWith(".avi") || url.endsWith(".mov") 224 | || url.endsWith(".mpg4") || url.endsWith(".mp4") || url.endsWith(".flv") || url.endsWith(".wmv"); 225 | } 226 | 227 | public static boolean isAudio(String url) { 228 | return url.endsWith(".mp3") || url.endsWith(".ogg") || url.endsWith(".m3u") || url.endsWith(".wav"); 229 | } 230 | 231 | public static boolean isDoc(String url) { 232 | return url.endsWith(".pdf") || url.endsWith(".ppt") || url.endsWith(".doc") 233 | || url.endsWith(".swf") || url.endsWith(".rtf") || url.endsWith(".xls"); 234 | } 235 | 236 | public static boolean isPackage(String url) { 237 | return url.endsWith(".gz") || url.endsWith(".tgz") || url.endsWith(".zip") 238 | || url.endsWith(".rar") || url.endsWith(".deb") || url.endsWith(".rpm") || url.endsWith(".7z"); 239 | } 240 | 241 | public static boolean isApp(String url) { 242 | return url.endsWith(".exe") || url.endsWith(".bin") || url.endsWith(".bat") || url.endsWith(".dmg"); 243 | } 244 | 245 | public static boolean isImage(String url) { 246 | return url.endsWith(".png") || url.endsWith(".jpeg") || url.endsWith(".gif") 247 | || url.endsWith(".jpg") || url.endsWith(".bmp") || url.endsWith(".ico") || url.endsWith(".eps"); 248 | } 249 | 250 | /** 251 | * @see 252 | * http://blogs.sun.com/CoreJavaTechTips/entry/cookie_handling_in_java_se 253 | */ 254 | public static void enableCookieMgmt() { 255 | CookieManager manager = new CookieManager(); 256 | manager.setCookiePolicy(CookiePolicy.ACCEPT_ALL); 257 | CookieHandler.setDefault(manager); 258 | } 259 | 260 | /** 261 | * @see 262 | * http://stackoverflow.com/questions/2529682/setting-user-agent-of-a-java-urlconnection 263 | */ 264 | public static void enableUserAgentOverwrite() { 265 | System.setProperty("http.agent", ""); 266 | } 267 | 268 | public static String getUrlFromUglyGoogleRedirect(String url) { 269 | if (url.startsWith("http://www.google.com/url?")) { 270 | url = url.substring("http://www.google.com/url?".length()); 271 | String arr[] = urlDecode(url).split("\\&"); 272 | if (arr != null) 273 | for (String str : arr) { 274 | if (str.startsWith("q=")) 275 | return str.substring("q=".length()); 276 | } 277 | } 278 | 279 | return null; 280 | } 281 | 282 | public static String getUrlFromUglyFacebookRedirect(String url) { 283 | if (url.startsWith("http://www.facebook.com/l.php?u=")) { 284 | url = url.substring("http://www.facebook.com/l.php?u=".length()); 285 | return urlDecode(url); 286 | } 287 | 288 | return null; 289 | } 290 | 291 | public static String urlEncode(String str) { 292 | try { 293 | return URLEncoder.encode(str, UTF8); 294 | } catch (UnsupportedEncodingException ex) { 295 | return str; 296 | } 297 | } 298 | 299 | public static String urlDecode(String str) { 300 | try { 301 | return URLDecoder.decode(str, UTF8); 302 | } catch (UnsupportedEncodingException ex) { 303 | return str; 304 | } 305 | } 306 | 307 | /** 308 | * Popular sites uses the #! to indicate the importance of the following 309 | * chars. Ugly but true. Such as: facebook, twitter, gizmodo, ... 310 | */ 311 | public static String removeHashbang(String url) { 312 | return url.replaceFirst("#!", ""); 313 | } 314 | 315 | public static String printNode(Element root) { 316 | return printNode(root, 0); 317 | } 318 | 319 | public static String printNode(Element root, int indentation) { 320 | StringBuilder sb = new StringBuilder(); 321 | for (int i = 0; i < indentation; i++) { 322 | sb.append(' '); 323 | } 324 | sb.append(root.tagName()); 325 | sb.append(":"); 326 | sb.append(root.ownText()); 327 | sb.append("\n"); 328 | for (Element el : root.children()) { 329 | sb.append(printNode(el, indentation + 1)); 330 | sb.append("\n"); 331 | } 332 | return sb.toString(); 333 | } 334 | 335 | public static String estimateDate(String url) { 336 | int index = url.indexOf("://"); 337 | if (index > 0) 338 | url = url.substring(index + 3); 339 | 340 | int year = -1; 341 | int yearCounter = -1; 342 | int month = -1; 343 | int monthCounter = -1; 344 | int day = -1; 345 | String strs[] = url.split("/"); 346 | for (int counter = 0; counter < strs.length; counter++) { 347 | String str = strs[counter]; 348 | if (str.length() == 4) { 349 | try { 350 | year = Integer.parseInt(str); 351 | } catch (Exception ex) { 352 | continue; 353 | } 354 | if (year < 1970 || year > 3000) { 355 | year = -1; 356 | continue; 357 | } 358 | yearCounter = counter; 359 | } else if (str.length() == 2) { 360 | if (monthCounter < 0 && counter == yearCounter + 1) { 361 | try { 362 | month = Integer.parseInt(str); 363 | } catch (Exception ex) { 364 | continue; 365 | } 366 | if (month < 1 || month > 12) { 367 | month = -1; 368 | continue; 369 | } 370 | monthCounter = counter; 371 | } else if (counter == monthCounter + 1) { 372 | try { 373 | day = Integer.parseInt(str); 374 | } catch (Exception ex) { 375 | } 376 | if (day < 1 || day > 31) { 377 | day = -1; 378 | continue; 379 | } 380 | break; 381 | } 382 | } 383 | } 384 | 385 | if (year < 0) 386 | return null; 387 | 388 | StringBuilder str = new StringBuilder(); 389 | str.append(year); 390 | if (month < 1) 391 | return str.toString(); 392 | 393 | str.append('/'); 394 | if (month < 10) 395 | str.append('0'); 396 | str.append(month); 397 | if (day < 1) 398 | return str.toString(); 399 | 400 | str.append('/'); 401 | if (day < 10) 402 | str.append('0'); 403 | str.append(day); 404 | return str.toString(); 405 | } 406 | 407 | public static String completeDate(String dateStr) { 408 | if (dateStr == null) 409 | return null; 410 | 411 | int index = dateStr.indexOf('/'); 412 | if (index > 0) { 413 | index = dateStr.indexOf('/', index + 1); 414 | if (index > 0) 415 | return dateStr; 416 | else 417 | return dateStr + "/01"; 418 | } 419 | return dateStr + "/01/01"; 420 | } 421 | 422 | /** 423 | * keep in mind: simpleDateFormatter is not thread safe! call completeDate 424 | * before applying this formatter. 425 | */ 426 | public static SimpleDateFormat createDateFormatter() { 427 | return new SimpleDateFormat("yyyy/MM/dd"); 428 | } 429 | 430 | // with the help of http://stackoverflow.com/questions/1828775/httpclient-and-ssl 431 | public static void enableAnySSL() { 432 | try { 433 | SSLContext ctx = SSLContext.getInstance("TLS"); 434 | ctx.init(new KeyManager[0], new TrustManager[]{new DefaultTrustManager()}, new SecureRandom()); 435 | SSLContext.setDefault(ctx); 436 | } catch (Exception ex) { 437 | ex.printStackTrace(); 438 | } 439 | } 440 | 441 | private static class DefaultTrustManager implements X509TrustManager { 442 | 443 | @Override 444 | public void checkClientTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { 445 | } 446 | 447 | @Override 448 | public void checkServerTrusted(X509Certificate[] arg0, String arg1) throws CertificateException { 449 | } 450 | 451 | @Override 452 | public X509Certificate[] getAcceptedIssuers() { 453 | return null; 454 | } 455 | } 456 | 457 | public static int countLetters(String str) { 458 | int len = str.length(); 459 | int chars = 0; 460 | for (int i = 0; i < len; i++) { 461 | if (Character.isLetter(str.charAt(i))) 462 | chars++; 463 | } 464 | return chars; 465 | } 466 | } 467 | -------------------------------------------------------------------------------- /src/main/java/de/jetwick/snacktory/HtmlFetcher.java: -------------------------------------------------------------------------------- 1 | /* 2 | * Copyright 2011 Peter Karich 3 | * 4 | * Licensed under the Apache License, Version 2.0 (the "License"); 5 | * you may not use this file except in compliance with the License. 6 | * You may obtain a copy of the License at 7 | * 8 | * http://www.apache.org/licenses/LICENSE-2.0 9 | * 10 | * Unless required by applicable law or agreed to in writing, software 11 | * distributed under the License is distributed on an "AS IS" BASIS, 12 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 | * See the License for the specific language governing permissions and 14 | * limitations under the License. 15 | */ 16 | package de.jetwick.snacktory; 17 | 18 | import java.io.BufferedReader; 19 | import java.io.BufferedWriter; 20 | import java.io.FileReader; 21 | import java.io.FileWriter; 22 | import java.io.IOException; 23 | import java.io.InputStream; 24 | import java.net.HttpURLConnection; 25 | import java.net.MalformedURLException; 26 | import java.net.Proxy; 27 | import java.net.URL; 28 | import java.util.LinkedHashSet; 29 | import java.util.Set; 30 | import java.util.concurrent.atomic.AtomicInteger; 31 | import java.util.zip.GZIPInputStream; 32 | import java.util.zip.Inflater; 33 | import java.util.zip.InflaterInputStream; 34 | 35 | import org.slf4j.Logger; 36 | import org.slf4j.LoggerFactory; 37 | 38 | /** 39 | * Class to fetch articles. This class is thread safe. 40 | * 41 | * @author Peter Karich 42 | */ 43 | public class HtmlFetcher { 44 | 45 | static { 46 | SHelper.enableCookieMgmt(); 47 | SHelper.enableUserAgentOverwrite(); 48 | SHelper.enableAnySSL(); 49 | } 50 | private static final Logger logger = LoggerFactory.getLogger(HtmlFetcher.class); 51 | 52 | public static void main(String[] args) throws Exception { 53 | BufferedReader reader = new BufferedReader(new FileReader("urls.txt")); 54 | String line = null; 55 | Set existing = new LinkedHashSet(); 56 | while ((line = reader.readLine()) != null) { 57 | int index1 = line.indexOf("\""); 58 | int index2 = line.indexOf("\"", index1 + 1); 59 | String url = line.substring(index1 + 1, index2); 60 | String domainStr = SHelper.extractDomain(url, true); 61 | String counterStr = ""; 62 | // TODO more similarities 63 | if (existing.contains(domainStr)) 64 | counterStr = "2"; 65 | else 66 | existing.add(domainStr); 67 | 68 | String html = new HtmlFetcher().fetchAsString(url, 20000); 69 | String outFile = domainStr + counterStr + ".html"; 70 | BufferedWriter writer = new BufferedWriter(new FileWriter(outFile)); 71 | writer.write(html); 72 | writer.close(); 73 | } 74 | reader.close(); 75 | } 76 | private String referrer = "https://github.com/karussell/snacktory"; 77 | private String userAgent = "Mozilla/5.0 (compatible; Snacktory; +" + referrer + ")"; 78 | private String cacheControl = "max-age=0"; 79 | private String language = "en-us"; 80 | private String accept = "application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5"; 81 | private String charset = "UTF-8"; 82 | private SCache cache; 83 | private Proxy proxy = null; 84 | private AtomicInteger cacheCounter = new AtomicInteger(0); 85 | private int maxTextLength = -1; 86 | private ArticleTextExtractor extractor = new ArticleTextExtractor(); 87 | private Set furtherResolveNecessary = new LinkedHashSet() { 88 | { 89 | add("bit.ly"); 90 | add("cli.gs"); 91 | add("deck.ly"); 92 | add("fb.me"); 93 | add("feedproxy.google.com"); 94 | add("flic.kr"); 95 | add("fur.ly"); 96 | add("goo.gl"); 97 | add("is.gd"); 98 | add("ink.co"); 99 | add("j.mp"); 100 | add("lnkd.in"); 101 | add("on.fb.me"); 102 | add("ow.ly"); 103 | add("plurl.us"); 104 | add("sns.mx"); 105 | add("snurl.com"); 106 | add("su.pr"); 107 | add("t.co"); 108 | add("tcrn.ch"); 109 | add("tl.gd"); 110 | add("tiny.cc"); 111 | add("tinyurl.com"); 112 | add("tmi.me"); 113 | add("tr.im"); 114 | add("twurl.nl"); 115 | } 116 | }; 117 | 118 | public HtmlFetcher() { 119 | } 120 | 121 | public void setExtractor(ArticleTextExtractor extractor) { 122 | this.extractor = extractor; 123 | } 124 | 125 | public ArticleTextExtractor getExtractor() { 126 | return extractor; 127 | } 128 | 129 | public HtmlFetcher setCache(SCache cache) { 130 | this.cache = cache; 131 | return this; 132 | } 133 | 134 | public SCache getCache() { 135 | return cache; 136 | } 137 | 138 | public int getCacheCounter() { 139 | return cacheCounter.get(); 140 | } 141 | 142 | public HtmlFetcher clearCacheCounter() { 143 | cacheCounter.set(0); 144 | return this; 145 | } 146 | 147 | public HtmlFetcher setMaxTextLength(int maxTextLength) { 148 | this.maxTextLength = maxTextLength; 149 | return this; 150 | } 151 | 152 | public int getMaxTextLength() { 153 | return maxTextLength; 154 | } 155 | 156 | public void setAccept(String accept) { 157 | this.accept = accept; 158 | } 159 | 160 | public void setCharset(String charset) { 161 | this.charset = charset; 162 | } 163 | 164 | public void setCacheControl(String cacheControl) { 165 | this.cacheControl = cacheControl; 166 | } 167 | 168 | public String getLanguage() { 169 | return language; 170 | } 171 | 172 | public void setLanguage(String language) { 173 | this.language = language; 174 | } 175 | 176 | public String getReferrer() { 177 | return referrer; 178 | } 179 | 180 | public HtmlFetcher setReferrer(String referrer) { 181 | this.referrer = referrer; 182 | return this; 183 | } 184 | 185 | public String getUserAgent() { 186 | return userAgent; 187 | } 188 | 189 | public void setUserAgent(String userAgent) { 190 | this.userAgent = userAgent; 191 | } 192 | 193 | public String getAccept() { 194 | return accept; 195 | } 196 | 197 | public String getCacheControl() { 198 | return cacheControl; 199 | } 200 | 201 | public String getCharset() { 202 | return charset; 203 | } 204 | 205 | public void setProxy(Proxy proxy) { 206 | this.proxy = proxy; 207 | } 208 | 209 | public Proxy getProxy() { 210 | return (proxy != null ? proxy : Proxy.NO_PROXY); 211 | } 212 | 213 | public boolean isProxySet() { 214 | return getProxy() != null; 215 | } 216 | 217 | public JResult fetchAndExtract(String url, int timeout, boolean resolve) throws Exception { 218 | String originalUrl = url; 219 | url = SHelper.removeHashbang(url); 220 | String gUrl = SHelper.getUrlFromUglyGoogleRedirect(url); 221 | if (gUrl != null) 222 | url = gUrl; 223 | else { 224 | gUrl = SHelper.getUrlFromUglyFacebookRedirect(url); 225 | if (gUrl != null) 226 | url = gUrl; 227 | } 228 | 229 | if (resolve) { 230 | // check if we can avoid resolving the URL (which hits the website!) 231 | JResult res = getFromCache(url, originalUrl); 232 | if (res != null) 233 | return res; 234 | 235 | String resUrl = getResolvedUrl(url, timeout); 236 | if (resUrl.isEmpty()) { 237 | if (logger.isDebugEnabled()) 238 | logger.warn("resolved url is empty. Url is: " + url); 239 | 240 | JResult result = new JResult(); 241 | if (cache != null) 242 | cache.put(url, result); 243 | return result.setUrl(url); 244 | } 245 | 246 | // if resolved url is longer then use it! 247 | if (resUrl != null && resUrl.trim().length() > url.length()) { 248 | // this is necessary e.g. for some homebaken url resolvers which return 249 | // the resolved url relative to url! 250 | url = SHelper.useDomainOfFirstArg4Second(url, resUrl); 251 | } 252 | } 253 | 254 | // check if we have the (resolved) URL in cache 255 | JResult res = getFromCache(url, originalUrl); 256 | if (res != null) 257 | return res; 258 | 259 | JResult result = new JResult(); 260 | // or should we use? 261 | result.setUrl(url); 262 | result.setOriginalUrl(originalUrl); 263 | result.setDate(SHelper.estimateDate(url)); 264 | 265 | // Immediately put the url into the cache as extracting content takes time. 266 | if (cache != null) { 267 | cache.put(originalUrl, result); 268 | cache.put(url, result); 269 | } 270 | 271 | String lowerUrl = url.toLowerCase(); 272 | if (SHelper.isDoc(lowerUrl) || SHelper.isApp(lowerUrl) || SHelper.isPackage(lowerUrl)) { 273 | // skip 274 | } else if (SHelper.isVideo(lowerUrl) || SHelper.isAudio(lowerUrl)) { 275 | result.setVideoUrl(url); 276 | } else if (SHelper.isImage(lowerUrl)) { 277 | result.setImageUrl(url); 278 | } else { 279 | extractor.extractContent(result, fetchAsString(url, timeout)); 280 | if (result.getFaviconUrl().isEmpty()) 281 | result.setFaviconUrl(SHelper.getDefaultFavicon(url)); 282 | 283 | // some links are relative to root and do not include the domain of the url :( 284 | result.setFaviconUrl(fixUrl(url, result.getFaviconUrl())); 285 | result.setImageUrl(fixUrl(url, result.getImageUrl())); 286 | result.setVideoUrl(fixUrl(url, result.getVideoUrl())); 287 | result.setRssUrl(fixUrl(url, result.getRssUrl())); 288 | } 289 | result.setText(lessText(result.getText())); 290 | synchronized (result) { 291 | result.notifyAll(); 292 | } 293 | return result; 294 | } 295 | 296 | public String lessText(String text) { 297 | if (text == null) 298 | return ""; 299 | 300 | if (maxTextLength >= 0 && text.length() > maxTextLength) 301 | return text.substring(0, maxTextLength); 302 | 303 | return text; 304 | } 305 | 306 | private static String fixUrl(String url, String urlOrPath) { 307 | return SHelper.useDomainOfFirstArg4Second(url, urlOrPath); 308 | } 309 | 310 | public String fetchAsString(String urlAsString, int timeout) 311 | throws MalformedURLException, IOException { 312 | return fetchAsString(urlAsString, timeout, true); 313 | } 314 | 315 | public String fetchAsString(String urlAsString, int timeout, boolean includeSomeGooseOptions) 316 | throws MalformedURLException, IOException { 317 | HttpURLConnection hConn = createUrlConnection(urlAsString, timeout, includeSomeGooseOptions); 318 | hConn.setInstanceFollowRedirects(true); 319 | String encoding = hConn.getContentEncoding(); 320 | InputStream is; 321 | if (encoding != null && encoding.equalsIgnoreCase("gzip")) { 322 | is = new GZIPInputStream(hConn.getInputStream()); 323 | } else if (encoding != null && encoding.equalsIgnoreCase("deflate")) { 324 | is = new InflaterInputStream(hConn.getInputStream(), new Inflater(true)); 325 | } else { 326 | is = hConn.getInputStream(); 327 | } 328 | 329 | String enc = Converter.extractEncoding(hConn.getContentType()); 330 | String res = createConverter(urlAsString).streamToString(is, enc); 331 | if (logger.isDebugEnabled()) 332 | logger.debug(res.length() + " FetchAsString:" + urlAsString); 333 | return res; 334 | } 335 | 336 | public Converter createConverter(String url) { 337 | return new Converter(url); 338 | } 339 | 340 | /** 341 | * On some devices we have to hack: 342 | * http://developers.sun.com/mobility/reference/techart/design_guidelines/http_redirection.html 343 | * 344 | * @param timeout Sets a specified timeout value, in milliseconds 345 | * @return the resolved url if any. Or null if it couldn't resolve the url 346 | * (within the specified time) or the same url if response code is OK 347 | */ 348 | public String getResolvedUrl(String urlAsString, int timeout) { 349 | String newUrl = null; 350 | int responseCode = -1; 351 | try { 352 | HttpURLConnection hConn = createUrlConnection(urlAsString, timeout, true); 353 | // force no follow 354 | hConn.setInstanceFollowRedirects(false); 355 | // the program doesn't care what the content actually is !! 356 | // http://java.sun.com/developer/JDCTechTips/2003/tt0422.html 357 | hConn.setRequestMethod("HEAD"); 358 | hConn.connect(); 359 | responseCode = hConn.getResponseCode(); 360 | hConn.getInputStream().close(); 361 | if (responseCode == HttpURLConnection.HTTP_OK) 362 | return urlAsString; 363 | 364 | newUrl = hConn.getHeaderField("Location"); 365 | if (responseCode / 100 == 3 && newUrl != null) { 366 | newUrl = newUrl.replaceAll(" ", "+"); 367 | // some services use (none-standard) utf8 in their location header 368 | if (urlAsString.startsWith("http://bit.ly") || urlAsString.startsWith("http://is.gd")) 369 | newUrl = encodeUriFromHeader(newUrl); 370 | 371 | // fix problems if shortened twice. as it is often the case after twitters' t.co bullshit 372 | if (furtherResolveNecessary.contains(SHelper.extractDomain(newUrl, true))) 373 | newUrl = getResolvedUrl(newUrl, timeout); 374 | 375 | return newUrl; 376 | } else 377 | return urlAsString; 378 | 379 | } catch (Exception ex) { 380 | logger.warn("getResolvedUrl:" + urlAsString + " Error:" + ex.getMessage(), ex); 381 | return ""; 382 | } finally { 383 | if (logger.isDebugEnabled()) 384 | logger.debug(responseCode + " url:" + urlAsString + " resolved:" + newUrl); 385 | } 386 | } 387 | 388 | /** 389 | * Takes a URI that was decoded as ISO-8859-1 and applies percent-encoding 390 | * to non-ASCII characters. Workaround for broken origin servers that send 391 | * UTF-8 in the Location: header. 392 | */ 393 | static String encodeUriFromHeader(String badLocation) { 394 | StringBuilder sb = new StringBuilder(); 395 | 396 | for (char ch : badLocation.toCharArray()) { 397 | if (ch < (char) 128) { 398 | sb.append(ch); 399 | } else { 400 | // this is ONLY valid if the uri was decoded using ISO-8859-1 401 | sb.append(String.format("%%%02X", (int) ch)); 402 | } 403 | } 404 | 405 | return sb.toString(); 406 | } 407 | 408 | protected HttpURLConnection createUrlConnection(String urlAsStr, int timeout, 409 | boolean includeSomeGooseOptions) throws MalformedURLException, IOException { 410 | URL url = new URL(urlAsStr); 411 | //using proxy may increase latency 412 | Proxy proxy = getProxy(); 413 | HttpURLConnection hConn = (HttpURLConnection) url.openConnection(proxy); 414 | hConn.setRequestProperty("User-Agent", userAgent); 415 | hConn.setRequestProperty("Accept", accept); 416 | 417 | if (includeSomeGooseOptions) { 418 | hConn.setRequestProperty("Accept-Language", language); 419 | hConn.setRequestProperty("content-charset", charset); 420 | hConn.addRequestProperty("Referer", referrer); 421 | // avoid the cache for testing purposes only? 422 | hConn.setRequestProperty("Cache-Control", cacheControl); 423 | } 424 | 425 | // suggest respond to be gzipped or deflated (which is just another compression) 426 | // http://stackoverflow.com/q/3932117 427 | hConn.setRequestProperty("Accept-Encoding", "gzip, deflate"); 428 | hConn.setConnectTimeout(timeout); 429 | hConn.setReadTimeout(timeout); 430 | return hConn; 431 | } 432 | 433 | private JResult getFromCache(String url, String originalUrl) throws Exception { 434 | if (cache != null) { 435 | JResult res = cache.get(url); 436 | if (res != null) { 437 | // e.g. the cache returned a shortened url as original url now we want to store the 438 | // current original url! Also it can be that the cache response to url but the JResult 439 | // does not contain it so overwrite it: 440 | res.setUrl(url); 441 | res.setOriginalUrl(originalUrl); 442 | cacheCounter.addAndGet(1); 443 | return res; 444 | } 445 | } 446 | return null; 447 | } 448 | } 449 | -------------------------------------------------------------------------------- /src/test/java/de/jetwick/snacktory/ArticleTextExtractorTodoTester.java: -------------------------------------------------------------------------------- 1 | package de.jetwick.snacktory; 2 | 3 | import java.io.BufferedReader; 4 | import java.io.FileReader; 5 | import org.junit.Before; 6 | import org.junit.Test; 7 | import static org.junit.Assert.*; 8 | 9 | public class ArticleTextExtractorTodoTester { 10 | 11 | ArticleTextExtractor extractor; 12 | Converter c; 13 | 14 | @Before 15 | public void setup() throws Exception { 16 | c = new Converter(); 17 | extractor = new ArticleTextExtractor(); 18 | } 19 | 20 | @Test 21 | public void testEspn2() throws Exception { 22 | //String url = "http://sports.espn.go.com/golf/pgachampionship10/news/story?id=5463456"; 23 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("espn2.html"))); 24 | assertTrue(article.getText(), article.getText().startsWith("PHILADELPHIA -- Michael Vick missed practice Thursday because of a leg injury and is unlikely to play Sunday wh")); 25 | assertEquals("http://a.espncdn.com/media/motion/2010/0813/dm_100814_pga_rinaldi.jpg", article.getImageUrl()); 26 | } 27 | 28 | @Test 29 | public void testWashingtonpost() throws Exception { 30 | //String url = "http://www.washingtonpost.com/wp-dyn/content/article/2010/12/08/AR2010120803185.html"; 31 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("washingtonpost.html"))); 32 | assertTrue(article.getText(), article.getText().startsWith("The Supreme Court sounded ")); 33 | assertEquals("http://media3.washingtonpost.com/wp-dyn/content/photo/2010/10/09/PH2010100904575.jpg", article.getImageUrl()); 34 | } 35 | 36 | @Test 37 | public void testBoingboing() throws Exception { 38 | //String url = "http://www.boingboing.net/2010/08/18/dr-laura-criticism-o.html"; 39 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("boingboing.html"))); 40 | assertTrue(article.getText(), article.getText().startsWith("Dr. Laura Schlessinger is leaving radio to regain")); 41 | assertEquals("http://www.boingboing.net/images/drlaura.jpg", article.getImageUrl()); 42 | } 43 | 44 | @Test 45 | public void testReadwriteWeb() throws Exception { 46 | //String url = "http://www.readwriteweb.com/start/2010/08/pagely-headline.php"; 47 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("readwriteweb.html"))); 48 | assertTrue(article.getText(), article.getText().startsWith("In the heart of downtown Chandler, Arizona")); 49 | assertEquals("http://rww.readwriteweb.netdna-cdn.com/start/images/logopagely_aug10.jpg", article.getImageUrl()); 50 | } 51 | 52 | @Test 53 | public void testYahooNewsEvenThoughTheyFuckedUpDeliciousWeWillTestThemAnyway() throws Exception { 54 | //String url = "http://news.yahoo.com/s/ap/20110305/ap_on_re_af/af_libya"; 55 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("yahoo.html"))); 56 | assertTrue(article.getText(), article.getText().startsWith("TRIPOLI, Libya – Government forces in tanks rolled into the opposition-held city closest ")); 57 | assertEquals("http://d.yimg.com/a/p/ap/20110305/http://d.yimg.com/a/p/ap/20110305/thumb.23c7d780d8d84bc4a8c77af11ecba277-23c7d780d8d84bc4a8c77af11ecba277-0.jpg?x=130&y=90&xc=1&yc=1&wc=130&hc=90&q=85&sig=LbIZK0rnJlZAcrAWn.brLw--", 58 | article.getImageUrl()); 59 | } 60 | 61 | @Test 62 | public void testLifehacker() throws Exception { 63 | //String url = "http://lifehacker.com/#!5659837/build-a-rocket-stove-to-heat-your-home-with-wood-scraps"; 64 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("lifehacker.html"))); 65 | assertTrue(article.getText(), article.getText().startsWith("If you find yourself with lots of leftover wood")); 66 | assertEquals("http://cache.gawker.com/assets/images/lifehacker/2010/10/rocket-stove-finished.jpeg", article.getImageUrl()); 67 | } 68 | 69 | @Test 70 | public void testNaturalhomemagazine() throws Exception { 71 | //String url = "http://www.naturalhomemagazine.com/diy-projects/try-this-papier-mache-ghostly-lanterns.aspx"; 72 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("naturalhomemagazine.html"))); 73 | assertTrue(article.getText(), article.getText().startsWith("Guide trick or treaters and other friendly spirits to your front")); 74 | assertEquals("http://www.naturalhomemagazine.com/uploadedImages/articles/issues/2010-09-01/NH-SO10-trythis-lantern-final2_resized400X266.jpg", 75 | article.getImageUrl()); 76 | } 77 | 78 | @Test 79 | public void testSfgate() throws Exception { 80 | //String url = "http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2010/10/27/BUD61G2DBL.DTL"; 81 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("sfgate.html"))); 82 | assertTrue(article.getText(), article.getText().startsWith("Fewer homes in California and")); 83 | assertEquals("http://imgs.sfgate.com/c/pictures/2010/10/26/ba-foreclosures2_SFCG1288130091.jpg", 84 | article.getImageUrl()); 85 | } 86 | 87 | @Test 88 | public void testScientificdaily() throws Exception { 89 | //String url = "http://www.scientificamerican.com/article.cfm?id=bpa-semen-quality"; 90 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("scientificamerican.html"))); 91 | assertTrue(article.getText(), article.getText().startsWith("The common industrial chemical bisphenol A (BPA) ")); 92 | assertEquals("http://www.scientificamerican.com/media/inline/bpa-semen-quality_1.jpg", article.getImageUrl()); 93 | assertEquals("Everyday BPA Exposure Decreases Human Semen Quality", article.getTitle()); 94 | } 95 | 96 | @Test 97 | public void testUniverseToday() throws Exception { 98 | //String url = "http://www.universetoday.com/76881/podcast-more-from-tony-colaprete-on-lcross/"; 99 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("universetoday.html"))); 100 | assertTrue(article.getText(), article.getText().startsWith("I had the chance to interview LCROSS")); 101 | assertEquals("http://www.universetoday.com/wp-content/uploads/2009/10/lcross-impact_01_01.jpg", 102 | article.getImageUrl()); 103 | assertEquals("More From Tony Colaprete on LCROSS", article.getTitle()); 104 | } 105 | 106 | @Test 107 | public void testCNBC() throws Exception { 108 | //String url = "http://www.cnbc.com/id/40491584"; 109 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("cnbc.html"))); 110 | assertTrue(article.getText(), article.getText().startsWith("A prominent expert on Chinese works ")); 111 | assertEquals("http://media.cnbc.com/i/CNBC/Sections/News_And_Analysis/__Story_Inserts/graphics/__ART/chinese_vase_150.jpg", 112 | article.getImageUrl()); 113 | assertTrue(article.getTitle().equals("Chinese Art Expert 'Skeptical' of Record-Setting Vase")); 114 | } 115 | 116 | @Test 117 | public void testMsnbc() throws Exception { 118 | //String url = "http://www.msnbc.msn.com/id/41207891/ns/world_news-europe/"; 119 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("msnbc.html"))); 120 | assertTrue(article.getText(), article.getText().startsWith("DUBLIM -- Prime Minister Brian Cowen announced Saturday")); 121 | assertEquals("Irish premier resigns as party leader, stays as PM", article.getTitle()); 122 | assertEquals("http://msnbcmedia3.msn.com/j/ap/ireland government crisis--687575559_v2.grid-6x2.jpg", 123 | article.getImageUrl()); 124 | } 125 | 126 | @Test 127 | public void testTheAtlantic() throws Exception { 128 | //String url = "http://www.theatlantic.com/culture/archive/2011/01/how-to-stop-james-bond-from-getting-old/69695/"; 129 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("theatlantic.html"))); 130 | assertTrue(article.getText(), article.getText().startsWith("If James Bond could age, he'd be well into his 90s right now")); 131 | assertEquals("http://assets.theatlantic.com/static/mt/assets/culture_test/James%20Bond_post.jpg", 132 | article.getImageUrl()); 133 | } 134 | 135 | @Test 136 | public void testGawker() throws Exception { 137 | //String url = "http://gawker.com/#!5777023/charlie-sheen-is-going-to-haiti-with-sean-penn"; 138 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("gawker.html"))); 139 | assertTrue(article.getText(), article.getText().startsWith("With a backlash brewing against the incessant media")); 140 | assertEquals("http://cache.gawkerassets.com/assets/images/7/2011/03/medium_0304_pennsheen.jpg", 141 | article.getImageUrl()); 142 | } 143 | 144 | @Test 145 | public void testNyt2() throws Exception { 146 | //String url = "http://www.nytimes.com/2010/12/22/world/europe/22start.html"; 147 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("nyt2.html"))); 148 | assertTrue(article.getText(), article.getText().startsWith("WASHINGTON — An arms control treaty paring back American")); 149 | assertEquals("http://graphics8.nytimes.com/images/2010/12/22/world/22start-span/Start-articleInline.jpg", 150 | article.getImageUrl()); 151 | } 152 | 153 | @Test 154 | public void testGettingVideosFromGraphVinyl() throws Exception { 155 | //String url = "http://grapevinyl.com/v/84/magnetic-morning/getting-nowhere"; 156 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("grapevinyl.html"))); 157 | assertEquals("http://www.youtube.com/v/dsVWVtGWoa4&hl=en_US&fs=1&color1=d6d6d6&color2=ffffff&autoplay=1&iv_load_policy=3&rel=0&showinfo=0&hd=1", 158 | article.getVideoUrl()); 159 | } 160 | 161 | @Test 162 | public void testLiveStrong() throws Exception { 163 | //String url = "http://www.livestrong.com/article/395538-how-to-decrease-the-rates-of-obesity-in-children/"; 164 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("livestrong.html"))); 165 | assertTrue(article.getText(), article.getText().startsWith("Childhood obesity increases a young person")); 166 | assertEquals("http://photos.demandstudios.com/getty/article/184/46/87576279_XS.jpg", 167 | article.getImageUrl()); 168 | } 169 | 170 | @Test 171 | public void testLiveStrong2() throws Exception { 172 | //String url = "http://www.livestrong.com/article/396152-do-resistance-bands-work-for-strength-training/"; 173 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("livestrong2.html"))); 174 | assertTrue(article.getText(), article.getText().startsWith("Resistance bands or tubes are named because")); 175 | assertEquals("http://photos.demandstudios.com/getty/article/142/66/86504893_XS.jpg", article.getImageUrl()); 176 | } 177 | 178 | @Test 179 | public void testCracked() throws Exception { 180 | //String url = "http://www.cracked.com/article_19029_6-things-social-networking-sites-need-to-stop-doing.html"; 181 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("cracked.html"))); 182 | assertTrue(article.getText(), article.getText().startsWith("Social networking is here to stay")); 183 | assertEquals("http://i-beta.crackedcdn.com/phpimages/article/2/1/6/45216.jpg?v=1", article.getImageUrl()); 184 | } 185 | 186 | @Test 187 | public void testMidgetmanofsteel() throws Exception { 188 | //String url = "http://www.cracked.com/article_19029_6-things-social-networking-sites-need-to-stop-doing.html"; 189 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("midgetmanofsteel.html"))); 190 | assertTrue(article.getText(), article.getText().startsWith("I've decided to turn my Facebook assholishnessicicity")); 191 | assertEquals("http://4.bp.blogspot.com/_F74vJj-Clzk/TPkzP-Y93jI/AAAAAAAALKM/D3w1sfJqE5U/s200/funny-dog-pictures-will-work-for-hot-dogs.jpg", article.getImageUrl()); 192 | } 193 | 194 | @Test 195 | public void testTrailsCom() throws Exception { 196 | //String url = "http://www.trails.com/facts_41596_hot-spots-citrus-county-florida.html"; 197 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("trails.html"))); 198 | assertTrue(article.getText(), article.getText().startsWith("Snorkel and view artificial reefs or chase")); 199 | assertEquals("http://cdn-www.trails.com/imagecache/articles/295x195/hot-spots-citrus-county-florida-295x195.png", article.getImageUrl()); 200 | } 201 | 202 | @Test 203 | public void testTrailsCom2() throws Exception { 204 | //String url = "http://www.trails.com/facts_12408_history-alpine-skis.html"; 205 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("trails2.html"))); 206 | assertTrue(article.getText(), article.getText().startsWith("Derived from the old Norse word")); 207 | assertEquals("http://cdn-www.trails.com/imagecache/articles/295x195/history-alpine-skis-295x195.png", article.getImageUrl()); 208 | } 209 | 210 | @Test 211 | public void testEhow() throws Exception { 212 | //String url = "http://www.ehow.com/how_7734109_make-white-spaghetti.html"; 213 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("ehow.html"))); 214 | assertTrue(article.getText(), article.getText().startsWith("Heat the oil in the")); 215 | assertEquals("How to Make White Spaghetti", article.getTitle()); 216 | } 217 | 218 | @Test 219 | public void testGolfLink() throws Exception { 220 | //String url = "http://www.golflink.com/how_1496_eat-cheap-las-vegas.html"; 221 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("golflink.html"))); 222 | assertTrue(article.getText(), article.getText().startsWith("Las Vegas, while noted for its glitz")); 223 | assertEquals("http://cdn-www.golflink.com/Cms/images/GlobalPhoto/Articles/2011/2/17/1496/fotolia4152707XS-main_Full.jpg", 224 | article.getImageUrl()); 225 | } 226 | 227 | @Test 228 | public void testNewsweek() throws Exception { 229 | //String url = "http://www.newsweek.com/2010/10/09/how-moscow-s-war-on-islamist-rebels-is-backfiring.html"; 230 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("newsweek.html"))); 231 | assertTrue(article.getText(), article.getText().startsWith("At first glance, Kadyrov might seem")); 232 | // assertEquals("http://www.newsweek.com/content/newsweek/2010/10/09/how-moscow-s-war-on-islamist-rebels-is-backfiring.scaled.small.1309768214891.jpg", 233 | // article.getImageUrl()); 234 | assertEquals("http://www.newsweek.com/content/newsweek/2010/10/09/how-moscow-s-war-on-islamist-rebels-is-backfiring.scaled.small.1302869450444.jpg", 235 | article.getImageUrl()); 236 | } 237 | 238 | @Test 239 | public void testBusinessweek() throws Exception { 240 | // String url = "http://www.businessweek.com/magazine/content/10_34/b4192066630779.htm"; 241 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("businessweek.html"))); 242 | assertEquals("Olivia Munn: Queen of the Uncool - BusinessWeek", article.getTitle()); 243 | assertTrue(article.getText(), article.getText().startsWith("Six years ago, Olivia Munn arrived in Hollywood with fading ambitions of making it ")); 244 | assertEquals("http://images.businessweek.com/mz/10/34/370/1034_mz_66popmunnessa.jpg", article.getImageUrl()); 245 | } 246 | 247 | @Test 248 | public void testNature() throws Exception { 249 | //String url = "http://www.nature.com/news/2011/110411/full/472146a.html"; 250 | JResult article = extractor.extractContent(c.streamToString(getClass().getResourceAsStream("nature.html"))); 251 | assertTrue(article.getText(), article.getText().startsWith("As the immediate threat from Fukushima " 252 | + "Daiichi's damaged nuclear reactors recedes, engineers and scientists are")); 253 | } 254 | 255 | /** 256 | * @param filePath the name of the file to open. Not sure if it can accept 257 | * URLs or just filenames. Path handling could be better, and buffer sizes 258 | * are hardcoded 259 | */ 260 | public static String readFileAsString(String filePath) 261 | throws java.io.IOException { 262 | StringBuilder fileData = new StringBuilder(1000); 263 | BufferedReader reader = new BufferedReader(new FileReader(filePath)); 264 | char[] buf = new char[1024]; 265 | int numRead = 0; 266 | while ((numRead = reader.read(buf)) != -1) { 267 | String readData = String.valueOf(buf, 0, numRead); 268 | fileData.append(readData); 269 | buf = new char[1024]; 270 | } 271 | reader.close(); 272 | return fileData.toString(); 273 | } 274 | } 275 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/universetoday.html: -------------------------------------------------------------------------------- 1 | Podcast: More From Tony Colaprete on LCROSS

Podcast: More From Tony Colaprete on LCROSS

by Nancy Atkinson on October 28, 2010

Artist concept of the Centaur and LCROSS heading towards the Moon. Credit: NASA

I had the chance to interview LCROSS principal investigator Anthony Colaprete about the latest findings released from the lunar impact of the spacecraft a year ago, and in addition to the article we posted here on Universe Today, I also did a podcast for the NASA Lunar Science Institute. If you would like to actually “hear” from Colaprete, you can listen to the podcast on the NLSI website, or you can also find it on the 365 Days of Astronomy podcast.

{ 2 comments }

Aqua October 28, 2010 at 11:21 am

Very enjoyable, informative and appreciated! FOOD FOR THOUGHT!

The Methane and Hydrogen continuously generated by the Moon’s core? Why that would mean….

GBendt October 29, 2010 at 1:41 am

Molten rock being put under high pressure can store huge amounts of gases. This may apply for the moon´s core, too. But the moon´s core is not necessarily the source of the H20, OH, CO, CO2, H2S, SO2, NH3, H3, C2H4 and CH4 detected in the LCROSS impact plume. All these molecules are typical ingredients of cometary cores. It is sure that over the past billions of years thousands of comets impacted the moon, each with millions and billions of tons of material.
The puzzling thing is that some of these molecules do not freeze at the temperatures in the Cabeaus crater and thus should not be able to stay there. As they are still there, there must be a mechanism to keep them there.

Comments on this entry are closed.

Previous post:

Next post:

8 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/facebook.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 12 | 13 | 14 | 15 | 16 | In my column... | Facebook 17 | 18 | 19 | 20 | 21 | 22 | 23 |
28 | 42 | 43 | 44 | 45 | -------------------------------------------------------------------------------- /src/test/resources/de/jetwick/snacktory/facebook2.html: -------------------------------------------------------------------------------- 1 | 3 | 4 | 5 | 6 | 7 | 12 | 13 | 14 | 15 | 16 | Sommer is the best... | Facebook 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |
30 | 44 | 45 | 46 | 47 | --------------------------------------------------------------------------------