6 |
15 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/syracuse.com2.txt:
--------------------------------------------------------------------------------
1 | The Onondaga Historical Association leads groups through the Hotel Syracuse today on Historic Ghostwalks from 1 - 3:30 p.m. In what's subtitled 'Suite Stories' people tour many of the public spaces at the hotel as actors play personalities from the hotel's past. The building opened in 1924 and many famous names have been guests. A feature of the event is a view of Carl Roters 40 foot mural above the registration desk which has been covered with mirrors for the last 37 years.
2 |
3 | Here's an update from the OHA :
4 |
5 | • Reservation can only be made online at www.hotelsyracuseghostwalk.eventbrite.com, not by calling OHA
6 |
7 | • There are no walk-ins allowed
8 |
9 | • There are a limited amount of tickets being added today, December 29th, at 2 P.M. for the January 3rd tours.
10 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/slate.com2.txt:
--------------------------------------------------------------------------------
1 | The biggest challenge to Brazil’s World Cup preparations over the past few weeks has been the chaos caused by striking public workers . While São Paulo subway workers voted to at least temporarily suspend their strikes on Monday night, protests by activists angry at the money being spent by Brazil to host the event are expected to continue throughout the World Cup .
2 |
3 | Though nowhere near the size of last year’s broad-based social movement, which brought millions to the streets, protests have been building back up in recent months. The tournament, which begins Thursday in São Paulo, has reportedly cost in excess of $11 billion, money protesters say could have been spent on public infrastructure. The movement has already produced its share of powerful images. Here are some of the most striking photos.
--------------------------------------------------------------------------------
/cosrlib/sources/url.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | from cosrlib.sources import Source
4 | import requests
5 |
6 |
7 | class UrlSource(Source):
8 | """ Source that actually fetches an URL. Use with caution! """
9 |
10 | def get_partitions(self):
11 |
12 | if self.args.get("urls"):
13 | return [{
14 | "url": url
15 | } for url in self.args["urls"]]
16 | else:
17 | return [{
18 | "url": self.args["url"]
19 | }]
20 |
21 | def iter_items(self, partition):
22 |
23 | fetched = requests.get(partition["url"], headers={'user-agent': 'CommonSearch/dev'})
24 |
25 | if fetched.status_code == 200:
26 | yield partition["url"], fetched.headers, "html", 2, fetched.content
27 |
--------------------------------------------------------------------------------
/tests/cosrlibtests/signals/test_dmoz.py:
--------------------------------------------------------------------------------
1 |
2 | def test_signal_dmoz_domain(ranker):
3 | rank = lambda url: ranker.client.get_signal_value_from_url("dmoz_domain", url)
4 |
5 | assert rank("http://www.neoanime.org") == 1.
6 | assert rank("http://www.neoanime.org/") == 1.
7 | assert rank("http://www.neoanime.org/non-existing-page") == 1.
8 | assert rank("http://www.non-existing-domain.com") == 0.
9 |
10 |
11 | def test_signal_dmoz_url(ranker):
12 | rank = lambda url: ranker.client.get_signal_value_from_url("dmoz_url", url)
13 |
14 | assert rank("http://www.neoanime.org") == 1.
15 | assert rank("http://www.neoanime.org/?") == 1.
16 | assert rank("http://www.hcs.harvard.edu/~anime/") == 1.
17 | assert rank("http://www.non-existent-domain.com") == 0.
18 | assert rank("http://www.hcs.harvard.edu/non-existing-page") == 0.
19 |
--------------------------------------------------------------------------------
/cosrlib/signals/wikidata_url.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | import math
4 |
5 | from . import BaseSignal
6 |
7 |
8 | class Signal(BaseSignal):
9 | """ Ranking signal based on the URL presence in Wikidata as official website
10 | """
11 |
12 | def get_value(self, document, url_metadata):
13 |
14 | # Number of pages on Wikimedia projects is a rough approximation of the importance of the entity
15 | number_of_sitelinks = max(
16 | url_metadata["url"].wikidata_sitelinks,
17 | url_metadata["url_without_query"].wikidata_sitelinks
18 | )
19 |
20 | max_sitelinks = 200
21 |
22 | # http://www.wolframalpha.com/input/?i=sqrt(x%2F200)+from+0+to+200
23 | score = min(1., math.sqrt(float(number_of_sitelinks) / max_sitelinks))
24 |
25 | return score
26 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/chinese.txt:
--------------------------------------------------------------------------------
1 | 香港行政长官梁振英在各方压力下就其大宅的违章建筑(僭建)问题到立法会接受质询,并向香港民众道歉。
2 |
3 | 梁振英此前承认早在去年参选行政长官之前就已知悉其住宅的违建问题,引发诚信危机。
4 |
5 | 梁振英在星期二(12月10日)的答问大会开始之际在其演说中道歉,但强调他在违章建筑问题上没有隐瞒的意图和动机。
6 |
7 | 不过泛民主派议员们普遍指责梁振英撒谎,要求他马上辞职下台。
8 |
9 | 一些亲北京阵营议员欢迎梁振英道歉,且认为应能获得香港民众接受,但这些议员也质问梁振英有否向执法部门施压。
10 |
11 | 梁振英强调承诺将在两周内解决其住宅的违建问题。
12 |
13 | 郑重道歉
14 |
15 | 香港媒体于6月份曝光梁振英大宅的首批违建部分后,于3月的选举中被击败的民主党参选人何俊仁向法院提出选举呈请,至11月中旬被终审法院驳回。
16 |
17 | 梁振英说,其位于太平山山顶的住宅内的违建部分大都不是由他所建,此前没有马上公开交待和处理,是因为律师意见认为司法程序仍在进行,他不应评论。
18 |
19 | 梁振英在接受质询前的发言中说:回顾事件,我虽然从无任何存心隐瞒的意图,但必须承认自己有处理疏忽及交代不清之处,为此我再次向市民郑重道歉。
20 |
21 | 梁振英在选举中还击败了曾是自由党党员的前政务司司长唐英年。
22 |
23 | 在回答自由党议员的提问时,梁振英称,他从未说过其房产不存在违建问题。
24 |
25 | 现为间选议员的何俊仁说,梁振英至今仍不坦诚以对,让他震惊;亲北京政团工联会直选议员黄国健也批评梁振英抱着不服输的态度接受质询。
26 |
27 | 历时1.5小时的答问大会在进入中段之际,泛民主派人民力量的直选议员黄毓民、陈伟业和社会民主连线的梁国雄先后因播放录音、叫嚣,和向梁振英扔掷文件而被议长驱逐。
28 |
29 | 民主党此前计划在星期三(11日)对梁振英提出不信任动议。
--------------------------------------------------------------------------------
/tests/cosrlibtests/signals/test_url_length.py:
--------------------------------------------------------------------------------
1 |
2 | def test_signal_url_total_length(ranker):
3 | rank = lambda url: ranker.client.get_signal_value_from_url("url_total_length", url)
4 |
5 | rank_long = rank("http://www.verrrryyyylongdomain.com/very-long-url-xxxxxxxxxxxxxx.html")
6 | rank_short = rank("https://en.wikipedia.org/wiki/Maceo_Parker")
7 | rank_min = rank("http://t.co")
8 |
9 | assert 0 <= rank_long < rank_short < rank_min <= 1
10 |
11 |
12 | def test_signal_url_path_length(ranker):
13 | rank = lambda url: ranker.client.get_signal_value_from_url("url_path_length", url)
14 |
15 | rank_hp = rank("http://www.longdomain.com")
16 | rank_hp2 = rank("http://www.domain.com/")
17 |
18 | assert rank_hp == rank_hp2
19 |
20 | rank_subpage = rank("http://t.co/p")
21 |
22 | assert rank_subpage < rank_hp
23 |
24 | rank_subpage_query = rank("http://t.co/p?q=1")
25 |
26 | assert rank_subpage_query < rank_subpage
27 |
--------------------------------------------------------------------------------
/urlserver/import.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | import os
4 | import sys
5 | import time
6 |
7 | sys.path.insert(-1, os.path.normpath(os.path.join(__file__, "../../")))
8 | from cosrlib.dataproviders import list_dataproviders
9 |
10 | if len(sys.argv) > 1:
11 | to_import = sys.argv[1:]
12 | else:
13 | to_import = [
14 | "alexa_top1m",
15 | "ut1_blacklist",
16 | "dmoz",
17 | "webdatacommons_hc",
18 | "commonsearch_host_pagerank",
19 | "wikidata"
20 | ]
21 |
22 | dataproviders = list_dataproviders()
23 |
24 | for dataprovider in to_import:
25 | ds = dataproviders[dataprovider]
26 |
27 | start_time = time.time()
28 | print("Importing %s..." % dataprovider)
29 |
30 | ds.import_dump()
31 |
32 | print("Import of %s finished. Took %ss" % (dataprovider, int(time.time() - start_time)))
33 | print()
34 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/theonion.com1.txt:
--------------------------------------------------------------------------------
1 | CHICAGO—Admitting it took them some time to come around to the idea, the parents of local woman Laura Stevens said Wednesday that they had finally accepted their daughter’s mixed-attractiveness relationship with Kyle Baker, a man who is considerably worse-looking than she is. “To be honest, we were quite surprised when Laura brought Kyle to the house for the first time, but eventually we came around to it,” said Stevens’ mother, Janet, who noted that the pair were still met with uncomfortable stares and disapproval from other family members—especially Laura’s grandmother—at last year’s Thanksgiving dinner. “Her father was particularly upset at first, but now I think he’s learned to accept it, and he’s even grown to like Kyle. Besides, Kyle seems to make Laura happy, and that’s all that really matters.” Janet Stevens went on to say that, if the two ever got married, she would love their children no matter how average-looking they are.
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/normalize-spaces/expected.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tab here incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
4 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
4 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
5 |
6 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/glamour.com1.txt:
--------------------------------------------------------------------------------
1 | Oh, RiRi. You truly are an icon.
2 |
3 | When Rihanna walks into a fashion show, even editors, publicists, other celebs gasp. Gasp they did on Friday and Saturday as the cooler-than-thou-crooner sashayed into her costume designer Adam Selman's show, as well as those of Joseph Altuzarra and Alexander Wang. And she did not disappoint. Let's take a look at what she wore:
4 |
5 |
6 |
7 | At Adam Selman, RiRi rocked an elegant and girly white A-line dress with pearls and my favorite Christian Louboutin strappy pumps.
8 |
9 |
10 |
11 | For Altuzarra, she glammed out in super sophistication with a plunging black jacket with fringe from the designer's collection—WITHOUT PANTS. I LOVE IT.
12 |
13 |
14 |
15 | At Alexander Wang, she brought out her inner '90s child with a matching midnight blue hoodie and skirt.
16 |
17 | Are you a Rihhana fan? I love her. Now where's my plunging black jacket? Not to fret—I will wear pants.
18 |
19 | Photos: Getty Images
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/self.com1.txt:
--------------------------------------------------------------------------------
1 | Blair Waldorf who? The headband has come a long way from its schoolgirl reputation, thanks to the recent efforts of the always-stylish Taylor Swift. When she recently stepped out in NYC wearing a sparkly hair accessory, she demonstrated three must-know rules to follow when pulling off the accessory:
2 |
3 | Keep Your Hair’s Movement: Unless you’re fighting with grown-out bangs, refrain from pulling your headband straight back. Instead, keep your hair’s natural movement by placing your style on top of the hair, not under and behind the ears.
4 |
5 | Go Easy on Styling: The simplest way to keep the accessory from looking stuffy is to skip elaborate updos and ringlet curls. Keep the hair more natural and use the headband to add structure.
6 |
7 | Make It the Focal Point: With a piece of bling like Taylor’s, keep the rest of the beauty look soft. Although she wore fuchsia lipstick, the singer opted for a matte variety to tone-down the vivid hue.
8 |
9 | RELATED:
10 |
11 | Image Credit: Getty
--------------------------------------------------------------------------------
/tests/sparktests/test_plugin_grep.py:
--------------------------------------------------------------------------------
1 | import pytest
2 | from test_index import CORPUSES
3 | import tempfile
4 | import shutil
5 | import pipes
6 | import os
7 | import ujson as json
8 |
9 |
10 | def test_spark_plugin_grep(sparksubmit):
11 |
12 | tmp_dir = tempfile.mkdtemp()
13 |
14 | try:
15 |
16 | sparksubmit("spark/jobs/pipeline.py --source corpus:%s --plugin 'plugins.grep.Words:words=c1 d1 world,output=%s/out/,coalesce=1'" % (
17 | pipes.quote(json.dumps(CORPUSES["simple_link_graph_domain"])),
18 | tmp_dir
19 | ))
20 |
21 | parts = [os.path.join(tmp_dir, "out", f) for f in os.listdir(tmp_dir + "/out/") if f.startswith("part-")]
22 | assert len(parts) == 1
23 | with open(parts[0], "r") as f:
24 | data = set(f.read().strip().split("\n"))
25 | assert data == set([
26 | "d1,world http://example-d.com/page1",
27 | "c1 http://example-c.com/page1"
28 | ])
29 |
30 | finally:
31 | shutil.rmtree(tmp_dir)
32 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/lifebuzz.com2.txt:
--------------------------------------------------------------------------------
1 | What Do You Think This Guy Is Doing? You Will Never Guess…. And It’s Going To Break Your Heart
2 |
3 | Sylvie a 13 year old Husky went on her morning walk with her owner when she fell through the ice. After being trapped in the water for about 30 minutes, Firefighter Sean Coyle came to the rescue. The pictures below capture the event as it transpires. Exhausted Sylvie had been in the frigid water for more than 30 minutes
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 | Boston Herald / Polaris Firefighter Sean Coyle uses a basket to slide out to Sylvie
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 | Boston Herald / Polaris The husky clings to the ice as firefighter Sean Coyle inches out to the hole
20 |
21 |
22 |
23 |
24 |
25 |
26 |
27 | Boston Herald / Polaris Fireman Coyle grabs Sylvie by the scuff of the neck as he attempts to lift her from the water
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 | Boston Herald / Polaris
36 |
37 | Next › Page 1 of 2
38 |
39 | 16shares Share on Facebook
40 |
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/basic-tags-cleaning/expected.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
4 |
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
5 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
6 |
7 |
8 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
9 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
4 |
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
5 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
6 |
7 |
8 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
9 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
4 |
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
5 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
6 |
7 |
8 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
9 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
4 |
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
5 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
6 |
7 |
8 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
9 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
10 |
11 |
--------------------------------------------------------------------------------
/cosrlib/sources/wikidata.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | from cosrlib.sources import Source
4 | from cosrlib.document import Document
5 | from cosrlib.dataproviders import load_dataprovider
6 |
7 |
8 | class WikidataSource(Source):
9 | """ Source that reads 'fake' documents from the WikiData dump """
10 |
11 | def get_partitions(self):
12 |
13 | return ["__wikidata_single_dump__"]
14 |
15 | def iter_documents(self, partition):
16 |
17 | dataprovider = load_dataprovider("wikidata")
18 |
19 | i = 0
20 | maxdocs = int(self.args.get("maxdocs") or 0)
21 |
22 | for key, _ in dataprovider.iter_rows():
23 |
24 | doc = Document(None, url=b"http://%s" % key) # TODO get the original URL instead?
25 |
26 | # Summary & title will be inferred from the Wikidata *dataprovider* via url_metadata
27 | # doc._title = values["wikidata_title"]
28 |
29 | yield doc
30 |
31 | i += 1
32 | if i >= maxdocs > 0:
33 | return
34 |
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/remove-extra-brs/expected.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
4 |
5 |
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
6 |
7 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
8 |
9 |
10 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
11 |
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
12 |
13 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/backstage.com2.txt:
--------------------------------------------------------------------------------
1 | To find Los Angeles–area stage and film acting schools, teachers, and coaches, click here to search Backstage's Acting Schools database.
2 |
3 | Each of the entries contains the following information, if applicable: name of teacher or school, address, phone and fax numbers, email address and/or website, average number of students per class, whether beginning, intermediate, or advanced students are taught, whether auditing is permitted, whether classes are ongoing or by sessions, any special emphasis used in classes or coaching, whether a work/study program is offered. Descriptions of the class, school, or coaching are provided by the instructor or institution and edited by Backstage.
4 |
5 | Also, you can find additional schools and coaches in the Backstage Yellow Pages.
6 |
7 | Schools and coaches who have been omitted may contact listings {at} backstage.com regarding inclusion in the Acting Schools database. Schools and coaches can also self-post listings and media-enhanced ads in the Backstage Yellow Pages. Contact Backstage's Advertising Department for additional marketing options.
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/replace-font-tags/expected.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
4 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
5 |
6 |
--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
1 | version: '2'
2 | services:
3 |
4 | elasticsearch:
5 | image: commonsearch/local-elasticsearch
6 |
7 | urlserver:
8 | image: commonsearch/local-back
9 | command: python urlserver/server.py
10 | volumes:
11 | - .:/cosr/back:rw
12 | working_dir: /cosr/back
13 | depends_on:
14 | - elasticsearch
15 | links:
16 | - elasticsearch
17 | environment:
18 | # Use the elasticsearch instance running the linked container
19 | - "COSR_ELASTICSEARCHDOCS=elasticsearch:9200"
20 | - "COSR_ELASTICSEARCHTEXT=elasticsearch:9200"
21 |
22 | cosrback:
23 | image: commonsearch/local-back
24 | command: bash
25 | environment:
26 | - "TRAVIS"
27 | - "TRAVIS_BRANCH"
28 | - "TRAVIS_JOB_ID"
29 | - "COSR_ENV"
30 | - "TERM=xterm-256color"
31 | - "COSR_ELASTICSEARCHDOCS=elasticsearch:9200"
32 | - "COSR_ELASTICSEARCHTEXT=elasticsearch:9200"
33 | depends_on:
34 | - elasticsearch
35 | - urlserver
36 | links:
37 | - elasticsearch
38 | - urlserver
39 | volumes:
40 | - .:/cosr/back:rw
41 | working_dir: /cosr/back
42 |
--------------------------------------------------------------------------------
/cosrlib/utils.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | import traceback
4 |
5 | #
6 | # Use this file for utility Python functions that are independent from the rest of cosrlib
7 | #
8 |
9 |
10 | def ignore_exceptions(default):
11 | """ Wraps a function and catches any exceptions it may raise """
12 | def outer(fn):
13 | def wrapped(*args, **kwargs):
14 | try:
15 | return fn(*args, **kwargs)
16 | except Exception: # pylint: disable=broad-except
17 | print("Caught Python exception!")
18 | traceback.print_exc()
19 | return default
20 | return wrapped
21 | return outer
22 |
23 |
24 | def ignore_exceptions_generator(fn):
25 | """ Wraps a generator and catches any exceptions it may raise """
26 | def wrapped(*args, **kwargs):
27 | try:
28 | for x in fn(*args, **kwargs):
29 | yield x
30 | except Exception: # pylint: disable=broad-except
31 | print("Caught Python exception in generator!")
32 | traceback.print_exc()
33 | return wrapped
34 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/graziadaily.co.uk1.txt:
--------------------------------------------------------------------------------
1 | Lady Gaga's no stranger to the world of beauty, having worked on a MAC makeup collection before releasing two of her own fragrances. Now she's dipping her heel into the realm once more as the face of Shiseido's 2015 New Year's ad campaign.
2 |
3 | Not only that but Gaga turned her hand to photography for the Japanese beauty brand, shooting the campaign images herself. And yes, that means selfies ahoy!
4 |
5 | 50 selfies to be exact. According to WWD, the pop star shot a different image for a whooping 50 different Japanese newspapers. And it's no surprise considering selfies have dominated social media in 2014 and there's no doubt Gaga's had plenty of practice - have you seen her selfie-packed Instagram profile? Exactly.
6 |
7 | Each image will appear across the New Year's busy shopping period, encouraging shoppers to look as selfie-perfect as Gaga and we have every faith they'll be as wacktacular as ever. Sadly, Shiseido has no plans to publish the campaign outside of Japan but if you'd like to see Gaga's 50 selfies, they're due to be released on the brand's website in the New Year. Failing that, we're sure one or two will pop up on Instagram.
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/dallasnews.com1.txt:
--------------------------------------------------------------------------------
1 | On Saturday, I went to volunteer at the Operation Care International’s Christmas Party For Jesus at the Kay Bailey Hutchison Convention Center, and being directionally impaired, I got lost and had no idea where to park.
2 |
3 | I saw two female Dallas police officers and rolled down my window to ask about parking. Upon rolling down my window, the antler on my window fell off. Not wanting to backtrack and tie up traffic, I drove off leaving the antler on the street. I found the parking meter lot that the officer had given me directions to, but then realized I had no change.
4 |
5 | I was standing by the meter figuring out how to pay when I hear someone saying, “Hey, you lost your antler.” I turned around, and the female officer was in her squad car holding out the antler that had fallen off my window. I went over and got my antler, explaining that I was figuring out how to pay for parking because I had no change. The officer said they had just announced that they would not be giving out parking-meter ticket violations that day during the event hours. I’m thankful to our Dallas Police Department for all they do.
6 |
7 | Kim Gump, Dallas
8 |
9 | Top Picks
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/replace-brs/expected.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
Lorem ipsum
4 |
dolor sit
5 |
amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
6 |
7 |
8 |
Tempor
9 |
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
10 |
11 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/gq.com1.txt:
--------------------------------------------------------------------------------
1 | By
2 |
3 |
4 |
5 | Ahhh, at last, Memorial Day weekend. The sweet sweet reminder that summer has officially arrived. It's just crab cakes and football straight through Labor Day, right? Yeah. Well, if you find yourself celebrating the return of the season with a road trip, but have yet to pack, we've got your back. You'll be traveling with everything you need in no time. All you have to do is pay close attention to the following.
6 |
7 |
8 |
9 | First, make sure your trusty weekender is ready to go. If you're bringing a suit, better go with a garment bag.
10 |
11 | Grab your toiletry kit, a 3-day weekend is not the time to let your skin down.
12 |
13 | Make sure you bring sunscreen. It's about your health, it's important.
14 |
15 | Don't just throw everything in your weekender, guys. There's an art to folding. Learn it. Live it. Love it.
16 |
17 | Planning on going for a swim? You'll need one of these. Not the board shorts you bought a few summers ago. Forget about those. Forever.
18 |
19 | Skip the running shoes, you don't need those when you've got this.
20 |
21 | Throw in a good book for the beach, the train, or wherever.
22 |
23 | And remember, don't show up at somebody's house empty-handed.
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/mlive.com1.txt:
--------------------------------------------------------------------------------
1 | Police have released surveillance photos in hopes of garnering help to track down a suspect in the University Bank robbery on Dec. 22.
2 |
3 | Although the photos do not show clear images of the suspect’s face, they show what appears to be a white male in dark colored clothing.
4 |
5 | The man is described as being in his 20s, about 5 feet 11 inches tall and 200 pounds, Ann Arbor Police Department Detective Lt. Robert Pfannes said. He is described as wearing dark rimmed glasses, a dark grey or khaki hooded sweater, black pants and white gym shoes.
6 |
7 | He was wearing an open-face knit ski hat that encircled his face.
8 |
9 | The robbery took place at about 3:15 p.m. Dec. 22 when a man entered University Bank at 2015 Washtenaw Avenue and handed a note to a bank teller, Pfannes previously said.
10 |
11 | The note implied the suspect, described as a white male, had a weapon and demanded money, he said. After taking money, the man then fled on foot.
12 |
13 | The man may have then entered a car in 2100 block of Washtenaw, Pfannes said Monday.
14 |
15 | Those with information on the bank robbery are asked to contact the Ann Arbor Police Department tip line at 734-794-6939 or e-mail TIPS@a2gov.org.
--------------------------------------------------------------------------------
/cosrlib/signals/alexa_top1m.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | import math
4 |
5 | from . import BaseSignal
6 |
7 |
8 | class Signal(BaseSignal):
9 | """ Ranking signal based on the list of the top 1 million domains from Alexa
10 | """
11 |
12 | def get_value(self, document, url_metadata):
13 |
14 | factor = 1.0
15 | alexa_rank = url_metadata["domain"].alexa_top1m
16 |
17 | # Sub-domain matches have a slight penalty
18 | # This works well for en.wikipedia.org but not for xxxx.tumblr.com
19 | # Ideally we should divide it by the number of known subdomains...
20 | if not alexa_rank:
21 | factor = 0.7
22 | alexa_rank = url_metadata["pld"].alexa_top1m
23 |
24 | if not alexa_rank:
25 | return None
26 |
27 | max_rank = 1000001
28 |
29 | # Attenuate differencies in the top100
30 | k1 = 2
31 |
32 | # http://www.wolframalpha.com/input/?i=1+-+(ln(x%2B1)%2Fln(1000000))%5E2+with+x+from+0+to+1000
33 | # http://www.wolframalpha.com/input/?i=1+-+(ln(x%2B1)%2Fln(1000000))%5E2+with+x+from+0+to+1000000
34 |
35 | return (1 - (math.log(float(alexa_rank), max_rank)) ** k1) * factor
36 |
--------------------------------------------------------------------------------
/scripts/import_commoncrawl.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | #
4 | # This scripts imports Common Crawl WARC files and the URL list
5 | # Usage: ./scripts/import_commoncrawl.sh 2
6 | #
7 |
8 |
9 | # Number of ~1Go WARC files to download. Default is just 1 WARC
10 | WARC_COUNT=${1:-1}
11 |
12 | # Common Crawl ID. See http://blog.commoncrawl.org/ for latest dumps
13 | COMMONCRAWL_ID=${COMMONCRAWL_ID:-CC-MAIN-2016-44}
14 |
15 | mkdir -p local-data/common-crawl/crawl-data
16 |
17 | echo "Downloading file list from Common Crawl: $COMMONCRAWL_ID"
18 | curl 'https://commoncrawl.s3.amazonaws.com/crawl-data/'$COMMONCRAWL_ID'/warc.paths.gz' | gzip -d > local-data/common-crawl/warc.paths.txt
19 |
20 | ccfiles="$(cat local-data/common-crawl/warc.paths.txt | head -$WARC_COUNT)"
21 |
22 | # Cleanup if there were leftovers from a previous download
23 | find -L local-data/common-crawl/crawl-data -name "*.tmp" | xargs rm -f
24 |
25 | for f in ${ccfiles[@]}
26 | do
27 | if [ -f local-data/$f ]; then
28 | echo "Already downloaded: `basename $f` ..."
29 | else
30 | echo "Downloading: `basename $f` ..."
31 | echo "---"
32 | curl --create-dirs https://commoncrawl.s3.amazonaws.com/$f -o local-data/common-crawl/$f.tmp
33 | mv local-data/common-crawl/$f.tmp local-data/common-crawl/$f
34 | fi
35 | done
36 |
37 |
--------------------------------------------------------------------------------
/tests/README.md:
--------------------------------------------------------------------------------
1 | # Test suite for cosr-back
2 |
3 | We have an extensive series of tests that are run on [Travis CI](https://travis-ci.org/commonsearch/cosr-back) at each commit.
4 |
5 | ## How to run everything
6 |
7 | Just run this command:
8 |
9 | ```
10 | make docker_test
11 | ```
12 |
13 | This will start our dependencies like Elasticsearch (with docker-compose) and run the whole test suite in a container. This will take a while!
14 |
15 | ## How to run just one test
16 |
17 | If you modified one part of the code and just want to run the associated tests again, you can open a shell into the container and invoke our test runner [pytest](https://pytest.org) directly.
18 |
19 | First, open the shell:
20 |
21 | ```
22 | make docker_shell
23 | ```
24 |
25 | You are now inside the main container environment, with additional containers linked as dependencies available for your tests.
26 |
27 | Then, run one of these commands depending on your needs:
28 |
29 | ```
30 | # Just one directory
31 | py.test tests/cosrlibtests/ -v
32 |
33 | # Just one file, with realtime output
34 | py.test tests/cosrlibtests/signals/test_wikidata.py -v -s
35 |
36 | # By keyword
37 | py.test tests/ -k "rank"
38 |
39 | # Simple speed benchmark
40 | py.test tests/cosrlibtests/document/html/ -v --repeat 50 --profile
41 | ```
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/reuters.com1.txt:
--------------------------------------------------------------------------------
1 | People walk past the Bombay Stock Exchange (BSE) building in Mumbai May 13, 2014.
2 |
3 | The best and worst of Bollywood in 2014.
4 |
5 | MUMBAI (Reuters) - The BSE Sensex and Nifty ended flat on Tuesday as weaker regional shares offset optimism over additional reforms a day after the government passed an executive order to ease land-acquisition rules.
6 |
7 | Monday's announcement could kick-start hundreds of billions of dollars worth of stalled projects, and comes after separate orders were passed to implement coal and insurance reforms.
8 |
9 | However, weak sentiment across the region on a sharp selloff in commodities overnight and political uncertainty in Greece left investors hesitant to take big bets.
10 |
11 | The BSE Sensex gained 0.03 percent at 27,403.54 points, while the broader Nifty ended 0.02 higher percent at 8,248.25 points.
12 |
13 | Infrastructure stocks gained with Larsen and Toubro (LART.NS) adding 0.5 percent while Lanco Infratech (LAIN.NS) rose 4.3 percent.
14 |
15 | However, energy firms fell as Brent crude fell to a 5-1/2 year low below $57 a barrel. Reliance Industries (RELI.NS) closed down 1.9 percent and Oil and Natural Gas Corp (ONGC.NS) ended down 1.5 percent.
16 |
17 | (Reporting by Indulal PM; Editing by Sunil Nair)
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/missing-paragraphs/expected-metadata.json:
--------------------------------------------------------------------------------
1 | {
2 | "title": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy\n eirmod tempor invidunt",
3 | "byline": "Henri Sivonen",
4 | "excerpt": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy\n eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam\n voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet\n clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit\n amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam\n nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,\n sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.\n Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor\n sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed\n diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,\n sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum.\n Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor\n sit amet.",
5 | "readerable": true
6 | }
7 |
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/replace-brs/source.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Replace brs test
6 |
7 |
8 |
9 |
Lorem
10 |
11 | Lorem ipsum dolor sit
amet, consectetur adipisicing elit, sed do eiusmod
12 | tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
13 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
14 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
15 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
16 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
17 |
18 |
Foo
19 |
20 | Tempor
incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
21 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
22 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
23 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
24 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
25 |
11 | Lorem
12 | ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
13 | tab here
14 | incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
15 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
16 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
17 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
18 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
19 |
20 |
Foo
21 |
22 | Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
23 | quis nostrud exercitation
24 |
25 |
26 |
27 |
28 | ullamco laboris nisi ut aliquip ex ea commodo
29 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
30 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
31 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
32 |
33 |
34 |
35 |
36 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/cleveland.com1.txt:
--------------------------------------------------------------------------------
1 | Chia seeds and goji berries are the new kale and quinoa, according to Google's recent parsing of food-focused searches from 2014. Each year, the search giant pours through some of our more fascinating queries to come up with their Year in Search.
2 |
3 | Some of the other more interesting food-related data points include:
4 |
5 | Pizza was searched more than the World Cup.
6 |
7 | The Cronut rose to 17th on the global recipe list after its arrival last year.
8 |
9 | Our favorite ways to eat eggs are: 1) Deviled, 2) Scotch, 3) Scrambled, 4) Pickled, 5) Boiled.
10 |
11 | This year we searched for 'recipes' less and 'restaurant' significantly more.
12 |
13 | Our top slimming questions were 'how many calories should i eat in a day' and 'how to lose weight,' and the Paleo diet was the top searched way to trim down.
14 |
15 | Foodies in Japan searched French food more than France.
16 |
17 | Hungry folk in Australia searched Argentine food more than Argentina.
18 |
19 | Spice-loving Brits searched Indian food more than India.
20 |
21 | In 2014 'i am hungry' was searched a button-popping 7x more than 'i am thirsty.'
22 |
23 | Oh, and nine million people watched a tiny hampster eating a tiny burrito.
24 |
25 | The lesson, as always: you are what you search.
26 |
27 | -- By Michael Russell Follow @tdmrussell
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/remove-extra-brs/source.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Remove trailing brs test
6 |
7 |
8 |
9 |
Lorem
10 |
11 |
12 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
13 | tempor incididunt ut labore et dolore magna aliqua.
14 |
Ut enim ad minim veniam,
15 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
16 | consequat.
17 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
18 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
19 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
20 |
21 |
Foo
22 |
23 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
24 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
25 | consequat.
26 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
27 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
28 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
29 |
30 |
31 |
32 |
33 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/bhg.com2.txt:
--------------------------------------------------------------------------------
1 | BHG.com / Videos / / / How to Hang a Christmas Wreath (Two No-Fail Secrets!)
2 |
3 | How to Hang a Christmas Wreath (Two No-Fail Secrets!)
4 |
5 | It's not a holiday home until you've hung a Christmas wreath. This quick video offers nifty tricks for damage-free wreath hanging on doors and windows. Get our secrets to success!
6 |
7 | A pretty wreath, is just the trick to take your Christmas decorations from basic to bold. Here are the no fail secrets, for hanging window, and door wreaths. For outdoor wreaths, choose a ribbon that's strong, durable, and will make a high impact statement. Your best bet, is a two and half inch satin ribbon. To avoid a droopy look, hang your wreath in the top half of the window. On a door, center the wreath at about eye level. To hang a wreath in a window, lower the top window sash and place the wreath outside of the window while holding the ends of the length of the hanging ribbon. [MUSIC] A piece of paint-friendly tape provides stability. Tape just the very end of the ribbon before sliding your window securely close. When hanging on a door, use a staple gun, to staple the ends of the hanging ribbon to the center top of the door. That way, the staples will never be seen. To keep a lightweight wreath in place, use double-sided foam tape on the back. It's that simple. Use this method every Christmas, to hang your wreath without damaging your home. [MUSIC]
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/replace-font-tags/source.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Replace font tags test
6 |
7 |
8 |
9 |
Lorem
10 |
11 | Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
12 | tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
13 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
14 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
15 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
16 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
17 |
18 |
Foo
19 |
20 | Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
21 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
22 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
23 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
24 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
25 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
14 | tempor incididunt ut labore et dolore magna aliqua.
15 |
Ut enim ad minim veniam,
16 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
17 | consequat.
18 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
19 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
20 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
21 |
22 |
Foo
23 |
24 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
25 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
26 | consequat.
27 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
28 | cillum dolore eu fugiat nulla pariatur.
29 | Excepteur sint occaecat cupidatat non
30 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
31 |
32 |
33 |
34 |
35 |
--------------------------------------------------------------------------------
/cosrlib/sources/corpus.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | import json
4 |
5 | from cosrlib.sources import Source
6 | from cosrlib.url import URL
7 |
8 |
9 | class CorpusSource(Source):
10 | """ Source that yields documents from a static corpus. Mostly used in tests """
11 |
12 | def get_partitions(self):
13 |
14 | if self.args.get("path"):
15 | return [{
16 | "path": self.args["path"]
17 | }]
18 | else:
19 | return [{
20 | "doc": doc
21 | } for doc in self.args.get("docs") or []]
22 |
23 | def iter_items(self, partition):
24 | """ Partition can be either a single raw document, or a filepath to a JSON file """
25 |
26 | if partition.get("path"):
27 | with open(partition["path"], "r") as f:
28 | docs = json.load(f)
29 | else:
30 | docs = [partition["doc"]]
31 |
32 | for doc in docs:
33 |
34 | url = URL(doc["url"].encode("utf-8"))
35 |
36 | do_parse, index_level = self.qualify_url(url)
37 |
38 | if do_parse:
39 |
40 | yield (
41 | url,
42 | {"Content-Type": "text/html"},
43 | "html",
44 | index_level,
45 | doc["content"].encode("utf-8")
46 | )
47 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/ok.co.uk1.txt:
--------------------------------------------------------------------------------
1 | WE can't keep up with all the engagements happening at the moment.
2 |
3 | And it seems that there could be another one to add to our list.
4 |
5 | Has Sean Penn popped the question? [Wenn]
6 |
7 | Sean Penn has reportedly popped the question to Charlize Theron.
8 |
9 | The I Am Sam actor is said to have proposed while on a romantic trip to Paris in November.
10 |
11 | US sources claimed that the 54-year-old was keen to take his relationship with the mum-of-one to the next level.
12 |
13 | A source added that Charlize isn't yet wearing a ring to symbolise their union.
14 |
15 | The couple have been dating for the last year [Splash]
16 |
17 | They said: "There's no ring, but they are committed."
18 |
19 | The pair have been friends for a long time but only became romantically involved last year.
20 |
21 | It will be the third time Sean has walked down the aisle. He was previously married to Madonna and split from second wife Robin Wright in 2010.
22 |
23 | The couple had two children together, Hopper and Dylan.
24 |
25 | Will it be third time lucky for Madonna's ex-husband? [Splash]
26 |
27 | Related article: Charlize Theron and Sean Penn plan to marry and adopt
28 |
29 | Related article: Charlize Theron causes outrage as she compares 'Googling herself to being raped'
30 |
31 | Related article: Josie Cunningham is engaged
32 |
33 | Related article: Hollyoaks co-stars Emmett Scanlan and Claire Cooper get engaged
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/oregonlive.com2.txt:
--------------------------------------------------------------------------------
1 | A 36-year-old part-time missionary who served a year in a Cambodian prison for sexually abusing boys in an orphanage pleaded not guilty on Monday in a Eugene courtroom to a rarely imposed federal charge of engaging in illicit sexual conduct in a foreign place.
2 |
3 | Daniel Stephen Johnson faces a potential 30-year prison term on the new charge, which accuses him of having sex with a boy in the Kingdom of Cambodia sometime between Nov. 28, 2005, and Oct. 12, 2006.
4 |
5 | U.S. Magistrate Judge Thomas Coffin set Johnson's trial for Feb. 25.
6 |
7 | A federal grand jury indicted Johnson on Dec. 10. He's awaiting trial in the Lane County Jail.
8 |
9 | Johnson served a one-year sentence in Cambodia for sexually abusing boys in his care at an orphanage, The Register-Guard newspaper reported. He worked as a Christian missionary in the Southeast Asian country for about a decade, according to the Cambodia-based anti-pedophile group Action pour les Enfants.
10 |
11 | A 2003 federal law aimed at preventing child abuse made it a crime for any U.S. citizen to have illegal sexual contact with a minor in a foreign country.
12 |
13 | More than a decade ago, Johnson was accused in Oregon of molesting three children in his sister's care.
14 |
15 | Lincoln County prosecutors dismissed charges after investigators began to doubt the alleged victims' statements, according to a 2003 article in the Yamhill Valley News-Register.
16 |
17 | -- Bryan Denson
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/remove-extra-paragraphs/source.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Replace font tags test
6 |
7 |
8 |
9 |
Lorem
10 |
11 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
12 | tempor incididunt ut labore et dolore magna aliqua.
13 |
14 |
Ut enim ad minim veniam,
15 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
16 | consequat.
17 |
18 |
19 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
20 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
21 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
22 |
23 |
24 |
Foo
25 |
26 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
27 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
28 | consequat.
29 |
30 |
31 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
32 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
33 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
34 |
35 |
36 |
37 |
38 |
39 |
40 |
41 |
42 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/businessinsider.com1.txt:
--------------------------------------------------------------------------------
1 | The UK may be forced to review its Falkland Islands air defenses to face a renewed threat in the South Atlantic.
2 |
3 | According to a report in the Daily Express newspaper, the Argentine Air Force is set to get a dozen Sukhoi Su-24 Fencer attack planes from Russia in return for foodstuff.
4 |
5 | Due to this, the UK Ministry of Defense is in the process of reviewing the Falkland Islands air defenses. The delivery of the supersonic, all-weather attack aircraft could pose a threat to the islands, referred to as “Malvinas” by Argentina.
6 |
7 | According to Jane’s, the islands current British air defenses include four Eurofighter Typhoon jets, Rapier SAM (Surface to Air Missile) systems, along with about 1,200 troops permanently stationed in the South Atlantic base.
8 |
9 | Even though the Typhoons are modern enough to deal with a dozen Su-24s, the Soviet-era twin-engined two-seater are able to perform ultra-low level surface and maritime strike missions. The planes can be outfitted with a wide variety of General Purpose as well as Laser Guided Bombs and stand-off missiles, such as the Kh-31 (AS-17 “Krypton”) anti-radiation and anti-shipping sea-skimming missiles.
10 |
11 | We don’t know whether the potential deal includes armament; still the possible delivery of Su-24s to Argentina makes the Falkland Islands a bit more vulnerable to an attack by the Fuerza Aérea Argentina.
12 |
13 | This article originally appeared at The Aviationist. Copyright 2015. Follow The Aviationist on Twitter.
--------------------------------------------------------------------------------
/tests/testdata/html_w3c_encoding_testcases/the-input-byte-stream-015.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | No encoding declaration
5 |
6 |
7 |
8 |
9 |
10 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
15 | Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
16 | tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
17 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
18 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
19 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
20 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
21 |
22 |
27 |
Foo
28 |
29 | Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
30 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
31 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
32 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
33 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
34 |
35 |
36 |
41 |
42 |
43 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/247wallst.com2.txt:
--------------------------------------------------------------------------------
1 | December 28, 2014: Here are four stocks among the 66 equities making new 52-week lows today. Volumes continue to be low as we wait to ring in the new year.
2 |
3 | Transocean Partners LLC (NYSE: RIGP) posted a new 52-week — and all-time — low on Monday of $13.18. Based on Friday night’s closing price of $14.32 that’s a drop of nearly 8%. The stock’s 52-week high is $29.43. Volume is about a third higher than the daily average of around 325,000 shares. The offshore drilling firm had no specific news today, but crude prices are down about 2%.
4 |
5 | SandRidge Mississippian Trust II (NYSE: SDR) dropped about 5.3% on Monday to post a new 52-week — and all-time — low of $3.91 after closing at $4.13 on Wednesday. The stock’s 52-week high is $10.12. Share volume is nearly double the daily average of 400,000 shares traded. The royalty company had no specific news today but is feeling pressure from low crude prices.
6 |
7 | Vivus Inc. (NASDAQ: VVUS) dropped about 6% Monday to post a new 52-week low of $2.78. The stock’s 52-week high is $9.80. Volume totaled 25% higher than the daily average of around 1.9 million shares. The U.S. FDA approved a competing diabetes drug last week to compete with Vivus’s Qsymia treatment.
8 |
9 | Zulilly Inc. (NASDAQ: ZU) dropped about 3.8% Monday to establish a new 52-week — and all-time — low at $22.39 against a high of $73.50. Volume was 25% below the daily average of around 1.4 million shares. The online retailer had no specific news today.
10 |
11 | ALSO READ: States Where People Live Longest
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/self.com2.txt:
--------------------------------------------------------------------------------
1 | Starting the New Year on a good note means skipping the stress—especially when it comes to your hair! But since many of us will be working in the hours leading up to the ball drop (often with little to no time to adjust between office and party) our suggestion is to prep your locks in the AM. With the right style and a few night-appropriate tricks, your hair won't need much upkeep to make it to midnight.
2 |
3 | Loose Waves
4 |
5 | Start with structured waves in the morning, wrapping small sections of hair (1-1 1/2-inch pieces) around a 1-inch curling iron for 20 to 30 seconds. Brush the ringlets out and throughout the day, they’ll smooth to soft waves. After work, apply a texturizing spray into the hair, spraying and scrunching each section to increase volume.
6 |
7 | Glam Topknot
8 |
9 | The secret to a topknot is in the volume. In the morning apply a generous amount of dry shampoo from the scalp to the tip, scrunching the hair to add fullness. Then pull the hair up into a twisted topknot towards the top of the head. Hold with hairspray. At night, either pull a few front pieces out to soften the look or add a glitzy headband.
10 |
11 | Messy Fishtail
12 |
13 | Pull off a lived-in braid the natural way. Start with a clean fishtail braid in the morning and let it naturally transition to messy-chic—pieces will fall out of the plait over time. At the end of the day, use a dry shampoo at the roots to make the entire style look lived in. Add a bow or leather cuff at the tip of the braid for subtle structure.
14 |
15 | RELATED:
--------------------------------------------------------------------------------
/scripts/git-set-file-times:
--------------------------------------------------------------------------------
1 | #!/usr/bin/perl -w
2 | use strict;
3 |
4 | # sets mtime and atime of files to the latest commit time in git
5 | #
6 | # This is useful for serving static content (managed by git)
7 | # from a cluster of identically configured HTTP servers. HTTP
8 | # clients and content delivery networks can get consistent
9 | # Last-Modified headers no matter which HTTP server in the
10 | # cluster they hit. This should improve caching behavior.
11 | #
12 | # This does not take into account merges, but if you're updating
13 | # every machine in the cluster from the same commit (A) to the
14 | # same commit (B), the mtimes will be _consistent_ across all
15 | # machines if not necessarily accurate.
16 | #
17 | # THIS IS NOT INTENDED TO OPTIMIZE BUILD SYSTEMS SUCH AS 'make'
18 | # YOU HAVE BEEN WARNED!
19 |
20 | my %ls = ();
21 | my $commit_time;
22 |
23 | if ($ENV{GIT_DIR}) {
24 | chdir($ENV{GIT_DIR}) or die $!;
25 | }
26 |
27 | $/ = "\0";
28 | open FH, 'git ls-files -z|' or die $!;
29 | while () {
30 | chomp;
31 | $ls{$_} = $_;
32 | }
33 | close FH;
34 |
35 |
36 | $/ = "\n";
37 | open FH, "git log -m -r --name-only --no-color --pretty=raw -z @ARGV |" or die $!;
38 | while () {
39 | chomp;
40 | if (/^committer .*? (\d+) (?:[\-\+]\d+)$/) {
41 | $commit_time = $1;
42 | } elsif (s/\0\0commit [a-f0-9]{40}( \(from [a-f0-9]{40}\))?$// or s/\0$//) {
43 | my @files = delete @ls{split(/\0/, $_)};
44 | @files = grep { defined $_ } @files;
45 | next unless @files;
46 | utime $commit_time, $commit_time, @files;
47 | }
48 | last unless %ls;
49 |
50 | }
51 | close FH;
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/basic-tags-cleaning/source.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Basic tag cleaning test
6 |
7 |
8 |
9 |
Lorem
10 |
11 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
12 | tempor incididunt ut labore et dolore magna aliqua.
13 |
Ut enim ad minim veniam,
14 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
15 | consequat.
16 |
17 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
18 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
19 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
20 |
21 |
Foo
22 |
23 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
24 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
25 | consequat.
26 |
29 |
30 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
31 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
32 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
19 |
20 |
21 |
28 |
33 |
34 |
35 |
36 |
37 |
38 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/youbeauty.com2.txt:
--------------------------------------------------------------------------------
1 | I have confused skin. One moment it is dry and needs hydration and the next it is oily and greasy. So I need a powder that will cut the oil, but still let my skin breathe. Clinique Acne Solutions Powder Makeup gently covers blemishes, evens skin and even absorbs oil to leave skin looking fresh. It's made specifically for "dry combination to oily skin types," so it may be just what I have been looking for.
2 |
3 | Product: Clinique Acne Solutions Powder Makeup
4 |
5 | Price: $31.00
6 |
7 | Tags:
8 |
9 | How to Use it: The powder comes with a sponge applicator, but feel free to apply it with a powder brush as well. Apply to oily spots only or use the powder on your entire face for full-coverage. (It's also available in liquid form.)
10 |
11 | Results: The shade that I received as a trial, Golden, was a bit too dark for my complexion, but I'm expecting to get a tan soon on a trip to the tropics. That aside, my skin has been in its oily phase for the past few days, so I was super excited to see if Clinique's Acne Solutions Powder Makeup really delivered — and it definitely did! The powder immediately cut the shine and gave my skin an even, natural look. It is very light and even thought it builds on your skin, it doesn't look or feel "caked on" like some powders do. I will be purchasing it for myself in the future, although maybe in a lighter shade.
12 |
13 | Disclaimer: I received Clinique Acne Solutions Powder as a sample.
14 |
15 | Related Articles:
16 |
17 | Could Resveratrol Be the Secret to Clearer Skin?
18 |
19 | Neroli Oil: The Ideal Face Oil for Oily Skin
20 |
21 | 5 Stubborn Skin Issues and How to Fix Them
--------------------------------------------------------------------------------
/plugins/filter.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | from cosrlib.plugins import Plugin
4 | from cosrlib import re
5 |
6 |
7 | class FilterPlugin(Plugin):
8 | """ Abstract class for plugins that filter URLs """
9 |
10 | def __init__(self, args):
11 | Plugin.__init__(self, args)
12 | self.do_index_body = bool(int(self.args.get("index_body", "0")))
13 | self.do_index = bool(int(self.args.get("index", "0"))) or self.do_index_body
14 | self.do_parse = (not bool(int(self.args.get("skip", "0")))) or self.do_index
15 |
16 | def hook_filter_url(self, url):
17 | """ Returns what to do with this URL: (do_parse, do_index, do_index_body) """
18 |
19 | if self.match_url(url):
20 | return (self.do_parse, self.do_index, self.do_index_body)
21 | return (None, None, None)
22 |
23 | def match_url(self, url):
24 | return True
25 |
26 |
27 | class All(FilterPlugin):
28 | """ Filters all documents """
29 |
30 | def match_url(self, url):
31 | return True
32 |
33 |
34 | class Homepages(FilterPlugin):
35 | """ Filters homepages """
36 |
37 | def match_url(self, url):
38 | return (url.parsed.path == "/" and url.parsed.query == "")
39 |
40 |
41 | class Domains(FilterPlugin):
42 | """ Domain filter """
43 |
44 | def init(self):
45 | # Match based on domain suffixes
46 | self.regex_source = "|".join([re.escape(d) + "$" for d in self.args["domains"].split(" ")])
47 | self.regex = re.compile(self.regex_source)
48 |
49 | def match_url(self, url):
50 | return self.regex.search(url.domain)
51 |
--------------------------------------------------------------------------------
/cosrlib/config.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | import os
4 | import json
5 | import re
6 |
7 | #
8 | # Loads the configuration from (in order of priority):
9 | # - Default values below (that work well in a local Docker install)
10 | # - Values in cosr-config.json
11 | # - COSR_* Environment variables
12 | #
13 |
14 | _defaults = {
15 |
16 | # HTTP URL of both ElasticSearch servers
17 | "ELASTICSEARCHTEXT": "http://172.17.0.1:39200",
18 | "ELASTICSEARCHDOCS": "http://172.17.0.1:39200",
19 |
20 | # Host:port of the URLserver instance, or "local" for direct import on the same node
21 | "URLSERVER": "local", # "172.17.0.1:9702"
22 |
23 | # Host:port of the Explainer instance
24 | "EXPLAINER": "0.0.0.0:9703", # "127.0.0.1:9703"
25 |
26 | # Environment type: prod, staging, local, ci, ...
27 | "ENV": "local",
28 |
29 | # Should we use files in tests/testdata/ as datasources? ("0" or "1")
30 | "TESTDATA": "0",
31 |
32 | # Path to the parent directory of cosrlib
33 | "PATH_BACK": os.path.dirname(os.path.dirname(__file__)),
34 |
35 | # Path to the local-data directory
36 | "PATH_LOCALDATA": os.path.join(os.path.dirname(os.path.dirname(__file__)), "local-data")
37 | }
38 |
39 | _config_file = os.path.normpath(os.path.join(__file__, "../../cosr-config.json"))
40 | if os.path.isfile(_config_file):
41 | with open(_config_file, "r") as f:
42 | _cnt = re.sub(r"\/\*.*?\*\/", "", f.read().replace("\n", ""), flags=re.DOTALL)
43 | _defaults.update(json.loads(_cnt))
44 |
45 | config = {}
46 | for k, default in _defaults.items():
47 | config[k] = os.getenv("COSR_%s" % k, default)
48 |
--------------------------------------------------------------------------------
/cosrlib/dataproviders/wikidata.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | from cosrlib.url import URL
4 | from . import BaseDataProvider
5 |
6 |
7 | class DataProvider(BaseDataProvider):
8 | """ Return the title and description from Wikidata, when present in "official website"
9 | """
10 |
11 | dump_testdata = "tests/testdata/wikidata.json"
12 | dump_url = "https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz"
13 | dump_compression = "gz"
14 | dump_format = "json-lines"
15 | dump_batch_size = 10000
16 | dump_count_estimate = 1000000
17 |
18 | def import_row(self, i, row):
19 | """ Maps a raw data row into a list of (key, values) pairs """
20 |
21 | # https://www.wikidata.org/wiki/Property:P856
22 | official_website = None
23 | for claim in row["claims"].get("P856", []):
24 | if (
25 | claim["type"] == "statement" and
26 | claim["mainsnak"]["datatype"] == "url" and
27 | claim["mainsnak"].get("datavalue")
28 | ):
29 | official_website = URL(claim["mainsnak"]["datavalue"]["value"].encode("utf-8")).normalized
30 |
31 | # TODO: other languages!
32 | label_en = row["labels"].get("en", {}).get("value") or ""
33 | description_en = row["descriptions"].get("en", {}).get("value") or ""
34 |
35 | if official_website:
36 | yield official_website, {
37 | "wikidata_title": label_en,
38 | "wikidata_description": description_en,
39 | "wikidata_sitelinks": len(row.get("sitelinks") or [])
40 | }
41 |
--------------------------------------------------------------------------------
/explainer/static/js/explainer.js:
--------------------------------------------------------------------------------
1 | $(function() {
2 |
3 | /*
4 | Quick script to manage the Explainer frontend. Could be improved in many ways!
5 | */
6 |
7 | $.urlParam = function(name){
8 | var results = new RegExp('[\?&]' + name + '=([^]*)').exec(window.location.href);
9 | if (results==null){
10 | return null;
11 | }
12 | else{
13 | return decodeURIComponent(results[1]).replace(/\+/g, " ") || 0;
14 | }
15 | }
16 |
17 | var urldebug = function(url) {
18 | $("#url").val(url);
19 |
20 | $.get("/api/urldebug", {"url": url}, function(data) {
21 | data = JSON.parse(data);
22 | console.log(data);
23 |
24 | var tpl = $("#tpl_urldebug").html();
25 |
26 | var html = _.template(tpl)({"data": data});
27 |
28 | $("#urldebug_content").html(html);
29 |
30 | });
31 |
32 | };
33 |
34 | var searchdebug = function(query, lang) {
35 | $("#q").val(query);
36 | $("#g").val(lang);
37 |
38 | $.get("/api/searchdebug", {"q": query, "g": lang}, function(data) {
39 | data = JSON.parse(data);
40 | console.log(data);
41 |
42 | var tpl = $("#tpl_searchdebug").html();
43 |
44 | var html = _.template(tpl)({"data": data});
45 |
46 | $("#searchdebug_content").html(html);
47 |
48 | });
49 |
50 | };
51 |
52 | // TODO HTML5 history to avoid the submit request
53 | // $("#urlform").on("submit", function(evt) {
54 | // var url = $("#url").val();
55 | // urldebug(url);
56 | // return false;
57 | // });
58 |
59 | var initUrl = $.urlParam("url");
60 | if (initUrl) {
61 | urldebug(initUrl);
62 | }
63 |
64 | var initSearch = $.urlParam("q");
65 | if (initSearch) {
66 | searchdebug($.urlParam("q"), $.urlParam("g") || "en");
67 | }
68 |
69 |
70 | });
--------------------------------------------------------------------------------
/tests/testdata/html_w3c_encoding_testcases/the-input-byte-stream-009.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | meta charset attribute
5 |
6 |
7 |
8 |
9 |
10 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
28 |
33 |
34 |
35 |
36 |
37 |
38 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/dailycaller.com2.txt:
--------------------------------------------------------------------------------
1 | Sports
2 |
3 | Dec 27, 2014; Louisville, KY, USA; Louisville Cardinals forward Montrezl Harrell (24) posts up against Kentucky Wildcats guard Aaron Harrison (2) during the first half at KFC Yum! Center. Mandatory Credit: Jamie Rhodes-USA TODAY Sports - RTR4JDW4
4 |
5 | 4419915
6 |
7 | The Louisville Cardinals hosted the Kentucky Wildcats on Saturday for one of the 2014 season’s most highly anticipated college basketball games.
8 |
9 | Rick Pitino’s 4th ranked Cardinals have had a great start to the season, but they were still huge underdogs against John Calipari’s unbeaten powerhouse. (RELATED: Kentucky Basketball’s Christmas Card Must Be Pretty Intimidating For Opposing Teams)
10 |
11 | Louisville trailed by 4 points at halftime, but they couldn’t close the gap thanks to plays like this.
12 |
13 | WATCH:
14 |
15 | Slow it down one time.
16 |
17 | What a joke.
18 |
19 | In games like these, I usually cheer for the underdog, but that flop had me wanting Kentucky to win by fifty. (RELATED: Forget ISIS, China And Immigrants; Flopping Is Threatening To Destroy America)
20 |
21 | Like the rest of their games this season, the Wildcats eventually pulled away from their cross-state rival. UK won by a final score of 58-50.
22 |
23 | On a side note, apparently Jennifer Lawrence is a Cardinals fan.
24 |
25 | NEW: Jennifer Lawrence at the Louisville Cardinals basketball game today (Dec. 27) pic.twitter.com/soxnfJGWwd — Jennifer Lawrence (@JenniferUpdates) December 27, 2014
26 |
27 | "May the odds be ever in Louisville's favor!" – Jennifer Lawrence http://t.co/rE6Li6QWGp pic.twitter.com/zK8QMCzPDf — Sporting News (@sportingnews) December 27, 2014
28 |
29 | I love you too, J. Law.
30 |
31 | Follow Datoc on Twitter
--------------------------------------------------------------------------------
/explainer/templates/layout.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | Explainer - Common Search
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
13 | tempor incididunt ut labore et dolore magna aliqua.
14 |
Ut enim ad minim veniam,
15 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
16 | consequat.
17 |
20 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
21 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
22 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
23 |
24 |
27 |
Foo
28 |
29 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
30 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
31 | consequat.
32 |
33 |
Duis aute irure dolor in reprehenderit in voluptate velit esse
34 | cillum dolore eu fugiat nulla pariatur.
35 |
38 | Excepteur sint occaecat cupidatat non
39 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
20 |
21 |
22 |
29 |
34 |
35 |
36 |
37 |
38 |
39 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/foxnews.com2.txt:
--------------------------------------------------------------------------------
1 | In a page taken from the most gruesome of books, a 17-year-old girl from Reynosa, Mexico lured an 8-months-pregnant woman to her home and killed her to try and steal her unborn baby.
2 |
3 | Nathaly Cartas Leon, 20, had “met” Guadalupe Salinas Hernández on Facebook and had gratefully accepted her offer to give the mom-to-be a few baby items she needed. They met at a mall and then went to Salinas’ home, where she said she had more stuff to give her.
4 |
5 | Once there, investigators said, Salinas beat Cartas to death with a blunt object and then proceeded to open her belly with a kitchen knife. The motive, police said: obsessive love. According to authorities, Salinas had an abortion back in June and had not told her boyfriend, who still believed she was pregnant and was looking forward to the birth.
6 |
7 | It is not clear whether Salinas’ abortion was intentional or not.
8 |
9 | After the gruesome killing and removal of the newborn, the teen hurried to the hospital and tried to save him, claiming the child was stillborn at home. But the fetus had stopped breathing when her mother died and could not be saved.
10 |
11 | It didn’t take much for the doctors at the hospital to realize that the baby had not come from Salinas’ womb and they contacted police.
12 |
13 | “I don’t regret it, I don’t regret,” local newspapers quote Salinas as saying. The body of the young mother was found in a shrub close to the killer’s house. Investigators are still trying to determine if she had an accomplice.
14 |
15 | Reynosa is a border city in the northern part of Tamaulipas, Mexico. It is located on the southern bank of the Rio Grande, directly across the border from Hidalgo, Texas.
16 |
17 | Like us on Facebook
18 |
19 | Follow us on Twitter & Instagram
--------------------------------------------------------------------------------
/tests/testdata/html_w3c_encoding_testcases/the-input-byte-stream-007.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | meta content attribute
5 |
6 |
7 |
8 |
9 |
10 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
28 |
33 |
34 |
35 |
36 |
37 |
38 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/ok.co.uk2.txt:
--------------------------------------------------------------------------------
1 | EastEnders are set to show a controversial plot line in the near future.
2 |
3 | Kat Moon will discover that her late uncle Harry is a serial sex offender.
4 |
5 | Kat Slater will discover her uncle was a serial sex offender
6 |
7 | Alfie Moon's wife was groomed by the relative from a young age.
8 |
9 | As a result of rape, she gave birth to her daughter Zoe Slater at the age of 13.
10 |
11 | However, Kat pretended to be her sister until the secret was revealed years later after Zoe said she wanted to live with Harry in Spain.
12 |
13 | Police will turn up at Kat's house telling her that a number of victims have come forward with allegations against Harry, despite him passing away 10 years ago.
14 |
15 | Previous victims of Harry Slater will come forward
16 |
17 | Kat, played by Jessie Wallace, will be drawn to helping the investigation after discovering that she has been left money in his will.
18 |
19 | She then confides in her husband who believes that the heart-to-heart means a reunion is possible. Struggling to cope, Kat has a one-night stand in an attempt to take her mind off the situation.
20 |
21 | A show insider told The Sun: "EastEnders has a rich history of tackling difficult social issues and Kat's continued story is one of these."
22 |
23 | Played by the late Michael Elphick, Harry passed away off screen when he suffered a heart attack.
24 |
25 | She gave birth to daughter Zoe Slater at 13 after being raped
26 |
27 | Related article: Laurie Brett splits from husband John Milroy
28 |
29 | Related article: EastEnders spoiler - Character to be killed off on New Year's Day
30 |
31 | Related article: Could Kat and Alfie Moon's marriage go up in smoke?
32 |
33 | Related article: Lucy Beale killer discovered by Emma Summerhayes
--------------------------------------------------------------------------------
/tests/testdata/html_w3c_encoding_testcases/the-input-byte-stream-037.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | UTF-8 BOM vs meta content
5 |
6 |
7 |
8 |
9 |
10 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
28 |
33 |
34 |
35 |
36 |
37 |
38 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/thedebrief.co.uk1.txt:
--------------------------------------------------------------------------------
1 | Lena deCasparis | Contributing Writer | 5 days ago
2 |
3 | Ed Sheeran Wants To Set Up Taylor Swift, But We’re Not So Sure About His Matchmaking Skills The Debrief: Ed put the cupid bow and arrow down, and stick to playing the guitar
4 |
5 | So in our eyes Ed Sheeran can do no wrong. He’s the most loveable, adorkable (that’s adorable crossed with dorky FYI) pop star EVER. And we love him, we really do. However when it comes to his matchmaking skills for our fave gal, and his BFF, Taylor Swift – well we think he’s a little bit off.
6 |
7 |
8 |
9 | This week Ed told Now magazine that he hoped he could set up Taylor with none other that Orlando Bloom. He said ‘He’s lovely, and they live in the same building. [I'm hoping that] the magic might present itself eventually.’
10 |
11 |
12 |
13 | Er, we’re not so sure.
14 |
15 | Now don’t get us wrong we’ve had a soft spot for floppy haired Mr Bloom since his Pirate days, and when he took a swing at Bieber earlier this year well we know which side of the ring we were on.
16 |
17 | But is he worthy of all that is T-Swift? The jury is out. And let’s not forget Bloom's also already dated Taylor’s BFF Selena, we predict that might make this matchmake very unlikely as Taylor is 100% 'Sisters Before Misters'.
18 |
19 | Ultimately we think Tay can do better, but then we're not sure there is any man who is worthy!
20 |
21 | Like this? You might also be interested in:
22 |
23 | Lena Dunham Has Declared Today Taylor Swift Day. Here’s Why We All Need To Get Better At Celebrating Female Success
24 |
25 | South Park Does A Lorde Skit. But Who’s It Taking The Piss Out Of?
26 |
27 | Lorde Gave Taylor Swift The First Listen To Her New Song
28 |
29 | Follow Lena on Twitter: @Lenadecasparis
30 |
31 | Pictures: Getty
--------------------------------------------------------------------------------
/tests/testdata/html_w3c_encoding_testcases/the-input-byte-stream-018.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | HTTP vs meta charset
5 |
6 |
7 |
8 |
9 |
10 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
29 |
34 |
35 |
36 |
37 |
38 |
39 |
--------------------------------------------------------------------------------
/tests/cosrlibtests/signals/test_language.py:
--------------------------------------------------------------------------------
1 |
2 |
3 | def _make_document(url, text, title=""):
4 | from cosrlib.document.html import HTMLDocument
5 | html = "%s%s" % (
6 | title, text
7 | )
8 | return HTMLDocument(html, url=url).parse()
9 |
10 |
11 | def test_signal_language(ranker):
12 | detect = lambda doc: ranker.client.get_signal_value("language", doc)
13 |
14 | langs = detect(_make_document("http://example.com", "This is obviously english text"))
15 | assert len(langs) == 1
16 | assert langs["en"]
17 |
18 | langs = detect(_make_document("http://example.com", "Ceci est du francais, biensur."))
19 | assert len(langs) == 1
20 | assert langs["fr"]
21 |
22 | langs = detect(_make_document("http://example.com", "Esto es espanol, por seguro"))
23 | assert len(langs) == 1
24 | assert langs["es"]
25 |
26 | langs = detect(_make_document("http://example.com", "Deutsch ist die meistverbreitete Muttersprache"))
27 | assert len(langs) == 1
28 | assert langs["de"]
29 |
30 | # Test a mixed lang document
31 | langs = detect(_make_document(
32 | "http://example.fr",
33 | "Die Standardsprache Standarddeutsch setzt sich aus den Standardvarietaten zusammen und einer Vielzahl von hochdeutschen und niederdeutschen Mundarten, die in einem Dialektkontinuum miteinander verbunden sind. Die Standardvarietaten entstanden aus deutschen Mundarten und Kanzleisprachen. " +
34 | "Du fait de ses nombreux dialectes, l'allemand constitue dans une certaine mesure une langue-toit. Ceci est du francais, biensur. Nous pourrions continuer mais cela suffit."
35 | ))
36 | assert len(langs) == 2
37 | assert langs["de"]
38 | assert langs["fr"]
39 | assert langs["de"] > langs["fr"]
40 |
--------------------------------------------------------------------------------
/tests/testdata/html_page_samples/__init__.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | SAMPLES = {
4 | "nytimes-article-2006.html": {
5 | "url": "http://www.nytimes.com/2006/06/16/arts/music/16prince.web.html",
6 | "title": "Prince a Surprise Guest at Celebrate Brooklyn - New York Times",
7 | "assert_words_missing": ["servedbyopenx", "newspaper", "technology"],
8 | "assert_words": ["prince", "maceo"]
9 | },
10 | "nytimes-article-2011.html": {
11 | "url": "http://www.nytimes.com/2011/10/06/arts/music/fred-wesley-pee-wee-ellis-and-maceo-parker-reunite.html",
12 | "title": "Fred Wesley, Pee Wee Ellis and Maceo Parker Reunite - The New York Times",
13 | "assert_words_missing": ["sections", "advertisement", "tweet", "browser"],
14 | "assert_words": ["godsons", "maceo", "wesley"]
15 | },
16 | "wordpress-2014.html": {
17 | "url": "https://thesunbreak.com/2014/06/23/live-show-review-maceo-parker-at-jazz-alley/",
18 | "title": "Live Show Review: Maceo Parker at Jazz Alley | The SunBreak",
19 | "assert_words_missing": ["weather", "primary", "notifications", "timberfest"],
20 | "assert_words": ["sunbreak", "maceo", "believe"]
21 | },
22 | "wikipedia-2015.html": {
23 | "url": "https://en.wikipedia.org/wiki/Nine_Inch_Nails",
24 | "title": "Nine Inch Nails - Wikipedia, the free encyclopedia",
25 | "assert_words_missing": ["random", "nederlands", "developers", "logged"],
26 | "assert_words": ["nine", "kiss", "velvet", "rockbeat", "snapped"]
27 | },
28 | "nutch-nested-spider-trap.html": {
29 | "url": "http://svn.apache.org/viewvc/nutch/trunk/src/testresources/fetch-test-site/nested_spider_trap.html?revision=1175075&view=co",
30 | "title": "nested spider trap",
31 | "assert_words_missing": ["b", "i"],
32 | "assert_words": ["nutch", "fetcher", "test", "page"]
33 | }
34 | }
35 |
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/reordering-paragraphs/expected.html:
--------------------------------------------------------------------------------
1 |
2 |
Regarding item# 11111, under sufficiently extreme conditions, quarks may become deconfined and exist as free particles. In the course of asymptotic freedom, the strong interaction becomes weaker at higher temperatures. Eventually, color confinement would be lost and an extremely hot plasma of freely moving quarks and gluons would be formed. This theoretical phase of matter is called quark-gluon plasma.[81] The exact conditions needed to give rise to this state are unknown and have been the subject of a great deal of speculation and experimentation.
3 |
Regarding item# 22222, under sufficiently extreme conditions, quarks may become deconfined and exist as free particles. In the course of asymptotic freedom, the strong interaction becomes weaker at higher temperatures. Eventually, color confinement would be lost and an extremely hot plasma of freely moving quarks and gluons would be formed. This theoretical phase of matter is called quark-gluon plasma.[81] The exact conditions needed to give rise to this state are unknown and have been the subject of a great deal of speculation and experimentation.
4 |
Regarding item# 33333, under sufficiently extreme conditions, quarks may become deconfined and exist as free particles. In the course of asymptotic freedom, the strong interaction becomes weaker at higher temperatures. Eventually, color confinement would be lost and an extremely hot plasma of freely moving quarks and gluons would be formed. This theoretical phase of matter is called quark-gluon plasma.[81] The exact conditions needed to give rise to this state are unknown and have been the subject of a great deal of speculation and experimentation.
19 |
20 |
21 |
30 |
35 |
36 |
37 |
38 |
39 |
40 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/reuters.com4.txt:
--------------------------------------------------------------------------------
1 | 1 of 2. A general view of Gartnavel General Hospital is seen in Glasgow, Scotland December 29, 2014.
2 |
3 | LONDON (Reuters) - A healthcare worker has been diagnosed with Ebola a day after flying home to Glasgow from Sierra Leone, the Scottish government said on Monday.
4 |
5 | The patient is being treated in isolation at Glasgow's Gartnavel Hospital, having flown back to Scotland's largest city late on Sunday on a British Airways flight via Casablanca in Morocco and London's Heathrow.
6 |
7 | "All possible contacts with the patient are now being investigated and anyone deemed to be at risk will be contacted and closely monitored," the Scottish government said in a statement.
8 |
9 | "However, having been diagnosed in the very early stages of the illness, the risk to others is considered extremely low."
10 |
11 | The patient, whom BBC sources described as a female aid worker, will be transferred to a high-level isolation unit in the Royal Free hospital in London.
12 |
13 | British Prime Minister Cameron has been informed, the Scottish government added.
14 |
15 | In August, another British aid worker, William Pooley, contracted the disease after working Sierra Leone. He was discharged in September after treatment at the Royal Free hospital.
16 |
17 | With more than 9,000 cases, Sierra Leone now accounts for nearly half of the known cases of Ebola in this year's West African outbreak, the worst ever. Neighbouring Liberia and Guinea have also been badly hit.
18 |
19 | The World Health Organization on Monday said the number of people infected by Ebola in Liberia, Sierra Leone and Guinea -- the worst affected by the outbreak -- has passed 20,000, with more than 7,842 deaths in the epidemic so far.
20 |
21 | (Reporting by Andy Bruce in London and Ankur Banerjee in Bengaluru; Editing by Joyjeet Das)
--------------------------------------------------------------------------------
/plugins/grep.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | from pyspark.sql import types as SparkTypes
4 | from cosrlib.spark import sql, SparkPlugin
5 |
6 |
7 | class Words(SparkPlugin):
8 | """ Finds documents containing a list of words in their indexable text (visible, non-boilerplate) """
9 |
10 | def init(self):
11 | if not self.args.get("words"):
12 | raise Exception("grep.Words plugin needs words!")
13 |
14 | self.words = frozenset(self.args["words"].split(" "))
15 |
16 | def hook_spark_pipeline_init(self, sc, sqlc, schema, indexer):
17 | schema.append(SparkTypes.StructField(
18 | "grep_words",
19 | SparkTypes.ArrayType(SparkTypes.StringType()),
20 | nullable=True
21 | ))
22 |
23 | def hook_document_post_index(self, document, metadata):
24 | """ Filters a document post-indexing """
25 |
26 | doc_words = document.get_all_words()
27 |
28 | match = doc_words.intersection(self.words)
29 | if len(match) > 0:
30 | print("WORD MATCH", match, document.source_url.url)
31 | metadata["grep_words"] = list(match)
32 |
33 | def hook_spark_pipeline_action(self, sc, sqlc, df, indexer):
34 |
35 | lines_df = sql(sqlc, """
36 | SELECT CONCAT(CONCAT_WS(",", SORT_ARRAY(grep_words)), " ", url) r
37 | FROM df
38 | WHERE size(grep_words) > 0
39 | """, {"df": df})
40 |
41 | self.save_dataframe(lines_df, "text")
42 |
43 | return True
44 |
45 |
46 | # class TextRegex(SparkPlugin):
47 | # """ Finds documents matching a regex on their visible text """
48 | # pass
49 |
50 |
51 | # class HTMLRegex(SparkPlugin):
52 | # """ Finds documents matching a regex on their raw HTML """
53 | # pass
54 |
--------------------------------------------------------------------------------
/tests/testdata/html_w3c_encoding_testcases/the-input-byte-stream-016.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | HTTP vs meta content
5 |
6 |
7 |
8 |
9 |
10 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
21 |
22 |
29 |
34 |
35 |
36 |
37 |
38 |
39 |
--------------------------------------------------------------------------------
/explainer/templates/search.html:
--------------------------------------------------------------------------------
1 | {% extends "layout.html" %}
2 | {% block content %}
3 |
4 |
27 |
28 |
29 |
30 |
60 |
61 | {% endblock %}
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/nydailynews.com1.txt:
--------------------------------------------------------------------------------
1 | Kevin C. Downs/for New York Daily News London Rene and 'Mob Wives' star Natalie Guercio pose in the office of their lawyer, Scott Rynecki. The couple was outside a Brooklyn club on Dec. 28 when a man with a box cutter attacked Rene.
2 |
3 | The boyfriend of sexy “Mob Wives” star Natalie Guercio is suing the Brooklyn club where he was brutally slashed by a box cutter-wielding assailant, the Daily News has learned.
4 |
5 | London Rene is alleging Club Output's "inept" and "incompetent" security was negligent in failing to provide proper safety for its patrons on Dec. 28 when he was attacked outside the Williamsburg nightspot.
6 |
7 | Rene's lawyer Scott Rynecki said there was "little or no security" watching over the club, which has a capacity of 600-persons.
8 |
9 | The suit, which will be filed Thursday in Brooklyn Supreme Court, seeks unspecified monetary damages for Rene's physical and emotional scars. The court papers also name the security guard firm hired by the club.
10 |
11 | "When I look in the mirror I'm angry," Rene, 37, said Wednesday. "I'm scarred for life; the scars are going to be there until the day I die."
12 |
13 | Kevin C. Downs/for New York Daily News Rene seeks unspecified monetary damages for his physical and emotional scars. He plans to file the lawsuit Thursday. Kevin C. Downs/for New York Daily News London Rene suffered scars from the attack. Previous Next
14 |
15 | Enlarge
16 |
17 | The alleged slasher, Rodolfo Lopez, fled the scene and later surrendered to the NYPD.
18 |
19 | Rene's face and torso were cut in the attack. Guercio, whose family owns a funeral home in Philadelphia where mobsters and their families were laid out, witnessed the slashing but she was not hurt.
20 |
21 | "Any one of us could have been a victim," she said.
22 |
23 | There was no listed phone number for the club.
24 |
25 | jmarzulli@nydailynews.com
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/al.com2.txt:
--------------------------------------------------------------------------------
1 | HUNTSVILLE, Alabama - Hockey players, roller skating enthusiasts and Future City competitors are among groups expected to bring more than 5,900 people together next month in Huntsville/Madison County.
2 |
3 | Here is a Huntsville/Madison County Convention & Visitors Bureau calendar of January events and conventions with host hotel if applicable and number of expected attendees:
4 |
5 | Jan. 1-3: UAH - Charger Hockey, UAH Chargers vs. Anchorage Alaska (January 2015), NHH, 200
6 |
7 | Jan. 14-19: National Supreme Council Ancient & Accepted Scottish Rites Masons, National Supreme Council Mid Winter Meeting 2015, Marriott-Huntsville, 225
8 |
9 | Jan. 15-18: BK Productions, 2015 Boat Show, NHH, 500
10 |
11 | Jan. 15-17: Alabama Military Collectors' Association, 2015 Winter Military Collector's Show, Hampton Inn - Arsenal/South Parkway, 700
12 |
13 | Jan. 15-19: Dogg Pound's Martin Luther King Jr Skate-A-Thon, 2015 Annual MLK Freestyle Skate Jam, Holiday Inn Research Park, 2000
14 |
15 | Jan. 15-17: UAH - Charger Hockey, UAH Chargers vs. Northern Michigan (January 2015), NHH, 200
16 |
17 | Jan. 16-19: North Alabama Hockey Association (NAHA), 2015 Freedom Tournament, Hilton Garden Inn-Space Center, 800
18 |
19 | Jan. 17-18: Alabama Regional Future City Competition, 2015 Future City Competition, NHH, 200
20 |
21 | Jan. 21-25: American Medical Association Alliance, Southern Regional Meeting, Embassy Suites Hotel & Spa, 100
22 |
23 | Jan. 23-25: Huntsville Track Club, 21st Annual Mountain Mist Trail Run (2015), Embassy Suites Hotel & Spa, 300
24 |
25 | Jan. 23-25: UAH - Charger Hockey, UAH Chargers vs. U.S. Under-18 (January 2015), NHH, 200
26 |
27 | Jan. 27-28: UAH - Student Success Center (Career Fair), 2015 UAH Spring Career Fair, Courtyard by Marriott, 500
28 |
29 | Jan. 27-30: The Aerospace Corporation, 2015 Executive Board Meeting, Embassy Suites Hotel & Spa, 15
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/base-url/expected.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
20 |
21 |
22 |
--------------------------------------------------------------------------------
/spark/conf/log4j.properties:
--------------------------------------------------------------------------------
1 | #
2 | # Licensed to the Apache Software Foundation (ASF) under one or more
3 | # contributor license agreements. See the NOTICE file distributed with
4 | # this work for additional information regarding copyright ownership.
5 | # The ASF licenses this file to You under the Apache License, Version 2.0
6 | # (the "License"); you may not use this file except in compliance with
7 | # the License. You may obtain a copy of the License at
8 | #
9 | # http://www.apache.org/licenses/LICENSE-2.0
10 | #
11 | # Unless required by applicable law or agreed to in writing, software
12 | # distributed under the License is distributed on an "AS IS" BASIS,
13 | # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 | # See the License for the specific language governing permissions and
15 | # limitations under the License.
16 | #
17 |
18 | # Set everything to be logged to the console
19 | log4j.rootCategory=WARN, console
20 |
21 | log4j.appender.console=org.apache.log4j.ConsoleAppender
22 | log4j.appender.console.target=System.err
23 | log4j.appender.console.layout=org.apache.log4j.PatternLayout
24 | log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
25 |
26 | # Settings to quiet third party logs that are too verbose
27 | log4j.logger.org.spark-project.jetty=WARN
28 | log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
29 | log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
30 | log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
31 | log4j.logger.org.apache.parquet=ERROR
32 | log4j.logger.parquet=ERROR
33 |
34 | # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
35 | log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
36 | log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
37 |
38 | # log4j.logger.org.apache.spark.storage.BlockManager=TRACE
--------------------------------------------------------------------------------
/tests/cosrlibtests/test_indexer.py:
--------------------------------------------------------------------------------
1 | import pytest
2 |
3 |
4 | @pytest.mark.elasticsearch
5 | def test_simple_insert_and_search(indexer, searcher):
6 |
7 | res = indexer.client.index_corpus([{
8 | "content": """hello worldhello body""",
9 | "url": "http://example.com"
10 | }, {
11 | "content": """another worlddocument body""",
12 | "url": "http://example2.com/page2"
13 | }, {
14 | "content": """ngfr""",
15 | "url": "http://nord.gouv.fr/page3"
16 | }])
17 |
18 | indexed = res[0]
19 | indexed2 = res[1]
20 | indexed3 = res[2]
21 |
22 | indexer.client.flush()
23 | indexer.client.refresh()
24 |
25 | assert indexed["id"]
26 | assert indexed["url"] == "http://example.com"
27 | assert indexed["rank"] > 0
28 |
29 | assert indexed2["id"] != indexed["id"]
30 |
31 | search_results = searcher.client.search("hello")
32 | assert len(search_results["hits"]) == 1
33 | assert search_results["hits"][0]["id"] == indexed["id"]
34 | assert search_results["hits"][0]["score"] > 0
35 |
36 | search_results = searcher.client.search("world")
37 | assert len(search_results["hits"]) == 2
38 |
39 | search_results = searcher.client.search("world", domain="example2.com")
40 | assert len(search_results["hits"]) == 1
41 | assert search_results["hits"][0]["id"] == indexed2["id"]
42 |
43 | # Make sure we index domain suffixes
44 | # https://github.com/commonsearch/cosr-back/issues/31
45 | search_results = searcher.client.search("fr")
46 | assert len(search_results["hits"]) == 1
47 | assert search_results["hits"][0]["id"] == indexed3["id"]
48 |
49 | search_results = searcher.client.search("gouv")
50 | assert len(search_results["hits"]) == 1
51 | assert search_results["hits"][0]["id"] == indexed3["id"]
52 |
--------------------------------------------------------------------------------
/tests/testdata/html_w3c_encoding_testcases/the-input-byte-stream-030.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | meta charset, then meta content
5 |
6 |
7 |
8 |
9 |
10 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
Lupita Nyong'o's now-famous Oscar dress -- adorned in pearls -- was stolen right out of her hotel room ... TMZ has learned.
11 |
Law enforcement sources tell TMZ ... the dress was taken out of Lupita's room at The London West Hollywood. The dress is made of pearls ... 6,000 white Akoya pearls. It's valued at $150,000.
12 |
Our sources say Lupita told cops it was taken from her room sometime between 8 AM and 9 PM Wednesday ... while she was gone.
13 |
We're told there is security footage that cops are looking at that could catch the culprit right in the act.
14 |
12:00 PM PT -- Sheriff's deputies were at The London Thursday morning. We know they were in the manager's office and we're told they have looked at security footage to determine if they can ID the culprit.
11 | Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
12 | tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
13 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
14 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
15 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
16 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
17 |
35 | Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
36 | quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
37 | consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
38 | cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
39 | proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
40 |
41 |
42 |
43 |
44 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/arabic.txt:
--------------------------------------------------------------------------------
1 | دمشق، سوريا (CNN) -- أكدت جهات سورية معارضة أن فصائل مسلحة معارضة لنظام الرئيس بشار الأسد وعلى صلة بـ"الجيش الحر" تمكنت من السيطرة على مستودعات للأسلحة بريف دمشق تضم كميات من الصواريخ ومضادات الدروع، في الوقت الذي حض فيه الائتلاف الوطني السوري المعارض الفصائل الكردية والإسلامية المتقاتلة في شمالي البلاد إلى "ضبط النفس."
2 |
3 | وقال المرصد السوري لحقوق الإنسان، وهو هيئة معارضة مقرها لندن، إن مقاتلين من لواء الاسلام - جبهة النصرة- كتيبة التوحيد- قوات المغاوير - كتائب شهداء القلمون، وعدة كتائب أخرى، سيطروا على ثلاثة مستودعات للذخيرة بالقرب من بلدة قلدون في منطقة القلمون بريف دمشق.
4 |
5 | وبحسب المرصد فقد اغتنم مقاتلو الكتائب المقاتلة أسلحة مضادة للدروع وصواريخ أرض- أرض (غراد) وذخائر أخرى متنوعة, كما تجددت الاشتباكات بين مقاتلين من الكتائب المقاتلة من طرف والقوات النظامية ومسلحين من اللجان الشعبية التابعة لها من الطائفة الشيعية من طرف آخر في منطقة السيدة زينب.
6 |
7 | وفي محافظة الحسكة شمال شرقي البلاد، أفاد المرصد عن اشتباكات دارت بعد منتصف ليل الجمعة - السبت، في محيط بلدة تل حلف قرب مدينة رأس العين بين و"حدات حماية الشعب" الكردية، ومقاتلي ما يعرف بـ"الدولة الإسلامية في العراق والشام" وجبهة النصرة وبعض الكتائب المقاتلة من طرف آخر.
8 |
9 | ولم ترد تقارير حول الخسائر البشرية، في في حين دارت اشتباكات عنيفة بين الطرفين في وقت متأخر من ليل الجمعة، في قرية التويمية، الواقعة بين منطقة أصفر ونجار وقرية مشرافة في جنوب مدينة راس العين، إثر محاولة مقاتلي الجبهة و"الدولة الإسلامية" التقدم باتجاه المدينة.
10 |
11 | أما الائتلاف الوطني السوري المعارض، فقد دعا في بيان له كافة الكتائب والفصائل المقاتلة في الشمال السوري إلى "ضرورة الوعي بأهمية المرحلة الراهنة، وبضبط النفس والتحلي بالحكمة لضمان سلامة المدنيين وإخلاء سبيل أي أشخاص موقوفين أو معتقلين."
12 |
13 | وشدد الائتلاف على "ضرورة الابتعاد عن الأعمال الاستفزازية بكافة أشكالها، ويحذر كل من يستغل المرحلة الراهنة لتطبيق أجندات سياسية، وترك القرار للشعب السوري الحر ليختار مصيره بملء إرادته" في بيان يأتي بالترافق مع الحديث عن كون تلك المواجهات مقدمة لولادة حكومة تدير المناطق التي يقطنها الأكراد في سوريا بشكل مستقل.
--------------------------------------------------------------------------------
/urlserver/storage.py:
--------------------------------------------------------------------------------
1 | from __future__ import absolute_import, division, print_function, unicode_literals
2 |
3 | import os
4 |
5 | import rocksdb
6 | from cosrlib.config import config
7 |
8 |
9 | class Storage(object):
10 | """ Key-value storage class using a RocksDB database on disk as a source. """
11 |
12 | _rocksdb_options_readonly = {
13 | "db_log_dir": "/dev/null", # TODO: Is this the right way of disabling logs?
14 | "max_open_files": 100,
15 | "max_background_compactions": 0,
16 | "max_background_flushes": 0,
17 | "keep_log_file_num": 0,
18 | "disable_auto_compactions": True,
19 | "advise_random_on_open": True
20 | }
21 |
22 | _db_dir = os.path.join(config["PATH_LOCALDATA"], "urlserver-rocksdb")
23 |
24 | def __init__(self, read_only=True):
25 | self.read_only = read_only
26 | self.db = None
27 |
28 | if self.read_only:
29 | if os.path.isdir(self._db_dir):
30 | self.db = rocksdb.DB(
31 | self._db_dir,
32 | rocksdb.Options(**self._rocksdb_options_readonly),
33 | read_only=True
34 | )
35 | else:
36 | print("WARNING: RocksDB data not found (%s). Run make import_local_data" % self._db_dir)
37 | else:
38 | self.db = rocksdb.DB(self._db_dir, rocksdb.Options(create_if_missing=True), read_only=False)
39 |
40 | def get(self, key):
41 | """ Returns the value of a key """
42 | return self.db.get(key)
43 |
44 | def close(self):
45 | """ Closes the connection to the DB """
46 | if self.db is not None:
47 | del self.db
48 |
49 | def write_batch(self, batch):
50 | """ Write a batch of data to the DB and return a new empty one """
51 | if batch is not None:
52 | self.db.write(batch, sync=True)
53 | self.db.compact_range()
54 | return rocksdb.WriteBatch()
55 |
--------------------------------------------------------------------------------
/tests/testdata/html_mozilla_readability_testcases/embedded-videos/expected.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
4 |
Videos
5 |
At root
6 |
7 |
8 |
9 |
In a paragraph
10 |
11 |
12 |
13 |
In a div
14 |
15 |
16 |
17 |
Foo
18 |
Tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
19 |
20 |
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/theatlanticcities.com1.txt:
--------------------------------------------------------------------------------
1 | Antarctica remained largely untouched until roughly 200 years ago, and now, more than 10,000 people travel there every year. But tourists bring more than cameras. Scientists are warning that pathogens brought by visitors could threaten the continent’s most iconic inhabitant: the penguin.
2 |
3 | Isolation has left local wildlife populations particularly vulnerable to diseases commonplace elsewhere in the world. “The effects of both a growing tourism industry and research presence will not be without consequences,” Wray Grimaldi of the University of Otago in Dunedin, New Zealand, said to New Scientist. “Penguins are highly susceptible to infectious diseases.”
4 |
5 | One group of researchers found 20 different fecal pathogens on just 15 pairs of tourist boots.
6 |
7 | Her team of Antarctic researchers found multiple infectious agents—bacteria such as salmonella and E. coli, viruses such as West Nile and the Avian pox virus—in captive penguins dating back to 1947. Outbreaks from those diseases have killed thousands of penguins over the years, the team reported in a paper published this month in the journal Polar Biology.
8 |
9 | Another theory is that migrating animals may have brought diseases to Antarctica, as the warming climate is attracting more species than ever before. But previous studies have identified tourist boots as vectors for disease transmission. One group of researchers tested 72 tourists' boots and found 20 different fecal pathogens on just 15 pairs of shoes.
10 |
11 | Norman Ratcliffe, an Antarctic ecologist from the Antarctic Survey in Cambridge, United Kingdom, told New Scientist that the evidence blaming tourists for sick penguins is lacking. He said that tourism companies are very strict on what they let visitors bring on their journey. “The tour companies are quite careful to make sure everyone cleans their boots before they go ashore,” he said. “They don't allow any animal products to be taken ashore.”
12 |
13 | This story originally appeared on The Atlantic.
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/technologyreview.com2.txt:
--------------------------------------------------------------------------------
1 | A street vendor in India uses a light powered by refurbished battery cells.
2 |
3 | Many of the estimated 50 million lithium-ion laptop batteries discarded every year could provide electricity storage sufficient to light homes in poor countries, researchers at IBM say.
4 |
5 | In work being aired this week at a conference in San Jose, researchers at IBM Research India in Bangalore found that at least 70 percent of all discarded batteries have enough life left to power an LED light at least four hours a day for a year.
6 |
7 | While it’s possible to combine LED lights with solar panels and rechargeable batteries (see “Innovators Under 35: Evans Wadongo”), using discarded batteries could make the approach far cheaper.
8 |
9 | “The most costly component in these systems is often the battery,” says Vikas Chandan, a research scientist at the lab’s Smarter Energy Group, who led the project. “In this case, the most expensive part of your storage solution is coming from trash.”
10 |
11 | The IBM group, working with a hardware R&D firm called RadioStudio, tore open discarded laptop battery packaging and extracted individual storage units called cells, tested those individually to pick out the good ones, and recombined them to form refurbished battery packs. Then, after adding charging dongles as well as circuitry to prevent overheating, they gave them to five users in Bangalore who lived in slums or operated sidewalk carts.
12 |
13 | Three months later, the users said the battery packs had worked well; the main request was for rat-resistant wires and brighter bulbs, says Mohit Jain, a research engineer with the group. A revised setup is now being tested.
14 |
15 | Around 50 million laptop and desktop computers are discarded in the United States every year, according to the Environmental Protection Agency. Meanwhile, in India alone, about 400 million people lack grid-connected electricity.
16 |
17 | IBM is not considering this as a business but says the technology could be offered free to poor countries.
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/al.com1.txt:
--------------------------------------------------------------------------------
1 | NEW ORLEANS - Both college football analysts on ESPN, Lou Holtz and Mark May are paid to discuss their unbiased, wide-ranging opinions about the sport on the air.
2 |
3 | They disagree a lot, especially when it comes to Ohio State.
4 |
5 | Holtz, a Notre Dame coaching legend, typically has been pro-Ohio State this year while May, a former player at Pittsburgh, has been down on the Buckeyes and the Big Ten.
6 |
7 | As No. 4 Ohio State prepares to take on No. 1 Alabama in the Sugar Bowl - a semifinal game in the inaugural College Football Playoff - Holtz didn't mind boasting a little about being right about the Buckeyes this year.
8 |
9 | "Mark May is a great guy - We have no teleprompter, no script, no rehearsal, but we have a difference opinion," Holtz said. "I love him, but he was a player, I was a coach. He made suggestions, I made decisions. He showered after work, I showered before work. I signed the the paycheck on the front, he signed the back.
10 |
11 | "We just have a different way of looking at things."
12 |
13 | During a segment called "Final Verdict" on ESPN's College Football Final, Holtz bantered with May about whether or not the Big Ten would have a team in the playoff and whether Ohio State had a shot of cracking the top four.
14 |
15 | In those segments, co-host Rece Davis, dressed like a judge, rules either in favor of Holtz or May. Both times, May, who said Ohio State and the Big Ten were out of the College Football Playoff hunt, got the ruling.
16 |
17 | Holtz hasn't forgotten.
18 |
19 | "I lost two 'Final Verdicts' and doggone it both of them turned out that Rece was wrong," Holtz said. "No. 1 the Big Ten would have somebody in (the playoff) and Ohio State had a chance. Both times he ruled against me."
20 |
21 | As for why May tends to have an anti-Big Ten opinion - something many Ohio State fans feel is a trend - Holtz decided to sidestep that question.
22 |
23 | "You would have to ask Mark May," Holtz said. "One thing I learned, I don't speak for Mark May. I have a hard time speaking for Lou Holtz."
--------------------------------------------------------------------------------
/tests/testdata/html_newspaper_testcases/text/247wallst.com1.txt:
--------------------------------------------------------------------------------
1 | December 28, 2014: Markets opened lower on Monday as neither the bulls nor the bears can gin up a lot of support. Volumes are low and corporate news is nearly non-existent. Crude oil started the day higher, but has drifted to a loss of about 2% as the day wore on. Shortly before the closing bell the DJIA traded down 0.08% for the day, the S&P 500 traded up 0.09%, and the Nasdaq Composite traded flat. International Business Machines Corp. (NYSE: IBM) dropped more than 1% today and its high per share price pushed the DJIA lower just as the market closed.
2 |
3 | The DJIA stock posting the largest daily gain ahead of the close Monday was The Home Depot Inc. (NYSE: HD) which traded up 0.87% at $104.64. The stock’s 52-week range is $73.96 to $104.80, and the high was posted early Monday. Trading volume was less than half the daily average of around 5.8 million shares. Reports of solid sales at home improvement stores during the holidays have helped push this stock higher today.
4 |
5 | JPMorgan Chase & Co. (NYSE: JPM) traded up 0.77% at $63.03. The stock’s 52-week range is $52.97 to $63.34, and the high was set today. Trading volume was about 50% below the daily average of around 26 million shares. The big bank is one of the underwriters for the IPO of Shake Shack which was announced today.
6 |
7 | The Boeing Co. (NYSE: BA) traded higher by 0.71% at $132.57. The stock’s 52-week range is $116.32 to $144.57. Volume was less than half the daily average of around 4.4 million shares. The company held its first test flight on Sunday of a new Air Force Tanker.
8 |
9 | Cisco Systems Inc. (NASDAQ: CSCO) traded up 0.48% at $28.49. The stock’s 52-week range is $21.27 to $28.59. Trading volume about 60% below the daily average of around 29 million shares. The stock missed matching its 52-week high by just a couple of pennies and was touted as a Rocket Stock at thestreet.com today.
10 |
11 | Of the Dow 30 stocks 13 are set to close higher today and 17 are on track to close lower.
12 |
13 | ALSO READ: 5 Top Tech Stocks for 2015 With Potential Big Catalysts
--------------------------------------------------------------------------------