├── .gitignore
├── LICENSE
├── README.md
├── mercury.py
├── reader.py
└── requirements.txt
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 |
49 | # Translations
50 | *.mo
51 | *.pot
52 |
53 | # Django stuff:
54 | *.log
55 | local_settings.py
56 |
57 | # Flask stuff:
58 | instance/
59 | .webassets-cache
60 |
61 | # Scrapy stuff:
62 | .scrapy
63 |
64 | # Sphinx documentation
65 | docs/_build/
66 |
67 | # PyBuilder
68 | target/
69 |
70 | # Jupyter Notebook
71 | .ipynb_checkpoints
72 |
73 | # pyenv
74 | .python-version
75 |
76 | # celery beat schedule file
77 | celerybeat-schedule
78 |
79 | # SageMath parsed files
80 | *.sage.py
81 |
82 | # dotenv
83 | .env
84 |
85 | # virtualenv
86 | .venv
87 | venv/
88 | ENV/
89 |
90 | # Spyder project settings
91 | .spyderproject
92 | .spyproject
93 |
94 | # Rope project settings
95 | .ropeproject
96 |
97 | # mkdocs documentation
98 | /site
99 |
100 | # mypy
101 | .mypy_cache/
102 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2017 Zachary Yocum
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # reader
2 | Extract clean(er), readable text from web pages via [Mercury Web Parser](https://github.com/postlight/mercury-parser).
3 |
4 | ## A note on the Mercury Web Parser
5 | The creators of the Mercury Web Parser initially offered it as a free service via a ReSTful API, but have since open sourced it. The API was shut down April 15, 2019. To continue using the parser, install its command-line driver using [`yarn`](https://github.com/yarnpkg/yarn) or [`npm`](https://github.com/npm/cli) package managers:
6 |
7 | ```
8 | # Install Mercury globally
9 | yarn global add @postlight/mercury-parser
10 | # or
11 | npm -g install @postlight/mercury-parser
12 | ```
13 |
14 | ## Install
15 |
16 | Clone this repository, create a virtual environment, and install the Python requirements:
17 |
18 | ```
19 | $ python3 -m venv .
20 | ...
21 | $ source bin/activate
22 | (reader) $ pip install -r requirements.txt
23 | ...
24 | ```
25 |
26 | ## Usage
27 |
28 | ```
29 | (reader) $ ./reader.py -h
30 | usage: reader.py [-h] [-f {json,md,txt}] [-w BODY_WIDTH] filename
31 |
32 | Get a cleaner version of a web page for reading purposes. This script reads
33 | JSON input from the Mercury Web Parser (https://github.com/postlight/mercury-
34 | parser) and performs conversion of HTML to markdown and plain-text via
35 | html2text.
36 |
37 | positional arguments:
38 | filename load Mercury Web Parser JSON result from file (use "-"
39 | to read from stdin)
40 |
41 | optional arguments:
42 | -h, --help show this help message and exit
43 | -f {json,md,txt}, --format {json,md,txt}
44 | output format (default: json)
45 | -w BODY_WIDTH, --body-width BODY_WIDTH
46 | character offset at which to wrap lines for plain-text
47 | (default: None)
48 | ```
49 |
50 | Alternatively, there is a `mercury.py` script that acts just like `reader.py`, except it wraps the `mercury-parser` command line on your behalf, so instead of loading the JSON from stdin or a file, it runs the Node.js javascript internally, so all it requires is a URL:
51 |
52 | ```
53 | (reader) $ ./mercury.py -h
54 | usage: mercury.py [-h] [-f {json,md,txt}] [-w BODY_WIDTH] [-p MERCURY_PATH]
55 | url
56 |
57 | Python wrapper of the Mercury Parser command line This requires you've
58 | installed Node.js (https://nodejs.org/en/) and the mercury-parser
59 | (https://github.com/postlight/mercury-parser): # Install Mercury globally $
60 | yarn global add @postlight/mercury-parser # or $ npm -g install
61 | @postlight/mercury-parser
62 |
63 | positional arguments:
64 | url URL to parse
65 |
66 | optional arguments:
67 | -h, --help show this help message and exit
68 | -f {json,md,txt}, --format {json,md,txt}
69 | output format (default: json)
70 | -w BODY_WIDTH, --body-width BODY_WIDTH
71 | character offset at which to wrap lines for plain-text
72 | (default: None)
73 | -p MERCURY_PATH, --mercury-path MERCURY_PATH
74 | path to mercury-parser command line driver (default:
75 | /usr/local/bin/mercury-parser)
76 | ```
77 |
78 | If you installed `mercury-parser` somewhere other than the default path, just supply the path with the `-p/--mercury-path` option.
79 |
80 | ## Examples
81 |
82 | ### Mercury Web Parser JSON
83 |
84 | The Mercury Web Parser's raw JSON results are useful on their own:
85 |
86 | ```
87 | (reader) $ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source | jq .
88 | {
89 | "title": "Mercury Goes Open Source! — Postlight — Digital product studio",
90 | "author": "Adam Pash",
91 | "date_published": "2019-02-06T14:36:45.000Z",
92 | "dek": null,
93 | "lead_image_url": "https://postlight.com/wp-content/uploads/2019/02/mercury-open-source-social-card-e1550670446269.png",
94 | "content": "
It’s my pleasure to announce that today, Postlight is open-sourcing the Mercury Web Parser.
\n
Written in JavaScript and running on both Node and in the browser, Mercury Parser is the engine that powers the Mercury Parser API, Mercury AMP Converter, Mercury Reader, and even more third-party software and services.
\n
Mercury Parser allows for better reading experiences, easier content migration, and endless opportunities for remixing the web, by making semantic sense out of any article. Mercury Parser sees web pages the same way you do: It sees titles, content, authors, and lead images, and makes all of that extracted data easily available to your software, which, unfortunately, sees only a sea of HTML markup, where page navigation, advertising, and the like are indistinguishable from content.
\n
Get Mercury Parser for use in your projects on GitHub:
\n
📜 Extracting content from the chaos of the web. Contribute to postlight/mercury-parser development by creating an account on GitHub.
\n
Try Mercury Parser
\n
Wanna see Mercury Parser in action in your own command line? First install it:
\n
$ yarn global add @postlight/mercury-parser
\n
Then parse an article and check out the results:
\n
$ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source
\n
Now, as an open-source project — and with your help — we hope to make the Mercury Parser even better. Say, for example, Mercury’s done a less-than-perfect job parsing an article from your favorite web site. You can write and submit a custom site parser guaranteed to get it right quickly, every time. We’re excited about all sorts of ways the Mercury community will contribute to this project.
\n
What about the API?
\n
Over time, we will deprecate the Mercury Parser API. We’ll do it slowly, with lots of warning and advance email notifications, and drop-in replacement code. We’ve committed to creating an easy path for people who want to use Mercury in any way they see fit, using open source, well-documented code that can be easily rolled into any other service or API. We want to put our energy there, making a more tractable web together—not behind a private, hosted API.
\n
Indeed, one of the main drivers for this choice was API users asking us to open source Mercury—and asking how they could help improve it.
\n
Today we’ve done exactly that. You can use Mercury Parser directly in any JavaScript project, whether on Node or in your browser, starting today, with no API required. If you’d like to chat about the Mercury Parser or need some help getting started, join the community in the Mercury Gitter channel.
\n
Adam Pash is a Director of Engineering at Postlight. Want help making sense of big messy data? Get in touch: [email protected].
",
95 | "next_page_url": null,
96 | "url": "https://postlight.com/trackchanges/mercury-goes-open-source",
97 | "domain": "postlight.com",
98 | "excerpt": "It’s my pleasure to announce that today, Postlight is open-sourcing the Mercury Web Parser. Written in JavaScript and running on both Node and in the ...",
99 | "word_count": 436,
100 | "direction": "ltr",
101 | "total_pages": 1,
102 | "rendered_pages": 1
103 | }
104 | ```
105 |
106 | ### Full JSON
107 |
108 | `reader.py` augments the Mercury Web Parser's results with addition Markdown (`.content.mardkwon`) and plain-text (`.content.text`) conversions of the original HTML content:
109 |
110 | ```
111 | (reader) $ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source | ./reader.py - | jq .
112 | {
113 | "title": "Mercury Goes Open Source! — Postlight — Digital product studio",
114 | "author": "Adam Pash",
115 | "date_published": "2019-02-06T14:36:45.000Z",
116 | "dek": null,
117 | "lead_image_url": "https://postlight.com/wp-content/uploads/2019/02/mercury-open-source-social-card-e1550670446269.png",
118 | "content": {
119 | "html": " It’s my pleasure to announce that today, Postlight is open-sourcing the Mercury Web Parser.
\n
Written in JavaScript and running on both Node and in the browser, Mercury Parser is the engine that powers the Mercury Parser API, Mercury AMP Converter, Mercury Reader, and even more third-party software and services.
\n
Mercury Parser allows for better reading experiences, easier content migration, and endless opportunities for remixing the web, by making semantic sense out of any article. Mercury Parser sees web pages the same way you do: It sees titles, content, authors, and lead images, and makes all of that extracted data easily available to your software, which, unfortunately, sees only a sea of HTML markup, where page navigation, advertising, and the like are indistinguishable from content.
\n
Get Mercury Parser for use in your projects on GitHub:
\n
📜 Extracting content from the chaos of the web. Contribute to postlight/mercury-parser development by creating an account on GitHub.
\n
Try Mercury Parser
\n
Wanna see Mercury Parser in action in your own command line? First install it:
\n
$ yarn global add @postlight/mercury-parser
\n
Then parse an article and check out the results:
\n
$ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source
\n
Now, as an open-source project — and with your help — we hope to make the Mercury Parser even better. Say, for example, Mercury’s done a less-than-perfect job parsing an article from your favorite web site. You can write and submit a custom site parser guaranteed to get it right quickly, every time. We’re excited about all sorts of ways the Mercury community will contribute to this project.
\n
What about the API?
\n
Over time, we will deprecate the Mercury Parser API. We’ll do it slowly, with lots of warning and advance email notifications, and drop-in replacement code. We’ve committed to creating an easy path for people who want to use Mercury in any way they see fit, using open source, well-documented code that can be easily rolled into any other service or API. We want to put our energy there, making a more tractable web together—not behind a private, hosted API.
\n
Indeed, one of the main drivers for this choice was API users asking us to open source Mercury—and asking how they could help improve it.
\n
Today we’ve done exactly that. You can use Mercury Parser directly in any JavaScript project, whether on Node or in your browser, starting today, with no API required. If you’d like to chat about the Mercury Parser or need some help getting started, join the community in the Mercury Gitter channel.
\n
Adam Pash is a Director of Engineering at Postlight. Want help making sense of big messy data? Get in touch: [email protected].
",
120 | "markdown": "It's my pleasure to announce that today, Postlight is open-sourcing the [Mercury Web Parser](https://mercury.postlight.com/web-parser/).\n\nWritten in JavaScript and running on both Node and in the browser, Mercury Parser is the engine that powers the Mercury Parser API, [Mercury AMP Converter](https://mercury.postlight.com/amp-converter/), [Mercury Reader](https://mercury.postlight.com/reader/), and [even more third-party software and services.](https://postlight.com/trackchanges/the-secret-engines-of-the-internet)\n\nMercury Parser allows for better reading experiences, easier content migration, and endless opportunities for remixing the web, by making semantic sense out of any article. Mercury Parser sees web pages the same way you do: It sees titles, content, authors, and lead images, and makes all of that extracted data easily available to your software, which, unfortunately, sees only a sea of HTML markup, where page navigation, advertising, and the like are indistinguishable from content.\n\nGet [Mercury Parser](https://github.com/postlight/mercury-parser) for use in your projects on GitHub:\n\n> 📜 Extracting content from the chaos of the web. Contribute to postlight/mercury-parser development by creating an account on GitHub.\n\n### Try Mercury Parser\n\nWanna see Mercury Parser in action in your own command line? First install it:\n \n \n $ yarn global add @postlight/mercury-parser\n\nThen parse an article and check out the results:\n \n \n $ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source\n\nNow, as an open-source project -- and with your help -- we hope to make the Mercury Parser even better. Say, for example, Mercury's done a less-than-perfect job parsing an article from your favorite web site. You can [write and submit a custom site parser](https://github.com/postlight/mercury-parser/blob/master/src/extractors/custom/README.md) guaranteed to get it right quickly, every time. We're excited about [all sorts of ways](https://github.com/postlight/mercury-parser/blob/master/CONTRIBUTING.md) the Mercury community will contribute to this project.\n\n### What about the API?\n\nOver time, we will deprecate the Mercury Parser API. We'll do it slowly, with lots of warning and advance email notifications, and [drop-in replacement code](https://github.com/postlight/mercury-parser-api). We've committed to creating an easy path for people who want to use Mercury in any way they see fit, using open source, well-documented code that can be easily rolled into any other service or API. We want to put our energy there, making a more tractable web together--not behind a private, hosted API.\n\nIndeed, one of the main drivers for this choice was API users asking us to open source Mercury--and asking how they could help improve it.\n\nToday we've done exactly that. You can use Mercury Parser directly in any JavaScript project, whether on Node or in your browser, starting today, with no API required. If you'd like to chat about the Mercury Parser or need some help getting started, join the community in the [Mercury Gitter channel](https://gitter.im/postlight/mercury).\n\n_[Adam Pash](https://postlight.com/trackchanges/authors/adam-pash) is a Director of Engineering at Postlight. Want help making sense of big messy data? Get in touch: [ [email protected]](https://postlight.com/cdn-cgi/l/email-protection#1a727f7676755a6a75696e76737d726e34797577)._\n",
121 | "text": "It's my pleasure to announce that today, Postlight is open-sourcing the Mercury Web Parser.\n\nWritten in JavaScript and running on both Node and in the browser, Mercury Parser is the engine that powers the Mercury Parser API, Mercury AMP Converter, Mercury Reader, and even more third-party software and services.\n\nMercury Parser allows for better reading experiences, easier content migration, and endless opportunities for remixing the web, by making semantic sense out of any article. Mercury Parser sees web pages the same way you do: It sees titles, content, authors, and lead images, and makes all of that extracted data easily available to your software, which, unfortunately, sees only a sea of HTML markup, where page navigation, advertising, and the like are indistinguishable from content.\n\nGet Mercury Parser for use in your projects on GitHub:\n\n> 📜 Extracting content from the chaos of the web. Contribute to postlight/mercury-parser development by creating an account on GitHub.\n\n### Try Mercury Parser\n\nWanna see Mercury Parser in action in your own command line? First install it:\n \n \n $ yarn global add @postlight/mercury-parser\n\nThen parse an article and check out the results:\n \n \n $ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source\n\nNow, as an open-source project -- and with your help -- we hope to make the Mercury Parser even better. Say, for example, Mercury's done a less-than-perfect job parsing an article from your favorite web site. You can write and submit a custom site parser guaranteed to get it right quickly, every time. We're excited about all sorts of ways the Mercury community will contribute to this project.\n\n### What about the API?\n\nOver time, we will deprecate the Mercury Parser API. We'll do it slowly, with lots of warning and advance email notifications, and drop-in replacement code. We've committed to creating an easy path for people who want to use Mercury in any way they see fit, using open source, well-documented code that can be easily rolled into any other service or API. We want to put our energy there, making a more tractable web together--not behind a private, hosted API.\n\nIndeed, one of the main drivers for this choice was API users asking us to open source Mercury--and asking how they could help improve it.\n\nToday we've done exactly that. You can use Mercury Parser directly in any JavaScript project, whether on Node or in your browser, starting today, with no API required. If you'd like to chat about the Mercury Parser or need some help getting started, join the community in the Mercury Gitter channel.\n\nAdam Pash is a Director of Engineering at Postlight. Want help making sense of big messy data? Get in touch: [email protected].\n"
122 | },
123 | "next_page_url": null,
124 | "url": "https://postlight.com/trackchanges/mercury-goes-open-source",
125 | "domain": "postlight.com",
126 | "excerpt": "It’s my pleasure to announce that today, Postlight is open-sourcing the Mercury Web Parser. Written in JavaScript and running on both Node and in the ...",
127 | "word_count": 436,
128 | "direction": "ltr",
129 | "total_pages": 1,
130 | "rendered_pages": 1
131 | }
132 | ```
133 |
134 | ### HTML
135 | The original extracted HTML content from the Mercury Web Parser is accessible from `.content.html`:
136 |
137 | ```
138 | (reader) $ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source | ./reader.py - | jq -r .content.html
139 | It’s my pleasure to announce that today, Postlight is open-sourcing the Mercury Web Parser.
140 |
Written in JavaScript and running on both Node and in the browser, Mercury Parser is the engine that powers the Mercury Parser API, Mercury AMP Converter, Mercury Reader, and even more third-party software and services.
141 |
Mercury Parser allows for better reading experiences, easier content migration, and endless opportunities for remixing the web, by making semantic sense out of any article. Mercury Parser sees web pages the same way you do: It sees titles, content, authors, and lead images, and makes all of that extracted data easily available to your software, which, unfortunately, sees only a sea of HTML markup, where page navigation, advertising, and the like are indistinguishable from content.
142 |
Get Mercury Parser for use in your projects on GitHub:
143 |
📜 Extracting content from the chaos of the web. Contribute to postlight/mercury-parser development by creating an account on GitHub.
144 |
Try Mercury Parser
145 |
Wanna see Mercury Parser in action in your own command line? First install it:
146 |
$ yarn global add @postlight/mercury-parser
147 |
Then parse an article and check out the results:
148 |
$ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source
149 |
Now, as an open-source project — and with your help — we hope to make the Mercury Parser even better. Say, for example, Mercury’s done a less-than-perfect job parsing an article from your favorite web site. You can write and submit a custom site parser guaranteed to get it right quickly, every time. We’re excited about all sorts of ways the Mercury community will contribute to this project.
150 |
What about the API?
151 |
Over time, we will deprecate the Mercury Parser API. We’ll do it slowly, with lots of warning and advance email notifications, and drop-in replacement code. We’ve committed to creating an easy path for people who want to use Mercury in any way they see fit, using open source, well-documented code that can be easily rolled into any other service or API. We want to put our energy there, making a more tractable web together—not behind a private, hosted API.
152 |
Indeed, one of the main drivers for this choice was API users asking us to open source Mercury—and asking how they could help improve it.
153 |
Today we’ve done exactly that. You can use Mercury Parser directly in any JavaScript project, whether on Node or in your browser, starting today, with no API required. If you’d like to chat about the Mercury Parser or need some help getting started, join the community in the Mercury Gitter channel.
154 |
Adam Pash is a Director of Engineering at Postlight. Want help making sense of big messy data? Get in touch: [email protected].
155 | ```
156 |
157 | ### Markdown
158 | A Markdown conversion from the HTML is added in `.content.markdown` which can be extracted just like the HTML via `jq` in the previous example. However, as a convenience `reader.py` can output the document as Markdown (as opposed to JSON) including some of the human-relevant metadata using the `-f/--format` option:
159 |
160 | ```
161 | (reader) $ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source | ./reader.py - --format=md
162 |
163 | date: 2019-02-06 14:36:45
164 | author(s): Adam Pash
165 |
166 | # [Mercury Goes Open Source! — Postlight — Digital product studio](https://postlight.com/trackchanges/mercury-goes-open-source)
167 |
168 | It's my pleasure to announce that today, Postlight is open-sourcing the [Mercury Web Parser](https://mercury.postlight.com/web-parser/).
169 |
170 | Written in JavaScript and running on both Node and in the browser, Mercury Parser is the engine that powers the Mercury Parser API, [Mercury AMP Converter](https://mercury.postlight.com/amp-converter/), [Mercury Reader](https://mercury.postlight.com/reader/), and [even more third-party software and services.](https://postlight.com/trackchanges/the-secret-engines-of-the-internet)
171 |
172 | Mercury Parser allows for better reading experiences, easier content migration, and endless opportunities for remixing the web, by making semantic sense out of any article. Mercury Parser sees web pages the same way you do: It sees titles, content, authors, and lead images, and makes all of that extracted data easily available to your software, which, unfortunately, sees only a sea of HTML markup, where page navigation, advertising, and the like are indistinguishable from content.
173 |
174 | Get [Mercury Parser](https://github.com/postlight/mercury-parser) for use in your projects on GitHub:
175 |
176 | > 📜 Extracting content from the chaos of the web. Contribute to postlight/mercury-parser development by creating an account on GitHub.
177 |
178 | ### Try Mercury Parser
179 |
180 | Wanna see Mercury Parser in action in your own command line? First install it:
181 |
182 |
183 | $ yarn global add @postlight/mercury-parser
184 |
185 | Then parse an article and check out the results:
186 |
187 |
188 | $ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source
189 |
190 | Now, as an open-source project -- and with your help -- we hope to make the Mercury Parser even better. Say, for example, Mercury's done a less-than-perfect job parsing an article from your favorite web site. You can [write and submit a custom site parser](https://github.com/postlight/mercury-parser/blob/master/src/extractors/custom/README.md) guaranteed to get it right quickly, every time. We're excited about [all sorts of ways](https://github.com/postlight/mercury-parser/blob/master/CONTRIBUTING.md) the Mercury community will contribute to this project.
191 |
192 | ### What about the API?
193 |
194 | Over time, we will deprecate the Mercury Parser API. We'll do it slowly, with lots of warning and advance email notifications, and [drop-in replacement code](https://github.com/postlight/mercury-parser-api). We've committed to creating an easy path for people who want to use Mercury in any way they see fit, using open source, well-documented code that can be easily rolled into any other service or API. We want to put our energy there, making a more tractable web together--not behind a private, hosted API.
195 |
196 | Indeed, one of the main drivers for this choice was API users asking us to open source Mercury--and asking how they could help improve it.
197 |
198 | Today we've done exactly that. You can use Mercury Parser directly in any JavaScript project, whether on Node or in your browser, starting today, with no API required. If you'd like to chat about the Mercury Parser or need some help getting started, join the community in the [Mercury Gitter channel](https://gitter.im/postlight/mercury).
199 |
200 | _[Adam Pash](https://postlight.com/trackchanges/authors/adam-pash) is a Director of Engineering at Postlight. Want help making sense of big messy data? Get in touch: [ [email protected]](https://postlight.com/cdn-cgi/l/email-protection#86eee3eaeae9c6f6e9f5f2eaefe1eef2a8e5e9eb)._
201 |
202 | ```
203 | ### Plain-text
204 | Similarly to the previous example, `reader.py` can also format the whole document, along with a subset of the metadata, as plain-text:
205 |
206 | ```
207 | (reader) $ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source | ./reader.py - --format=txt
208 |
209 | url: https://postlight.com/trackchanges/mercury-goes-open-source
210 | date: 2019-02-06 14:36:45
211 | author(s): Adam Pash
212 |
213 | Mercury Goes Open Source! — Postlight — Digital product studio
214 |
215 | It's my pleasure to announce that today, Postlight is open-sourcing the Mercury Web Parser.
216 |
217 | Written in JavaScript and running on both Node and in the browser, Mercury Parser is the engine that powers the Mercury Parser API, Mercury AMP Converter, Mercury Reader, and even more third-party software and services.
218 |
219 | Mercury Parser allows for better reading experiences, easier content migration, and endless opportunities for remixing the web, by making semantic sense out of any article. Mercury Parser sees web pages the same way you do: It sees titles, content, authors, and lead images, and makes all of that extracted data easily available to your software, which, unfortunately, sees only a sea of HTML markup, where page navigation, advertising, and the like are indistinguishable from content.
220 |
221 | Get Mercury Parser for use in your projects on GitHub:
222 |
223 | > 📜 Extracting content from the chaos of the web. Contribute to postlight/mercury-parser development by creating an account on GitHub.
224 |
225 | ### Try Mercury Parser
226 |
227 | Wanna see Mercury Parser in action in your own command line? First install it:
228 |
229 |
230 | $ yarn global add @postlight/mercury-parser
231 |
232 | Then parse an article and check out the results:
233 |
234 |
235 | $ mercury-parser https://postlight.com/trackchanges/mercury-goes-open-source
236 |
237 | Now, as an open-source project -- and with your help -- we hope to make the Mercury Parser even better. Say, for example, Mercury's done a less-than-perfect job parsing an article from your favorite web site. You can write and submit a custom site parser guaranteed to get it right quickly, every time. We're excited about all sorts of ways the Mercury community will contribute to this project.
238 |
239 | ### What about the API?
240 |
241 | Over time, we will deprecate the Mercury Parser API. We'll do it slowly, with lots of warning and advance email notifications, and drop-in replacement code. We've committed to creating an easy path for people who want to use Mercury in any way they see fit, using open source, well-documented code that can be easily rolled into any other service or API. We want to put our energy there, making a more tractable web together--not behind a private, hosted API.
242 |
243 | Indeed, one of the main drivers for this choice was API users asking us to open source Mercury--and asking how they could help improve it.
244 |
245 | Today we've done exactly that. You can use Mercury Parser directly in any JavaScript project, whether on Node or in your browser, starting today, with no API required. If you'd like to chat about the Mercury Parser or need some help getting started, join the community in the Mercury Gitter channel.
246 |
247 | Adam Pash is a Director of Engineering at Postlight. Want help making sense of big messy data? Get in touch: [email protected].
248 |
249 | ```
250 |
251 | ### Read Web Content in Your Terminal
252 | One use case for this script is to convert content from the web to a format that is suitable for reading in your terminal. Here's a short shell pipeline to extract the content and feed the converted plain-text to your `$PAGER` of choice for easy reading:
253 |
254 | ```
255 | #!/bin/bash
256 | url=$1
257 | reader=path/to/reader.py
258 | mercury-parser "$url" | "$reader" - -w 80 -f txt | "$PAGER"
259 | ```
260 |
--------------------------------------------------------------------------------
/mercury.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | """Python wrapper of the Mercury Parser command line
4 |
5 | This requires you've installed Node.js
6 | (https://nodejs.org/en/)
7 | and the mercury-parser
8 | (https://github.com/postlight/mercury-parser):
9 |
10 | # Install Mercury globally
11 | $ yarn global add @postlight/mercury-parser
12 | # or
13 | $ npm -g install @postlight/mercury-parser
14 |
15 | """
16 |
17 | import json
18 | import sys
19 |
20 | from reader import HTML2Text, Format, unescape, main
21 |
22 | from Naked.toolshed.shell import muterun_js
23 |
24 | def mercury(url, mercury_cli_path):
25 | """Wrap the Mercury Parser command line driver
26 |
27 | url: URL string to parse
28 | mercur_cli_path: path to mercury-parser command line driver
29 | """
30 | response = muterun_js(
31 | mercury_cli_path,
32 | url
33 | )
34 | if response.exitcode != 0:
35 | print('[ERROR] URL: {}'.format(url), file=sys.stderr)
36 | print('[ERROR]', response.stderr.decode('utf-8'), file=sys.stderr)
37 | sys.exit(response.exitcode)
38 | else:
39 | result = json.loads(response.stdout.decode('utf-8'))
40 | if 'error' in result:
41 | print('[ERROR] URL: {}'.format(url), file=sys.stderr)
42 | print('[ERROR]', result['messages'], file=sys.stderr)
43 | sys.exit(1)
44 | return result
45 |
46 | if __name__ == '__main__':
47 | import argparse
48 | parser = argparse.ArgumentParser(
49 | formatter_class=argparse.ArgumentDefaultsHelpFormatter,
50 | description=__doc__
51 | )
52 | parser.add_argument(
53 | 'url',
54 | help='URL to parse',
55 | )
56 | parser.add_argument(
57 | '-f', '--format',
58 | choices=list(Format.formatter),
59 | default='json',
60 | help='output format'
61 | )
62 | parser.add_argument(
63 | '-w', '--body-width',
64 | type=int,
65 | default=None,
66 | help='character offset at which to wrap lines for plain-text'
67 | )
68 | parser.add_argument(
69 | '-p', '--mercury-path',
70 | default='/opt/homebrew/bin/mercury-parser',
71 | help='path to mercury-parser command line driver'
72 | )
73 | args = parser.parse_args()
74 | obj = main(
75 | mercury(args.url, args.mercury_path),
76 | args.body_width
77 | )
78 | print(Format.formatter[args.format](obj))
79 |
--------------------------------------------------------------------------------
/reader.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | """Get a cleaner version of a web page for reading purposes.
4 |
5 | This script reads JSON input from the Mercury Web Parser
6 | (https://github.com/postlight/mercury-parser) and performs conversion of HTML
7 | to markdown and plain-text via html2text.
8 | """
9 |
10 | import sys
11 | import json
12 | import textwrap
13 |
14 | from datetime import datetime
15 | from html import unescape
16 | from html2text import HTML2Text
17 |
18 | class Format():
19 | """This is a decorator class for registering document format methods.
20 |
21 | You can register additional document formatter functions by decorating
22 | them with @Format.
23 |
24 | A formatter should be a function that takes as input a response object
25 | from the Mercury API. It's output can be any string derived from that
26 | input.
27 |
28 | By convention formatters should have a '_format' suffix in their function
29 | name. By this convention, if you have a formatter named 'json_format',
30 | then you can call this with Format.formatter['json']().
31 | """
32 | formatter = {}
33 | def __init__(self, f):
34 | key, _ = f.__name__.rsplit('_', 1)
35 | self.formatter.update({key: f})
36 | self.format = f
37 |
38 | def __call__(self):
39 | self.format()
40 |
41 | def format_date(obj):
42 | date = obj.get('date_published')
43 | if date is not None:
44 | obj['date_published'] = datetime.strptime(
45 | obj['date_published'],
46 | "%Y-%m-%dT%H:%M:%S.%fZ"
47 | )
48 |
49 | @Format
50 | def json_format(obj):
51 | """Formatter that formats as JSON"""
52 | return json.dumps(obj, ensure_ascii=False)
53 |
54 | @Format
55 | def md_format(obj):
56 | """Formatter that formats as markdown"""
57 | format_date(obj)
58 | content = '''
59 | date: {date_published}
60 | author(s): {author}
61 |
62 | # [{title}]({url})
63 | '''
64 | return '\n'.join((
65 | textwrap.dedent(content.format(**obj)),
66 | obj['content'].get('markdown', '')
67 | ))
68 |
69 | @Format
70 | def txt_format(obj):
71 | """Formatter that formats as plain-text"""
72 | format_date(obj)
73 | content = '''
74 | url: {url}
75 | date: {date_published}
76 | author(s): {author}
77 |
78 | {title}
79 | '''
80 | return '\n'.join((
81 | textwrap.dedent(content.format(**obj)),
82 | obj['content'].get('text', '')
83 | ))
84 |
85 | def load(filename):
86 | """Load Mercury Web Parser JSON results from file as a Python dict"""
87 | try:
88 | if filename in {"-", None}:
89 | return json.loads(sys.stdin.read())
90 | with open(filename, mode='r') as f:
91 | return json.load(f)
92 | except json.JSONDecodeError:
93 | print(f'failed to load JSON from file: {filename}', file=sys.stderr)
94 | sys.exit(1)
95 |
96 | def main(result, body_width):
97 | """Convert Mercury parse result dict to Markdown and plain-text
98 |
99 | result: a mercury-parser result (as a Python dict)
100 | """
101 | text = HTML2Text()
102 | text.body_width = body_width
103 | text.ignore_emphasis = True
104 | text.ignore_images = True
105 | text.ignore_links = True
106 | text.convert_charrefs = True
107 | markdown = HTML2Text()
108 | markdown.body_width = body_width
109 | markdown.convert_charrefs = True
110 | result['content'] = {
111 | 'html': result['content'],
112 | 'markdown': unescape(markdown.handle(result['content'])),
113 | 'text': unescape(text.handle(result['content']))
114 | }
115 | return result
116 |
117 | if __name__ == '__main__':
118 | import argparse
119 | parser = argparse.ArgumentParser(
120 | formatter_class=argparse.ArgumentDefaultsHelpFormatter,
121 | description=__doc__
122 | )
123 | parser.add_argument(
124 | 'filename',
125 | help=(
126 | 'load Mercury Web Parser JSON result from file (use "-" '
127 | 'to read from stdin)'
128 | )
129 | )
130 | parser.add_argument(
131 | '-f', '--format',
132 | choices=list(Format.formatter),
133 | default='json',
134 | help='output format'
135 | )
136 | parser.add_argument(
137 | '-w', '--body-width',
138 | type=int,
139 | default=None,
140 | help='character offset at which to wrap lines for plain-text'
141 | )
142 | args = parser.parse_args()
143 | obj = main(
144 | load(args.filename),
145 | args.body_width,
146 | )
147 | print(Format.formatter[args.format](obj))
148 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | certifi==2021.10.8
2 | charset-normalizer==2.0.12
3 | html2text==2020.1.16
4 | idna==3.3
5 | requests==2.27.1
6 | urllib3==1.26.9
7 |
--------------------------------------------------------------------------------