├── README.org
├── org-protocol-capture-html.el
├── org-protocol-capture-html.sh
└── screenshot.png


/README.org:
--------------------------------------------------------------------------------
  1 | #+PROPERTY: LOGGING nil
  2 | 
  3 | * org-protocol-capture-html                                        :noexport:
  4 | 
  5 | org-protocol is awesome, but browsers do a pretty poor job of turning a page's HTML content into plain-text.  However, Pandoc supports converting /from/ HTML /to/ org-mode, so we can use it to turn HTML into Org-mode content!  It can even turn HTML tables into Org tables!
  6 | 
  7 | * Screenshot                                                       :noexport:
  8 | 
  9 | Here's an example of what you get in Emacs from capturing [[http://kitchingroup.cheme.cmu.edu/blog/2014/07/17/Pandoc-does-org-mode-now/][this page]]:
 10 | 
 11 | [[screenshot.png]]
 12 | 
 13 | * Contents :TOC:
 14 |  - [[#requirements][Requirements]]
 15 |  - [[#installation][Installation]]
 16 |      - [[#emacs][Emacs]]
 17 |      - [[#bookmarklets][Bookmarklets]]
 18 |  - [[#shell-script][Shell script]]
 19 |  - [[#usage][Usage]]
 20 |  - [[#changelog][Changelog]]
 21 |  - [[#credits][Credits]]
 22 |  - [[#appendix][Appendix]]
 23 |      - [[#org-protocol-instructions][org-protocol Instructions]]
 24 |      - [[#selection-grabbing-function][Selection-grabbing function]]
 25 |  - [[#to-do][To-Do]]
 26 | 
 27 | * Requirements
 28 | 
 29 | + *[[http://orgmode.org/worg/org-contrib/org-protocol.html][org-protocol]]*: This is what connects org-mode to the "outside world" using a MIME protocol handler.  The instructions on the org-protocol page are a bit out of date, so you might want to try [[#org-protocol-instructions][these instructions]] instead.
 30 | + [[https://github.com/magnars/s.el][s.el]]
 31 | + *Pandoc*: Version 1.8 or later is required.
 32 | + The shell script uses =curl= to download URLs (if you use it in that mode).
 33 | 
 34 | * Installation
 35 | ** Emacs
 36 | 
 37 | Put =org-protocol-capture-html.el= in your =load-path= and add to your init file:
 38 | 
 39 | #+BEGIN_SRC elisp
 40 | (require 'org-protocol-capture-html)
 41 | #+END_SRC
 42 | 
 43 | *** org-capture Template
 44 | 
 45 | You need a suitable =org-capture= template.  I recommend this one.  Whatever you choose, the default selection key is =w=, so if you want to use a different key, you'll need to modify the script and the bookmarklets.
 46 | 
 47 | #+BEGIN_SRC elisp
 48 | ("w" "Web site" entry
 49 |   (file "")
 50 |   "* %a :website:\n\n%U %?\n\n%:initial")
 51 | #+END_SRC
 52 | 
 53 | ** Bookmarklets
 54 | 
 55 | Now you need to make a bookmarklet in your browser(s) of choice.  You can select text in the page when you capture and it will be copied into the template, or you can just capture the page title and URL.  A [[#selection-grabbing-function][selection-grabbing function]] is used to capture the selection.
 56 | 
 57 | *Note:* The =w= in the URL in these bookmarklets chooses the corresponding capture template. You can leave it out if you want to be prompted for the template, or change it to another letter for a different template key.
 58 | 
 59 | *** Firefox
 60 | 
 61 | This bookmarklet captures what is currently selected in the browser.  Or if nothing is selected, it just captures the page's URL and title.
 62 | 
 63 | #+BEGIN_SRC js
 64 |   javascript:location.href = 'org-protocol://capture-html?template=w&url=' + encodeURIComponent(location.href) + '&title=' + encodeURIComponent(document.title || "[untitled page]") + '&body=' + encodeURIComponent(function () {var html = ""; if (typeof document.getSelection != "undefined") {var sel = document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}());
 65 | #+END_SRC
 66 | 
 67 | This one uses =eww='s built-in readability-scoring function in Emacs 25.1 and up to capture the article or main content of the page.
 68 | 
 69 | #+BEGIN_SRC js
 70 |   javascript:location.href = 'org-protocol://capture-eww-readable?template=w&url=' + encodeURIComponent(location.href) + '&title=' + encodeURIComponent(document.title || "[untitled page]");
 71 | #+END_SRC
 72 | 
 73 | *Note:* When you click on one of these bookmarklets for the first time, Firefox will ask what program to use to handle the =org-protocol= protocol.  You can simply choose the default program that appears (=org-protocol=).
 74 | 
 75 | *** Pentadactyl
 76 | 
 77 | If you use [[http://5digits.org/pentadactyl/][Pentadactyl]], you can use the Firefox *bookmarklets* above, or you can put these *commands* in your =.pentadactylrc=:
 78 | 
 79 | #+BEGIN_SRC js
 80 |   map -modes=n,v ch -javascript content.location.href = 'org-protocol://capture-html?template=w&url=' + encodeURIComponent(content.location.href) + '&title=' + encodeURIComponent(content.document.title || "[untitled page]") + '&body=' + encodeURIComponent(function () {var html = ""; if (typeof content.document.getSelection != "undefined") {var sel = content.document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = content.document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = content.document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}())
 81 | 
 82 |   map -modes=n,v ce -javascript location.href='org-protocol://capture-eww-readable?template=w&url='+encodeURIComponent(content.location.href)+'&title='+encodeURIComponent(content.document.title || "[untitled page]")
 83 | #+END_SRC
 84 | 
 85 | *Note:* The JavaScript objects are slightly different for running as Pentadactyl commands since it has its own chrome.
 86 | 
 87 | *** Chrome
 88 | 
 89 | These bookmarklets work in Chrome:
 90 | 
 91 | #+BEGIN_SRC js
 92 |   javascript:location.href = 'org-protocol:///capture-html?template=w&url=' + encodeURIComponent(location.href) + '&title=' + encodeURIComponent(document.title || "[untitled page]") + '&body=' + encodeURIComponent(function () {var html = ""; if (typeof window.getSelection != "undefined") {var sel = window.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}());
 93 | 
 94 |   javascript:location.href = 'org-protocol:///capture-eww-readable?template=w&url=' + encodeURIComponent(location.href) + '&title=' + encodeURIComponent(document.title || "[untitled page]");
 95 | 
 96 | #+END_SRC
 97 | 
 98 | *Note:* The first sets of slashes are tripled compared to the Firefox bookmarklets.  When testing with Chrome, I found that =xdg-open= was collapsing the double-slashes into single-slashes, which breaks =org-protocol=.  I'm not sure why that doesn't seem to be necessary for Firefox.  If you have any trouble with this, you might try removing the extra slashes.
 99 | 
100 | * Shell script
101 | 
102 | The [[org-protocol-capture-html.sh][shell script]] is handy for piping any HTML (or plain-text) content to Org through the shell, or downloading and capturing any URL directly (without a browser), but it's not required.  It requires =getopt=, part of the =util-linux= package which should be standard on most Linux distros.  On OS X you may need to install =getopt= or =util-linux= from MacPorts or Homebrew, etc.
103 | 
104 | You can use it like this:
105 | 
106 | #+BEGIN_EXAMPLE
107 | org-protocol-capture-html.sh [OPTIONS] [HTML]
108 | cat html | org-protocol-capture-html.sh [OPTIONS]
109 | 
110 | Send HTML to Emacs through org-protocol, passing it through Pandoc to
111 | convert HTML to Org-mode.  HTML may be passed as an argument or
112 | through STDIN.  If only URL is given, it will be downloaded and its
113 | contents used.
114 | 
115 | Options:
116 |     -h, --heading HEADING     Heading
117 |     -r, --readability         Capture web page article with eww-readable
118 |     -t, --template TEMPLATE   org-capture template key (default: w)
119 |     -u, --url URL             URL
120 | 
121 |     --debug  Print debug info
122 |     --help   I need somebody!
123 | #+END_EXAMPLE
124 | 
125 | * Usage
126 | 
127 | After installing the bookmarklets, you can select some text on a web page with your mouse, open the bookmarklet with the browser, and Emacs should pop up an Org capture buffer.  You can also do it without selecting text first, if you just want to capture a link to the page.
128 | 
129 | You can also pass data through the shell script, for example:
130 | 
131 | #+BEGIN_SRC sh
132 | dmesg | grep -i sata | org-protocol-capture-html.sh --heading "dmesg SATA messages" --template i
133 | 
134 | org-protocol-capture-html.sh --readability --url "https://lwn.net/Articles/615220/"
135 | 
136 | org-protocol-capture-html.sh -h "TODO Feed the cat!" -t i "He gets grouchy if I forget!"
137 | #+END_SRC
138 | 
139 | * Changelog                                                      :noexport_1:
140 | 
141 | ** <2019-05-12>
142 | 
143 | +  Python 2-3 compatibility fixes in =org-protocol-capture-html.sh=.  ([[https://github.com/alphapapa/org-protocol-capture-html/pull/31][#31]].  Thanks to [[https://github.com/samspills][Sam Pillsworth]].)
144 | 
145 | ** <2017-04-17>
146 | 
147 | +  Use [[https://github.com/magnars/s.el][s.el]].
148 | +  Handle empty titles from =dom=.
149 | +  Skip HTTP headers more reliably in the =eww-readable= support.
150 | 
151 | ** <2017-04-15>
152 | 
153 | +  Switch from old-style =org-protocol= links to the new-style ones used in Org 9.  *Note*: This requires updating existing bookmarklets to use the new-style links.  See the examples in the usage instructions.  Users who are unable to upgrade to Org 9 should use the previous version of this package.
154 | +  Remove =python-readability= support and just use =eww-readable=.  =eww-readable= seems to work so well that it seems unnecessary to bother with external tools.  Of course, this does require Emacs 25.1, so users on Emacs 24 may wish to use the previous version.
155 | 
156 | ** <2017-04-11>
157 | 
158 | + Add =org-protocol-capture-eww-readable=.  For Emacs 25.1 and up, this uses =eww='s built-in readability-style function instead of calling external Python scripts.
159 | 
160 | ** <2016-10-23 Sun>
161 | 
162 | + Add =org-protocol-capture-html-demote-times= variable, which controls how many times headings in captured pages are demoted.  This is handy if you use a sub-heading in your capture template, so you can make all the headings in captured pages lower than the lowest-level heading in your capture template.
163 | 
164 | ** <2016-10-05 Wed>
165 | 
166 | +  Check Pandoc's no-wrap option lazily (upon first capture), and if Pandoc takes too long for some reason, try again next time a capture is run.
167 | +  If Pandoc does take too long, kill the buffer and process without prompting.
168 | +  Use ~sleep-for~ instead of ~sit-for~ to work around any potential issues with whatever "input" may interrupt ~sit-for~.
169 | 
170 | Hopefully this puts issue #12 to rest for good.  Thanks to [[https://github.com/jguenther][@jguenther]] for his help fixing and reporting bugs.
171 | 
172 | ** <2016-10-03 Mon>
173 | 
174 | + Handle pages without titles in bookmarklet examples.  If a page lacks an HTML title, the string passed to =org-protocol= would have nothing where the title should go, and this would cause the capture to fail.  Now the bookmarklets will use =[untitled page]= instead of an empty string.  (No Elisp code changed, only the examples in the readme.)
175 | 
176 | ** <2016-10-01 Sat>
177 | 
178 | + Use a temp buffer for the Pandoc test, thanks to [[https://github.com/jguenther][@jguenther]].
179 | 
180 | ** <2016-09-29 Thu>
181 | 
182 | +  Fix issue #12 (i.e. /really/ fix the =--no-wrap= deprecation), thanks to [[https://github.com/jguenther][@jguenther]].
183 | +  Require =cl= and use =cl-incf= instead of =incf=.
184 | 
185 | ** <2016-09-23 Fri>
186 | 
187 | + Fix for Pandoc versions =>== 1.16, which deprecates =--no-wrap= in favor of =--wrap=none=.
188 | 
189 | ** <2016-04-03 Sun>
190 | 
191 | + Add support for [[https://github.com/buriy/python-readability][python-readability]].
192 | + Improve instructions.
193 | 
194 | ** <2016-03-23 Wed>
195 | 
196 | + Add URL downloading to the shell script.  Now you can run =org-protocol-capture-html.sh -u http://example.com= and it will download and capture the page.
197 | + Add =org-capture= template to the readme.  This will make it much easier for new users.
198 | 
199 | * Credits
200 | 
201 | + Thanks to [[https://github.com/jguenther][@jguenther]] for helping to fix issue #12.
202 | + Thanks to [[https://github.com/xuchunyang][@xuchunyang]] for finding and fixing #17 and #19.
203 | 
204 | * Appendix
205 | 
206 | ** org-protocol Instructions
207 | 
208 | *** 1. Add protocol handler
209 | 
210 | Create the file =~/.local/share/applications/org-protocol.desktop= containing:
211 | 
212 | #+BEGIN_SRC conf
213 |   [Desktop Entry]
214 |   Name=org-protocol
215 |   Exec=emacsclient %u
216 |   Type=Application
217 |   Terminal=false
218 |   Categories=System;
219 |   MimeType=x-scheme-handler/org-protocol;
220 | #+END_SRC
221 | 
222 | *Note:* Each line's key must be capitalized exactly as displayed, or it will be an invalid =.desktop= file.
223 | 
224 | Then update =~/.local/share/applications/mimeinfo.cache= by running:
225 | 
226 | -  On KDE: =kbuildsycoca4=
227 | -  On GNOME: =update-desktop-database ~/.local/share/applications/=
228 | 
229 | *** 2. Configure Emacs
230 | 
231 | **** Init file
232 | 
233 | Add to your Emacs init file:
234 | 
235 | #+BEGIN_SRC elisp
236 |     (server-start)
237 |     (require 'org-protocol)
238 | #+END_SRC
239 | 
240 | **** Capture template
241 | 
242 | You'll probably want to add a capture template something like this:
243 | 
244 | #+BEGIN_SRC elisp
245 |   ("w" "Web site"
246 |    entry (file+olp "~/org/inbox.org" "Web")
247 |    "* %c :website:\n%U %?%:initial")
248 | #+END_SRC
249 | 
250 | *Note:* Using =%:initial= instead of =%i= seems to handle multi-line content better.
251 | 
252 | This will result in a capture like this:
253 | 
254 | #+BEGIN_SRC org
255 |    * [[http://orgmode.org/worg/org-contrib/org-protocol.html][org-protocol.el – Intercept calls from emacsclient to trigger custom actions]] :website:
256 |    [2015-09-29 Tue 11:09] About org-protocol.el org-protocol.el is based on code and ideas from org-annotation-helper.el and org-browser-url.el.
257 | #+END_SRC
258 | 
259 | *** 3. Configure Firefox
260 | 
261 | On some versions of Firefox, it may be necessary to add this setting. You may skip this step and come back to it if you get an error saying that Firefox doesn't know how to handle =org-protocol= links.
262 | 
263 | Open =about:config= and create a new =boolean= value named =network.protocol-handler.expose.org-protocol= and set it to =true=.
264 | 
265 | *Note:* If you do skip this step, and you do encounter the error, Firefox may replace all open tabs in the window with the error message, making it difficult or impossible to recover those tabs. It's best to use a new window with a throwaway tab to test this setup until you know it's working.
266 | 
267 | ** Selection-grabbing function
268 | 
269 | This function gets the HTML from the browser's selection.  It's from [[http://stackoverflow.com/a/6668159/712624][this answer]] on StackOverflow.
270 | 
271 | #+BEGIN_SRC js
272 |   function () {
273 |       var html = "";
274 | 
275 |       if (typeof content.document.getSelection != "undefined") {
276 |           var sel = content.document.getSelection();
277 |           if (sel.rangeCount) {
278 |               var container = document.createElement("div");
279 |               for (var i = 0, len = sel.rangeCount; i < len; ++i) {
280 |                   container.appendChild(sel.getRangeAt(i).cloneContents());
281 |               }
282 |               html = container.innerHTML;
283 |           }
284 |       } else if (typeof document.selection != "undefined") {
285 |           if (document.selection.type == "Text") {
286 |               html = document.selection.createRange().htmlText;
287 |           }
288 |       }
289 | 
290 |       var relToAbs = function (href) {
291 |           var a = content.document.createElement("a");
292 |           a.href = href;
293 |           var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash;
294 |           a.remove();
295 |           return abs;
296 |       };
297 |       var elementTypes = [
298 |           ['a', 'href'],
299 |           ['img', 'src']
300 |       ];
301 | 
302 |       var div = content.document.createElement('div');
303 |       div.innerHTML = html;
304 | 
305 |       elementTypes.map(function(elementType) {
306 |           var elements = div.getElementsByTagName(elementType[0]);
307 |           for (var i = 0; i < elements.length; i++) {
308 |               elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));
309 |           }
310 |       });
311 |       return div.innerHTML;
312 |   }
313 | #+END_SRC
314 | 
315 | Here's a one-line version of it, better for pasting into bookmarklets and such:
316 | 
317 | #+BEGIN_SRC js
318 |   function () {var html = ""; if (typeof content.document.getSelection != "undefined") {var sel = content.document.getSelection(); if (sel.rangeCount) {var container = document.createElement("div"); for (var i = 0, len = sel.rangeCount; i < len; ++i) {container.appendChild(sel.getRangeAt(i).cloneContents());} html = container.innerHTML;}} else if (typeof document.selection != "undefined") {if (document.selection.type == "Text") {html = document.selection.createRange().htmlText;}} var relToAbs = function (href) {var a = content.document.createElement("a"); a.href = href; var abs = a.protocol + "//" + a.host + a.pathname + a.search + a.hash; a.remove(); return abs;}; var elementTypes = [['a', 'href'], ['img', 'src']]; var div = content.document.createElement('div'); div.innerHTML = html; elementTypes.map(function(elementType) {var elements = div.getElementsByTagName(elementType[0]); for (var i = 0; i < elements.length; i++) {elements[i].setAttribute(elementType[1], relToAbs(elements[i].getAttribute(elementType[1])));}}); return div.innerHTML;}
319 | #+END_SRC
320 | 
321 | * To-Do                                                          :noexport_1:
322 | 
323 | ** TODO Add link to Mac OS X article
324 | 
325 | [[https://blog.aaronbieber.com/2016/11/24/org-capture-from-anywhere-on-your-mac.html][This article]] would be helpful for Mac users in setting up org-protocol.
326 | 
327 | ** TODO File-based capturing
328 | 
329 | Pentadactyl has the =:write= command, which can write a page's HTML to a file, or to a command, like =:write !org-protocol-capture-html.sh=.  This should make it easy to implement file-based capturing, which would pass HTML through a temp file rather than as an argument, and this would work around the argument-length limit that we occasionally run into.
330 | 
331 | All that should be necessary is to:
332 | 
333 | 1. Add a new sub-protocol =capture-file= that receives a path to a file instead of a URL to a page.
334 |      - It should probably delete the file after finishing the capture, to avoid leaving temp files laying around, so it should protect against deleting random files.  Probably the best way to do this would be to define a directory and a prefix, and any files not in that directory and not having that prefix should not be deleted.
335 | 2. Add a options to =org-protocol-capture-html.sh= to capture with files.
336 |      - This should have two methods:
337 |          + Pass the path to an existing file, which will then be passed to Emacs.
338 |          + Pass content via =STDIN=, write it to a tempfile, and pass the tempfile's path to Emacs.  The tempfile should go in the directory and have the prefix so that Emacs knows it's safe to delete that file.
339 | 3. Document how to integrate this with Pentadactyl.  It should be very simple, like =:write !org-protocol-capture-html --tempfile=.
340 |      - This would, by default, pass the entire content of the page.  It would be good to also be able to capture only the selection, and to be able to use Readability on the result.  Here's an example from the Pentadactyl manual that seems to show using JavaScript to fill arguments to the command:
341 | 
342 | #+BEGIN_EXAMPLE txt
343 |   :com! search-selection,ss -bang -nargs=? -complete search
344 |   \ -js commands.execute((bang ? open : tabopen )
345 |   \ + args + + buffer.currentWord)
346 | #+END_EXAMPLE
347 | 
348 |         However, I don't see how this would allow writing different content to =STDIN=, only arguments.  So this might not be possible without modifying Pentadactyl and/or using a separate Firefox extension.  [[file:~/src/dactyl/common/modules/buffer.jsm::commands.add(%5B"sav%5Beas%5D",%20"w%5Brite%5D"%5D,][Here]] is the source for the =:write= command, and [[file:~/Temp/src/dactyl/common/modules/storage.jsm::write:%20function%20write(buf,%20mode,%20perms,%20encoding)%20{][here]] for the underlying JS function.  And you can see [[file:~/src/dactyl/common/modules/io.jsm::%5B"exec",%20">"%20%2B%20shellEscape(stdout.path),%20"2>&1",%20"<"%20%2B%20shellEscape(stdin.path),][here]] how it uses temp files to pass =STDIN= to commands.
349 | 
350 | 
351 | ** Handle long chunks of HTML
352 | 
353 | If you try to capture too long a chunk of HTML, it will fail with "argument list too long errors" from =emacsclient=.  To work around this will require capturing via STDIN instead of arguments.  Since org-protocol is based on using URLs, this will probably require using a shell script and a new Emacs function, and perhaps another MIME protocol-handler.  Even then, it might still run into problems, because the data is passed to the shell script as an argument in the protocol-handler.  Working around that would probably require a non-protocol-handler-based method using a browser extension to send the HTML directly via STDIN.  Might be possible with Pentadactyl instead of making an entirely new browser extension.  Also, maybe the [[https://addons.mozilla.org/en-US/firefox/addon/org-mode-capture/][Org-mode Capture]] Firefox extension could be extended (...) to do this.
354 | 
355 | However, most of the time, this is not a problem.
356 | 
357 | ** Package for MELPA
358 | 
359 | This would be nice.
360 | 


--------------------------------------------------------------------------------
/org-protocol-capture-html.el:
--------------------------------------------------------------------------------
  1 | ;;; org-protocol-capture-html.el --- Capture HTML with org-protocol
  2 | 
  3 | ;; URL: https://github.com/alphapapa/org-protocol-capture-html
  4 | ;; Version: 0.1-pre
  5 | ;; Package-Requires: ((emacs "24.4"))
  6 | 
  7 | ;;; Commentary:
  8 | 
  9 | ;; This package captures Web pages into Org-mode using Pandoc to
 10 | ;; process HTML.  It can also use eww's eww-readable functionality to
 11 | ;; get the main content of a page.
 12 | 
 13 | ;; These are the helper functions that run in Emacs.  To capture pages
 14 | ;; into Emacs, you can use either a browser bookmarklet or the
 15 | ;; org-protocol-capture-html.sh shell script.  See the README.org file
 16 | ;; for instructions.
 17 | 
 18 | ;;; License:
 19 | 
 20 | ;; This program is free software; you can redistribute it and/or modify
 21 | ;; it under the terms of the GNU General Public License as published by
 22 | ;; the Free Software Foundation, either version 3 of the License, or
 23 | ;; (at your option) any later version.
 24 | 
 25 | ;; This program is distributed in the hope that it will be useful,
 26 | ;; but WITHOUT ANY WARRANTY; without even the implied warranty of
 27 | ;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 28 | ;; GNU General Public License for more details.
 29 | 
 30 | ;; You should have received a copy of the GNU General Public License
 31 | ;; along with this program.  If not, see <http://www.gnu.org/licenses/>.
 32 | 
 33 | ;;; Code:
 34 | 
 35 | ;;;; Require
 36 | 
 37 | (require 'org-protocol)
 38 | (require 'cl-lib)
 39 | (require 'subr-x)
 40 | (require 's)
 41 | 
 42 | ;;;; Vars
 43 | 
 44 | (defcustom org-protocol-capture-html-demote-times 1
 45 |   "How many times to demote headings in captured pages.
 46 | You may want to increase this if you use a sub-heading in your capture template."
 47 |   :group 'org-protocol-capture-html :type 'integer)
 48 | 
 49 | ;;;; Test Pandoc
 50 | 
 51 | (defconst org-protocol-capture-html-pandoc-no-wrap-option nil
 52 |   ;; Set this so it won't be unbound
 53 |   "Option to pass to Pandoc to disable wrapping.
 54 | Pandoc >= 1.16 deprecates `--no-wrap' in favor of
 55 | `--wrap=none'.")
 56 | 
 57 | (defun org-protocol-capture-html--define-pandoc-wrap-const ()
 58 |   "Set `org-protocol-capture-html-pandoc-no-wrap-option'."
 59 |   (setq org-protocol-capture-html-pandoc-no-wrap-option
 60 |         ;; Pandoc >= 1.16 deprecates the --no-wrap option, replacing it with
 61 |         ;; --wrap=none.  Sending the wrong option causes output to STDERR,
 62 |         ;; which `call-process-region' doesn't like.  So we test Pandoc to see
 63 |         ;; which option to use.
 64 |         (with-temp-buffer
 65 |           (let* ((process (start-process "test-pandoc" (current-buffer) "pandoc" "--dump-args" "--no-wrap"))
 66 |                  (limit 3)
 67 |                  (checked 0))
 68 |             (while (process-live-p process)
 69 |               (if (= checked limit)
 70 |                   (progn
 71 |                     ;; Pandoc didn't exit in time.  Kill it and raise
 72 |                     ;; an error.  This function will return `nil' and
 73 |                     ;; `org-protocol-capture-html-pandoc-no-wrap-option'
 74 |                     ;; will remain `nil', which will cause this
 75 |                     ;; function to run again and set the const when a
 76 |                     ;; capture is run.
 77 |                     (set-process-query-on-exit-flag process nil)
 78 |                     (error "Unable to test Pandoc!  Please report this bug! (include the output of \"pandoc --dump-args --no-wrap\")"))
 79 |                 (sleep-for 0.2)
 80 |                 (cl-incf checked)))
 81 |             (if (and (zerop (process-exit-status process))
 82 |                      (not (string-match "--no-wrap is deprecated" (buffer-string))))
 83 |                 "--no-wrap"
 84 |               "--wrap=none")))))
 85 | 
 86 | ;;;; Direct-to-Pandoc
 87 | 
 88 | (defun org-protocol-capture-html--with-pandoc (data)
 89 |   "Process an org-protocol://capture-html:// URL using DATA.
 90 | 
 91 | This function is basically a copy of `org-protocol-do-capture', but
 92 | it passes the captured content (not the URL or title) through
 93 | Pandoc, converting HTML to Org-mode."
 94 | 
 95 |   ;; It would be nice to not basically duplicate
 96 |   ;; `org-protocol-do-capture', but passing the data back to that
 97 |   ;; function would require re-encoding the data into a URL string
 98 |   ;; with Emacs after Pandoc converts it.  Since we've already split
 99 |   ;; it up, we might as well go ahead and run the capture directly.
100 | 
101 |   (unless org-protocol-capture-html-pandoc-no-wrap-option
102 |     (org-protocol-capture-html--define-pandoc-wrap-const))
103 | 
104 |   (let* ((template (or (plist-get data :template)
105 |                        org-protocol-default-template-key))
106 |          (url (org-protocol-sanitize-uri (plist-get data :url)))
107 |          (type (if (string-match "^\\([a-z]+\\):" url)
108 |                    (match-string 1 url)))
109 |          (title (or (org-protocol-capture-html--nbsp-to-space (string-trim (plist-get data :title))) ""))
110 |          (content (or (org-protocol-capture-html--nbsp-to-space (string-trim (plist-get data :body))) ""))
111 |          (orglink (org-make-link-string
112 |                    url (if (string-match "[^[:space:]]" title) title url)))
113 |          (org-capture-link-is-already-stored t)) ; avoid call to org-store-link
114 | 
115 |     (setq org-stored-links
116 |           (cons (list url title) org-stored-links))
117 |     (kill-new orglink)
118 | 
119 |     (with-temp-buffer
120 |       (insert content)
121 |       (if (not (zerop (call-process-region
122 |                        (point-min) (point-max)
123 |                        "pandoc" t t nil "-f" "html" "-t" "org" org-protocol-capture-html-pandoc-no-wrap-option)))
124 |           (message "Pandoc failed: %s" (buffer-string))
125 |         (progn
126 |           ;; Pandoc succeeded
127 |           (org-store-link-props :type type
128 |                                 :annotation orglink
129 |                                 :link url
130 |                                 :description title
131 |                                 :orglink orglink
132 |                                 :initial (buffer-string)))))
133 |     (org-protocol-capture-html--do-capture)
134 |     nil))
135 | 
136 | (add-to-list 'org-protocol-protocol-alist
137 |              '("capture-html"
138 |                :protocol "capture-html"
139 |                :function org-protocol-capture-html--with-pandoc
140 |                :kill-client t))
141 | 
142 | ;;;; eww-readable
143 | 
144 | (defvar url-http-end-of-headers)
145 | 
146 | (eval-when-compile
147 |   ;; eww-readable only works on Emacs >=25.1, but I think it's better
148 |   ;; to check for the actual symbols.  I think using
149 |   ;; `eval-when-compile' is the right way to do this, but I'm not
150 |   ;; sure.
151 |   (when (and (require 'eww nil t)
152 |              (require 'dom nil t)
153 |              (fboundp 'eww-score-readability))
154 | 
155 |     (defun org-protocol-capture-html--capture-eww-readable (data)
156 |       "Capture content of URL with eww-readable.."
157 | 
158 |       (unless org-protocol-capture-html-pandoc-no-wrap-option
159 |         (org-protocol-capture-html--define-pandoc-wrap-const))
160 | 
161 |       (let* ((template (or (plist-get data :template)
162 |                            org-protocol-default-template-key))
163 |              (url (org-protocol-sanitize-uri (plist-get data :url)))
164 |              (type (if (string-match "^\\([a-z]+\\):" url)
165 |                        (match-string 1 url)))
166 |              (html (org-protocol-capture-html--url-html url))
167 |              (result (org-protocol-capture-html--eww-readable html))
168 |              (title (cdr result))
169 |              (content (with-temp-buffer
170 |                         (insert (org-protocol-capture-html--nbsp-to-space (car result)))
171 |                         ;; Convert to Org with Pandoc
172 |                         (unless (= 0 (call-process-region (point-min) (point-max)
173 |                                                           "pandoc" t t nil "-f" "html" "-t" "org"
174 |                                                           org-protocol-capture-html-pandoc-no-wrap-option))
175 |                           (error "Pandoc failed"))
176 |                         (save-excursion
177 |                           ;; Remove DOS CR/LF line endings
178 |                           (goto-char (point-min))
179 |                           (while (search-forward (string ?\C-m) nil t)
180 |                             (replace-match "")))
181 |                         ;; Demote page headings in capture buffer to below the
182 |                         ;; top-level Org heading and "Article" 2nd-level heading
183 |                         (save-excursion
184 |                           (goto-char (point-min))
185 |                           (while (re-search-forward (rx bol (1+ "*") (1+ space)) nil t)
186 |                             (beginning-of-line)
187 |                             (insert "**")
188 |                             (end-of-line)))
189 |                         (buffer-string)))
190 |              (orglink (org-make-link-string
191 |                        url (if (s-present? title) title url)))
192 |              ;; Avoid call to org-store-link
193 |              (org-capture-link-is-already-stored t))
194 | 
195 |         (setq org-stored-links
196 |               (cons (list url title) org-stored-links))
197 |         (kill-new orglink)
198 | 
199 |         (org-store-link-props :type type
200 |                               :annotation orglink
201 |                               :link url
202 |                               :description title
203 |                               :orglink orglink
204 |                               :initial content)
205 |         (org-protocol-capture-html--do-capture)
206 |         nil))
207 | 
208 |     (add-to-list 'org-protocol-protocol-alist
209 |                  '("capture-eww-readable"
210 |                    :protocol "capture-eww-readable"
211 |                    :function org-protocol-capture-html--capture-eww-readable
212 |                    :kill-client t))
213 | 
214 |     (defun org-protocol-capture-html--url-html (url)
215 |       "Return HTML from URL as string."
216 |       (let* ((response-buffer (url-retrieve-synchronously url nil t))
217 |              (encoded-html (with-current-buffer response-buffer
218 |                              (pop-to-buffer response-buffer)
219 |                              ;; Skip HTTP headers, using marker provided by url-http
220 |                              (delete-region (point-min) (1+ url-http-end-of-headers))
221 |                              (buffer-string))))
222 |         (kill-buffer response-buffer)     ; Not sure if necessary to avoid leaking buffer
223 |         (with-temp-buffer
224 |           ;; For some reason, running `decode-coding-region' in the
225 |           ;; response buffer has no effect, so we have to do it in a
226 |           ;; temp buffer.
227 |           (insert encoded-html)
228 |           (condition-case nil
229 |               ;; Fix undecoded text
230 |               (decode-coding-region (point-min) (point-max) 'utf-8)
231 |             (coding-system-error nil))
232 |           (buffer-string))))
233 | 
234 |     (defun org-protocol-capture-html--eww-readable (html)
235 |       "Return `eww-readable' part of HTML with title.
236 | Returns list (HTML . TITLE)."
237 |       ;; Based on `eww-readable'
238 |       (let* ((html
239 |               ;; Convert "&nbsp;" in HTML to plain spaces.
240 |               ;; `libxml-parse-html-region' turns them into
241 |               ;; underlines.  The closest I can find to an explanation
242 |               ;; is at <http://www.perlmonks.org/?node_id=825188>.
243 |               (org-protocol-capture-html--nbsp-to-space html))
244 |              (dom (with-temp-buffer
245 |                     (insert html)
246 |                     (libxml-parse-html-region (point-min) (point-max))))
247 |              (title (cl-caddr (car (dom-by-tag dom 'title)))))
248 |         (eww-score-readability dom)
249 |         (cons (with-temp-buffer
250 |                 (shr-dom-print (eww-highest-readability dom))
251 |                 (buffer-string))
252 |               title)))))
253 | 
254 | ;;;; Helper functions
255 | 
256 | (defun org-protocol-capture-html--nbsp-to-space (s)
257 |   "Convert HTML non-breaking spaces to plain spaces in S."
258 |   ;; Not sure why sometimes these are in the HTML and Pandoc converts
259 |   ;; them to underlines instead of spaces, but this fixes it.
260 |   (replace-regexp-in-string (rx "&nbsp;") " " s t t))
261 | 
262 | (with-no-warnings
263 |   ;; Ignore warning about the dynamically scoped `template' variable.
264 |   (defun org-protocol-capture-html--do-capture ()
265 |     "Call `org-capture' and demote page headings in capture buffer."
266 |     (raise-frame)
267 |     (funcall 'org-capture nil template)
268 | 
269 |     ;; Demote page headings in capture buffer to below the
270 |     ;; top-level Org heading
271 |     (save-excursion
272 |       (goto-char (point-min))
273 |       (re-search-forward (rx bol "*" (1+ space)) nil t) ; Skip 1st heading
274 |       (while (re-search-forward (rx bol "*" (1+ space)) nil t)
275 |         (dotimes (n org-protocol-capture-html-demote-times)
276 |           (org-demote-subtree))))))
277 | 
278 | (provide 'org-protocol-capture-html)
279 | 
280 | ;;; org-protocol-capture-html.el ends here
281 | 


--------------------------------------------------------------------------------
/org-protocol-capture-html.sh:
--------------------------------------------------------------------------------
  1 | #!/bin/bash
  2 | 
  3 | # * Defaults
  4 | 
  5 | heading=" "
  6 | protocol="capture-html"
  7 | template="w"
  8 | 
  9 | # * Functions
 10 | 
 11 | function debug {
 12 |     if [[ -n $debug ]]
 13 |     then
 14 |         function debug {
 15 |             echo "DEBUG: $@" >&2
 16 |         }
 17 |         debug "$@"
 18 |     else
 19 |         function debug {
 20 |             true
 21 |         }
 22 |     fi
 23 | }
 24 | function die {
 25 |     echo "$@" >&2
 26 |     exit 1
 27 | }
 28 | function usage {
 29 |     cat <<EOF
 30 | $0 [OPTIONS] [HTML]
 31 | html | $0 [OPTIONS]
 32 | 
 33 | Send HTML to Emacs through org-protocol, passing it through Pandoc to
 34 | convert HTML to Org-mode.  HTML may be passed as an argument or
 35 | through STDIN.  If only URL is given, it will be downloaded and its
 36 | contents used.
 37 | 
 38 | Options:
 39 |     -h, --heading HEADING     Heading
 40 |     -r, --readability         Capture web page article with python-readability
 41 |     -t, --template TEMPLATE   org-capture template key (default: w)
 42 |     -u, --url URL             URL
 43 | 
 44 |     --debug  Print debug info
 45 |     --help   I need somebody!
 46 | EOF
 47 | }
 48 | 
 49 | function urlencode {
 50 |     python -c "
 51 | from __future__ import print_function
 52 | try:
 53 |     from urllib import quote  # Python 2
 54 | except ImportError:
 55 |     from urllib.parse import quote  # Python 3
 56 | import sys
 57 | 
 58 | print(quote(sys.stdin.read()[:-1], safe=''))"
 59 | }
 60 | 
 61 | # * Args
 62 | 
 63 | args=$(getopt -n "$0" -o dh:rt:u: -l debug,help,heading:,readability,template:,url: -- "$@") \
 64 |     || die "Unable to parse args.  Is getopt installed?"
 65 | eval set -- "$args"
 66 | 
 67 | while true
 68 | do
 69 |     case "$1" in
 70 |         -d|--debug)
 71 |             debug=true
 72 |             debug "Debugging on"
 73 |             ;;
 74 |         --help)
 75 |             usage
 76 |             exit
 77 |             ;;
 78 |         -h|--heading)
 79 |             shift
 80 |             heading="$1"
 81 |             ;;
 82 |         -r|--readability)
 83 |             protocol="capture-eww-readable"
 84 |             readability=true
 85 |             ;;
 86 |         -t|--template)
 87 |             shift
 88 |             template="$1"
 89 |             ;;
 90 |         -u|--url)
 91 |             shift
 92 |             url="$1"
 93 |             ;;
 94 |         --)
 95 |             # Remaining args
 96 |             shift
 97 |             rest=("$@")
 98 |             break
 99 |             ;;
100 |     esac
101 | 
102 |     shift
103 | done
104 | 
105 | debug "ARGS: $args"
106 | debug "Remaining args: ${rest[@]}"
107 | 
108 | # * Main
109 | 
110 | # ** Get HTML
111 | 
112 | if [[ -n $@ ]]
113 | then
114 |     debug "HTML from args"
115 | 
116 |     html="$@"
117 | 
118 | elif ! [[ -t 0 ]]
119 | then
120 |     debug "HTML from STDIN"
121 | 
122 |     html=$(cat)
123 | 
124 | elif [[ -n $url && ! -n $readability ]]
125 | then
126 |     debug "Only URL given; downloading..."
127 | 
128 |     # Download URL
129 |     html=$(curl "$url") || die "Unable to download $url"
130 | 
131 |     # Get HTML title for heading
132 |     heading=$(sed -nr '/<title>/{s|.*<title>([^<]+)</title>.*|\1|i;p;q};' <<<"$html") || heading="A web page with no name"
133 | 
134 |     debug "Using heading: $heading"
135 | 
136 | elif [[ -n $readability ]]
137 | then
138 |     debug "Using readability"
139 | 
140 | else
141 |     usage
142 |     echo
143 |     die "I need somethin' ta go on, Cap'n!"
144 | fi
145 | 
146 | # ** Check URL
147 | # The URL shouldn't be empty
148 | 
149 | [[ -n $url ]] || url="http://example.com"
150 | 
151 | # ** URL-encode html
152 | 
153 | heading=$(urlencode <<<"$heading") || die "Unable to urlencode heading."
154 | url=$(urlencode <<<"$url") || die "Unable to urlencode URL."
155 | html=$(urlencode <<<"$html") || die "Unable to urlencode HTML."
156 | 
157 | # ** Send to Emacs
158 | 
159 | emacsclient "org-protocol://$protocol?template=$template&url=$url&title=$heading&body=$html"
160 | 


--------------------------------------------------------------------------------
/screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/alphapapa/org-protocol-capture-html/a912aaefae8abdada2b2479aec0ad53fcf0b57bf/screenshot.png


--------------------------------------------------------------------------------