Demo Document

├── .gitignore ├── README.md ├── examples ├── README.md ├── blogpost.md ├── header-attrs.md ├── headers.org ├── html.md ├── link.md ├── mult-authors.md ├── nav.md ├── nav.org ├── nba2.org ├── no-headers.org ├── ol.md ├── paragraphs.md ├── paragraphs.org ├── plain-list.org ├── sample.md ├── test.md ├── worknotes.md └── worknotes.org ├── pandoc_opml └── __init__.py └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.egg-info 2 | *.pyc 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | pandoc-opml 2 | =========== 3 | 4 | pandoc-opml generates [OPML] files from [Markdown] with the help of [pandoc]. 5 | 6 | [OPML]: http://dev.opml.org/spec2.html 7 | [Markdown]: http://johnmacfarlane.net/pandoc/README.html#pandocs-markdown 8 | [pandoc]: http://johnmacfarlane.net/pandoc/ 9 | 10 | Demo 11 | ---- 12 | 13 | Imagine this Markdown document: 14 | 15 | ```markdown 16 | --- 17 | title: Demo Document 18 | author: Eric Davis 19 | --- 20 | 21 | # Hello World! 22 | 23 | This is a child of the "Hello World!" header. 24 | ``` 25 | 26 | After running it through `pandoc-opml`, you'd have this OPML document: 27 | 28 | ```xml 29 | 30 | 31 | 32 | 33 | Demo Document 34 | Eric Davis 35 | Tue, 13 Jan 2015 04:21:33 GMT 36 | https://github.com/edavis/pandoc-opml 37 | https://github.com/edavis/pandoc-opml#docs 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | ``` 46 | 47 | Alright, so I've taken the simplicity of Markdown and turned it into a 48 | jumble of XML. What's so great about this? 49 | 50 | Well, think of what an XML version of your Markdown now enables. 51 | 52 | Say you wanted to grab all level 1 and level 2 headlines from a 53 | Markdown document to put together a table of contents. 54 | 55 | All the widely used Markdown libraries seem to focus primarily on 56 | transforming Markdown into HTML, so no help there. Beyond that, you 57 | could try writing a regex to extract the headers but [that path is 58 | brittle and full of pain][regex quote]. 59 | 60 | What if instead you could transform your Markdown into XML and gain 61 | with it all the tools and libraries that natively work with XML? Then 62 | your "grab all level 1 and level 2 headers" task would be a breeze. 63 | 64 | pandoc-opml is the tool to do just that. 65 | 66 | [regex quote]: http://blog.codinghorror.com/regular-expressions-now-you-have-two-problems/ 67 | 68 | Installation 69 | ------------ 70 | 71 | I'll eventually toss this up on PyPI, but for now: 72 | 73 | ```bash 74 | $ pip install https://github.com/edavis/pandoc-opml 75 | ``` 76 | 77 | The only external requirement is [pandoc]. 78 | 79 | Running 80 | ------- 81 | 82 | ```bash 83 | $ pandoc-opml [-o ] 84 | ``` 85 | 86 | If `-o/--output` is not provided, the output is written to stdout. 87 | 88 | Docs 89 | ---- 90 | 91 | pandoc-opml makes every effort to follow the [OPML v2.0][OPML] 92 | specification as closely as possible. 93 | 94 | However, Markdown is a rich format so some additional information 95 | about the source elements are stored as attributes. 96 | 97 | A good OPML parser should ignore anything it doesn't understand, so 98 | none of this should be a problem. Please file a bug report if any 99 | problems do arise. 100 | 101 | ### Headers 102 | 103 | The OPML of Markdown [headline elements][headlines] includes two 104 | attributes: `level` and `name`. 105 | 106 | The `level` attribute is the HTML level for the given header 107 | element. For example `1` for h1, `2` for h2, etc. 108 | 109 | The `name` attribute is the unique identifier assigned according to 110 | [these rules][unique ids]. 111 | 112 | [headlines]: http://johnmacfarlane.net/pandoc/README.html#headers 113 | [unique ids]: http://johnmacfarlane.net/pandoc/README.html#extension-auto_identifiers 114 | 115 | To override the `name` attribute, explicitly set the unique identifier: 116 | 117 | ```markdown 118 | # Hello World {#custom-id} 119 | ``` 120 | 121 | ```xml 122 | 123 | ``` 124 | 125 | ### Attributes 126 | 127 | If you specify [header attributes], pandoc-opml will include them in 128 | the resulting OPML: 129 | 130 | ```markdown 131 | # Hello World {#custom-id .draft category=demo} 132 | ``` 133 | 134 | ```xml 135 | 136 | ``` 137 | 138 | Class header attributes have the value of "true" while key/value 139 | header attributes are included as-is. 140 | 141 | Later attributes overwrite earlier ones. For example: 142 | 143 | ```markdown 144 | # Hello World {#unique-id .name name=example} 145 | ``` 146 | 147 | First, `name=unique-id`. Then, the class attribute sets 148 | `name=true`. Then, the key/value attribute sets `name=example`. In the 149 | resulting OPML, `name` will equal `example`. 150 | 151 | [header attributes]: http://johnmacfarlane.net/pandoc/README.html#extension-header_attributes 152 | 153 | ### Lists 154 | 155 | [Unordered list items][unordered lists] have a `list` attribute set to 156 | `unordered`. 157 | 158 | [Ordered list items][ordered lists] have a `list` attribute set to 159 | `ordered` and an `ordinal` attribute set to the ordinal number of the 160 | list item. 161 | 162 | Example: 163 | 164 | ```markdown 165 | - Hello World 166 | - This is a test 167 | 168 | 1) Hello World 169 | 2) This is a test 170 | ``` 171 | 172 | ```xml 173 | 174 | 175 | 176 | 177 | 178 | ``` 179 | 180 | [list elements]: http://johnmacfarlane.net/pandoc/README.html#lists 181 | [unordered lists]: http://johnmacfarlane.net/pandoc/README.html#bullet-lists 182 | [ordered lists]: http://johnmacfarlane.net/pandoc/README.html#ordered-lists 183 | 184 | ### Metadata 185 | 186 | If `description` is included in the [metadata], it is included as a 187 | `` element in the OPML's ``. 188 | 189 | If `date` is included, it is included as the `` element 190 | in the OPML's ``. 191 | 192 | The `` element is the timestamp of when `pandoc-opml` 193 | created the OPML. 194 | 195 | All the other metadata (e.g., title, author, email, etc.) maps to the 196 | standard OPML `` elements. 197 | 198 | If more than one author is provided, a single `` element is 199 | created with the names comma delimited. 200 | 201 | [metadata]: http://johnmacfarlane.net/pandoc/README.html#metadata-blocks 202 | 203 | ### HTML 204 | 205 | If the source Markdown contains formatting, the respective OPML `text` 206 | attribute will contain encoded HTML markup: 207 | 208 | ```markdown 209 | This paragraph contains *emphasis* and **strong** formatting along 210 | with `code` and H~2~O (subscripts) and 2^10^ (superscripts) and last, 211 | but not least, ~~deleted text~~. 212 | ``` 213 | 214 | ```xml 215 | 216 | ``` 217 | 218 | Background 219 | ---------- 220 | 221 | I've long been interested in OPML as a file format, but I was always 222 | more comfortable using a text editor than any of the available OPML 223 | editors. 224 | 225 | So I started toying with the idea of using a regular text editor and 226 | exporting plain text files to OPML instead of editing OPML 227 | directly. 228 | 229 | I knew the hardest part was going to be parsing the plain text input 230 | files. Looking for alternatives to writing that code myself, I found 231 | pandoc and was thrilled to see it provided access to the abstract 232 | syntax tree (AST) that represented the input file's headers, 233 | paragraphs, list items, etc. Plus, by using pandoc, I could write the 234 | input files in any of the [many file formats it understands][inputs]. 235 | 236 | [inputs]: http://johnmacfarlane.net/pandoc/README.html#description 237 | -------------------------------------------------------------------------------- /examples/README.md: -------------------------------------------------------------------------------- 1 | I often need little documents to test out certain bits of pandoc-opml 2 | functionality. These are those documents. 3 | 4 | I'll probably phase these out once I have unittests in place but for 5 | now it's better than nothing. 6 | -------------------------------------------------------------------------------- /examples/blogpost.md: -------------------------------------------------------------------------------- 1 | % Eric's Blog OPML 2 | % Eric Davis 3 | % 2015-01-15 4 | 5 | # NBA {type=include url=http://files.davising.com.s3.amazonaws.com/2015/01/11/nba.opml} 6 | # Google {type=link url=http://google.com/} 7 | -------------------------------------------------------------------------------- /examples/header-attrs.md: -------------------------------------------------------------------------------- 1 | # Header 1 {type=howto domain=opml.ericdavis.org} 2 | 3 | Hello world 4 | 5 | ## Header 2 {#custom-name type=worknote} 6 | 7 | - Hello 8 | -------------------------------------------------------------------------------- /examples/headers.org: -------------------------------------------------------------------------------- 1 | #+TITLE: Hello World 2 | #+DESCRIPTION: Test headers in the OPML head element 3 | #+AUTHOR: Eric Davis; Davis Eric 4 | #+EMAIL: edavis@eresources.com 5 | #+DATE: 2014-01-01 6 | 7 | - Item 1 8 | - Item 2 9 | - Item 2.1 10 | - Item 3 11 | -------------------------------------------------------------------------------- /examples/html.md: -------------------------------------------------------------------------------- 1 | Emph: *emph* 2 | 3 | Strong: **strong** 4 | 5 | Code: `code 123` 6 | 7 | Sub and super: H~2~O is a liquid. 2^10^ is 1024. 8 | 9 | Strikeout: This is ~~deleted text~~. 10 | -------------------------------------------------------------------------------- /examples/link.md: -------------------------------------------------------------------------------- 1 | This is a paragraph with a link in it: [Hello World] 2 | 3 | This is an inline link . 4 | 5 | Same as above, but implicit: http://example.com/ 6 | 7 | (Have to add brackets or nothing happens, it looks like). 8 | 9 | And with a [title](http://example.com/ "example title"). 10 | 11 | [Hello World]: http://example.com/ 12 | -------------------------------------------------------------------------------- /examples/mult-authors.md: -------------------------------------------------------------------------------- 1 | % Testing multiple authors 2 | % Eric Davis 3 | 4 | Hello World! 5 | -------------------------------------------------------------------------------- /examples/nav.md: -------------------------------------------------------------------------------- 1 | % Hello World 2 | % Eric Davis; Davis Eric 3 | % 2015-01-01 4 | 5 | # Header 1 6 | 7 | Paragraph below 1 in markdown 8 | 9 | ## Header 2 10 | 11 | Author info doesn't work for head in markdown 12 | -------------------------------------------------------------------------------- /examples/nav.org: -------------------------------------------------------------------------------- 1 | #+DESCRIPTION: See how pandoc-opml deals with org-mode navigation headers 2 | 3 | * Hello World 4 | Paragraph hw 5 | 6 | - Testing 7 | - Testing 123 8 | - Testing 321 9 | 10 | Last paragraph 11 | 12 | ** World Hello 13 | Paragraph 14 | 15 | - Item 1 16 | - Item 2 17 | - Item 2.1 18 | - Item 3 19 | 20 | Final graf 21 | -------------------------------------------------------------------------------- /examples/nba2.org: -------------------------------------------------------------------------------- 1 | #+TITLE: NBA Teams 2 | #+AUTHOR: Eric Davis 3 | #+EMAIL: eric@davising.com 4 | #+DESCRIPTION: List of all NBA teams 5 | 6 | * Eastern Conference 7 | ** Atlantic Division 8 | - Boston Celtics 9 | - Brooklyn Nets 10 | - New York Knicks 11 | - Philadelphia 76ers 12 | - Toronto Raptors 13 | ** Central Division 14 | - Chicago Bulls 15 | - Cleveland Cavaliers 16 | - Detroit Pistons 17 | - Indiana Pacers 18 | - Milwaukee Bucks 19 | ** Southeast Division 20 | - Atlanta Hawks 21 | - Charlotte Bobcats 22 | - Miami Heat 23 | - Orlando Magic 24 | - Washington Wizards 25 | * Western Conference 26 | ** Southwest Division 27 | - Dallas Mavericks 28 | - Houston Rockets 29 | - Memphis Grizzlies 30 | - New Orleans Pelicans 31 | - San Antonio Spurs 32 | ** Northwest Division 33 | - Denver Nuggets 34 | - Minnesota Timberwolves 35 | - Portland Trail Blazers 36 | - Oklahoma City Thunder 37 | - Utah Jazz 38 | ** Pacific Division 39 | - Golden State Warriors 40 | - Los Angeles Clippers 41 | - Los Angeles Lakers 42 | - Phoenix Suns 43 | - Sacramento Kings 44 | -------------------------------------------------------------------------------- /examples/no-headers.org: -------------------------------------------------------------------------------- 1 | - Hello World 2 | -------------------------------------------------------------------------------- /examples/ol.md: -------------------------------------------------------------------------------- 1 | 1) Hello 2 | 2) World 3 | 1) Eric 4 | 2) James 5 | 3) Davis 6 | 3) Testing 7 | -------------------------------------------------------------------------------- /examples/paragraphs.md: -------------------------------------------------------------------------------- 1 | - Item 1 2 | 3 | - Test 4 | 5 | - Test 2 6 | 7 | - Test 3 8 | -------------------------------------------------------------------------------- /examples/paragraphs.org: -------------------------------------------------------------------------------- 1 | * Header 2 | 3 | Paragraph 1 4 | 5 | - Item 1 6 | - Item 1.1 7 | - Item 2 8 | - Item 3 9 | 10 | Paragraph 2 11 | -------------------------------------------------------------------------------- /examples/plain-list.org: -------------------------------------------------------------------------------- 1 | - NBA 2 | - Eastern Conference 3 | - Atlantic Division 4 | - Boston Celtics 5 | - Brooklyn Nets 6 | - New York Knicks 7 | - Philadelphia 76ers 8 | - Toronto Raptors 9 | - Central Division 10 | - Chicago Bulls 11 | - Cleveland Cavaliers 12 | - Detroit Pistons 13 | - Indiana Pacers 14 | - Milwaukee Bucks 15 | - Southeast Division 16 | - Atlanta Hawks 17 | - Charlotte Bobcats 18 | - Miami Heat 19 | - Orlando Magic 20 | - Washington Wizards 21 | - Western Conference 22 | - Southwest Division 23 | - Dallas Mavericks 24 | - Houston Rockets 25 | - Memphis Grizzlies 26 | - New Orleans Pelicans 27 | - San Antonio Spurs 28 | - Northwest Division 29 | - Denver Nuggets 30 | - Minnesota Timberwolves 31 | - Portland Trail Blazers 32 | - Oklahoma City Thunder 33 | - Utah Jazz 34 | - Pacific Division 35 | - Golden State Warriors 36 | - Los Angeles Clippers 37 | - Los Angeles Lakers 38 | - Phoenix Suns 39 | - Sacramento Kings 40 | -------------------------------------------------------------------------------- /examples/sample.md: -------------------------------------------------------------------------------- 1 | - Item 1 2 | - Item 1.1 3 | - Item 1.2 4 | - Item 2 5 | - Item 2.1 6 | - Item 2.1.1 7 | - Item 2.2 8 | - Item 3 9 | - Item 3.1 10 | - Item 3.2 11 | - Item 3.3 12 | - Item 3.3.1 13 | -------------------------------------------------------------------------------- /examples/test.md: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/edavis/pandoc-opml/94eb14b4df92fffa72e41e728dc9f4c599dbb7c3/examples/test.md -------------------------------------------------------------------------------- /examples/worknotes.md: -------------------------------------------------------------------------------- 1 | # 2015 2 | ## January 2015 3 | ### January 11, 2015 4 | 5 | Hello world! 6 | -------------------------------------------------------------------------------- /examples/worknotes.org: -------------------------------------------------------------------------------- 1 | * 2015 2 | ** January 2015 3 | *** January 11, 2015 4 | 5 | Body text, yo 6 | -------------------------------------------------------------------------------- /pandoc_opml/__init__.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import json 3 | import time 4 | import itertools 5 | import subprocess 6 | from datetime import datetime 7 | from xml.etree import ElementTree as ET 8 | 9 | __version__ = '0.1' 10 | 11 | def gmt(s=None): 12 | utc = time.gmtime(s) 13 | d = datetime(*utc[:6]) 14 | return d.strftime('%a, %d %b %Y %H:%M:%S') + ' GMT' 15 | 16 | class Node(object): 17 | def __init__(self, text, attr=None): 18 | self.text = text 19 | self.attr = attr or {} 20 | self.children = [] 21 | 22 | def append(self, node): 23 | self.children.append(node) 24 | 25 | class PandocOPML(object): 26 | def __init__(self, json_ast=None): 27 | if json_ast is None: 28 | self.head, self.body = json.loads(sys.stdin.read()) 29 | else: 30 | self.head, self.body = json.loads(json_ast) 31 | self.head = self.head['unMeta'] 32 | self.depth = 0 33 | self.el = None 34 | self.nodes = self.parse() 35 | 36 | def parse(self): 37 | nodes = [] 38 | 39 | def add_node(node): 40 | try: 41 | nodes[self.depth].append(node) 42 | except IndexError: 43 | nodes.append([node]) 44 | 45 | if self.depth > 0: 46 | parent = nodes[self.depth - 1][-1] 47 | parent.append(node) 48 | 49 | def inner(content): 50 | for obj in content: 51 | if obj.get('t') in {'Para', 'Plain'}: 52 | node = Node(self.extract(obj.get('c'))) 53 | add_node(node) 54 | self.el = obj.get('t') 55 | 56 | elif obj.get('t') == 'OrderedList': 57 | info, contents = obj.get('c') 58 | counter = itertools.count(info[0]) 59 | if self.el in {'Header', 'Para'} or self.el is None: 60 | for element in contents: 61 | inner(element) 62 | n = nodes[self.depth][-1] # most recently added node 63 | n.attr.update({ 64 | 'ordinal': str(next(counter)), 65 | 'list': 'ordered', 66 | }) 67 | else: 68 | self.depth += 1 69 | for element in contents: 70 | inner(element) 71 | n = nodes[self.depth][-1] 72 | n.attr.update({ 73 | 'ordinal': str(next(counter)), 74 | 'list': 'ordered', 75 | }) 76 | self.depth -= 1 77 | 78 | elif obj.get('t') == 'BulletList': 79 | if self.el in {'Header', 'Para'} or self.el is None: 80 | # Don't increase the depth when a BulletList 81 | # follows a Header or Para object. 82 | # 83 | # If the last object was Header, it has 84 | # already incremented the depth. 85 | for element in obj.get('c'): 86 | inner(element) 87 | n = nodes[self.depth][-1] 88 | n.attr['list'] = 'unordered' 89 | else: 90 | # But do increase the depth when a BulletList 91 | # follows anything else. 92 | # 93 | # This makes nested BulletLists work. 94 | self.depth += 1 95 | for element in obj.get('c'): 96 | inner(element) 97 | n = nodes[self.depth][-1] 98 | n.attr['list'] = 'unordered' 99 | self.depth -= 1 100 | 101 | elif obj.get('t') == 'Header': 102 | level, attr, content = obj.get('c') 103 | outline_attr = self.extract_header_attributes(attr) 104 | outline_attr['level'] = str(level) 105 | node = Node(self.extract(content), outline_attr) 106 | self.depth = level - 1 107 | 108 | add_node(node) 109 | 110 | # the next elements are children of this header 111 | self.depth += 1 112 | self.el = 'Header' 113 | 114 | inner(self.body) 115 | return nodes 116 | 117 | def write(self, output): 118 | def process(parent, node): 119 | for child in node.children: 120 | params = {'text': child.text} 121 | params.update(child.attr) 122 | el = ET.SubElement(parent, 'outline', **params) 123 | process(el, child) 124 | 125 | root = ET.Element('opml', version='2.0') 126 | head = ET.SubElement(root, 'head') 127 | body = ET.SubElement(root, 'body') 128 | now = gmt() 129 | 130 | def header(key, value): 131 | ET.SubElement(head, key).text = value 132 | 133 | if 'title' in self.head: 134 | header('title', self.extract(self.head['title']['c'])) 135 | 136 | if 'description' in self.head: 137 | header('description', self.extract(self.head['description']['c'])) 138 | 139 | if 'author' in self.head: 140 | # Markdown returns a MetaList of MetaInlines while org-mode returns MetaInlines. 141 | authors = [] 142 | if self.head['author'].get('t') == 'MetaList': 143 | for author in self.head['author']['c']: 144 | authors.append(self.extract(author['c'])) 145 | elif self.head['author'].get('t') == 'MetaInlines': 146 | authors = [self.extract(self.head['author']['c'])] 147 | header('ownerName', ', '.join(authors)) 148 | 149 | if 'email' in self.head: 150 | header('ownerEmail', self.extract(self.head['email']['c'])) 151 | 152 | if 'date' in self.head: 153 | header('dateCreated', self.extract(self.head['date']['c'])) 154 | 155 | header('dateModified', now) 156 | header('generator', 'https://github.com/edavis/pandoc-opml') 157 | header('docs', 'https://github.com/edavis/pandoc-opml#docs') 158 | 159 | generated = ET.Comment(' OPML generated by pandoc-opml v%s on %s ' % (__version__, now)) 160 | root.insert(0, generated) 161 | 162 | for summit in self.nodes.pop(0): 163 | params = {'text': summit.text} 164 | params.update(summit.attr) 165 | el = ET.SubElement(body, 'outline', **params) 166 | process(el, summit) 167 | 168 | content = ET.ElementTree(root) 169 | content.write( 170 | open(output, 'wb') if output else sys.stdout, 171 | encoding = 'UTF-8', 172 | xml_declaration = True, 173 | ) 174 | 175 | def extract_header_attributes(self, attr): 176 | outline_attr = {} 177 | name, args, kwargs = attr 178 | if name: 179 | outline_attr['name'] = name 180 | for arg in args: 181 | outline_attr[arg] = 'true' 182 | outline_attr.update(dict(kwargs)) 183 | return outline_attr 184 | 185 | def extract(self, contents): 186 | ret = [] 187 | html_map = { 188 | 'Emph': 'em', 189 | 'Strong': 'strong', 190 | 'Subscript': 'sub', 191 | 'Superscript': 'sup', 192 | 'Strikeout': 'del', 193 | } 194 | 195 | for obj in contents: 196 | if obj.get('t') == 'Str': 197 | ret.append(obj.get('c')) 198 | elif obj.get('t') == 'Space': 199 | ret.append(' ') 200 | elif obj.get('t') == 'Link': 201 | content, (link_url, link_title) = obj.get('c') 202 | text = self.extract(content) 203 | if link_title: 204 | ret.append(r'%s' % (link_url, link_title, text)) 205 | else: 206 | ret.append(r'%s' % (link_url, text)) 207 | elif obj.get('t') in html_map: 208 | tag = html_map[obj.get('t')] 209 | ret.append( 210 | r'<%s>%s' % (tag, self.extract(obj.get('c')), tag) 211 | ) 212 | elif obj.get('t') == 'Code': 213 | (_, code) = obj.get('c') 214 | ret.append(r'%s' % code) 215 | 216 | return ''.join(ret) 217 | 218 | def main(): 219 | import argparse 220 | parser = argparse.ArgumentParser() 221 | parser.add_argument('-o', '--output') 222 | parser.add_argument('input') 223 | args = parser.parse_args() 224 | 225 | json_ast = subprocess.check_output(['pandoc', '-t', 'json', args.input]) 226 | 227 | p = PandocOPML(json_ast) 228 | p.write(args.output) 229 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import setup, find_packages 2 | from pandoc_opml import __version__ 3 | 4 | setup( 5 | name = 'pandoc-opml', 6 | version = __version__, 7 | packages = find_packages(), 8 | author = 'Eric Davis', 9 | author_email = 'eric@davising.com', 10 | url = 'https://github.com/edavis/pandoc-opml', 11 | entry_points = { 12 | 'console_scripts': [ 13 | 'pandoc-opml = pandoc_opml:main', 14 | ], 15 | }, 16 | ) 17 | --------------------------------------------------------------------------------