├── .gitignore ├── LICENSE ├── README.md ├── hfeed2atom ├── __about__.py ├── __init__.py ├── feed_parser.py ├── hfeed2atom.py └── templates.py ├── requirements.txt └── setup.py /.gitignore: -------------------------------------------------------------------------------- 1 | ## important ignores here 2 | 3 | ## random ignores here 4 | 5 | #ignote compiled files and eggs 6 | *.pyc 7 | *.egg-info 8 | 9 | # ignore weird cache files 10 | 11 | *~ 12 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2014 Kartik Prabhu 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | hfeed2atom 2 | =========== 3 | 4 | Convert h-feed pages to Atom 1.0 XML format for traditional feed readers. You can try it out at https://kartikprabhu.com/hfeed2atom 5 | 6 | Installation 7 | ------------ 8 | 9 | To install hfeed2atom use pip as follows: 10 | 11 | ``` 12 | pip install git+https://github.com/kartikprabhu/hfeed2atom.git --process-dependency-links 13 | 14 | ``` 15 | 16 | This will install hfeed2atom with its dependencies from pypi and mf2py from the experimental repo https://github.com/kartikprabhu/mf2py/tree/experimental. 17 | 18 | Usage 19 | ----- 20 | 21 | hfeed2atom takes as arguments one or more of the following: 22 | * `doc`: a string, a Python File object or a BeautifulSoup document containing the contents of an HTML page 23 | * `url`: the URL for the page to be parsed. It is recommended to always send a URL argument as it is used to convert all other URLs in the document to absolute URLs. 24 | * `atom_url`: the URL of the page with the ATOM file. 25 | * `hfeed`: a Python dictionary of the microformats h-feed object. Use this if the document has already been parsed for microformats. 26 | 27 | hfeed2atom returns the following as a tuple: 28 | * Atom format of the h-feed, or None if there was an error. 29 | * A string message of the error if any 30 | 31 | 32 | The easiest way to use hfeed2atom in your own python code to parse the feed on a URL `http://example.com` 33 | 34 | ``` 35 | from hfeed2atom import hfeed2atom 36 | 37 | atom, message = hfeed2atom(url = 'http://example.com') 38 | ``` 39 | With this code, hfeed2atom will do a GET request to `http://example.com`, find the first h-feed and return the Atom as a string. 40 | 41 | If you already have the `contents` of the URL (by doing a GET request yourself, or if it is your own page on your server), then you can pass them as a `doc` argument as 42 | 43 | ``` 44 | atom, message = hfeed2atom(doc = contents, url = 'http://example.com') 45 | ``` 46 | `doc` can be a string, a Python File object or a BeautifulSoup document. 47 | 48 | If you already have the h-feed microformats object of a page as a Python dictionary in the variable `hfeed` then use 49 | 50 | ``` 51 | atom, message = hfeed2atom(hfeed = hfeed, url = 'http://example.com') 52 | ``` 53 | Note, in this case hfeed2atom assumes that all the required properties for Atom are already in the `hfeed` variable and *will not* attempt to generate any fallback properties. 54 | 55 | Features 56 | -------- 57 | 58 | * Finds first h-feed element to generate Atom feed, if no h-feed found defaults to using the top-level h-entries for the feed. 59 | * Generates fallbacks for required Atom properties of the feed. The fallbacks, in order, are: 60 | - title : h-feed `name` property else, `` element of the page else, `Feed for URL`. 61 | - id : h-feed `uid` or `url` property else, URL argument given. 62 | - updated date : h-feed `updated` or `published` property else, `updated` or `published` property of the latest entry. 63 | * Generates fallback categories for the h-feed from the `meta name='keywords'>` element of the page. 64 | * Generates fallback for required Atom properties of each entry in the h-feed. The fallbacks, in order, are: 65 | - title : h-entry `name` property if it is not same as `content>value` else, `content` property truncated to 50 characters else, `uid` or `url` of the entry. 66 | - id : h-entry `uid` property or `url` property else, error and skips that entry. 67 | - updated date : h-entry `updated` or `published` property else, error and skips that entry. 68 | 69 | To Do 70 | ----- 71 | * Author discovery if the h-feed does not have an author property. Note `<author>` is an optional tag in Atom! 72 | 73 | Go forth 74 | -------- 75 | 76 | Now [use this yourself](https://github.com/kartikprabhu/hfeed2atom) and [give feedback](https://github.com/kartikprabhu/hfeed2atom/issues). 77 | -------------------------------------------------------------------------------- /hfeed2atom/__about__.py: -------------------------------------------------------------------------------- 1 | # file containing common data about the code 2 | 3 | NAME = 'hfeed2atom' 4 | 5 | SUMMARY = 'Converter from h-feed microformats to Atom 1.0' 6 | 7 | VERSION = (0, 2, 4, "") 8 | 9 | AUTHOR = {'name' : 'Kartik Prabhu', 'email' : 'me@kartikprabhu.com'} 10 | 11 | COPYRIGHT = 'Copyright (c) by ' + AUTHOR['name'] 12 | 13 | LICENSE = 'MIT' 14 | 15 | URL = {'self' : 'https://kartikprabhu.com/hfeed2atom', 'github' : 'https://github.com/kartikprabhu/hfeed2atom'} 16 | -------------------------------------------------------------------------------- /hfeed2atom/__init__.py: -------------------------------------------------------------------------------- 1 | #/usr/bin/env python 2 | 3 | from . import __about__ 4 | 5 | __author__ = __about__.AUTHOR['name'] 6 | __contact__ = __about__.AUTHOR['email'] 7 | __copyright__ = __about__.COPYRIGHT 8 | __license__ = __about__.LICENSE 9 | __version__ = '.'.join(map(str, __about__.VERSION[0:3])) + ''.join(__about__.VERSION[3:]) 10 | 11 | from hfeed2atom import hfeed2atom, hentry2atom 12 | -------------------------------------------------------------------------------- /hfeed2atom/feed_parser.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | 4 | import mf2py 5 | import mf2util 6 | 7 | 8 | 9 | def feed_parser(doc=None, url=None): 10 | """ 11 | parser to get hfeed 12 | """ 13 | 14 | if doc: 15 | if not isinstance(doc, BeautifulSoup): 16 | doc = BeautifulSoup(doc) 17 | 18 | if url: 19 | if doc is None: 20 | data = requests.get(url) 21 | 22 | # check for charater encodings and use 'correct' data 23 | if 'charset' in data.headers.get('content-type', ''): 24 | doc = BeautifulSoup(data.text) 25 | else: 26 | doc = BeautifulSoup(data.content) 27 | 28 | # find first h-feed object if any or construct it 29 | 30 | hfeed = doc.find(class_="h-feed") 31 | 32 | if hfeed: 33 | hfeed = mf2py.Parser(hfeed, url).to_dict()['items'][0] 34 | else: 35 | hfeed = {'type': ['h-feed'], 'properties': {}, 'children': []} 36 | 37 | # parse whole document for microformats 38 | parsed = mf2py.Parser(doc, url).to_dict() 39 | 40 | # construct h-entries from top-level items 41 | hfeed['children'] = [x for x in parsed['items'] if 'h-entry' in x.get('type', [])] 42 | 43 | 44 | # construct fall back properties for hfeed 45 | 46 | props = hfeed['properties'] 47 | 48 | # if no name or name is the content value, construct name from title or default from URL 49 | name = props.get('name') 50 | if name: 51 | name = name[0] 52 | 53 | content = props.get('content') 54 | if content: 55 | content = content[0] 56 | if isinstance(content, dict): 57 | content = content.get('value') 58 | 59 | if not name or not mf2util.is_name_a_title(name, content): 60 | feed_title = doc.find('title') 61 | if feed_title: 62 | hfeed['properties']['name'] = [feed_title.get_text()] 63 | elif url: 64 | hfeed['properties']['name'] = ['Feed for' + url] 65 | 66 | # construct author from rep_hcard or meta-author 67 | 68 | # construct uid from url 69 | if 'uid' not in props and 'url' not in props: 70 | if url: 71 | hfeed['properties']['uid'] = [url] 72 | 73 | # construct categories from meta-keywords 74 | if 'category' not in props: 75 | keywords = doc.find('meta', attrs= {'name': 'keywords', 'content': True}) 76 | if keywords: 77 | hfeed['properties']['category'] = keywords.get('content', '').split(',') 78 | 79 | 80 | return hfeed 81 | -------------------------------------------------------------------------------- /hfeed2atom/hfeed2atom.py: -------------------------------------------------------------------------------- 1 | from xml.sax.saxutils import escape 2 | 3 | from . import templates, feed_parser 4 | 5 | import mf2util 6 | 7 | def _updated_or_published(mf): 8 | """ 9 | get the updated date or the published date 10 | 11 | Args: 12 | mf: python dictionary of some microformats object 13 | 14 | Return: string containing the updated date if it exists, or the published date if it exists or None 15 | 16 | """ 17 | 18 | props = mf['properties'] 19 | 20 | # construct updated/published date of mf 21 | if 'updated' in props: 22 | return props['updated'][0] 23 | elif 'published' in props: 24 | return props['published'][0] 25 | else: 26 | return None 27 | 28 | def _get_id(mf, url=None): 29 | """ 30 | get the uid of the mf object 31 | 32 | Args: 33 | mf: python dictionary of some microformats object 34 | url: optional URL to use in case no uid or url in mf 35 | 36 | Return: string containing the id or None 37 | """ 38 | 39 | props = mf['properties'] 40 | 41 | if 'uid' in props: 42 | return props['uid'][0] 43 | elif 'url' in props: 44 | return props['url'][0] 45 | else: 46 | return None 47 | 48 | def _response_context(mf): 49 | """ 50 | get the response context of the mf object 51 | 52 | Args: 53 | mf: python dictionary of some microformats object 54 | 55 | Return: string containing the HTML reconstruction of the response context 56 | """ 57 | 58 | props = mf['properties'] 59 | 60 | # get replies 61 | responses = props.get('in-reply-to') 62 | if responses: 63 | response_type = 'in reply to' 64 | for response in responses: 65 | response = response[0] 66 | if isinstance(response, dict): 67 | # the following is not correct 68 | response = response.get('url') 69 | 70 | if response: 71 | # make and return string with type and list of URLs 72 | response = None 73 | 74 | return None 75 | 76 | def hentry2atom(entry_mf): 77 | """ 78 | convert microformats of a h-entry object to Atom 1.0 79 | 80 | Args: 81 | entry_mf: python dictionary of parsed microformats of a h-entry 82 | 83 | Return: an Atom 1.0 XML version of the microformats or None if error, and error message 84 | """ 85 | 86 | # generate fall backs or errors for the non-existing required properties ones. 87 | 88 | if 'properties' in entry_mf: 89 | props = entry_mf['properties'] 90 | else: 91 | return None, 'properties of entry not found.' 92 | 93 | entry = {'title': '', 'subtitle': '', 'link': '', 'uid': '', 'published': '', 'updated': '', 'summary': '', 'content': '', 'categories': ''} 94 | 95 | ## required properties first 96 | 97 | # construct id of entry 98 | uid = _get_id(entry_mf) 99 | 100 | if uid: 101 | # construct id of entry -- required 102 | entry['uid'] = templates.ID.substitute(uid = escape(uid)) 103 | else: 104 | return None, 'entry does not have a valid id' 105 | 106 | # construct title of entry -- required - add default 107 | # if no name or name is the content value, construct name from title or default from URL 108 | name = props.get('name') 109 | if name: 110 | name = name[0] 111 | 112 | content = props.get('content') 113 | if content: 114 | content = content[0] 115 | if isinstance(content, dict): 116 | content = content.get('value') 117 | 118 | if name: 119 | # if name is generated from content truncate 120 | if not mf2util.is_name_a_title(name, content): 121 | if len(name) > 50: 122 | name = name[:50] + '...' 123 | else: 124 | name = uid 125 | 126 | entry['title'] = templates.TITLE.substitute(title = escape(name), t_type='title') 127 | 128 | # construct updated/published date of entry 129 | updated = _updated_or_published(entry_mf) 130 | 131 | # updated is -- required 132 | if updated: 133 | entry['updated'] = templates.DATE.substitute(date = escape(updated), dt_type = 'updated') 134 | else: 135 | return None, 'entry does not have valid updated date' 136 | 137 | ## optional properties 138 | 139 | entry['link'] = templates.LINK.substitute(url = escape(uid), rel='alternate') 140 | 141 | # construct published date of entry 142 | if 'published' in props: 143 | entry['published'] = templates.DATE.substitute(date = escape(props['published'][0]), dt_type = 'published') 144 | 145 | # construct subtitle for entry 146 | if 'additional-name' in props: 147 | feed['subtitle'] = templates.TITLE.substitute(title = escape(props['additional-name'][0]), t_type='subtitle') 148 | 149 | # content processing 150 | if 'content' in props: 151 | if isinstance(props['content'][0], dict): 152 | content = props['content'][0]['html'] 153 | else: 154 | content = props['content'][0] 155 | else: 156 | content = None 157 | 158 | if content: 159 | entry['content'] = templates.CONTENT.substitute(content = escape(content)) 160 | 161 | # construct summary of entry 162 | if 'featured' in props: 163 | featured = templates.FEATURED.substitute(featured = escape(props['featured'][0])) 164 | else: 165 | featured = '' 166 | 167 | if 'summary' in props: 168 | summary = templates.POST_SUMMARY.substitute(post_summary = escape(props['summary'][0])) 169 | else: 170 | summary = '' 171 | 172 | # make morelink if content does not exist 173 | if not content: 174 | morelink = templates.MORELINK.substitute(url = escape(uid), name = escape(name)) 175 | else: 176 | morelink = '' 177 | 178 | entry['summary'] = templates.SUMMARY.substitute(featured=featured, summary=summary, morelink=morelink) 179 | 180 | # construct category list of entry 181 | if 'category' in props: 182 | for category in props['category']: 183 | if isinstance(category, dict): 184 | if 'value' in category: 185 | category = category['value'] 186 | else: 187 | continue 188 | 189 | entry['categories'] += templates.CATEGORY.substitute(category=escape(category)) 190 | 191 | # construct atom of entry 192 | return templates.ENTRY.substitute(entry), 'up and Atom!' 193 | 194 | 195 | def hfeed2atom(doc=None, url=None, atom_url=None, hfeed=None): 196 | """ 197 | convert first h-feed object in a document to Atom 1.0 198 | 199 | Args: 200 | doc (file or string or BeautifulSoup doc): file handle, text of content 201 | to parse, or BeautifulSoup document to look for h-feed 202 | url: url of the document, used for making absolute URLs from url data, or for fetching the document 203 | 204 | Return: an Atom 1.0 XML document version of the first h-feed in the document or None if no h-feed found, and string with reason for error 205 | """ 206 | # if hfeed object given assume it is well formatted 207 | if hfeed: 208 | mf = hfeed 209 | else: 210 | # send to hfeed_parser to parse 211 | mf = feed_parser.feed_parser(doc, url) 212 | 213 | if not mf: 214 | return None, 'h-feed not found' 215 | 216 | feed = {'generator': '', 'title': '', 'subtitle': '', 'link': '', 'uid': '', 'updated': '', 'author': '', 'entries': ''} 217 | 218 | if 'properties' in mf: 219 | props = mf['properties'] 220 | else: 221 | return None, 'h-feed properties not found.' 222 | 223 | ## required properties first 224 | 225 | uid = _get_id(mf) or url 226 | 227 | # id is -- required 228 | if uid: 229 | # construct id of feed -- required 230 | feed['uid'] = templates.ID.substitute(uid = escape(uid)) 231 | else: 232 | return None, 'feed does not have a valid id' 233 | 234 | #construct title for feed -- required 235 | if 'name' in props: 236 | name = props['name'][0] or uid 237 | 238 | feed['title'] = templates.TITLE.substitute(title = escape(name), t_type='title') 239 | 240 | # entries 241 | if 'children' in mf: 242 | entries = [x for x in mf['children'] if 'h-entry' in x['type']] 243 | else: 244 | entries = [] 245 | 246 | # construct updated/published date of feed. 247 | updated = _updated_or_published(mf) 248 | 249 | if not updated and entries: 250 | updated = max([_updated_or_published(x) for x in entries]) 251 | 252 | # updated is -- required 253 | if updated: 254 | feed['updated'] = templates.DATE.substitute(date = escape(updated), dt_type = 'updated') 255 | else: 256 | return None, 'updated date for feed not found, and could not be constructed from entries.' 257 | 258 | ## optional properties 259 | 260 | # construct subtitle for feed 261 | if 'additional-name' in props: 262 | feed['subtitle'] = templates.TITLE.substitute(title = escape(props['additional-name'][0]), t_type='subtitle') 263 | 264 | feed['link'] = templates.LINK.substitute(url = escape(uid), rel='alternate') 265 | feed['self'] = templates.LINK.substitute(url = escape(atom_url), rel='self') 266 | 267 | # construct author for feed 268 | if 'author' in props: 269 | author = templates.AUTHOR.substitute(name = escape(props['author'][0]['properties']['name'][0])) 270 | 271 | # construct entries for feed 272 | for entry in entries: 273 | # construct entry template - skip entry if error 274 | entry_atom, message = hentry2atom(entry) 275 | if entry_atom: 276 | feed['entries'] += entry_atom 277 | 278 | feed['generator'] = templates.GENERATOR 279 | 280 | return templates.FEED.substitute(feed), 'up and Atom!' 281 | -------------------------------------------------------------------------------- /hfeed2atom/templates.py: -------------------------------------------------------------------------------- 1 | from string import Template 2 | from . import __about__ 3 | 4 | GENERATOR = Template("""<generator uri="${uri}" version="${version}">${name}</generator>""").substitute(uri = __about__.URL['self'], version = '.'.join(map(str, __about__.VERSION[0:3])) + ''.join(__about__.VERSION[3:]), name = __about__.NAME ) 5 | 6 | TITLE = Template("""<${t_type}>${title}</${t_type}>""") 7 | 8 | LINK = Template("""<link href="${url}" rel="${rel}"></link>""") 9 | 10 | DATE = Template("""<${dt_type}>${date}</${dt_type}>""") 11 | 12 | ID = Template("""<id>${uid}</id>""") 13 | 14 | AUTHOR = Template("""<author><name>${name}</name></author>""") 15 | 16 | FEATURED = Template("""<img src="${featured}"/>""") 17 | 18 | POST_SUMMARY = Template("""<p>${post_summary}</p>""") 19 | 20 | MORELINK = Template("""<span>Full post: <a href="${url}">${name}</a></span>""") 21 | 22 | SUMMARY = Template("""<summary type="html">${featured}${summary}${morelink}</summary>""") 23 | 24 | CONTENT = Template("""<content type="html">${content}</content>""") 25 | 26 | CATEGORY = Template("""<category term="${category}"></category>""") 27 | 28 | ENTRY = Template("""<entry>${title}${subtitle}${link}${uid}${published}${updated}${summary}${content} 29 | ${categories}</entry>""") 30 | 31 | FEED = Template("""<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us">${generator}${title}${subtitle}${link}${self}${uid}${updated}${author}${entries}</feed>""") 32 | 33 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | mf2py==1.1.2 2 | requests==2.19.1 3 | BeautifulSoup4==4.6.0 4 | 5 | mf2util==0.4.3 6 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | 4 | from setuptools import setup, find_packages 5 | import os.path 6 | 7 | _ABOUT_ = {} 8 | _PATH_ = os.path.dirname(__file__) 9 | 10 | with open(os.path.join(_PATH_, 'hfeed2atom/__about__.py'))\ 11 | as about_file: 12 | exec(about_file.read(), _ABOUT_) 13 | 14 | # use requirements.txt for dependencies 15 | with open(os.path.join(_PATH_, 'requirements.txt')) as f: 16 | required = map(lambda s: s.strip(), f.readlines()) 17 | 18 | with open(os.path.join(_PATH_, 'README.md')) as f: 19 | readme = f.read() 20 | 21 | with open(os.path.join(_PATH_, 'LICENSE')) as f: 22 | license = f.read() 23 | 24 | setup( 25 | name = _ABOUT_['NAME'], 26 | version = '.'.join(map(str, _ABOUT_['VERSION'][0:3])) + ''.join(_ABOUT_['VERSION'][3:]), 27 | description = _ABOUT_['SUMMARY'], 28 | long_description = readme, 29 | install_requires = required, 30 | dependency_links=[ 31 | "https://github.com/kartikprabhu/mf2py/tarball/experimental#egg=mf2py-1.1.1" 32 | ], 33 | author = _ABOUT_['AUTHOR']['name'], 34 | author_email = _ABOUT_['AUTHOR']['email'], 35 | url = _ABOUT_['URL']['github'], 36 | license = license, 37 | packages = find_packages(exclude=('tests', 'docs')) 38 | ) 39 | --------------------------------------------------------------------------------