├── .gitignore
├── LICENSE
├── README.md
├── hfeed2atom
    ├── __about__.py
    ├── __init__.py
    ├── feed_parser.py
    ├── hfeed2atom.py
    └── templates.py
├── requirements.txt
└── setup.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | ## important ignores here
 2 | 
 3 | ## random ignores here
 4 | 
 5 | #ignote compiled files and eggs
 6 | *.pyc
 7 | *.egg-info
 8 | 
 9 | # ignore weird cache files
10 | 
11 | *~
12 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2014 Kartik Prabhu
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | hfeed2atom
 2 | ===========
 3 | 
 4 | Convert h-feed pages to Atom 1.0 XML format for traditional feed readers. You can try it out at https://kartikprabhu.com/hfeed2atom
 5 | 
 6 | Installation
 7 | ------------
 8 | 
 9 | To install hfeed2atom use pip as follows:
10 | 
11 | ```
12 | pip install git+https://github.com/kartikprabhu/hfeed2atom.git --process-dependency-links
13 | 
14 | ```
15 | 
16 | This will install hfeed2atom with its dependencies from pypi and mf2py from the experimental repo https://github.com/kartikprabhu/mf2py/tree/experimental.
17 | 
18 | Usage
19 | -----
20 | 
21 | hfeed2atom takes as arguments one or more of the following:
22 | * `doc`: a string, a Python File object or a BeautifulSoup document containing the contents of an HTML page
23 | * `url`: the URL for the page to be parsed. It is recommended to always send a URL argument as it is used to convert all other URLs in the document to absolute URLs.
24 | * `atom_url`: the URL of the page with the ATOM file.
25 | * `hfeed`: a Python dictionary of the microformats h-feed object. Use this if the document has already been parsed for microformats.
26 |  
27 | hfeed2atom returns the following as a tuple:
28 | * Atom format of the h-feed, or None if there was an error.
29 | * A string message of the error if any
30 | 
31 | 
32 | The easiest way to use hfeed2atom in your own python code to parse the feed on a URL `http://example.com`
33 | 
34 | ```
35 | from hfeed2atom import hfeed2atom
36 | 
37 | atom, message = hfeed2atom(url = 'http://example.com')
38 | ```
39 | With this code, hfeed2atom will do a GET request to `http://example.com`, find the first h-feed and return the Atom as a string.
40 | 
41 | If you already have the `contents` of the URL (by doing a GET request yourself, or if it is your own page on your server), then you can pass them as a `doc` argument as
42 | 
43 | ```
44 | atom, message = hfeed2atom(doc = contents, url = 'http://example.com')
45 | ```
46 | `doc` can be a string, a Python File object or a BeautifulSoup document.
47 | 
48 | If you already have the h-feed microformats object of a page as a Python dictionary in the variable `hfeed` then use
49 | 
50 | ```
51 | atom, message = hfeed2atom(hfeed = hfeed, url = 'http://example.com')
52 | ```
53 | Note, in this case hfeed2atom assumes that all the required properties for Atom are already in the `hfeed` variable and *will not* attempt to generate any fallback properties.
54 | 
55 | Features
56 | --------
57 | 
58 | * Finds first h-feed element to generate Atom feed, if no h-feed found defaults to using the top-level h-entries for the feed.
59 | * Generates fallbacks for required Atom properties of the feed. The fallbacks, in order, are:
60 |   - title : h-feed `name` property else, `<title>` element of the page else, `Feed for URL`.
61 |   - id : h-feed `uid` or `url` property else, URL argument given.
62 |   - updated date : h-feed `updated` or `published` property else, `updated` or `published` property of the latest entry.
63 | * Generates fallback categories for the h-feed from the `meta name='keywords'>` element of the page.
64 | * Generates fallback for required Atom properties of each entry in the h-feed. The fallbacks, in order, are:
65 |   - title : h-entry `name` property if it is not same as `content>value` else, `content` property truncated to 50 characters else, `uid` or `url` of the entry.
66 |   - id : h-entry `uid` property or `url` property else, error and skips that entry.
67 |   - updated date : h-entry `updated` or `published` property else, error and skips that entry.
68 | 
69 | To Do
70 | -----
71 | * Author discovery if the h-feed does not have an author property. Note `<author>` is an optional tag in Atom!
72 | 
73 | Go forth
74 | --------
75 | 
76 | Now [use this yourself](https://github.com/kartikprabhu/hfeed2atom) and [give feedback](https://github.com/kartikprabhu/hfeed2atom/issues).
77 | 


--------------------------------------------------------------------------------
/hfeed2atom/__about__.py:
--------------------------------------------------------------------------------
 1 | # file containing common data about the code
 2 | 
 3 | NAME = 'hfeed2atom'
 4 | 
 5 | SUMMARY = 'Converter from h-feed microformats to Atom 1.0'
 6 | 
 7 | VERSION = (0, 2, 4, "")
 8 | 
 9 | AUTHOR = {'name' : 'Kartik Prabhu', 'email' : 'me@kartikprabhu.com'}
10 | 
11 | COPYRIGHT = 'Copyright (c) by ' + AUTHOR['name']
12 | 
13 | LICENSE = 'MIT'
14 | 
15 | URL = {'self' : 'https://kartikprabhu.com/hfeed2atom', 'github' : 'https://github.com/kartikprabhu/hfeed2atom'}
16 | 


--------------------------------------------------------------------------------
/hfeed2atom/__init__.py:
--------------------------------------------------------------------------------
 1 | #/usr/bin/env python
 2 | 
 3 | from . import __about__
 4 | 
 5 | __author__    = __about__.AUTHOR['name']
 6 | __contact__   = __about__.AUTHOR['email']
 7 | __copyright__ = __about__.COPYRIGHT
 8 | __license__   = __about__.LICENSE
 9 | __version__   = '.'.join(map(str, __about__.VERSION[0:3])) + ''.join(__about__.VERSION[3:])
10 | 
11 | from hfeed2atom import hfeed2atom, hentry2atom
12 | 


--------------------------------------------------------------------------------
/hfeed2atom/feed_parser.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | from bs4 import BeautifulSoup
 3 | 
 4 | import mf2py
 5 | import mf2util
 6 | 
 7 | 
 8 | 
 9 | def feed_parser(doc=None, url=None):
10 |     """
11 |     parser to get hfeed
12 |     """
13 | 
14 |     if doc:
15 |         if not isinstance(doc, BeautifulSoup):
16 |             doc = BeautifulSoup(doc)
17 | 
18 |     if url:
19 |         if doc is None:
20 |             data = requests.get(url)
21 | 
22 |             # check for charater encodings and use 'correct' data
23 |             if 'charset' in data.headers.get('content-type', ''):
24 |                 doc = BeautifulSoup(data.text)
25 |             else:
26 |                 doc = BeautifulSoup(data.content)
27 | 
28 |     # find first h-feed object if any or construct it
29 | 
30 |     hfeed = doc.find(class_="h-feed")
31 | 
32 |     if hfeed:
33 |         hfeed = mf2py.Parser(hfeed, url).to_dict()['items'][0]
34 |     else:
35 |         hfeed = {'type': ['h-feed'], 'properties': {}, 'children': []}
36 | 
37 |         # parse whole document for microformats
38 |         parsed = mf2py.Parser(doc, url).to_dict()
39 | 
40 |         # construct h-entries from top-level items
41 |         hfeed['children'] = [x for x in parsed['items'] if 'h-entry' in x.get('type', [])]
42 | 
43 | 
44 |     # construct fall back properties for hfeed
45 | 
46 |     props = hfeed['properties']
47 | 
48 |     # if no name or name is the content value, construct name from title or default from URL
49 |     name = props.get('name')
50 |     if name:
51 |         name = name[0]
52 | 
53 |     content = props.get('content')
54 |     if content:
55 |         content = content[0]
56 |         if isinstance(content, dict):
57 |             content = content.get('value')
58 | 
59 |     if not name or not mf2util.is_name_a_title(name, content):
60 |         feed_title = doc.find('title')
61 |         if feed_title:
62 |             hfeed['properties']['name'] = [feed_title.get_text()]
63 |         elif url:
64 |             hfeed['properties']['name'] = ['Feed for' + url]
65 | 
66 |     # construct author from rep_hcard or meta-author
67 | 
68 |     # construct uid from url
69 |     if 'uid' not in props and 'url' not in props:
70 |         if url:
71 |             hfeed['properties']['uid'] = [url]
72 | 
73 |     # construct categories from meta-keywords
74 |     if 'category' not in props:
75 |         keywords = doc.find('meta', attrs= {'name': 'keywords', 'content': True})
76 |         if keywords:
77 |             hfeed['properties']['category'] = keywords.get('content', '').split(',')
78 | 
79 | 
80 |     return hfeed
81 | 


--------------------------------------------------------------------------------
/hfeed2atom/hfeed2atom.py:
--------------------------------------------------------------------------------
  1 | from xml.sax.saxutils import escape
  2 | 
  3 | from . import templates, feed_parser
  4 | 
  5 | import mf2util
  6 | 
  7 | def _updated_or_published(mf):
  8 | 	"""
  9 | 	get the updated date or the published date
 10 | 
 11 | 	Args:
 12 | 		mf: python dictionary of some microformats object
 13 | 
 14 | 	Return: string containing the updated date if it exists, or the published date if it exists or None
 15 | 
 16 | 	"""
 17 | 
 18 | 	props =  mf['properties']
 19 | 
 20 | 	# construct updated/published date of mf
 21 | 	if 'updated' in props:
 22 | 		return props['updated'][0]
 23 | 	elif 'published' in props:
 24 | 		return props['published'][0]
 25 | 	else:
 26 | 		return None
 27 | 
 28 | def _get_id(mf, url=None):
 29 | 	"""
 30 | 	get the uid of the mf object
 31 | 
 32 | 	Args:
 33 | 		mf: python dictionary of some microformats object
 34 | 		url: optional URL to use in case no uid or url in mf
 35 | 
 36 | 	Return: string containing the id or None
 37 | 	"""
 38 | 
 39 | 	props =  mf['properties']
 40 | 
 41 | 	if 'uid' in props:
 42 | 		return props['uid'][0]
 43 | 	elif 'url' in props:
 44 | 		return props['url'][0]
 45 | 	else:
 46 | 		return None
 47 | 
 48 | def _response_context(mf):
 49 | 	"""
 50 | 	get the response context of the mf object
 51 | 
 52 | 	Args:
 53 | 		mf: python dictionary of some microformats object
 54 | 
 55 | 	Return: string containing the HTML reconstruction of the response context
 56 | 	"""
 57 | 
 58 | 	props = mf['properties']
 59 | 
 60 | 	# get replies
 61 | 	responses = props.get('in-reply-to')
 62 | 	if responses:
 63 | 		response_type = 'in reply to'
 64 | 		for response in responses:
 65 | 			response = response[0]
 66 | 			if isinstance(response, dict):
 67 | 				# the following is not correct
 68 | 				response = response.get('url')
 69 | 
 70 | 			if response:
 71 | 				# make and return string with type and list of URLs
 72 | 				response = None
 73 | 
 74 | 	return None
 75 | 
 76 | def hentry2atom(entry_mf):
 77 | 	"""
 78 | 	convert microformats of a h-entry object to Atom 1.0
 79 | 
 80 | 	Args:
 81 | 		entry_mf: python dictionary of parsed microformats of a h-entry
 82 | 
 83 | 	Return: an Atom 1.0 XML version of the microformats or None if error, and error message
 84 | 	"""
 85 | 
 86 | 	# generate fall backs or errors for the non-existing required properties ones.
 87 | 
 88 | 	if 'properties' in entry_mf:
 89 | 		props =  entry_mf['properties']
 90 | 	else:
 91 | 		return None, 'properties of entry not found.'
 92 | 
 93 | 	entry = {'title': '', 'subtitle': '', 'link': '', 'uid': '', 'published': '', 'updated': '', 'summary': '', 'content': '',  'categories': ''}
 94 | 
 95 | 	## required properties first
 96 | 
 97 | 	# construct id of entry
 98 | 	uid = _get_id(entry_mf)
 99 | 
100 | 	if uid:
101 | 		# construct id of entry -- required
102 | 		entry['uid'] = templates.ID.substitute(uid = escape(uid))
103 | 	else:
104 | 		return None, 'entry does not have a valid id'
105 | 
106 | 	# construct title of entry -- required - add default
107 | 	# if no name or name is the content value, construct name from title or default from URL
108 | 	name = props.get('name')
109 | 	if name:
110 | 		name = name[0]
111 | 
112 | 	content = props.get('content')
113 | 	if content:
114 | 		content = content[0]
115 | 		if isinstance(content, dict):
116 | 			content = content.get('value')
117 | 
118 | 	if name:
119 | 		# if name is generated from content truncate
120 | 		if not mf2util.is_name_a_title(name, content):
121 | 			if len(name) > 50:
122 | 				name = name[:50] + '...'
123 | 	else:
124 | 		name = uid
125 | 
126 | 	entry['title'] = templates.TITLE.substitute(title = escape(name), t_type='title')
127 | 
128 | 	# construct updated/published date of entry
129 | 	updated = _updated_or_published(entry_mf)
130 | 
131 | 	# updated is  -- required
132 | 	if updated:
133 | 		entry['updated'] = templates.DATE.substitute(date = escape(updated), dt_type = 'updated')
134 | 	else:
135 | 		return None, 'entry does not have valid updated date'
136 | 
137 | 	## optional properties
138 | 
139 | 	entry['link'] = templates.LINK.substitute(url = escape(uid), rel='alternate')
140 | 
141 | 	# construct published date of entry
142 | 	if 'published' in props:
143 | 		entry['published'] = templates.DATE.substitute(date = escape(props['published'][0]), dt_type = 'published')
144 | 
145 | 	# construct subtitle for entry
146 | 	if 'additional-name' in props:
147 | 		feed['subtitle'] = templates.TITLE.substitute(title = escape(props['additional-name'][0]), t_type='subtitle')
148 | 
149 | 	# content processing
150 | 	if 'content' in props:
151 | 		if isinstance(props['content'][0], dict):
152 | 			content = props['content'][0]['html']
153 | 		else:
154 | 			content = props['content'][0]
155 | 	else:
156 | 		content = None
157 | 
158 | 	if content:
159 | 		entry['content'] = templates.CONTENT.substitute(content = escape(content))
160 | 
161 | 	# construct summary of entry
162 | 	if 'featured' in props:
163 | 		featured = templates.FEATURED.substitute(featured = escape(props['featured'][0]))
164 | 	else:
165 | 		featured = ''
166 | 
167 | 	if 'summary' in props:
168 | 		summary = templates.POST_SUMMARY.substitute(post_summary = escape(props['summary'][0]))
169 | 	else:
170 | 		summary = ''
171 | 
172 | 	# make morelink if content does not exist
173 | 	if not content:
174 | 		morelink =  templates.MORELINK.substitute(url = escape(uid), name = escape(name))
175 | 	else:
176 | 		morelink = ''
177 | 
178 | 	entry['summary'] = templates.SUMMARY.substitute(featured=featured, summary=summary, morelink=morelink)
179 | 
180 | 	# construct category list of entry
181 | 	if 'category' in props:
182 | 		for category in props['category']:
183 | 			if isinstance(category, dict):
184 | 				if  'value' in category:
185 | 					category = category['value']
186 | 				else:
187 | 					continue
188 | 
189 | 			entry['categories'] += templates.CATEGORY.substitute(category=escape(category))
190 | 
191 | 	# construct atom of entry
192 | 	return templates.ENTRY.substitute(entry), 'up and Atom!'
193 | 
194 | 
195 | def hfeed2atom(doc=None, url=None, atom_url=None, hfeed=None):
196 | 	"""
197 | 	convert first h-feed object in a document to Atom 1.0
198 | 
199 | 	Args:
200 | 		doc (file or string or BeautifulSoup doc): file handle, text of content
201 |         to parse, or BeautifulSoup document to look for h-feed
202 | 		url: url of the document, used for making absolute URLs from url data, or for fetching the document
203 | 
204 | 	Return: an Atom 1.0 XML document version of the first h-feed in the document or None if no h-feed found, and string with reason for error
205 | 	"""
206 | 	# if hfeed object given assume it is well formatted
207 | 	if hfeed:
208 | 		mf = hfeed
209 | 	else:
210 | 		# send to hfeed_parser to parse
211 | 		mf = feed_parser.feed_parser(doc, url)
212 | 
213 | 		if not mf:
214 | 			return None, 'h-feed not found'
215 | 
216 | 	feed = {'generator': '', 'title': '', 'subtitle': '', 'link': '', 'uid': '', 'updated': '', 'author': '', 'entries': ''}
217 | 
218 | 	if 'properties' in mf:
219 | 		props = mf['properties']
220 | 	else:
221 | 		return None, 'h-feed properties not found.'
222 | 
223 | 	## required properties first
224 | 
225 | 	uid = _get_id(mf) or url
226 | 
227 | 	# id is -- required
228 | 	if uid:
229 | 		# construct id of feed -- required
230 | 		feed['uid'] = templates.ID.substitute(uid = escape(uid))
231 | 	else:
232 | 		return None, 'feed does not have a valid id'
233 | 
234 | 	#construct title for feed -- required
235 | 	if 'name' in props:
236 | 		name = props['name'][0] or uid
237 | 
238 | 	feed['title'] = templates.TITLE.substitute(title = escape(name), t_type='title')
239 | 
240 | 	# entries
241 | 	if 'children' in mf:
242 | 		entries = [x for x in mf['children'] if 'h-entry' in x['type']]
243 | 	else:
244 | 		entries = []
245 | 
246 | 	# construct updated/published date of feed.
247 | 	updated = _updated_or_published(mf)
248 | 
249 | 	if not updated and entries:
250 | 		updated = max([_updated_or_published(x) for x in entries])
251 | 
252 | 	# updated is  -- required
253 | 	if updated:
254 | 		feed['updated'] = templates.DATE.substitute(date = escape(updated), dt_type = 'updated')
255 | 	else:
256 | 		return None, 'updated date for feed not found, and could not be constructed from entries.'
257 | 
258 | 	## optional properties
259 | 
260 | 	# construct subtitle for feed
261 | 	if 'additional-name' in props:
262 | 		feed['subtitle'] = templates.TITLE.substitute(title = escape(props['additional-name'][0]), t_type='subtitle')
263 | 
264 | 	feed['link'] = templates.LINK.substitute(url = escape(uid), rel='alternate')
265 | 	feed['self'] = templates.LINK.substitute(url = escape(atom_url), rel='self')
266 | 
267 | 	# construct author for feed
268 | 	if 'author' in props:
269 | 		author = templates.AUTHOR.substitute(name = escape(props['author'][0]['properties']['name'][0]))
270 | 
271 | 	# construct entries for feed
272 | 	for entry in entries:
273 | 		# construct entry template  - skip entry if error
274 | 		entry_atom, message = hentry2atom(entry)
275 | 		if entry_atom:
276 | 			feed['entries'] += entry_atom
277 | 
278 | 	feed['generator'] = templates.GENERATOR
279 | 
280 | 	return templates.FEED.substitute(feed), 'up and Atom!'
281 | 


--------------------------------------------------------------------------------
/hfeed2atom/templates.py:
--------------------------------------------------------------------------------
 1 | from string import Template
 2 | from . import __about__
 3 | 
 4 | GENERATOR = Template("""<generator uri="${uri}" version="${version}">${name}</generator>""").substitute(uri = __about__.URL['self'], version = '.'.join(map(str, __about__.VERSION[0:3])) + ''.join(__about__.VERSION[3:]), name = __about__.NAME )
 5 | 
 6 | TITLE = Template("""<${t_type}>${title}</${t_type}>""")
 7 | 
 8 | LINK = Template("""<link href="${url}" rel="${rel}"></link>""")
 9 | 
10 | DATE = Template("""<${dt_type}>${date}</${dt_type}>""")
11 | 
12 | ID = Template("""<id>${uid}</id>""")
13 | 
14 | AUTHOR = Template("""<author><name>${name}</name></author>""")
15 | 
16 | FEATURED = Template("""&lt;img src="${featured}"/&gt;""")
17 | 
18 | POST_SUMMARY = Template("""&lt;p&gt;${post_summary}&lt;/p&gt;""")
19 | 
20 | MORELINK = Template("""&lt;span&gt;Full post: &lt;a href="${url}"&gt;${name}&lt;/a&gt;&lt;/span&gt;""")
21 | 
22 | SUMMARY = Template("""<summary type="html">${featured}${summary}${morelink}</summary>""")
23 | 
24 | CONTENT = Template("""<content type="html">${content}</content>""")
25 | 
26 | CATEGORY = Template("""<category term="${category}"></category>""")
27 | 
28 | ENTRY = Template("""<entry>${title}${subtitle}${link}${uid}${published}${updated}${summary}${content}
29 | ${categories}</entry>""")
30 | 
31 | FEED = Template("""<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-us">${generator}${title}${subtitle}${link}${self}${uid}${updated}${author}${entries}</feed>""")
32 | 
33 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | mf2py==1.1.2
2 | requests==2.19.1
3 | BeautifulSoup4==4.6.0
4 | 
5 | mf2util==0.4.3
6 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | from setuptools import setup, find_packages
 5 | import os.path
 6 | 
 7 | _ABOUT_ = {}
 8 | _PATH_ = os.path.dirname(__file__)
 9 | 
10 | with open(os.path.join(_PATH_, 'hfeed2atom/__about__.py'))\
11 |         as about_file:
12 |     exec(about_file.read(), _ABOUT_)
13 | 
14 | # use requirements.txt for dependencies
15 | with open(os.path.join(_PATH_, 'requirements.txt')) as f:
16 |     required = map(lambda s: s.strip(), f.readlines())
17 | 
18 | with open(os.path.join(_PATH_, 'README.md')) as f:
19 |     readme = f.read()
20 | 
21 | with open(os.path.join(_PATH_, 'LICENSE')) as f:
22 |     license = f.read()
23 | 
24 | setup(
25 |     name = _ABOUT_['NAME'],
26 |     version = '.'.join(map(str, _ABOUT_['VERSION'][0:3])) + ''.join(_ABOUT_['VERSION'][3:]),
27 |     description = _ABOUT_['SUMMARY'],
28 |     long_description = readme,
29 |     install_requires = required,
30 |     dependency_links=[
31 |         "https://github.com/kartikprabhu/mf2py/tarball/experimental#egg=mf2py-1.1.1"
32 |     ],
33 |     author = _ABOUT_['AUTHOR']['name'],
34 |     author_email = _ABOUT_['AUTHOR']['email'],
35 |     url = _ABOUT_['URL']['github'],
36 |     license = license,
37 |     packages = find_packages(exclude=('tests', 'docs'))
38 | )
39 | 


--------------------------------------------------------------------------------