├── .gitignore
├── LICENSE
├── README.md
├── hfeed2atom
├── __about__.py
├── __init__.py
├── feed_parser.py
├── hfeed2atom.py
└── templates.py
├── requirements.txt
└── setup.py
/.gitignore:
--------------------------------------------------------------------------------
1 | ## important ignores here
2 |
3 | ## random ignores here
4 |
5 | #ignote compiled files and eggs
6 | *.pyc
7 | *.egg-info
8 |
9 | # ignore weird cache files
10 |
11 | *~
12 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2014 Kartik Prabhu
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
6 | this software and associated documentation files (the "Software"), to deal in
7 | the Software without restriction, including without limitation the rights to
8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 |
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | hfeed2atom
2 | ===========
3 |
4 | Convert h-feed pages to Atom 1.0 XML format for traditional feed readers. You can try it out at https://kartikprabhu.com/hfeed2atom
5 |
6 | Installation
7 | ------------
8 |
9 | To install hfeed2atom use pip as follows:
10 |
11 | ```
12 | pip install git+https://github.com/kartikprabhu/hfeed2atom.git --process-dependency-links
13 |
14 | ```
15 |
16 | This will install hfeed2atom with its dependencies from pypi and mf2py from the experimental repo https://github.com/kartikprabhu/mf2py/tree/experimental.
17 |
18 | Usage
19 | -----
20 |
21 | hfeed2atom takes as arguments one or more of the following:
22 | * `doc`: a string, a Python File object or a BeautifulSoup document containing the contents of an HTML page
23 | * `url`: the URL for the page to be parsed. It is recommended to always send a URL argument as it is used to convert all other URLs in the document to absolute URLs.
24 | * `atom_url`: the URL of the page with the ATOM file.
25 | * `hfeed`: a Python dictionary of the microformats h-feed object. Use this if the document has already been parsed for microformats.
26 |
27 | hfeed2atom returns the following as a tuple:
28 | * Atom format of the h-feed, or None if there was an error.
29 | * A string message of the error if any
30 |
31 |
32 | The easiest way to use hfeed2atom in your own python code to parse the feed on a URL `http://example.com`
33 |
34 | ```
35 | from hfeed2atom import hfeed2atom
36 |
37 | atom, message = hfeed2atom(url = 'http://example.com')
38 | ```
39 | With this code, hfeed2atom will do a GET request to `http://example.com`, find the first h-feed and return the Atom as a string.
40 |
41 | If you already have the `contents` of the URL (by doing a GET request yourself, or if it is your own page on your server), then you can pass them as a `doc` argument as
42 |
43 | ```
44 | atom, message = hfeed2atom(doc = contents, url = 'http://example.com')
45 | ```
46 | `doc` can be a string, a Python File object or a BeautifulSoup document.
47 |
48 | If you already have the h-feed microformats object of a page as a Python dictionary in the variable `hfeed` then use
49 |
50 | ```
51 | atom, message = hfeed2atom(hfeed = hfeed, url = 'http://example.com')
52 | ```
53 | Note, in this case hfeed2atom assumes that all the required properties for Atom are already in the `hfeed` variable and *will not* attempt to generate any fallback properties.
54 |
55 | Features
56 | --------
57 |
58 | * Finds first h-feed element to generate Atom feed, if no h-feed found defaults to using the top-level h-entries for the feed.
59 | * Generates fallbacks for required Atom properties of the feed. The fallbacks, in order, are:
60 | - title : h-feed `name` property else, `
` element of the page else, `Feed for URL`.
61 | - id : h-feed `uid` or `url` property else, URL argument given.
62 | - updated date : h-feed `updated` or `published` property else, `updated` or `published` property of the latest entry.
63 | * Generates fallback categories for the h-feed from the `meta name='keywords'>` element of the page.
64 | * Generates fallback for required Atom properties of each entry in the h-feed. The fallbacks, in order, are:
65 | - title : h-entry `name` property if it is not same as `content>value` else, `content` property truncated to 50 characters else, `uid` or `url` of the entry.
66 | - id : h-entry `uid` property or `url` property else, error and skips that entry.
67 | - updated date : h-entry `updated` or `published` property else, error and skips that entry.
68 |
69 | To Do
70 | -----
71 | * Author discovery if the h-feed does not have an author property. Note `` is an optional tag in Atom!
72 |
73 | Go forth
74 | --------
75 |
76 | Now [use this yourself](https://github.com/kartikprabhu/hfeed2atom) and [give feedback](https://github.com/kartikprabhu/hfeed2atom/issues).
77 |
--------------------------------------------------------------------------------
/hfeed2atom/__about__.py:
--------------------------------------------------------------------------------
1 | # file containing common data about the code
2 |
3 | NAME = 'hfeed2atom'
4 |
5 | SUMMARY = 'Converter from h-feed microformats to Atom 1.0'
6 |
7 | VERSION = (0, 2, 4, "")
8 |
9 | AUTHOR = {'name' : 'Kartik Prabhu', 'email' : 'me@kartikprabhu.com'}
10 |
11 | COPYRIGHT = 'Copyright (c) by ' + AUTHOR['name']
12 |
13 | LICENSE = 'MIT'
14 |
15 | URL = {'self' : 'https://kartikprabhu.com/hfeed2atom', 'github' : 'https://github.com/kartikprabhu/hfeed2atom'}
16 |
--------------------------------------------------------------------------------
/hfeed2atom/__init__.py:
--------------------------------------------------------------------------------
1 | #/usr/bin/env python
2 |
3 | from . import __about__
4 |
5 | __author__ = __about__.AUTHOR['name']
6 | __contact__ = __about__.AUTHOR['email']
7 | __copyright__ = __about__.COPYRIGHT
8 | __license__ = __about__.LICENSE
9 | __version__ = '.'.join(map(str, __about__.VERSION[0:3])) + ''.join(__about__.VERSION[3:])
10 |
11 | from hfeed2atom import hfeed2atom, hentry2atom
12 |
--------------------------------------------------------------------------------
/hfeed2atom/feed_parser.py:
--------------------------------------------------------------------------------
1 | import requests
2 | from bs4 import BeautifulSoup
3 |
4 | import mf2py
5 | import mf2util
6 |
7 |
8 |
9 | def feed_parser(doc=None, url=None):
10 | """
11 | parser to get hfeed
12 | """
13 |
14 | if doc:
15 | if not isinstance(doc, BeautifulSoup):
16 | doc = BeautifulSoup(doc)
17 |
18 | if url:
19 | if doc is None:
20 | data = requests.get(url)
21 |
22 | # check for charater encodings and use 'correct' data
23 | if 'charset' in data.headers.get('content-type', ''):
24 | doc = BeautifulSoup(data.text)
25 | else:
26 | doc = BeautifulSoup(data.content)
27 |
28 | # find first h-feed object if any or construct it
29 |
30 | hfeed = doc.find(class_="h-feed")
31 |
32 | if hfeed:
33 | hfeed = mf2py.Parser(hfeed, url).to_dict()['items'][0]
34 | else:
35 | hfeed = {'type': ['h-feed'], 'properties': {}, 'children': []}
36 |
37 | # parse whole document for microformats
38 | parsed = mf2py.Parser(doc, url).to_dict()
39 |
40 | # construct h-entries from top-level items
41 | hfeed['children'] = [x for x in parsed['items'] if 'h-entry' in x.get('type', [])]
42 |
43 |
44 | # construct fall back properties for hfeed
45 |
46 | props = hfeed['properties']
47 |
48 | # if no name or name is the content value, construct name from title or default from URL
49 | name = props.get('name')
50 | if name:
51 | name = name[0]
52 |
53 | content = props.get('content')
54 | if content:
55 | content = content[0]
56 | if isinstance(content, dict):
57 | content = content.get('value')
58 |
59 | if not name or not mf2util.is_name_a_title(name, content):
60 | feed_title = doc.find('title')
61 | if feed_title:
62 | hfeed['properties']['name'] = [feed_title.get_text()]
63 | elif url:
64 | hfeed['properties']['name'] = ['Feed for' + url]
65 |
66 | # construct author from rep_hcard or meta-author
67 |
68 | # construct uid from url
69 | if 'uid' not in props and 'url' not in props:
70 | if url:
71 | hfeed['properties']['uid'] = [url]
72 |
73 | # construct categories from meta-keywords
74 | if 'category' not in props:
75 | keywords = doc.find('meta', attrs= {'name': 'keywords', 'content': True})
76 | if keywords:
77 | hfeed['properties']['category'] = keywords.get('content', '').split(',')
78 |
79 |
80 | return hfeed
81 |
--------------------------------------------------------------------------------
/hfeed2atom/hfeed2atom.py:
--------------------------------------------------------------------------------
1 | from xml.sax.saxutils import escape
2 |
3 | from . import templates, feed_parser
4 |
5 | import mf2util
6 |
7 | def _updated_or_published(mf):
8 | """
9 | get the updated date or the published date
10 |
11 | Args:
12 | mf: python dictionary of some microformats object
13 |
14 | Return: string containing the updated date if it exists, or the published date if it exists or None
15 |
16 | """
17 |
18 | props = mf['properties']
19 |
20 | # construct updated/published date of mf
21 | if 'updated' in props:
22 | return props['updated'][0]
23 | elif 'published' in props:
24 | return props['published'][0]
25 | else:
26 | return None
27 |
28 | def _get_id(mf, url=None):
29 | """
30 | get the uid of the mf object
31 |
32 | Args:
33 | mf: python dictionary of some microformats object
34 | url: optional URL to use in case no uid or url in mf
35 |
36 | Return: string containing the id or None
37 | """
38 |
39 | props = mf['properties']
40 |
41 | if 'uid' in props:
42 | return props['uid'][0]
43 | elif 'url' in props:
44 | return props['url'][0]
45 | else:
46 | return None
47 |
48 | def _response_context(mf):
49 | """
50 | get the response context of the mf object
51 |
52 | Args:
53 | mf: python dictionary of some microformats object
54 |
55 | Return: string containing the HTML reconstruction of the response context
56 | """
57 |
58 | props = mf['properties']
59 |
60 | # get replies
61 | responses = props.get('in-reply-to')
62 | if responses:
63 | response_type = 'in reply to'
64 | for response in responses:
65 | response = response[0]
66 | if isinstance(response, dict):
67 | # the following is not correct
68 | response = response.get('url')
69 |
70 | if response:
71 | # make and return string with type and list of URLs
72 | response = None
73 |
74 | return None
75 |
76 | def hentry2atom(entry_mf):
77 | """
78 | convert microformats of a h-entry object to Atom 1.0
79 |
80 | Args:
81 | entry_mf: python dictionary of parsed microformats of a h-entry
82 |
83 | Return: an Atom 1.0 XML version of the microformats or None if error, and error message
84 | """
85 |
86 | # generate fall backs or errors for the non-existing required properties ones.
87 |
88 | if 'properties' in entry_mf:
89 | props = entry_mf['properties']
90 | else:
91 | return None, 'properties of entry not found.'
92 |
93 | entry = {'title': '', 'subtitle': '', 'link': '', 'uid': '', 'published': '', 'updated': '', 'summary': '', 'content': '', 'categories': ''}
94 |
95 | ## required properties first
96 |
97 | # construct id of entry
98 | uid = _get_id(entry_mf)
99 |
100 | if uid:
101 | # construct id of entry -- required
102 | entry['uid'] = templates.ID.substitute(uid = escape(uid))
103 | else:
104 | return None, 'entry does not have a valid id'
105 |
106 | # construct title of entry -- required - add default
107 | # if no name or name is the content value, construct name from title or default from URL
108 | name = props.get('name')
109 | if name:
110 | name = name[0]
111 |
112 | content = props.get('content')
113 | if content:
114 | content = content[0]
115 | if isinstance(content, dict):
116 | content = content.get('value')
117 |
118 | if name:
119 | # if name is generated from content truncate
120 | if not mf2util.is_name_a_title(name, content):
121 | if len(name) > 50:
122 | name = name[:50] + '...'
123 | else:
124 | name = uid
125 |
126 | entry['title'] = templates.TITLE.substitute(title = escape(name), t_type='title')
127 |
128 | # construct updated/published date of entry
129 | updated = _updated_or_published(entry_mf)
130 |
131 | # updated is -- required
132 | if updated:
133 | entry['updated'] = templates.DATE.substitute(date = escape(updated), dt_type = 'updated')
134 | else:
135 | return None, 'entry does not have valid updated date'
136 |
137 | ## optional properties
138 |
139 | entry['link'] = templates.LINK.substitute(url = escape(uid), rel='alternate')
140 |
141 | # construct published date of entry
142 | if 'published' in props:
143 | entry['published'] = templates.DATE.substitute(date = escape(props['published'][0]), dt_type = 'published')
144 |
145 | # construct subtitle for entry
146 | if 'additional-name' in props:
147 | feed['subtitle'] = templates.TITLE.substitute(title = escape(props['additional-name'][0]), t_type='subtitle')
148 |
149 | # content processing
150 | if 'content' in props:
151 | if isinstance(props['content'][0], dict):
152 | content = props['content'][0]['html']
153 | else:
154 | content = props['content'][0]
155 | else:
156 | content = None
157 |
158 | if content:
159 | entry['content'] = templates.CONTENT.substitute(content = escape(content))
160 |
161 | # construct summary of entry
162 | if 'featured' in props:
163 | featured = templates.FEATURED.substitute(featured = escape(props['featured'][0]))
164 | else:
165 | featured = ''
166 |
167 | if 'summary' in props:
168 | summary = templates.POST_SUMMARY.substitute(post_summary = escape(props['summary'][0]))
169 | else:
170 | summary = ''
171 |
172 | # make morelink if content does not exist
173 | if not content:
174 | morelink = templates.MORELINK.substitute(url = escape(uid), name = escape(name))
175 | else:
176 | morelink = ''
177 |
178 | entry['summary'] = templates.SUMMARY.substitute(featured=featured, summary=summary, morelink=morelink)
179 |
180 | # construct category list of entry
181 | if 'category' in props:
182 | for category in props['category']:
183 | if isinstance(category, dict):
184 | if 'value' in category:
185 | category = category['value']
186 | else:
187 | continue
188 |
189 | entry['categories'] += templates.CATEGORY.substitute(category=escape(category))
190 |
191 | # construct atom of entry
192 | return templates.ENTRY.substitute(entry), 'up and Atom!'
193 |
194 |
195 | def hfeed2atom(doc=None, url=None, atom_url=None, hfeed=None):
196 | """
197 | convert first h-feed object in a document to Atom 1.0
198 |
199 | Args:
200 | doc (file or string or BeautifulSoup doc): file handle, text of content
201 | to parse, or BeautifulSoup document to look for h-feed
202 | url: url of the document, used for making absolute URLs from url data, or for fetching the document
203 |
204 | Return: an Atom 1.0 XML document version of the first h-feed in the document or None if no h-feed found, and string with reason for error
205 | """
206 | # if hfeed object given assume it is well formatted
207 | if hfeed:
208 | mf = hfeed
209 | else:
210 | # send to hfeed_parser to parse
211 | mf = feed_parser.feed_parser(doc, url)
212 |
213 | if not mf:
214 | return None, 'h-feed not found'
215 |
216 | feed = {'generator': '', 'title': '', 'subtitle': '', 'link': '', 'uid': '', 'updated': '', 'author': '', 'entries': ''}
217 |
218 | if 'properties' in mf:
219 | props = mf['properties']
220 | else:
221 | return None, 'h-feed properties not found.'
222 |
223 | ## required properties first
224 |
225 | uid = _get_id(mf) or url
226 |
227 | # id is -- required
228 | if uid:
229 | # construct id of feed -- required
230 | feed['uid'] = templates.ID.substitute(uid = escape(uid))
231 | else:
232 | return None, 'feed does not have a valid id'
233 |
234 | #construct title for feed -- required
235 | if 'name' in props:
236 | name = props['name'][0] or uid
237 |
238 | feed['title'] = templates.TITLE.substitute(title = escape(name), t_type='title')
239 |
240 | # entries
241 | if 'children' in mf:
242 | entries = [x for x in mf['children'] if 'h-entry' in x['type']]
243 | else:
244 | entries = []
245 |
246 | # construct updated/published date of feed.
247 | updated = _updated_or_published(mf)
248 |
249 | if not updated and entries:
250 | updated = max([_updated_or_published(x) for x in entries])
251 |
252 | # updated is -- required
253 | if updated:
254 | feed['updated'] = templates.DATE.substitute(date = escape(updated), dt_type = 'updated')
255 | else:
256 | return None, 'updated date for feed not found, and could not be constructed from entries.'
257 |
258 | ## optional properties
259 |
260 | # construct subtitle for feed
261 | if 'additional-name' in props:
262 | feed['subtitle'] = templates.TITLE.substitute(title = escape(props['additional-name'][0]), t_type='subtitle')
263 |
264 | feed['link'] = templates.LINK.substitute(url = escape(uid), rel='alternate')
265 | feed['self'] = templates.LINK.substitute(url = escape(atom_url), rel='self')
266 |
267 | # construct author for feed
268 | if 'author' in props:
269 | author = templates.AUTHOR.substitute(name = escape(props['author'][0]['properties']['name'][0]))
270 |
271 | # construct entries for feed
272 | for entry in entries:
273 | # construct entry template - skip entry if error
274 | entry_atom, message = hentry2atom(entry)
275 | if entry_atom:
276 | feed['entries'] += entry_atom
277 |
278 | feed['generator'] = templates.GENERATOR
279 |
280 | return templates.FEED.substitute(feed), 'up and Atom!'
281 |
--------------------------------------------------------------------------------
/hfeed2atom/templates.py:
--------------------------------------------------------------------------------
1 | from string import Template
2 | from . import __about__
3 |
4 | GENERATOR = Template("""${name}""").substitute(uri = __about__.URL['self'], version = '.'.join(map(str, __about__.VERSION[0:3])) + ''.join(__about__.VERSION[3:]), name = __about__.NAME )
5 |
6 | TITLE = Template("""<${t_type}>${title}${t_type}>""")
7 |
8 | LINK = Template("""""")
9 |
10 | DATE = Template("""<${dt_type}>${date}${dt_type}>""")
11 |
12 | ID = Template("""${uid}""")
13 |
14 | AUTHOR = Template("""${name}""")
15 |
16 | FEATURED = Template("""<img src="${featured}"/>""")
17 |
18 | POST_SUMMARY = Template("""<p>${post_summary}</p>""")
19 |
20 | MORELINK = Template("""<span>Full post: <a href="${url}">${name}</a></span>""")
21 |
22 | SUMMARY = Template("""${featured}${summary}${morelink}""")
23 |
24 | CONTENT = Template("""${content}""")
25 |
26 | CATEGORY = Template("""""")
27 |
28 | ENTRY = Template("""${title}${subtitle}${link}${uid}${published}${updated}${summary}${content}
29 | ${categories}""")
30 |
31 | FEED = Template("""${generator}${title}${subtitle}${link}${self}${uid}${updated}${author}${entries}""")
32 |
33 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | mf2py==1.1.2
2 | requests==2.19.1
3 | BeautifulSoup4==4.6.0
4 |
5 | mf2util==0.4.3
6 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 |
4 | from setuptools import setup, find_packages
5 | import os.path
6 |
7 | _ABOUT_ = {}
8 | _PATH_ = os.path.dirname(__file__)
9 |
10 | with open(os.path.join(_PATH_, 'hfeed2atom/__about__.py'))\
11 | as about_file:
12 | exec(about_file.read(), _ABOUT_)
13 |
14 | # use requirements.txt for dependencies
15 | with open(os.path.join(_PATH_, 'requirements.txt')) as f:
16 | required = map(lambda s: s.strip(), f.readlines())
17 |
18 | with open(os.path.join(_PATH_, 'README.md')) as f:
19 | readme = f.read()
20 |
21 | with open(os.path.join(_PATH_, 'LICENSE')) as f:
22 | license = f.read()
23 |
24 | setup(
25 | name = _ABOUT_['NAME'],
26 | version = '.'.join(map(str, _ABOUT_['VERSION'][0:3])) + ''.join(_ABOUT_['VERSION'][3:]),
27 | description = _ABOUT_['SUMMARY'],
28 | long_description = readme,
29 | install_requires = required,
30 | dependency_links=[
31 | "https://github.com/kartikprabhu/mf2py/tarball/experimental#egg=mf2py-1.1.1"
32 | ],
33 | author = _ABOUT_['AUTHOR']['name'],
34 | author_email = _ABOUT_['AUTHOR']['email'],
35 | url = _ABOUT_['URL']['github'],
36 | license = license,
37 | packages = find_packages(exclude=('tests', 'docs'))
38 | )
39 |
--------------------------------------------------------------------------------