├── .github └── FUNDING.yml ├── .gitignore ├── CHANGELOG ├── README.md ├── config.py.example ├── feedparser.py ├── html2text.py ├── r2e ├── r2e.bat ├── readme.html ├── rss2email.py └── summarize.py /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | github: [rcarmo] 2 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[cod] 2 | 3 | # C extensions 4 | *.so 5 | 6 | # Packages 7 | *.egg 8 | *.egg-info 9 | dist 10 | build 11 | eggs 12 | parts 13 | bin 14 | var 15 | sdist 16 | develop-eggs 17 | .installed.cfg 18 | lib 19 | lib64 20 | 21 | # Installer logs 22 | pip-log.txt 23 | 24 | # Unit test / coverage reports 25 | .coverage 26 | .tox 27 | nosetests.xml 28 | 29 | # Translations 30 | *.mo 31 | 32 | # Mr Developer 33 | .mr.developer.cfg 34 | .project 35 | .pydevproject 36 | -------------------------------------------------------------------------------- /CHANGELOG: -------------------------------------------------------------------------------- 1 | v2.72-rcarmo (2015-03) 2 | * Flush read messages after one day instead of on every run 3 | 4 | v2.71-rcarmo (2013-03) 5 | * Add IMAP support 6 | * Changed default CSS and config options 7 | * Added threading headers 8 | * Added experimental data:URI support for inline images 9 | * Bundled feedparser fixes 10 | 11 | v2.71 (2011-03-04) 12 | * Potentialy safer method for writing feeds.dat on UNIX 13 | * Handle via links with no title attribute 14 | * Handle attributes more cleanly with OVERRIDE_EMAIL and DEFAULT_EMAIL 15 | 16 | v2.70 (2010-12-21) 17 | * Improved handling of given feed email addresses to prevent mail servers rejecting poorly formed Froms 18 | * Added X-RSS-TAGS header that lists any tags provided by an entry, which will be helpful in filtering incoming messages 19 | 20 | v2.69 (2010-11-12) 21 | * Added support for connecting to SMTP server via SSL, see SMTP_SSL option 22 | * Improved backwards compatibility by fixing issue with listing feeds when run with older Python versions 23 | * Added selective feed email overrides through OVERRIDE_EMAIL and DEFAULT_EMAIL options 24 | * Added NO_FRIENDLY_NAME to from from address only without the friendly name 25 | * Added X-RSS-URL header in each message with the link to the original item 26 | 27 | v2.68 (2010-10-01) 28 | * Added ability to pause/resume checking of individual feeds through pause and unpause commands 29 | * Added ability to import and export OPML feed lists through importopml and exportopml commands 30 | 31 | v2.67 (2010-09-21) 32 | * Fixed entries that include an id which is blank (i.e., an empty string) were being resent 33 | * Fixed some entries not being sent by email because they had bad From headers 34 | * Fixed From headers with HTML entities encoded twice 35 | * Compatibility changes to support most recent development versions of feedparser 36 | * Compatibility changes to support Google Reader feeds 37 | 38 | v2.66 (2009-12-21) 39 | * Complete packaging of all necessary source files (rss2email, html2text, feedparser, r2e, etc.) into one bundle 40 | o Included a more complete config.py with all options 41 | o Default to HTML mail and CSS results 42 | * Added 'reset' command to erase history of already seen entries 43 | * Changed project email to 'lindsey@allthingsrss.com' and project homepage to 'http://www.allthingsrss.com/rss2email/' 44 | * Made exception and error output text more useful 45 | * Added X-RSS-Feed and X-RSS-ID headers to each email for easier filtering 46 | * Improved enclosure handling 47 | * Fixed MacOS compatibility issues 48 | 49 | v2.65 (2009-01-05) 50 | 51 | * Fixed warnings caused by Python v2.6 (using hashlib, removing mimify, etc.) 52 | * Deprecated QP_REQUIRED option as this is more than likely no longer needed and part of what triggered Python warnings 53 | * Fixed unicode errors in certain post headers 54 | * Attempted to incorporate Debian/Ubuntu patches into the mainstream release 55 | * Support img type enclosures 56 | * No file locking for SunOS 57 | 58 | v2.64 (2008-10-21) 59 | * Bug-fix version 60 | o Gracefully handle missing charsets 61 | o Friendlier and more useful message if sendmail isn't installed 62 | o SunOS locking fix 63 | 64 | v2.63 (2008-06-13) 65 | * Bug-fix version and license change: 66 | o Licensed under GPL 2 & 3 now 67 | o Display feed number in warning and error message lines 68 | o Fix for unicode handling problem with certain entry titles 69 | 70 | v2.62 (2008-01-14) 71 | * Bug-fix version: 72 | o Simplified SunOS fix 73 | o Local feeds (/home/user/file.xml) should work 74 | 75 | v2.61 (2007-12-07) 76 | * Bug-fix version: 77 | o Now really compatible with SunOS 78 | o Don't wrap long subject headers 79 | o New parameter CHARSET_LIST to override or supplement the order in which charsets are tried against an entry 80 | o Don't use blank content to generate id 81 | o Using GMail as mail server should work 82 | 83 | v2.60 (2006-08-25) 84 | * Small bug-fix version: 85 | o Now compatible with SunOS 86 | o Correctly handle international character sets in email From 87 | 88 | v2.59 (2006-06-09) 89 | * Finally added oft-requested support for enclosures. Any enclosures, such as a podcast MP3, will be listed under the entry URL 90 | * Made feed timeout compatible with Python versions 2.2 and higher, instead of v2.4 only 91 | * Added optional, configurable CSS styling to HTML mail. Set USE_CSS_STYLING=1 in your config.py to enable this. If you want to tweak the look, modify STYLE_SHEET. 92 | * Improved empty feed checking 93 | * Improved invalid feed messages 94 | * Unfortunately, rss2email is no longer compatible with Python v2.1. Two of the most serious lingering issues with rss2email were waiting forever for non-responsive feeds and its inablility to properly handle feeds with international characters. To properly fix these once and for all, rss2email now depends on functionality that was not available until Python v2.2. Hopefully this does not unduly inconvenience anyone that has not yet upgraded to a more current version of Python. 95 | 96 | v2.58 (2006-05-11) 97 | * Total rewrite of email code that should fix encoding problems 98 | * Added configurable timeout for nonresponsive feeds 99 | * Fixed incorrectly using text summary_detail instead of html content 100 | * Fixed bug with deleting feed 0 if no default email was set 101 | * Print name of feed that is being deleted 102 | 103 | v2.57 (2006-04-07) 104 | * Integrated Joey Hess's patches 105 | o First, a patch that makes delete more reliable, so it no longer allows you to remove the default email address ('feed' 0) and thereby hose your feed file, or 'remove' entries that don't exist without warning; and so it only says IDs have changed when they really have. Originally from http://bugs.debian.org/313101 106 | o Next a patch that avoids a backtrace if there's no email address defined, and outputs a less scary error message. 107 | o Next, a simple change to the usage; since the "email" subcommand always needs a parameter, don't mark it as optional. 108 | o And, avoid a backtrace if the email subcommand does get run w/o a parameter. 109 | o And also avoid backtraces if delete is run w/o a parameter. Also adds support for --help. 110 | o Simple change, make a comment match reality (/usr/sbin/sendmail) 111 | o This avoids another backtrace, this time if there's no feed file yet. [load()] 112 | o Add a handler for the AttributeError exception, which feedparser can throw. Beats crashing.. 113 | o Next, four hunks that make it more robust if no default email address is set and feeds are added w/o an email address. This patch originally comes from http://bugs.debian.org/310485 which has some examples. 114 | o Finally, this works around a bug in mimify that causes it to add a newline to the subject header if it contains very long words. Details at http://bugs.debian.org/320185. Note that Tatsuya Kinoshita has a larger patch torard the end of that bug report that deals with some other problems in this area, Aaron has seen that patch before and said it "looks pretty reasonable". 115 | * add() catches error case on first feed add and no email address is set 116 | * Made "emailaddress" consistent param label throughout 117 | * Error message improvements 118 | * Deleted problematic "if title" line 119 | * Deleted space in front of SMTP_USER 120 | * Only logs into SMTP server once 121 | * Added exception handling around SMTP server connect and login attempt 122 | * Broke contributors across multiple lines 123 | 124 | v2.56 (2006-04-04) 125 | * SMTP AUTH support added 126 | * Windows support 127 | * Fixed bug with HTML in titles 128 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | rss2imap 2 | ======== 3 | 4 | An adaptation of rss2email that uses IMAP directly. 5 | 6 | # What does it look like? 7 | 8 | Well, with the shipping CSS in `config.py`, it looks like this: 9 | 10 | 11 | 12 | ## What about mobile? 13 | 14 | Well, it works fine with the Gmail app on both Android and iOS, as well as the native IMAP clients: 15 | 16 | 17 | 18 | As long as you sync, all the text will be available off-line (images are cached at the whim of the MUA). 19 | 20 | The Gmail app ignores CSS and may have weird behaviors with long bits of text, though. 21 | 22 | # Main Features: 23 | 24 | * *NEW:* Automatically file away messages read after one day instead of on every run 25 | * Optional (naive) summarization of news items at the top of each item (see `SUMMARIZE` setting) 26 | * E-mail is injected directly via IMAP (so no delays or hassles with spam filters) 27 | * Feeds can be grouped into IMAP folders -- no inbox clutter! 28 | * Generates E-mail headers for threading, so a post that references another post (or that includes the same link) will show up as a thread on decent MUAs. Also, posts from the same feed will be part of the same thread) 29 | * Can (optionally) include images inline (as `data:` URIs for now -- which only works properly on iOS/Mac -- soon as MIME attachments) 30 | * Can (optionally) remove read (but not flagged) items automatically 31 | 32 | # Project Status 33 | 34 | Given that I've only had to tweak _one thing_ after two years of continued use, I'd say this is more than stable. I've gone off and built a multi-threaded app with a SQLite feed store called [bottle-fever](https://github.com/rcarmo/bottle-fever), but there's only so much free time, and even though this code is crammed with hideous legacy idioms, it works as is. 35 | 36 | Come 2016, I switched to Feedly because the user experience on the iPad using [Reeder](http://reederapp.com) was a little better. 37 | 38 | ## Similar Projects 39 | 40 | Other projects I've come across that traveled this path in other languages: 41 | 42 | * [greghendershott/feeds2gmail](https://github.com/greghendershott/feeds2gmail), using [Racket](https://www.racket-lang.org) 43 | * [Gonzih/feeds2imap.clj](https://github.com/Gonzih/feeds2imap.clj), using [Clojure](https://clojure.org) 44 | * [rcarmo/go-rss2imap](https://github.com/rcarmo/go-rss2imap) my attempt at tweaking a [Go](http://golang.org) version 45 | * [Riduidel/rrss2imap](https://github.com/Riduidel/rrss2imap), using [Rust](https://www.rust-lang.org/) 46 | 47 | ## Exercises For The Reader 48 | 49 | * Test nested folders (am only using single folders, not a nested hierarchy, so this might break) 50 | * Automatic message categorization using Bayesian filtering and NLTK 51 | * Better reference tracking to identify 'hot' items 52 | * Figure out a nice way to do favicons (X-Face is obsolete, and so is X-Image-URL) 53 | 54 | # Here Be Dragons 55 | 56 | Be aware that this works and is easy to hack, but uses old Python idioms and could do with some refactoring (PEP-8 zealots are sure to cringe as they read through the code -- I know I find it hideous, but it was a quick hack and has been working reliably for me for over two years now). 57 | -------------------------------------------------------------------------------- /config.py.example: -------------------------------------------------------------------------------- 1 | ### Options for configuring rss2email ### 2 | 3 | # The email address messages are from by default: 4 | DEFAULT_FROM = "bozo@dev.null.invalid" 5 | 6 | # 1: Send text/html messages when possible. 7 | # 0: Convert HTML to plain text. 8 | HTML_MAIL = 1 9 | 10 | # 1: Only use the DEFAULT_FROM address. 11 | # 0: Use the email address specified by the feed, when possible. 12 | FORCE_FROM = 0 13 | 14 | # 1: Receive one email per post. 15 | # 0: Receive an email every time a post changes. 16 | TRUST_GUID = 1 17 | 18 | # 1: Generate Date header based on item's date, when possible. 19 | # 0: Generate Date header based on time sent. 20 | DATE_HEADER = 1 21 | 22 | # A tuple consisting of some combination of 23 | # ('issued', 'created', 'modified', 'expired') 24 | # expressing ordered list of preference in dates 25 | # to use for the Date header of the email. 26 | DATE_HEADER_ORDER = ('modified', 'issued', 'created') 27 | 28 | # 1: Apply Q-P conversion (required for some MUAs). 29 | # 0: Send message in 8-bits. 30 | # http://cr.yp.to/smtp/8bitmime.html 31 | #DEPRECATED 32 | QP_REQUIRED = 0 33 | #DEPRECATED 34 | 35 | # 1: Name feeds as they're being processed. 36 | # 0: Keep quiet. 37 | VERBOSE = 0 38 | 39 | # 1: Use the publisher's email if you can't find the author's. 40 | # 0: Just use the DEFAULT_FROM email instead. 41 | USE_PUBLISHER_EMAIL = 1 42 | 43 | # 1: Use SMTP_SERVER to send mail. 44 | # 0: Call /usr/sbin/sendmail to send mail. 45 | SMTP_SEND = 1 46 | 47 | SMTP_SERVER = "smtp.yourisp.net:25" 48 | AUTHREQUIRED = 0 # if you need to use SMTP AUTH set to 1 49 | SMTP_USER = 'username' # for SMTP AUTH, set SMTP username here 50 | SMTP_PASS = 'password' # for SMTP AUTH, set SMTP password here 51 | 52 | # Connect to the SMTP server using SSL 53 | 54 | SMTP_SSL = 0 55 | 56 | # 1: Use IMAP_SERVER to deliver mail and ignore SMTP settings. 57 | # 0: Use SMTP 58 | 59 | IMAP_SEND = 1 60 | IMAP_SERVER = "smtp.yourisp.net:143" 61 | IMAP_USER = 'username' # set IMAP username here 62 | IMAP_PASS = 'password' # set IMAP password here 63 | 64 | # Connect to the IMAP server using SSL 65 | IMAP_SSL = 1 66 | 67 | # Synthesise From: addresses based on feed name and domain 68 | IMAP_MUNGE_FROM = 1 69 | IMAP_OVERRIDE_TO = 'Me ' 70 | 71 | # Generate References: headers based on item tags and/or URLs in feed content 72 | THREAD_ON_TAGS = 1 73 | THREAD_ON_LINKS = 1 74 | 75 | # Include inline images as a data: URI (not supported in MUAs like the mobile Gmail app, but works great on iOS and Mac) 76 | INLINE_IMAGES_DATA_URI = 0 77 | 78 | # Move read messages to a specific folder 79 | IMAP_MOVE_READ_TO = False 80 | 81 | # Mark all messages as read in the following folders 82 | IMAP_MARK_AS_READ = None # ['Trash'] 83 | 84 | # Set this to add a bonus header to all emails (start with '\n'). 85 | BONUS_HEADER = '' 86 | # Example: BONUS_HEADER = '\nApproved: joe@bob.org' 87 | 88 | # Set this to override From addresses. Keys are feed URLs, values are new titles. 89 | OVERRIDE_FROM = {} 90 | 91 | # Set this to override From email addresses. Keys are feed URLs, values are new emails. 92 | OVERRIDE_EMAIL = {} 93 | 94 | DEFAULT_EMAIL = {} 95 | 96 | # Only use the email from address rather than friendly name plus email address 97 | NO_FRIENDLY_NAME = 0 98 | 99 | # Set this to override the timeout (in seconds) for feed server response 100 | FEED_TIMEOUT = 60 101 | 102 | # Optional CSS styling 103 | USE_CSS_STYLING = 1 104 | STYLE_SHEET='img {max-width: 100% !important; height: auto;} body, #body {font-size: 12pt; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; font-family: Georgia, Times New Roman, Times, serif;} a:link {color: #0000cc} h1.header a {font-weight: normal; text-decoration: none; color: black;} .summary {font-size: 80%; font-style: italic;}' 105 | 106 | # If you have an HTTP Proxy set this in the format 'http://your.proxy.here:8080/' 107 | PROXY="" 108 | 109 | # To most correctly encode emails with international characters, we iterate through the list below and use the first character set that works 110 | # Eventually (and theoretically) ISO-8859-1 and UTF-8 are our catch-all failsafes 111 | CHARSET_LIST='US-ASCII', 'BIG5', 'ISO-2022-JP', 'ISO-8859-1', 'UTF-8' 112 | 113 | # Use Unicode characters instead of their ascii pseudo-replacements 114 | UNICODE_SNOB = 0 115 | 116 | # Put the links after each paragraph instead of at the end. 117 | LINKS_EACH_PARAGRAPH = 0 118 | 119 | # Wrap long lines at position. 0 for no wrapping. (Requires Python 2.3.) 120 | BODY_WIDTH = 0 121 | 122 | # Change the default imap folder 123 | DEFAULT_IMAP_FOLDER = "INBOX" 124 | 125 | # Whether to summarize the body or not. Set it to the number of sentences you require. 126 | SUMMARIZE = 0 127 | 128 | # When set to "1", during OPML import, the title tag of parent "outline" element is used as folder path 129 | USE_OPML_TITLE_AS_FOLDER = O -------------------------------------------------------------------------------- /html2text.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """html2text: Turn HTML into equivalent Markdown-structured text.""" 3 | __version__ = "3.200.3" 4 | __author__ = "Aaron Swartz (me@aaronsw.com)" 5 | __copyright__ = "(C) 2004-2008 Aaron Swartz. GNU GPL 3." 6 | __contributors__ = ["Martin 'Joey' Schulze", "Ricardo Reyes", "Kevin Jay North"] 7 | 8 | # TODO: 9 | # Support decoded entities with unifiable. 10 | 11 | try: 12 | True 13 | except NameError: 14 | setattr(__builtins__, 'True', 1) 15 | setattr(__builtins__, 'False', 0) 16 | 17 | def has_key(x, y): 18 | if hasattr(x, 'has_key'): return x.has_key(y) 19 | else: return y in x 20 | 21 | try: 22 | import htmlentitydefs 23 | import urlparse 24 | import HTMLParser 25 | except ImportError: #Python3 26 | import html.entities as htmlentitydefs 27 | import urllib.parse as urlparse 28 | import html.parser as HTMLParser 29 | try: #Python3 30 | import urllib.request as urllib 31 | except: 32 | import urllib 33 | import optparse, re, sys, codecs, types 34 | 35 | try: from textwrap import wrap 36 | except: pass 37 | 38 | # Use Unicode characters instead of their ascii psuedo-replacements 39 | UNICODE_SNOB = 0 40 | 41 | # Escape all special characters. Output is less readable, but avoids corner case formatting issues. 42 | ESCAPE_SNOB = 0 43 | 44 | # Put the links after each paragraph instead of at the end. 45 | LINKS_EACH_PARAGRAPH = 0 46 | 47 | # Wrap long lines at position. 0 for no wrapping. (Requires Python 2.3.) 48 | BODY_WIDTH = 78 49 | 50 | # Don't show internal links (href="#local-anchor") -- corresponding link targets 51 | # won't be visible in the plain text file anyway. 52 | SKIP_INTERNAL_LINKS = True 53 | 54 | # Use inline, rather than reference, formatting for images and links 55 | INLINE_LINKS = False 56 | 57 | # Number of pixels Google indents nested lists 58 | GOOGLE_LIST_INDENT = 36 59 | 60 | IGNORE_ANCHORS = False 61 | IGNORE_IMAGES = False 62 | IGNORE_EMPHASIS = False 63 | 64 | ### Entity Nonsense ### 65 | 66 | def name2cp(k): 67 | if k == 'apos': return ord("'") 68 | if hasattr(htmlentitydefs, "name2codepoint"): # requires Python 2.3 69 | return htmlentitydefs.name2codepoint[k] 70 | else: 71 | k = htmlentitydefs.entitydefs[k] 72 | if k.startswith("&#") and k.endswith(";"): return int(k[2:-1]) # not in latin-1 73 | return ord(codecs.latin_1_decode(k)[0]) 74 | 75 | unifiable = {'rsquo':"'", 'lsquo':"'", 'rdquo':'"', 'ldquo':'"', 76 | 'copy':'(C)', 'mdash':'--', 'nbsp':' ', 'rarr':'->', 'larr':'<-', 'middot':'*', 77 | 'ndash':'-', 'oelig':'oe', 'aelig':'ae', 78 | 'agrave':'a', 'aacute':'a', 'acirc':'a', 'atilde':'a', 'auml':'a', 'aring':'a', 79 | 'egrave':'e', 'eacute':'e', 'ecirc':'e', 'euml':'e', 80 | 'igrave':'i', 'iacute':'i', 'icirc':'i', 'iuml':'i', 81 | 'ograve':'o', 'oacute':'o', 'ocirc':'o', 'otilde':'o', 'ouml':'o', 82 | 'ugrave':'u', 'uacute':'u', 'ucirc':'u', 'uuml':'u', 83 | 'lrm':'', 'rlm':''} 84 | 85 | unifiable_n = {} 86 | 87 | for k in unifiable.keys(): 88 | unifiable_n[name2cp(k)] = unifiable[k] 89 | 90 | ### End Entity Nonsense ### 91 | 92 | def onlywhite(line): 93 | """Return true if the line does only consist of whitespace characters.""" 94 | for c in line: 95 | if c is not ' ' and c is not ' ': 96 | return c is ' ' 97 | return line 98 | 99 | def hn(tag): 100 | if tag[0] == 'h' and len(tag) == 2: 101 | try: 102 | n = int(tag[1]) 103 | if n in range(1, 10): return n 104 | except ValueError: return 0 105 | 106 | def dumb_property_dict(style): 107 | """returns a hash of css attributes""" 108 | return dict([(x.strip(), y.strip()) for x, y in [z.split(':', 1) for z in style.split(';') if ':' in z]]); 109 | 110 | def dumb_css_parser(data): 111 | """returns a hash of css selectors, each of which contains a hash of css attributes""" 112 | # remove @import sentences 113 | data += ';' 114 | importIndex = data.find('@import') 115 | while importIndex != -1: 116 | data = data[0:importIndex] + data[data.find(';', importIndex) + 1:] 117 | importIndex = data.find('@import') 118 | 119 | # parse the css. reverted from dictionary compehension in order to support older pythons 120 | elements = [x.split('{') for x in data.split('}') if '{' in x.strip()] 121 | try: 122 | elements = dict([(a.strip(), dumb_property_dict(b)) for a, b in elements]) 123 | except ValueError: 124 | elements = {} # not that important 125 | 126 | return elements 127 | 128 | def element_style(attrs, style_def, parent_style): 129 | """returns a hash of the 'final' style attributes of the element""" 130 | style = parent_style.copy() 131 | if 'class' in attrs: 132 | for css_class in attrs['class'].split(): 133 | css_style = style_def['.' + css_class] 134 | style.update(css_style) 135 | if 'style' in attrs: 136 | immediate_style = dumb_property_dict(attrs['style']) 137 | style.update(immediate_style) 138 | return style 139 | 140 | def google_list_style(style): 141 | """finds out whether this is an ordered or unordered list""" 142 | if 'list-style-type' in style: 143 | list_style = style['list-style-type'] 144 | if list_style in ['disc', 'circle', 'square', 'none']: 145 | return 'ul' 146 | return 'ol' 147 | 148 | def google_has_height(style): 149 | """check if the style of the element has the 'height' attribute explicitly defined""" 150 | if 'height' in style: 151 | return True 152 | return False 153 | 154 | def google_text_emphasis(style): 155 | """return a list of all emphasis modifiers of the element""" 156 | emphasis = [] 157 | if 'text-decoration' in style: 158 | emphasis.append(style['text-decoration']) 159 | if 'font-style' in style: 160 | emphasis.append(style['font-style']) 161 | if 'font-weight' in style: 162 | emphasis.append(style['font-weight']) 163 | return emphasis 164 | 165 | def google_fixed_width_font(style): 166 | """check if the css of the current element defines a fixed width font""" 167 | font_family = '' 168 | if 'font-family' in style: 169 | font_family = style['font-family'] 170 | if 'Courier New' == font_family or 'Consolas' == font_family: 171 | return True 172 | return False 173 | 174 | def list_numbering_start(attrs): 175 | """extract numbering from list element attributes""" 176 | if 'start' in attrs: 177 | return int(attrs['start']) - 1 178 | else: 179 | return 0 180 | 181 | class HTML2Text(HTMLParser.HTMLParser): 182 | def __init__(self, out=None, baseurl=''): 183 | HTMLParser.HTMLParser.__init__(self) 184 | 185 | # Config options 186 | self.unicode_snob = UNICODE_SNOB 187 | self.escape_snob = ESCAPE_SNOB 188 | self.links_each_paragraph = LINKS_EACH_PARAGRAPH 189 | self.body_width = BODY_WIDTH 190 | self.skip_internal_links = SKIP_INTERNAL_LINKS 191 | self.inline_links = INLINE_LINKS 192 | self.google_list_indent = GOOGLE_LIST_INDENT 193 | self.ignore_links = IGNORE_ANCHORS 194 | self.ignore_images = IGNORE_IMAGES 195 | self.ignore_emphasis = IGNORE_EMPHASIS 196 | self.google_doc = False 197 | self.ul_item_mark = '*' 198 | self.emphasis_mark = '_' 199 | self.strong_mark = '**' 200 | 201 | if out is None: 202 | self.out = self.outtextf 203 | else: 204 | self.out = out 205 | 206 | self.outtextlist = [] # empty list to store output characters before they are "joined" 207 | 208 | try: 209 | self.outtext = unicode() 210 | except NameError: # Python3 211 | self.outtext = str() 212 | 213 | self.quiet = 0 214 | self.p_p = 0 # number of newline character to print before next output 215 | self.outcount = 0 216 | self.start = 1 217 | self.space = 0 218 | self.a = [] 219 | self.astack = [] 220 | self.maybe_automatic_link = None 221 | self.absolute_url_matcher = re.compile(r'^[a-zA-Z+]+://') 222 | self.acount = 0 223 | self.list = [] 224 | self.blockquote = 0 225 | self.pre = 0 226 | self.startpre = 0 227 | self.code = False 228 | self.br_toggle = '' 229 | self.lastWasNL = 0 230 | self.lastWasList = False 231 | self.style = 0 232 | self.style_def = {} 233 | self.tag_stack = [] 234 | self.emphasis = 0 235 | self.drop_white_space = 0 236 | self.inheader = False 237 | self.abbr_title = None # current abbreviation definition 238 | self.abbr_data = None # last inner HTML (for abbr being defined) 239 | self.abbr_list = {} # stack of abbreviations to write later 240 | self.baseurl = baseurl 241 | 242 | try: del unifiable_n[name2cp('nbsp')] 243 | except KeyError: pass 244 | unifiable['nbsp'] = ' _place_holder;' 245 | 246 | 247 | def feed(self, data): 248 | data = data.replace("", "") 249 | HTMLParser.HTMLParser.feed(self, data) 250 | 251 | def handle(self, data): 252 | self.feed(data) 253 | self.feed("") 254 | return self.optwrap(self.close()) 255 | 256 | def outtextf(self, s): 257 | self.outtextlist.append(s) 258 | if s: self.lastWasNL = s[-1] == '\n' 259 | 260 | def close(self): 261 | HTMLParser.HTMLParser.close(self) 262 | 263 | self.pbr() 264 | self.o('', 0, 'end') 265 | 266 | self.outtext = self.outtext.join(self.outtextlist) 267 | if self.unicode_snob: 268 | nbsp = unichr(name2cp('nbsp')) 269 | else: 270 | nbsp = u' ' 271 | self.outtext = self.outtext.replace(u' _place_holder;', nbsp) 272 | 273 | return self.outtext 274 | 275 | def handle_charref(self, c): 276 | self.o(self.charref(c), 1) 277 | 278 | def handle_entityref(self, c): 279 | self.o(self.entityref(c), 1) 280 | 281 | def handle_starttag(self, tag, attrs): 282 | self.handle_tag(tag, attrs, 1) 283 | 284 | def handle_endtag(self, tag): 285 | self.handle_tag(tag, None, 0) 286 | 287 | def previousIndex(self, attrs): 288 | """ returns the index of certain set of attributes (of a link) in the 289 | self.a list 290 | 291 | If the set of attributes is not found, returns None 292 | """ 293 | if not has_key(attrs, 'href'): return None 294 | 295 | i = -1 296 | for a in self.a: 297 | i += 1 298 | match = 0 299 | 300 | if has_key(a, 'href') and a['href'] == attrs['href']: 301 | if has_key(a, 'title') or has_key(attrs, 'title'): 302 | if (has_key(a, 'title') and has_key(attrs, 'title') and 303 | a['title'] == attrs['title']): 304 | match = True 305 | else: 306 | match = True 307 | 308 | if match: return i 309 | 310 | def drop_last(self, nLetters): 311 | if not self.quiet: 312 | self.outtext = self.outtext[:-nLetters] 313 | 314 | def handle_emphasis(self, start, tag_style, parent_style): 315 | """handles various text emphases""" 316 | tag_emphasis = google_text_emphasis(tag_style) 317 | parent_emphasis = google_text_emphasis(parent_style) 318 | 319 | # handle Google's text emphasis 320 | strikethrough = 'line-through' in tag_emphasis and self.hide_strikethrough 321 | bold = 'bold' in tag_emphasis and not 'bold' in parent_emphasis 322 | italic = 'italic' in tag_emphasis and not 'italic' in parent_emphasis 323 | fixed = google_fixed_width_font(tag_style) and not \ 324 | google_fixed_width_font(parent_style) and not self.pre 325 | 326 | if start: 327 | # crossed-out text must be handled before other attributes 328 | # in order not to output qualifiers unnecessarily 329 | if bold or italic or fixed: 330 | self.emphasis += 1 331 | if strikethrough: 332 | self.quiet += 1 333 | if italic: 334 | self.o(self.emphasis_mark) 335 | self.drop_white_space += 1 336 | if bold: 337 | self.o(self.strong_mark) 338 | self.drop_white_space += 1 339 | if fixed: 340 | self.o('`') 341 | self.drop_white_space += 1 342 | self.code = True 343 | else: 344 | if bold or italic or fixed: 345 | # there must not be whitespace before closing emphasis mark 346 | self.emphasis -= 1 347 | self.space = 0 348 | self.outtext = self.outtext.rstrip() 349 | if fixed: 350 | if self.drop_white_space: 351 | # empty emphasis, drop it 352 | self.drop_last(1) 353 | self.drop_white_space -= 1 354 | else: 355 | self.o('`') 356 | self.code = False 357 | if bold: 358 | if self.drop_white_space: 359 | # empty emphasis, drop it 360 | self.drop_last(2) 361 | self.drop_white_space -= 1 362 | else: 363 | self.o(self.strong_mark) 364 | if italic: 365 | if self.drop_white_space: 366 | # empty emphasis, drop it 367 | self.drop_last(1) 368 | self.drop_white_space -= 1 369 | else: 370 | self.o(self.emphasis_mark) 371 | # space is only allowed after *all* emphasis marks 372 | if (bold or italic) and not self.emphasis: 373 | self.o(" ") 374 | if strikethrough: 375 | self.quiet -= 1 376 | 377 | def handle_tag(self, tag, attrs, start): 378 | #attrs = fixattrs(attrs) 379 | if attrs is None: 380 | attrs = {} 381 | else: 382 | attrs = dict(attrs) 383 | 384 | if self.google_doc: 385 | # the attrs parameter is empty for a closing tag. in addition, we 386 | # need the attributes of the parent nodes in order to get a 387 | # complete style description for the current element. we assume 388 | # that google docs export well formed html. 389 | parent_style = {} 390 | if start: 391 | if self.tag_stack: 392 | parent_style = self.tag_stack[-1][2] 393 | tag_style = element_style(attrs, self.style_def, parent_style) 394 | self.tag_stack.append((tag, attrs, tag_style)) 395 | else: 396 | dummy, attrs, tag_style = self.tag_stack.pop() 397 | if self.tag_stack: 398 | parent_style = self.tag_stack[-1][2] 399 | 400 | if hn(tag): 401 | self.p() 402 | if start: 403 | self.inheader = True 404 | self.o(hn(tag)*"#" + ' ') 405 | else: 406 | self.inheader = False 407 | return # prevent redundant emphasis marks on headers 408 | 409 | if tag in ['p', 'div']: 410 | if self.google_doc: 411 | if start and google_has_height(tag_style): 412 | self.p() 413 | else: 414 | self.soft_br() 415 | else: 416 | self.p() 417 | 418 | if tag == "br" and start: self.o(" \n") 419 | 420 | if tag == "hr" and start: 421 | self.p() 422 | self.o("* * *") 423 | self.p() 424 | 425 | if tag in ["head", "style", 'script']: 426 | if start: self.quiet += 1 427 | else: self.quiet -= 1 428 | 429 | if tag == "style": 430 | if start: self.style += 1 431 | else: self.style -= 1 432 | 433 | if tag in ["body"]: 434 | self.quiet = 0 # sites like 9rules.com never close 435 | 436 | if tag == "blockquote": 437 | if start: 438 | self.p(); self.o('> ', 0, 1); self.start = 1 439 | self.blockquote += 1 440 | else: 441 | self.blockquote -= 1 442 | self.p() 443 | 444 | if tag in ['em', 'i', 'u'] and not self.ignore_emphasis: self.o(self.emphasis_mark) 445 | if tag in ['strong', 'b'] and not self.ignore_emphasis: self.o(self.strong_mark) 446 | if tag in ['del', 'strike', 's']: 447 | if start: 448 | self.o("<"+tag+">") 449 | else: 450 | self.o("") 451 | 452 | if self.google_doc: 453 | if not self.inheader: 454 | # handle some font attributes, but leave headers clean 455 | self.handle_emphasis(start, tag_style, parent_style) 456 | 457 | if tag in ["code", "tt"] and not self.pre: self.o('`') #TODO: `` `this` `` 458 | if tag == "abbr": 459 | if start: 460 | self.abbr_title = None 461 | self.abbr_data = '' 462 | if has_key(attrs, 'title'): 463 | self.abbr_title = attrs['title'] 464 | else: 465 | if self.abbr_title != None: 466 | self.abbr_list[self.abbr_data] = self.abbr_title 467 | self.abbr_title = None 468 | self.abbr_data = '' 469 | 470 | if tag == "a" and not self.ignore_links: 471 | if start: 472 | if has_key(attrs, 'href') and not (self.skip_internal_links and attrs['href'].startswith('#')): 473 | self.astack.append(attrs) 474 | self.maybe_automatic_link = attrs['href'] 475 | else: 476 | self.astack.append(None) 477 | else: 478 | if self.astack: 479 | a = self.astack.pop() 480 | if self.maybe_automatic_link: 481 | self.maybe_automatic_link = None 482 | elif a: 483 | if self.inline_links: 484 | self.o("](" + escape_md(a['href']) + ")") 485 | else: 486 | i = self.previousIndex(a) 487 | if i is not None: 488 | a = self.a[i] 489 | else: 490 | self.acount += 1 491 | a['count'] = self.acount 492 | a['outcount'] = self.outcount 493 | self.a.append(a) 494 | self.o("][" + str(a['count']) + "]") 495 | 496 | if tag == "img" and start and not self.ignore_images: 497 | if has_key(attrs, 'src'): 498 | attrs['href'] = attrs['src'] 499 | alt = attrs.get('alt', '') 500 | self.o("![" + escape_md(alt) + "]") 501 | 502 | if self.inline_links: 503 | self.o("(" + escape_md(attrs['href']) + ")") 504 | else: 505 | i = self.previousIndex(attrs) 506 | if i is not None: 507 | attrs = self.a[i] 508 | else: 509 | self.acount += 1 510 | attrs['count'] = self.acount 511 | attrs['outcount'] = self.outcount 512 | self.a.append(attrs) 513 | self.o("[" + str(attrs['count']) + "]") 514 | 515 | if tag == 'dl' and start: self.p() 516 | if tag == 'dt' and not start: self.pbr() 517 | if tag == 'dd' and start: self.o(' ') 518 | if tag == 'dd' and not start: self.pbr() 519 | 520 | if tag in ["ol", "ul"]: 521 | # Google Docs create sub lists as top level lists 522 | if (not self.list) and (not self.lastWasList): 523 | self.p() 524 | if start: 525 | if self.google_doc: 526 | list_style = google_list_style(tag_style) 527 | else: 528 | list_style = tag 529 | numbering_start = list_numbering_start(attrs) 530 | self.list.append({'name':list_style, 'num':numbering_start}) 531 | else: 532 | if self.list: self.list.pop() 533 | self.lastWasList = True 534 | else: 535 | self.lastWasList = False 536 | 537 | if tag == 'li': 538 | self.pbr() 539 | if start: 540 | if self.list: li = self.list[-1] 541 | else: li = {'name':'ul', 'num':0} 542 | if self.google_doc: 543 | nest_count = self.google_nest_count(tag_style) 544 | else: 545 | nest_count = len(self.list) 546 | self.o(" " * nest_count) #TODO: line up
  1. s > 9 correctly. 547 | if li['name'] == "ul": self.o(self.ul_item_mark + " ") 548 | elif li['name'] == "ol": 549 | li['num'] += 1 550 | self.o(str(li['num'])+". ") 551 | self.start = 1 552 | 553 | if tag in ["table", "tr"] and start: self.p() 554 | if tag == 'td': self.pbr() 555 | 556 | if tag == "pre": 557 | if start: 558 | self.startpre = 1 559 | self.pre = 1 560 | else: 561 | self.pre = 0 562 | self.p() 563 | 564 | def pbr(self): 565 | if self.p_p == 0: 566 | self.p_p = 1 567 | 568 | def p(self): 569 | self.p_p = 2 570 | 571 | def soft_br(self): 572 | self.pbr() 573 | self.br_toggle = ' ' 574 | 575 | def o(self, data, puredata=0, force=0): 576 | if self.abbr_data is not None: 577 | self.abbr_data += data 578 | 579 | if not self.quiet: 580 | if self.google_doc: 581 | # prevent white space immediately after 'begin emphasis' marks ('**' and '_') 582 | lstripped_data = data.lstrip() 583 | if self.drop_white_space and not (self.pre or self.code): 584 | data = lstripped_data 585 | if lstripped_data != '': 586 | self.drop_white_space = 0 587 | 588 | if puredata and not self.pre: 589 | data = re.sub('\s+', ' ', data) 590 | if data and data[0] == ' ': 591 | self.space = 1 592 | data = data[1:] 593 | if not data and not force: return 594 | 595 | if self.startpre: 596 | #self.out(" :") #TODO: not output when already one there 597 | if not data.startswith("\n"): #
    stuff...
    598 |                     data = "\n" + data
    599 | 
    600 |             bq = (">" * self.blockquote)
    601 |             if not (force and data and data[0] == ">") and self.blockquote: bq += " "
    602 | 
    603 |             if self.pre:
    604 |                 if not self.list:
    605 |                     bq += "    "
    606 |                 #else: list content is already partially indented
    607 |                 for i in xrange(len(self.list)):
    608 |                     bq += "    "
    609 |                 data = data.replace("\n", "\n"+bq)
    610 | 
    611 |             if self.startpre:
    612 |                 self.startpre = 0
    613 |                 if self.list:
    614 |                     data = data.lstrip("\n") # use existing initial indentation
    615 | 
    616 |             if self.start:
    617 |                 self.space = 0
    618 |                 self.p_p = 0
    619 |                 self.start = 0
    620 | 
    621 |             if force == 'end':
    622 |                 # It's the end.
    623 |                 self.p_p = 0
    624 |                 self.out("\n")
    625 |                 self.space = 0
    626 | 
    627 |             if self.p_p:
    628 |                 self.out((self.br_toggle+'\n'+bq)*self.p_p)
    629 |                 self.space = 0
    630 |                 self.br_toggle = ''
    631 | 
    632 |             if self.space:
    633 |                 if not self.lastWasNL: self.out(' ')
    634 |                 self.space = 0
    635 | 
    636 |             if self.a and ((self.p_p == 2 and self.links_each_paragraph) or force == "end"):
    637 |                 if force == "end": self.out("\n")
    638 | 
    639 |                 newa = []
    640 |                 for link in self.a:
    641 |                     if self.outcount > link['outcount']:
    642 |                         self.out("   ["+ str(link['count']) +"]: " + urlparse.urljoin(self.baseurl, link['href']))
    643 |                         if has_key(link, 'title'): self.out(" ("+link['title']+")")
    644 |                         self.out("\n")
    645 |                     else:
    646 |                         newa.append(link)
    647 | 
    648 |                 if self.a != newa: self.out("\n") # Don't need an extra line when nothing was done.
    649 | 
    650 |                 self.a = newa
    651 | 
    652 |             if self.abbr_list and force == "end":
    653 |                 for abbr, definition in self.abbr_list.items():
    654 |                     self.out("  *[" + abbr + "]: " + definition + "\n")
    655 | 
    656 |             self.p_p = 0
    657 |             self.out(data)
    658 |             self.outcount += 1
    659 | 
    660 |     def handle_data(self, data):
    661 |         if r'\/script>' in data: self.quiet -= 1
    662 | 
    663 |         if self.style:
    664 |             self.style_def.update(dumb_css_parser(data))
    665 | 
    666 |         if not self.maybe_automatic_link is None:
    667 |             href = self.maybe_automatic_link
    668 |             if href == data and self.absolute_url_matcher.match(href):
    669 |                 self.o("<" + data + ">")
    670 |                 return
    671 |             else:
    672 |                 self.o("[")
    673 |                 self.maybe_automatic_link = None
    674 | 
    675 |         if not self.code and not self.pre:
    676 |             data = escape_md_section(data, snob=self.escape_snob)
    677 |         self.o(data, 1)
    678 | 
    679 |     def unknown_decl(self, data): pass
    680 | 
    681 |     def charref(self, name):
    682 |         if name[0] in ['x','X']:
    683 |             c = int(name[1:], 16)
    684 |         else:
    685 |             c = int(name)
    686 | 
    687 |         if not self.unicode_snob and c in unifiable_n.keys():
    688 |             return unifiable_n[c]
    689 |         else:
    690 |             try:
    691 |                 return unichr(c)
    692 |             except NameError: #Python3
    693 |                 return chr(c)
    694 | 
    695 |     def entityref(self, c):
    696 |         if not self.unicode_snob and c in unifiable.keys():
    697 |             return unifiable[c]
    698 |         else:
    699 |             try: name2cp(c)
    700 |             except KeyError: return "&" + c + ';'
    701 |             else:
    702 |                 try:
    703 |                     return unichr(name2cp(c))
    704 |                 except NameError: #Python3
    705 |                     return chr(name2cp(c))
    706 | 
    707 |     def replaceEntities(self, s):
    708 |         s = s.group(1)
    709 |         if s[0] == "#":
    710 |             return self.charref(s[1:])
    711 |         else: return self.entityref(s)
    712 | 
    713 |     r_unescape = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));")
    714 |     def unescape(self, s):
    715 |         return self.r_unescape.sub(self.replaceEntities, s)
    716 | 
    717 |     def google_nest_count(self, style):
    718 |         """calculate the nesting count of google doc lists"""
    719 |         nest_count = 0
    720 |         if 'margin-left' in style:
    721 |             nest_count = int(style['margin-left'][:-2]) / self.google_list_indent
    722 |         return nest_count
    723 | 
    724 | 
    725 |     def optwrap(self, text):
    726 |         """Wrap all paragraphs in the provided text."""
    727 |         if not self.body_width:
    728 |             return text
    729 | 
    730 |         assert wrap, "Requires Python 2.3."
    731 |         result = ''
    732 |         newlines = 0
    733 |         for para in text.split("\n"):
    734 |             if len(para) > 0:
    735 |                 if not skipwrap(para):
    736 |                     result += "\n".join(wrap(para, self.body_width))
    737 |                     if para.endswith('  '):
    738 |                         result += "  \n"
    739 |                         newlines = 1
    740 |                     else:
    741 |                         result += "\n\n"
    742 |                         newlines = 2
    743 |                 else:
    744 |                     if not onlywhite(para):
    745 |                         result += para + "\n"
    746 |                         newlines = 1
    747 |             else:
    748 |                 if newlines < 2:
    749 |                     result += "\n"
    750 |                     newlines += 1
    751 |         return result
    752 | 
    753 | ordered_list_matcher = re.compile(r'\d+\.\s')
    754 | unordered_list_matcher = re.compile(r'[-\*\+]\s')
    755 | md_chars_matcher = re.compile(r"([\\\[\]\(\)])")
    756 | md_chars_matcher_all = re.compile(r"([`\*_{}\[\]\(\)#!])")
    757 | md_dot_matcher = re.compile(r"""
    758 |     ^             # start of line
    759 |     (\s*\d+)      # optional whitespace and a number
    760 |     (\.)          # dot
    761 |     (?=\s)        # lookahead assert whitespace
    762 |     """, re.MULTILINE | re.VERBOSE)
    763 | md_plus_matcher = re.compile(r"""
    764 |     ^
    765 |     (\s*)
    766 |     (\+)
    767 |     (?=\s)
    768 |     """, flags=re.MULTILINE | re.VERBOSE)
    769 | md_dash_matcher = re.compile(r"""
    770 |     ^
    771 |     (\s*)
    772 |     (-)
    773 |     (?=\s|\-)     # followed by whitespace (bullet list, or spaced out hr)
    774 |                   # or another dash (header or hr)
    775 |     """, flags=re.MULTILINE | re.VERBOSE)
    776 | slash_chars = r'\`*_{}[]()#+-.!'
    777 | md_backslash_matcher = re.compile(r'''
    778 |     (\\)          # match one slash
    779 |     (?=[%s])      # followed by a char that requires escaping
    780 |     ''' % re.escape(slash_chars),
    781 |     flags=re.VERBOSE)
    782 | 
    783 | def skipwrap(para):
    784 |     # If the text begins with four spaces or one tab, it's a code block; don't wrap
    785 |     if para[0:4] == '    ' or para[0] == '\t':
    786 |         return True
    787 |     # If the text begins with only two "--", possibly preceded by whitespace, that's
    788 |     # an emdash; so wrap.
    789 |     stripped = para.lstrip()
    790 |     if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
    791 |         return False
    792 |     # I'm not sure what this is for; I thought it was to detect lists, but there's
    793 |     # a 
    -inside- case in one of the tests that also depends upon it. 794 | if stripped[0:1] == '-' or stripped[0:1] == '*': 795 | return True 796 | # If the text begins with a single -, *, or +, followed by a space, or an integer, 797 | # followed by a ., followed by a space (in either case optionally preceeded by 798 | # whitespace), it's a list; don't wrap. 799 | if ordered_list_matcher.match(stripped) or unordered_list_matcher.match(stripped): 800 | return True 801 | return False 802 | 803 | def wrapwrite(text): 804 | text = text.encode('utf-8') 805 | try: #Python3 806 | sys.stdout.buffer.write(text) 807 | except AttributeError: 808 | sys.stdout.write(text) 809 | 810 | def html2text(html, baseurl='', plaintext=False): 811 | h = HTML2Text(baseurl=baseurl) 812 | h.ignore_links = plaintext 813 | h.ignore_emphasis = plaintext 814 | h.ignore_images = plaintext 815 | return h.handle(html) 816 | 817 | def unescape(s, unicode_snob=False): 818 | h = HTML2Text() 819 | h.unicode_snob = unicode_snob 820 | return h.unescape(s) 821 | 822 | def escape_md(text): 823 | """Escapes markdown-sensitive characters within other markdown constructs.""" 824 | return md_chars_matcher.sub(r"\\\1", text) 825 | 826 | def escape_md_section(text, snob=False): 827 | """Escapes markdown-sensitive characters across whole document sections.""" 828 | text = md_backslash_matcher.sub(r"\\\1", text) 829 | if snob: 830 | text = md_chars_matcher_all.sub(r"\\\1", text) 831 | text = md_dot_matcher.sub(r"\1\\\2", text) 832 | text = md_plus_matcher.sub(r"\1\\\2", text) 833 | text = md_dash_matcher.sub(r"\1\\\2", text) 834 | return text 835 | 836 | 837 | def main(): 838 | baseurl = '' 839 | 840 | p = optparse.OptionParser('%prog [(filename|url) [encoding]]', 841 | version='%prog ' + __version__) 842 | p.add_option("--ignore-emphasis", dest="ignore_emphasis", action="store_true", 843 | default=IGNORE_EMPHASIS, help="don't include any formatting for emphasis") 844 | p.add_option("--ignore-links", dest="ignore_links", action="store_true", 845 | default=IGNORE_ANCHORS, help="don't include any formatting for links") 846 | p.add_option("--ignore-images", dest="ignore_images", action="store_true", 847 | default=IGNORE_IMAGES, help="don't include any formatting for images") 848 | p.add_option("-g", "--google-doc", action="store_true", dest="google_doc", 849 | default=False, help="convert an html-exported Google Document") 850 | p.add_option("-d", "--dash-unordered-list", action="store_true", dest="ul_style_dash", 851 | default=False, help="use a dash rather than a star for unordered list items") 852 | p.add_option("-e", "--asterisk-emphasis", action="store_true", dest="em_style_asterisk", 853 | default=False, help="use an asterisk rather than an underscore for emphasized text") 854 | p.add_option("-b", "--body-width", dest="body_width", action="store", type="int", 855 | default=BODY_WIDTH, help="number of characters per output line, 0 for no wrap") 856 | p.add_option("-i", "--google-list-indent", dest="list_indent", action="store", type="int", 857 | default=GOOGLE_LIST_INDENT, help="number of pixels Google indents nested lists") 858 | p.add_option("-s", "--hide-strikethrough", action="store_true", dest="hide_strikethrough", 859 | default=False, help="hide strike-through text. only relevant when -g is specified as well") 860 | p.add_option("--escape-all", action="store_true", dest="escape_snob", 861 | default=False, help="Escape all special characters. Output is less readable, but avoids corner case formatting issues.") 862 | (options, args) = p.parse_args() 863 | 864 | # process input 865 | encoding = "utf-8" 866 | if len(args) > 0: 867 | file_ = args[0] 868 | if len(args) == 2: 869 | encoding = args[1] 870 | if len(args) > 2: 871 | p.error('Too many arguments') 872 | 873 | if file_.startswith('http://') or file_.startswith('https://'): 874 | baseurl = file_ 875 | j = urllib.urlopen(baseurl) 876 | data = j.read() 877 | if encoding is None: 878 | try: 879 | from feedparser import _getCharacterEncoding as enc 880 | except ImportError: 881 | enc = lambda x, y: ('utf-8', 1) 882 | encoding = enc(j.headers, data)[0] 883 | if encoding == 'us-ascii': 884 | encoding = 'utf-8' 885 | else: 886 | data = open(file_, 'rb').read() 887 | if encoding is None: 888 | try: 889 | from chardet import detect 890 | except ImportError: 891 | detect = lambda x: {'encoding': 'utf-8'} 892 | encoding = detect(data)['encoding'] 893 | else: 894 | data = sys.stdin.read() 895 | 896 | data = data.decode(encoding) 897 | h = HTML2Text(baseurl=baseurl) 898 | # handle options 899 | if options.ul_style_dash: h.ul_item_mark = '-' 900 | if options.em_style_asterisk: 901 | h.emphasis_mark = '*' 902 | h.strong_mark = '__' 903 | 904 | h.body_width = options.body_width 905 | h.list_indent = options.list_indent 906 | h.ignore_emphasis = options.ignore_emphasis 907 | h.ignore_links = options.ignore_links 908 | h.ignore_images = options.ignore_images 909 | h.google_doc = options.google_doc 910 | h.hide_strikethrough = options.hide_strikethrough 911 | h.escape_snob = options.escape_snob 912 | 913 | wrapwrite(h.handle(data)) 914 | 915 | 916 | if __name__ == "__main__": 917 | main() 918 | -------------------------------------------------------------------------------- /r2e: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | set -e 4 | 5 | FEEDS=feeds.dat 6 | 7 | # Look for feeds.dat in the current directory and, if found, use that 8 | # as configuration data. Otherwise, use ~/.rss2email as a directory to 9 | # store the data. 10 | 11 | if [ -f "${FEEDS}" ]; then 12 | CFDIR=. 13 | else 14 | CFDIR="${HOME}/.rss2email" 15 | fi 16 | 17 | if [ ! -d "${CFDIR}" ]; then 18 | mkdir "${CFDIR}" 19 | fi 20 | 21 | # Add $CFDIR to $PYTHONPATH so that config.py can be found. 22 | PYTHONPATH="${CFDIR}:${PYTHONPATH}" python rss2email.py "${CFDIR}/${FEEDS}" $* 23 | -------------------------------------------------------------------------------- /r2e.bat: -------------------------------------------------------------------------------- 1 | @python rss2email.py feeds.dat %1 %2 %3 %4 %5 %6 %7 %8 %9 2 | -------------------------------------------------------------------------------- /readme.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | Getting Started With rss2email 4 | 5 | 6 |

    Getting Started With rss2email

    7 | 8 |

    We highly recommend that you subscribe to the rss2email project feed so you can keep up to date with the latest version, bugfixes and features: http://feeds.feedburner.com/allthingsrss/hJBr

    9 |

    Instructions for Windows Users
    10 | Instructions for UNIX Users
    11 | Customizing rss2email

    12 | 13 | 14 |

    Instructions for Windows Users

    15 | 16 |

    Requirements

    17 | 18 |

    Before you install rss2email, you'll need to make sure that a few things are in place. First, is that a version of Python 2.x installed. Second, determine your outgoing email server's address. That should be all you need.

    19 | 20 |

    Download

    21 | 22 |
      23 |
    1. Create a new folder
    2. 24 |
    3. Download the latest rss2email .ZIP file and unzip to the new folder 25 |
    26 | 27 |

    Configure

    28 | 29 |

    Edit the config.py file and fill in your outoing email server's details. If your server requires you to login, change "AUTHREQUIRED = 0" to "AUTHREQUIRED = 1" and enter your email username and password.

    30 | 31 |

    Install

    32 | 33 |

    From the command line, change to the folder you created. Now create a new feed database to send updates to your email address:

    34 | 35 |
    36 |

    r2e new you@yourdomain.com

    37 |
    38 | 39 |

    Subscribe to some feeds:

    40 | 41 |
    42 |

    r2e add http://feeds.feedburner.com/allthingsrss/hJBr

    43 |
    44 | 45 |

    That's the feed to be notified when there's a new version of rss2email. Repeat this for each feed you want to subscribe to.

    46 | 47 |

    When you run rss2email, it emails you about every story it hasn't seen before. But the first time you run it, that will be every story. To avoid this, you can ask rss2email not to send you any stories the first time you run it:

    48 | 49 |
    50 |

    r2e run --no-send

    51 |
    52 | 53 |

    Then later, you can ask it to email you new stories:

    54 | 55 |
    56 |

    r2e run

    57 | 58 |
    59 | 60 |

    If you get an error message "Sender domain must exist", add a line to config.py like this:

    61 | 62 |
    63 |

    DEFAULT_FROM = rss2email@yoursite.com

    64 |
    65 | 66 |

    You can make the email address whatever you want, but your mail server requires that the yoursite.com part actually exists.

    67 | 68 |

    Automating rss2email

    69 | 70 |

    More than likely you will want rss2email to run automatically at a regular interval. Under Windows this is can be easily accomplished using the Windows Task Scheduler. This site has a nice tutorial on it. Just select r2e.bat as the program to run. Once you've created the task, double click on it in the task list and change the Run entry so that "run" comes after r2e.bat. For example, if you installed rss2email in the C:\rss2email folder, then you would change the Run entry from "C:\rss2email\r2e.bat" to "C:\rss2email\r2e.bat run".

    71 | 72 |

    Now jump down to the section on customizing rss2email to your needs.

    73 | 74 |

    Upgrading to a new version

    75 |

    Simply replace all of the files from the .ZIP package to your install directory EXCEPT config.py

    76 | 77 |

    Instructions for UNIX/Linux Users

    78 | 79 |

    Requirements

    80 | 81 |

    Before you install rss2email, you'll need to make sure that a few things are in place. First, is a version of Python 2.x installed. Second, is whether you have sendmail (or a compatible replacement like postfix) installed. If sendmail isn't installed, determine your outgoing email server's address. That should be all you need.

    82 | 83 |

    Download

    84 | 85 |

    A quick way to get rss2email going is using pre-made packages. Here are releases for Debian Linux, Ubuntu Linux and NetBSD.

    86 | 87 |

    If you are unable to use these packages or you want the latest and greatest version, here's what you do:

    88 | 89 |
    90 | Unarchive (probably 'tar -xzf') the rss2email .tar.gz package to [folder where you want rss2email files to live]
    91 | cd [yourfolder]
    92 | chmod +x r2e 93 |
    94 | 95 |

    Install

    96 | 97 |

    Create a new feed database with your target email address:

    98 | 99 |
    100 |

    ./r2e new you@yourdomain.com

    101 |
    102 | 103 |

    Subscribe to some feeds:

    104 | 105 |
    106 |

    ./r2e add http://feeds.feedburner.com/allthingsrss/hJBr

    107 | 108 |
    109 | 110 |

    That's the feed to be notified when there's a new version of rss2email. Repeat this for each feed you want to subscribe to.

    111 | 112 |

    When you run rss2email, it emails you about every story it hasn't seen before. But the first time you run it, that will be every story. To avoid this, you can ask rss2email not to send you any stories the first time you run it:

    113 | 114 |
    115 |

    ./r2e run --no-send

    116 |
    117 | 118 |

    Then later, you can ask it to email you new stories:

    119 | 120 |
    121 |

    ./r2e run

    122 |
    123 | 124 |

    You probably want to set things up so that this command is run repeatedly. (One good way is via a cron job.)

    125 | 126 |

    If you get an error message "Sender domain must exist", add a line to config.py like this:

    127 | 128 |
    129 |

    DEFAULT_FROM = rss2email@yoursite.com

    130 |
    131 | 132 |

    You can make the email address whatever you want, but your mail server requires that the yoursite.com part actually exists.

    133 | 134 |

    Upgrading to a new version

    135 |

    Simply replace all of the files from the .tar.gz package to your install directory EXCEPT config.py

    136 | 137 | 138 | 139 |

    Customize rss2email

    140 | 141 |

    There are a number of options, described in full at the top of rss2email.py file, to customize the way rss2email behaves. If you want to change something, edit the config.py file. If you're not using rss2email under Windows, you'll have to create this file if it doesn't already exist.

    142 | 143 |

    For example, if you want to receive HTML mail, instead of having entries converted to plain text:

    144 | 145 |
    146 |

    HTML_MAIL = 1

    147 | 148 |
    149 | 150 |

    To be notified every time a post changes, instead of just when it's first posted:

    151 | 152 |
    153 |

    TRUST_GUID = 0

    154 |
    155 | 156 |

    And to make the emails look as if they were sent when the item was posted:

    157 | 158 |
    159 |

    DATE_HEADER = 1

    160 | 161 | 162 | 163 | -------------------------------------------------------------------------------- /rss2email.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | """rss2email: get RSS feeds emailed to you 3 | http://rss2email.infogami.com 4 | 5 | Usage: 6 | new [emailaddress] (create new feedfile) 7 | email newemailaddress (update default email) 8 | run [--no-send] [num], 9 | add feedurl [emailaddress] [folder] 10 | list 11 | reset 12 | delete n 13 | pause n 14 | unpause n 15 | opmlexport 16 | opmlimport filename 17 | """ 18 | __version__ = "2.72-rcarmo" 19 | __author__ = "Lindsey Smith (lindsey@allthingsrss.com)" 20 | __copyright__ = "(C) 2004 Aaron Swartz. GNU GPL 2 or 3." 21 | ___contributors__ = ["Dean Jackson", "Brian Lalor", "Joey Hess", 22 | "Matej Cepl", "Martin 'Joey' Schulze", 23 | "Marcel Ackermann (http://www.DreamFlasher.de)", 24 | "Rui Carmo (http://taoofmac.com)", 25 | "Lindsey Smith (maintainer)", "Erik Hetzner", 26 | "Aaron Swartz (original author)" ] 27 | 28 | ### Import Modules ### 29 | 30 | import os, sys, re, time 31 | from datetime import datetime, timedelta 32 | import socket, urllib2, urlparse, imaplib, smtplib 33 | urllib2.install_opener(urllib2.build_opener()) 34 | import string, csv, StringIO 35 | import hashlib, base64 36 | import traceback, types 37 | from types import * 38 | import threading, subprocess 39 | import cPickle as pickle 40 | 41 | from email.MIMEText import MIMEText 42 | from email.Header import Header 43 | from email.Utils import parseaddr, formataddr 44 | 45 | import feedparser 46 | feedparser.USER_AGENT = "rss2email/"+__version__+ " +https://github.com/rcarmo/rss2imap" 47 | 48 | import html2text as h2t 49 | from summarize import summarize 50 | 51 | DEFAULT_IMAP_FOLDER = "INBOX" 52 | 53 | # Read options from config file if present. 54 | sys.path.insert(0,".") 55 | try: 56 | from config import * 57 | except: 58 | pass 59 | 60 | h2t.UNICODE_SNOB = UNICODE_SNOB 61 | h2t.LINKS_EACH_PARAGRAPH = LINKS_EACH_PARAGRAPH 62 | h2t.BODY_WIDTH = BODY_WIDTH 63 | html2text = h2t.html2text 64 | 65 | 66 | def send(sender, recipient, subject, body, contenttype, when, extraheaders=None, mailserver=None, folder=None): 67 | """Send an email. 68 | 69 | All arguments should be Unicode strings (plain ASCII works as well). 70 | 71 | Only the real name part of sender and recipient addresses may contain 72 | non-ASCII characters. 73 | 74 | The email will be properly MIME encoded and delivered though SMTP to 75 | localhost port 25. This is easy to change if you want something different. 76 | 77 | The charset of the email will be the first one out of the list 78 | that can represent all the characters occurring in the email. 79 | """ 80 | 81 | # Header class is smart enough to try US-ASCII, then the charset we 82 | # provide, then fall back to UTF-8. 83 | header_charset = 'ISO-8859-1' 84 | 85 | # We must choose the body charset manually 86 | for body_charset in CHARSET_LIST: 87 | try: 88 | body.encode(body_charset) 89 | except (UnicodeError, LookupError): 90 | pass 91 | else: 92 | break 93 | 94 | # Split real name (which is optional) and email address parts 95 | sender_name, sender_addr = parseaddr(sender) 96 | recipient_name, recipient_addr = parseaddr(recipient) 97 | 98 | # We must always pass Unicode strings to Header, otherwise it will 99 | # use RFC 2047 encoding even on plain ASCII strings. 100 | sender_name = str(Header(unicode(sender_name), header_charset)) 101 | recipient_name = str(Header(unicode(recipient_name), header_charset)) 102 | 103 | # Make sure email addresses do not contain non-ASCII characters 104 | sender_addr = sender_addr.encode('ascii') 105 | recipient_addr = recipient_addr.encode('ascii') 106 | 107 | # Create the message ('plain' stands for Content-Type: text/plain) 108 | msg = MIMEText(body.encode(body_charset), contenttype, body_charset) 109 | if IMAP_OVERRIDE_TO: 110 | msg['To'] = IMAP_OVERRIDE_TO 111 | else: 112 | msg['To'] = formataddr((recipient_name, recipient_addr)) 113 | msg['Subject'] = Header(unicode(subject), header_charset) 114 | for hdr in extraheaders.keys(): 115 | try: 116 | msg[hdr] = Header(unicode(extraheaders[hdr], header_charset)) 117 | except: 118 | msg[hdr] = Header(extraheaders[hdr]) 119 | 120 | fromhdr = formataddr((sender_name, sender_addr)) 121 | if IMAP_MUNGE_FROM: 122 | msg['From'] = extraheaders['X-MUNGED-FROM'] 123 | else: 124 | msg['From'] = fromhdr 125 | 126 | msg_as_string = msg.as_string() 127 | 128 | if IMAP_SEND: 129 | if not mailserver: 130 | try: 131 | (host,port) = IMAP_SERVER.split(':',1) 132 | except ValueError: 133 | host = IMAP_SERVER 134 | port = 993 if IMAP_SSL else 143 135 | try: 136 | if IMAP_SSL: 137 | mailserver = imaplib.IMAP4_SSL(host, port) 138 | else: 139 | mailserver = imaplib.IMAP4(host, port) 140 | # speed up interactions on TCP connections using small packets 141 | mailserver.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) 142 | mailserver.login(IMAP_USER, IMAP_PASS) 143 | except KeyboardInterrupt: 144 | raise 145 | except Exception, e: 146 | print >>warn, "" 147 | print >>warn, ('Fatal error: could not connect to mail server "%s"' % IMAP_SERVER) 148 | print >>warn, ('Check your config.py file to confirm that IMAP_SERVER and other mail server settings are configured properly') 149 | if hasattr(e, 'reason'): 150 | print >>warn, "Reason:", e.reason 151 | sys.exit(1) 152 | if not folder: 153 | folder = DEFAULT_IMAP_FOLDER 154 | #mailserver.debug = 4 155 | if mailserver.select(folder)[0] == 'NO': 156 | print >>warn, ("%s does not exist, creating" % folder) 157 | mailserver.create(folder) 158 | mailserver.subscribe(folder) 159 | mailserver.append(folder,'',imaplib.Time2Internaldate(when), msg_as_string) 160 | return mailserver 161 | 162 | elif SMTP_SEND: 163 | if not mailserver: 164 | 165 | try: 166 | if SMTP_SSL: 167 | mailserver = smtplib.SMTP_SSL() 168 | else: 169 | mailserver = smtplib.SMTP() 170 | mailserver.connect(SMTP_SERVER) 171 | except KeyboardInterrupt: 172 | raise 173 | except Exception, e: 174 | print >>warn, "" 175 | print >>warn, ('Fatal error: could not connect to mail server "%s"' % SMTP_SERVER) 176 | print >>warn, ('Check your config.py file to confirm that SMTP_SERVER and other mail server settings are configured properly') 177 | if hasattr(e, 'reason'): 178 | print >>warn, "Reason:", e.reason 179 | sys.exit(1) 180 | 181 | if AUTHREQUIRED: 182 | try: 183 | mailserver.ehlo() 184 | if not SMTP_SSL: mailserver.starttls() 185 | mailserver.ehlo() 186 | mailserver.login(SMTP_USER, SMTP_PASS) 187 | except KeyboardInterrupt: 188 | raise 189 | except Exception, e: 190 | print >>warn, "" 191 | print >>warn, ('Fatal error: could not authenticate with mail server "%s" as user "%s"' % (SMTP_SERVER, SMTP_USER)) 192 | print >>warn, ('Check your config.py file to confirm that SMTP_SERVER and other mail server settings are configured properly') 193 | if hasattr(e, 'reason'): 194 | print >>warn, "Reason:", e.reason 195 | sys.exit(1) 196 | 197 | mailserver.sendmail(sender, recipient, msg_as_string) 198 | return mailserver 199 | 200 | else: 201 | try: 202 | p = subprocess.Popen(["/usr/sbin/sendmail", recipient], stdin=subprocess.PIPE, stdout=subprocess.PIPE) 203 | p.communicate(msg_as_string) 204 | status = p.returncode 205 | assert status != None, "just a sanity check" 206 | if status != 0: 207 | print >>warn, "" 208 | print >>warn, ('Fatal error: sendmail exited with code %s' % status) 209 | sys.exit(1) 210 | except: 211 | print '''Error attempting to send email via sendmail. Possibly you need to configure your config.py to use a SMTP server? Please refer to the rss2email documentation or website (http://rss2email.infogami.com) for complete documentation of config.py. The options below may suffice for configuring email: 212 | # 1: Use SMTP_SERVER to send mail. 213 | # 0: Call /usr/sbin/sendmail to send mail. 214 | SMTP_SEND = 0 215 | 216 | SMTP_SERVER = "smtp.yourisp.net:25" 217 | AUTHREQUIRED = 0 # if you need to use SMTP AUTH set to 1 218 | SMTP_USER = 'username' # for SMTP AUTH, set SMTP username here 219 | SMTP_PASS = 'password' # for SMTP AUTH, set SMTP password here 220 | ''' 221 | sys.exit(1) 222 | return None 223 | 224 | 225 | warn = sys.stderr 226 | 227 | if QP_REQUIRED: 228 | print >>warn, "QP_REQUIRED has been deprecated in rss2email." 229 | 230 | 231 | unix = 0 232 | try: 233 | import fcntl 234 | # A pox on SunOS file locking methods 235 | if (sys.platform.find('sunos') == -1): 236 | unix = 1 237 | except: 238 | pass 239 | 240 | socket_errors = [] 241 | for e in ['error', 'gaierror']: 242 | if hasattr(socket, e): socket_errors.append(getattr(socket, e)) 243 | 244 | ### Utility Functions ### 245 | 246 | import threading 247 | class TimeoutError(Exception): pass 248 | 249 | class InputError(Exception): pass 250 | 251 | def timelimit(timeout, function): 252 | # def internal(function): 253 | def internal2(*args, **kw): 254 | """ 255 | from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/473878 256 | """ 257 | class Calculator(threading.Thread): 258 | def __init__(self): 259 | threading.Thread.__init__(self) 260 | self.result = None 261 | self.error = None 262 | 263 | def run(self): 264 | try: 265 | self.result = function(*args, **kw) 266 | except: 267 | self.error = sys.exc_info() 268 | 269 | c = Calculator() 270 | c.setDaemon(True) # don't hold up exiting 271 | c.start() 272 | c.join(timeout) 273 | if c.isAlive(): 274 | raise TimeoutError 275 | if c.error: 276 | raise c.error[0], c.error[1] 277 | return c.result 278 | return internal2 279 | # return internal 280 | 281 | 282 | def isstr(f): return isinstance(f, type('')) or isinstance(f, type(u'')) 283 | def ishtml(t): return type(t) is type(()) 284 | def contains(a,b): return a.find(b) != -1 285 | def unu(s): # I / freakin' hate / that unicode 286 | if type(s) is types.UnicodeType: return s.encode('utf-8') 287 | else: return s 288 | 289 | ### Parsing Utilities ### 290 | 291 | def getContent(entry, HTMLOK=0): 292 | """Select the best content from an entry, deHTMLizing if necessary. 293 | If raw HTML is best, an ('HTML', best) tuple is returned. """ 294 | 295 | # How this works: 296 | # * We have a bunch of potential contents. 297 | # * We go thru looking for our first choice. 298 | # (HTML or text, depending on HTMLOK) 299 | # * If that doesn't work, we go thru looking for our second choice. 300 | # * If that still doesn't work, we just take the first one. 301 | # 302 | # Possible future improvement: 303 | # * Instead of just taking the first one 304 | # pick the one in the "best" language. 305 | # * HACK: hardcoded HTMLOK, should take a tuple of media types 306 | 307 | conts = entry.get('content', []) 308 | 309 | if entry.get('summary_detail', {}): 310 | conts += [entry.summary_detail] 311 | 312 | if conts: 313 | if HTMLOK: 314 | for c in conts: 315 | if contains(c.type, 'html'): return ('HTML', c.value) 316 | 317 | if not HTMLOK: # Only need to convert to text if HTML isn't OK 318 | for c in conts: 319 | if contains(c.type, 'html'): 320 | return html2text(c.value) 321 | 322 | for c in conts: 323 | if c.type == 'text/plain': return c.value 324 | 325 | return conts[0].value 326 | 327 | return "" 328 | 329 | def getID(entry): 330 | """Get best ID from an entry.""" 331 | if TRUST_GUID: 332 | if 'id' in entry and entry.id: 333 | # Newer versions of feedparser could return a dictionary 334 | if type(entry.id) is DictType: 335 | return entry.id.values()[0] 336 | 337 | return entry.id 338 | 339 | content = getContent(entry) 340 | if content and content != "\n": return hashlib.sha1(unu(content)).hexdigest() 341 | if 'link' in entry: return entry.link 342 | if 'title' in entry: return hashlib.sha1(unu(entry.title)).hexdigest() 343 | 344 | def getName(r, entry): 345 | """Get the best name.""" 346 | 347 | if NO_FRIENDLY_NAME: return '' 348 | 349 | feed = r.feed 350 | if hasattr(r, "url") and r.url in OVERRIDE_FROM.keys(): 351 | return OVERRIDE_FROM[r.url] 352 | 353 | name = feed.get('title', '') 354 | 355 | if 'name' in entry.get('author_detail', []): # normally {} but py2.1 356 | if entry.author_detail.name: 357 | if name: name += ": " 358 | det=entry.author_detail.name 359 | try: 360 | name += entry.author_detail.name 361 | except UnicodeDecodeError: 362 | name += unicode(entry.author_detail.name, 'utf-8') 363 | 364 | elif 'name' in feed.get('author_detail', []): 365 | if feed.author_detail.name: 366 | if name: name += ", " 367 | name += feed.author_detail.name 368 | 369 | return name 370 | 371 | 372 | def getMungedFrom(r): 373 | """Generate a better From.""" 374 | 375 | feed = r.feed 376 | if hasattr(r, "url") and r.url in OVERRIDE_FROM.keys(): 377 | return OVERRIDE_FROM[r.url] 378 | 379 | name = feed.get('title', 'unknown').lower() 380 | pattern = re.compile('[\W_]+',re.UNICODE) 381 | re.sub(pattern, '', name) 382 | name = "%s <%s@%s>" % (feed.get('title','Unnamed Feed'), name.replace(' ','_'), urlparse.urlparse(r.url).netloc) 383 | return name 384 | 385 | 386 | def validateEmail(email, planb): 387 | """Do a basic quality check on email address, but return planb if email doesn't appear to be well-formed""" 388 | email_parts = email.split('@') 389 | if len(email_parts) != 2: 390 | return planb 391 | return email 392 | 393 | def getEmail(r, entry): 394 | """Get the best email_address. If the best guess isn't well-formed (something@somthing.com), use DEFAULT_FROM instead""" 395 | 396 | feed = r.feed 397 | 398 | if FORCE_FROM: return DEFAULT_FROM 399 | 400 | if hasattr(r, "url") and r.url in OVERRIDE_EMAIL.keys(): 401 | return validateEmail(OVERRIDE_EMAIL[r.url], DEFAULT_FROM) 402 | 403 | if 'email' in entry.get('author_detail', []): 404 | return validateEmail(entry.author_detail.email, DEFAULT_FROM) 405 | 406 | if 'email' in feed.get('author_detail', []): 407 | return validateEmail(feed.author_detail.email, DEFAULT_FROM) 408 | 409 | if USE_PUBLISHER_EMAIL: 410 | if 'email' in feed.get('publisher_detail', []): 411 | return validateEmail(feed.publisher_detail.email, DEFAULT_FROM) 412 | 413 | if feed.get("errorreportsto", ''): 414 | return validateEmail(feed.errorreportsto, DEFAULT_FROM) 415 | 416 | if hasattr(r, "url") and r.url in DEFAULT_EMAIL.keys(): 417 | return DEFAULT_EMAIL[r.url] 418 | return DEFAULT_FROM 419 | 420 | ### Simple Database of Feeds ### 421 | 422 | class Feed: 423 | def __init__(self, url, to, folder=None): 424 | self.url, self.etag, self.modified, self.seen = url, None, None, {} 425 | self.active = True 426 | self.to = to 427 | self.folder = folder 428 | 429 | def load(lock=1): 430 | if not os.path.exists(feedfile): 431 | print 'Feedfile "%s" does not exist. If you\'re using r2e for the first time, you' % feedfile 432 | print "have to run 'r2e new' first." 433 | sys.exit(1) 434 | try: 435 | feedfileObject = open(feedfile, 'r') 436 | except IOError, e: 437 | print "Feedfile could not be opened: %s" % e 438 | sys.exit(1) 439 | feeds = pickle.load(feedfileObject) 440 | 441 | if lock: 442 | locktype = 0 443 | if unix: 444 | locktype = fcntl.LOCK_EX 445 | fcntl.flock(feedfileObject.fileno(), locktype) 446 | #HACK: to deal with lock caching 447 | feedfileObject = open(feedfile, 'r') 448 | feeds = pickle.load(feedfileObject) 449 | if unix: 450 | fcntl.flock(feedfileObject.fileno(), locktype) 451 | if feeds: 452 | for feed in feeds[1:]: 453 | if not hasattr(feed, 'active'): 454 | feed.active = True 455 | 456 | return feeds, feedfileObject 457 | 458 | def unlock(feeds, feedfileObject): 459 | if not unix: 460 | pickle.dump(feeds, open(feedfile, 'w')) 461 | else: 462 | fd = open(feedfile+'.tmp', 'w') 463 | pickle.dump(feeds, fd) 464 | fd.flush() 465 | os.fsync(fd.fileno()) 466 | fd.close() 467 | os.rename(feedfile+'.tmp', feedfile) 468 | fcntl.flock(feedfileObject.fileno(), fcntl.LOCK_UN) 469 | 470 | #@timelimit(FEED_TIMEOUT) 471 | def parse(url, etag, modified): 472 | if PROXY == '': 473 | return feedparser.parse(url, etag, modified) 474 | else: 475 | proxy = urllib2.ProxyHandler( {"http":PROXY} ) 476 | return feedparser.parse(url, etag, modified, handlers = [proxy]) 477 | 478 | 479 | ### Program Functions ### 480 | 481 | def add(*args): 482 | if len(args) == 2 and contains(args[1], '@') and not contains(args[1], '://'): 483 | urls, to = [args[0]], args[1] 484 | folder = None 485 | elif len(args) >= 2: 486 | urls, to, folder = [args[0]], None, ' '.join(args[1:]) 487 | else: 488 | urls, to, folder = args, None, None 489 | 490 | feeds, feedfileObject = load() 491 | if (feeds and not isstr(feeds[0]) and to is None) or (not len(feeds) and to is None): 492 | print "No email address has been defined. Please run 'r2e email emailaddress' or" 493 | print "'r2e add url emailaddress'." 494 | sys.exit(1) 495 | for url in urls: feeds.append(Feed(url, to, folder)) 496 | unlock(feeds, feedfileObject) 497 | 498 | 499 | ### HTML Parser for grabbing links and images ### 500 | 501 | from HTMLParser import HTMLParser 502 | class Parser(HTMLParser): 503 | def __init__(self, tag = 'a', attr = 'href'): 504 | HTMLParser.__init__(self) 505 | self.tag = tag 506 | self.attr = attr 507 | self.attrs = [] 508 | def handle_starttag(self, tag, attrs): 509 | if tag == self.tag: 510 | attrs = dict(attrs) 511 | if self.attr in attrs: 512 | self.attrs.append(attrs[self.attr]) 513 | 514 | 515 | ### CSV dialect for parsing IMAP responses 516 | class mailboxlist(csv.excel): 517 | delimiter = ' ' 518 | 519 | 520 | csv.register_dialect('mailboxlist',mailboxlist) 521 | 522 | 523 | def uid(data): 524 | m = re.match('\d+ \(UID (?P\d+)\)', data) 525 | return m.group('uid') 526 | 527 | 528 | def run(num=None): 529 | feeds, feedfileObject = load() 530 | mailserver = None 531 | try: 532 | # We store the default to address as the first item in the feeds list. 533 | # Here we take it out and save it for later. 534 | default_to = "" 535 | if feeds and isstr(feeds[0]): default_to = feeds[0]; ifeeds = feeds[1:] 536 | else: ifeeds = feeds 537 | 538 | if num: ifeeds = [feeds[num]] 539 | feednum = 0 540 | 541 | for f in ifeeds: 542 | try: 543 | feednum += 1 544 | if not f.active: continue 545 | 546 | if VERBOSE: print >>warn, 'I: Processing [%d] "%s"' % (feednum, f.url) 547 | r = {} 548 | try: 549 | r = timelimit(FEED_TIMEOUT, parse)(f.url, f.etag, f.modified) 550 | except TimeoutError: 551 | print >>warn, 'W: feed [%d] "%s" timed out' % (feednum, f.url) 552 | continue 553 | 554 | # Handle various status conditions, as required 555 | if 'status' in r: 556 | if r.status == 301: f.url = r['url'] 557 | elif r.status == 410: 558 | print >>warn, "W: feed gone; deleting", f.url 559 | feeds.remove(f) 560 | continue 561 | 562 | http_status = r.get('status', 200) 563 | if VERBOSE > 1: print >>warn, "I: http status", http_status 564 | http_headers = r.get('headers', { 565 | 'content-type': 'application/rss+xml', 566 | 'content-length':'1'}) 567 | exc_type = r.get("bozo_exception", Exception()).__class__ 568 | if http_status != 304 and not r.entries and not r.get('version', ''): 569 | if http_status not in [200, 302]: 570 | print >>warn, "W: error %d [%d] %s" % (http_status, feednum, f.url) 571 | 572 | elif contains(http_headers.get('content-type', 'rss'), 'html'): 573 | print >>warn, "W: looks like HTML [%d] %s" % (feednum, f.url) 574 | 575 | elif http_headers.get('content-length', '1') == '0': 576 | print >>warn, "W: empty page [%d] %s" % (feednum, f.url) 577 | 578 | elif hasattr(socket, 'timeout') and exc_type == socket.timeout: 579 | print >>warn, "W: timed out on [%d] %s" % (feednum, f.url) 580 | 581 | elif exc_type == IOError: 582 | print >>warn, 'W: "%s" [%d] %s' % (r.bozo_exception, feednum, f.url) 583 | 584 | elif hasattr(feedparser, 'zlib') and exc_type == feedparser.zlib.error: 585 | print >>warn, "W: broken compression [%d] %s" % (feednum, f.url) 586 | 587 | elif exc_type in socket_errors: 588 | exc_reason = r.bozo_exception.args[1] 589 | print >>warn, "W: %s [%d] %s" % (exc_reason, feednum, f.url) 590 | 591 | elif exc_type == urllib2.URLError: 592 | if r.bozo_exception.reason.__class__ in socket_errors: 593 | exc_reason = r.bozo_exception.reason.args[1] 594 | else: 595 | exc_reason = r.bozo_exception.reason 596 | print >>warn, "W: %s [%d] %s" % (exc_reason, feednum, f.url) 597 | 598 | elif exc_type == AttributeError: 599 | print >>warn, "W: %s [%d] %s" % (r.bozo_exception, feednum, f.url) 600 | 601 | elif exc_type == KeyboardInterrupt: 602 | raise r.bozo_exception 603 | 604 | elif r.bozo: 605 | print >>warn, 'E: error in [%d] "%s" feed (%s)' % (feednum, f.url, r.get("bozo_exception", "can't process")) 606 | 607 | else: 608 | print >>warn, "=== rss2email encountered a problem with this feed ===" 609 | print >>warn, "=== See the rss2email FAQ at http://www.allthingsrss.com/rss2email/ for assistance ===" 610 | print >>warn, "=== If this occurs repeatedly, send this to lindsey@allthingsrss.com ===" 611 | print >>warn, "E:", r.get("bozo_exception", "can't process"), f.url 612 | print >>warn, r 613 | print >>warn, "rss2email", __version__ 614 | print >>warn, "feedparser", feedparser.__version__ 615 | print >>warn, "html2text", h2t.__version__ 616 | print >>warn, "Python", sys.version 617 | print >>warn, "=== END HERE ===" 618 | continue 619 | 620 | r.entries.reverse() 621 | 622 | for entry in r.entries: 623 | id = getID(entry) 624 | 625 | # If TRUST_GUID isn't set, we get back hashes of the content. 626 | # Instead of letting these run wild, we put them in context 627 | # by associating them with the actual ID (if it exists). 628 | 629 | frameid = entry.get('id') 630 | if not(frameid): frameid = id 631 | if type(frameid) is DictType: 632 | frameid = frameid.values()[0] 633 | 634 | # If this item's ID is in our database 635 | # then it's already been sent 636 | # and we don't need to do anything more. 637 | 638 | if frameid in f.seen: 639 | if f.seen[frameid] == id: continue 640 | 641 | if not (f.to or default_to): 642 | print "No default email address defined. Please run 'r2e email emailaddress'" 643 | print "Ignoring feed %s" % f.url 644 | break 645 | 646 | if 'title_detail' in entry and entry.title_detail: 647 | title = entry.title_detail.value 648 | if contains(entry.title_detail.type, 'html'): 649 | title = html2text(title) 650 | else: 651 | title = getContent(entry)[:70] 652 | 653 | title = title.replace("\n", " ").strip() 654 | 655 | when = time.gmtime() 656 | 657 | if DATE_HEADER: 658 | for datetype in DATE_HEADER_ORDER: 659 | kind = datetype+"_parsed" 660 | if kind in entry and entry[kind]: when = entry[kind] 661 | 662 | link = entry.get('link', "") 663 | 664 | from_addr = getEmail(r, entry) 665 | 666 | name = h2t.unescape(getName(r, entry)) 667 | fromhdr = formataddr((name, from_addr,)) 668 | tohdr = (f.to or default_to) 669 | subjecthdr = title 670 | datehdr = time.strftime("%a, %d %b %Y %H:%M:%S -0000", when) 671 | useragenthdr = "rss2email" 672 | 673 | # Add post tags, if available 674 | tagline = "" 675 | if 'tags' in entry: 676 | tags = entry.get('tags') 677 | taglist = [] 678 | if tags: 679 | for tag in tags: 680 | taglist.append(tag['term']) 681 | if taglist: 682 | tagline = ",".join(taglist) 683 | 684 | extraheaders = {'Date': datehdr, 'User-Agent': useragenthdr, 'X-RSS-Feed': f.url, 'Message-ID': '<%s>' % hashlib.sha1(id.encode('utf-8')).hexdigest(), 'X-RSS-ID': id, 'X-RSS-URL': link, 'X-RSS-TAGS' : tagline, 'X-MUNGED-FROM': getMungedFrom(r), 'References': ''} 685 | if BONUS_HEADER != '': 686 | for hdr in BONUS_HEADER.strip().splitlines(): 687 | pos = hdr.strip().find(':') 688 | if pos > 0: 689 | extraheaders[hdr[:pos]] = hdr[pos+1:].strip() 690 | else: 691 | print >>warn, "W: malformed BONUS HEADER", BONUS_HEADER 692 | 693 | entrycontent = getContent(entry, HTMLOK=HTML_MAIL) 694 | contenttype = 'plain' 695 | content = '' 696 | if THREAD_ON_TAGS and len(tagline): 697 | extraheaders['References'] += ''.join([' <%s>' % hashlib.sha1(t.strip().encode('utf-8')).hexdigest() for t in tagline.split(',')]) 698 | if USE_CSS_STYLING and HTML_MAIL: 699 | contenttype = 'html' 700 | content = "\n" 701 | content += '\n' 702 | content += '\n' 703 | content += '
    \n' 704 | content += '

    '+subjecthdr+'

    \n' 706 | if ishtml(entrycontent): 707 | body = entrycontent[1].strip() 708 | if SUMMARIZE: 709 | content += '
    %s
    ' % (summarize(html2text(body, plaintext=True), SUMMARIZE) + "
    ") 710 | else: 711 | body = entrycontent.strip() 712 | if SUMMARIZE: 713 | content += '
    %s
    ' % (summarize(body, SUMMARIZE) + "
    ") 714 | if THREAD_ON_LINKS: 715 | parser = Parser() 716 | parser.feed(body) 717 | extraheaders['References'] += ''.join([' <%s>' % hashlib.sha1(h.strip().encode('utf-8')).hexdigest() for h in parser.attrs]) 718 | if INLINE_IMAGES_DATA_URI: 719 | parser = Parser(tag='img', attr='src') 720 | parser.feed(body) 721 | for src in parser.attrs: 722 | try: 723 | img = feedparser._open_resource(src, None, None, feedparser.USER_AGENT, link, [], {}) 724 | data = img.read() 725 | if hasattr(img, 'headers'): 726 | headers = dict((k.lower(), v) for k, v in dict(img.headers).items()) 727 | ctype = headers.get('content-type', None) 728 | if ctype and INLINE_IMAGES_DATA_URI: 729 | body = body.replace(src,'data:%s;base64,%s' % (ctype, base64.b64encode(data))) 730 | except: 731 | print >>warn, "Could not load image: %s" % src 732 | pass 733 | if body != '': 734 | content += '
    \n' + body + '
    \n' 735 | content += '\n
    \n' 752 | content += "\n\n" 753 | else: 754 | if ishtml(entrycontent): 755 | contenttype = 'html' 756 | content = "\n" 757 | content = ("\n\n" + 758 | '

    '+subjecthdr+'

    \n\n' + 759 | entrycontent[1].strip() + # drop type tag (HACK: bad abstraction) 760 | '

    URL: '+link+'

    ' ) 761 | 762 | if hasattr(entry,'enclosures'): 763 | for enclosure in entry.enclosures: 764 | if enclosure.url != "": 765 | content += ('Enclosure: '+enclosure.url+"
    \n") 766 | if 'links' in entry: 767 | for extralink in entry.links: 768 | if ('rel' in extralink) and extralink['rel'] == u'via': 769 | content += 'Via: '+extralink['title']+'
    \n' 770 | 771 | content += ("\n") 772 | else: 773 | content = entrycontent.strip() + "\n\nURL: "+link 774 | if hasattr(entry,'enclosures'): 775 | for enclosure in entry.enclosures: 776 | if enclosure.url != "": 777 | content += ('\nEnclosure: ' + enclosure.url + "\n") 778 | if 'links' in entry: 779 | for extralink in entry.links: 780 | if ('rel' in extralink) and extralink['rel'] == u'via': 781 | content += 'Via: '+extralink['title']+'\n' 782 | 783 | mailserver = send(fromhdr, tohdr, subjecthdr, content, contenttype, when, extraheaders, mailserver, f.folder) 784 | 785 | f.seen[frameid] = id 786 | 787 | f.etag, f.modified = r.get('etag', None), r.get('modified', None) 788 | except (KeyboardInterrupt, SystemExit): 789 | raise 790 | except: 791 | print >>warn, "=== rss2email encountered a problem with this feed ===" 792 | print >>warn, "=== See the rss2email FAQ at http://www.allthingsrss.com/rss2email/ for assistance ===" 793 | print >>warn, "=== If this occurs repeatedly, send this to lindsey@allthingsrss.com ===" 794 | print >>warn, "E: could not parse", f.url 795 | traceback.print_exc(file=warn) 796 | print >>warn, "rss2email", __version__ 797 | print >>warn, "feedparser", feedparser.__version__ 798 | print >>warn, "html2text", h2t.__version__ 799 | print >>warn, "Python", sys.version 800 | print >>warn, "=== END HERE ===" 801 | continue 802 | 803 | finally: 804 | unlock(feeds, feedfileObject) 805 | if mailserver: 806 | if IMAP_MARK_AS_READ: 807 | for folder in IMAP_MARK_AS_READ: 808 | mailserver.select(folder) 809 | res, data = mailserver.search(None, '(UNSEEN UNFLAGGED)') 810 | if res == 'OK': 811 | items = data[0].split() 812 | for i in items: 813 | res, data = mailserver.fetch(i, "(UID)") 814 | if data[0]: 815 | u = uid(data[0]) 816 | res, data = mailserver.uid('STORE', u, '+FLAGS', '(\Seen)') 817 | if IMAP_MOVE_READ_TO: 818 | typ, data = mailserver.list(pattern='*') 819 | # Parse folder listing as a CSV dialect (automatically removes quotes) 820 | reader = csv.reader(StringIO.StringIO('\n'.join(data)),dialect='mailboxlist') 821 | # Iterate over each folder 822 | for row in reader: 823 | folder = row[-1:][0] 824 | if folder == IMAP_MOVE_READ_TO or '\Noselect' in row[0]: 825 | continue 826 | mailserver.select(folder) 827 | yesterday = (datetime.now() - timedelta(days=1)).strftime("%d-%b-%Y") 828 | res, data = mailserver.search(None, '(SEEN BEFORE %s UNFLAGGED)' % yesterday) 829 | if res == 'OK': 830 | items = data[0].split() 831 | for i in items: 832 | res, data = mailserver.fetch(i, "(UID)") 833 | if data[0]: 834 | u = uid(data[0]) 835 | res, data = mailserver.uid('COPY', u, IMAP_MOVE_READ_TO) 836 | if res == 'OK': 837 | res, data = mailserver.uid('STORE', u, '+FLAGS', '(\Deleted)') 838 | mailserver.expunge() 839 | try: 840 | mailserver.quit() 841 | except: 842 | mailserver.logout() 843 | 844 | def list(): 845 | feeds, feedfileObject = load(lock=0) 846 | default_to = "" 847 | default_folder = DEFAULT_IMAP_FOLDER 848 | 849 | if feeds and isstr(feeds[0]): 850 | default_to = feeds[0]; ifeeds = feeds[1:]; i=1 851 | print "default email:", default_to 852 | else: ifeeds = feeds; i = 0 853 | for f in ifeeds: 854 | active = ('[ ]', '[*]')[f.active] 855 | print `i`+':',active, f.url, '(to: '+(f.to or (default_to+' (default)'))+', ' + 'folder: '+(f.folder or (default_folder+' (default))')) 856 | if not (f.to or default_to): 857 | print " W: Please define a default address with 'r2e email emailaddress'" 858 | i+= 1 859 | 860 | def opmlexport(): 861 | import xml.sax.saxutils 862 | feeds, feedfileObject = load(lock=0) 863 | 864 | if feeds: 865 | print '\n\n\nrss2email OPML export\n\n' 866 | exportableFeeds = {} 867 | if USE_OPML_TITLE_AS_FOLDER: 868 | for f in feeds[1:]: 869 | if not hasattr(exportableFeeds, f.folder): 870 | exportableFeeds[f.folder] = {} 871 | exportableFeeds[f.folder].append(f) 872 | 873 | for folder in exportableFeeds: 874 | print '\n\t' % (folder, folder) 875 | for f in exportableFeeds[folder]: 876 | url = xml.sax.saxutils.escape(f.url) 877 | print '\n\t\t' % (url, url) 878 | print '\n\t' 879 | else: 880 | for f in feeds[1:]: 881 | url = xml.sax.saxutils.escape(f.url) 882 | print '' % (url, url) 883 | print '\n\n' 884 | 885 | def opmlimport(importfile): 886 | importfileObject = None 887 | print 'Importing feeds from', importfile 888 | if not os.path.exists(importfile): 889 | print 'OPML import file "%s" does not exist.' % feedfile 890 | try: 891 | importfileObject = open(importfile, 'r') 892 | except IOError, e: 893 | print "OPML import file could not be opened: %s" % e 894 | sys.exit(1) 895 | try: 896 | import xml.dom.minidom 897 | dom = xml.dom.minidom.parse(importfileObject) 898 | newfeeds = dom.getElementsByTagName('outline') 899 | except: 900 | print 'E: Unable to parse OPML file' 901 | sys.exit(1) 902 | 903 | feeds, feedfileObject = load(lock=1) 904 | 905 | import xml.sax.saxutils 906 | 907 | for f in newfeeds: 908 | if f.hasAttribute('xmlUrl'): 909 | category = f.parentNode 910 | folder = None 911 | if USE_OPML_TITLE_AS_FOLDER: 912 | if category.hasAttribute("title")!=None: 913 | folder = category.getAttribute("title") 914 | feedurl = f.getAttribute('xmlUrl') 915 | print 'Adding %s' % xml.sax.saxutils.unescape(feedurl) 916 | feeds.append(Feed(feedurl, None, folder)) 917 | 918 | unlock(feeds, feedfileObject) 919 | 920 | def delete(n): 921 | feeds, feedfileObject = load() 922 | if (n == 0) and (feeds and isstr(feeds[0])): 923 | print >>warn, "W: ID has to be equal to or higher than 1" 924 | elif n >= len(feeds): 925 | print >>warn, "W: no such feed" 926 | else: 927 | print >>warn, "W: deleting feed %s" % feeds[n].url 928 | feeds = feeds[:n] + feeds[n+1:] 929 | if n != len(feeds): 930 | print >>warn, "W: feed IDs have changed, list before deleting again" 931 | unlock(feeds, feedfileObject) 932 | 933 | def toggleactive(n, active): 934 | feeds, feedfileObject = load() 935 | if (n == 0) and (feeds and isstr(feeds[0])): 936 | print >>warn, "W: ID has to be equal to or higher than 1" 937 | elif n >= len(feeds): 938 | print >>warn, "W: no such feed" 939 | else: 940 | action = ('Pausing', 'Unpausing')[active] 941 | print >>warn, "%s feed %s" % (action, feeds[n].url) 942 | feeds[n].active = active 943 | unlock(feeds, feedfileObject) 944 | 945 | def reset(): 946 | feeds, feedfileObject = load() 947 | if feeds and isstr(feeds[0]): 948 | ifeeds = feeds[1:] 949 | else: ifeeds = feeds 950 | for f in ifeeds: 951 | if VERBOSE: print "Resetting %d already seen items" % len(f.seen) 952 | f.seen = {} 953 | f.etag = None 954 | f.modified = None 955 | 956 | unlock(feeds, feedfileObject) 957 | 958 | def email(addr): 959 | feeds, feedfileObject = load() 960 | if feeds and isstr(feeds[0]): feeds[0] = addr 961 | else: feeds = [addr] + feeds 962 | unlock(feeds, feedfileObject) 963 | 964 | if __name__ == '__main__': 965 | args = sys.argv 966 | try: 967 | if len(args) < 3: raise InputError, "insufficient args" 968 | feedfile, action, args = args[1], args[2], args[3:] 969 | 970 | if action == "run": 971 | if args and args[0] == "--no-send": 972 | def send(sender, recipient, subject, body, contenttype, when, extraheaders=None, mailserver=None, folder=None): 973 | if VERBOSE: print 'Not sending:', unu(subject) 974 | 975 | if args and args[-1].isdigit(): run(int(args[-1])) 976 | else: run() 977 | 978 | elif action == "email": 979 | if not args: 980 | raise InputError, "Action '%s' requires an argument" % action 981 | else: 982 | email(args[0]) 983 | 984 | elif action == "add": add(*args) 985 | 986 | elif action == "new": 987 | if len(args) == 1: d = [args[0]] 988 | else: d = [] 989 | pickle.dump(d, open(feedfile, 'w')) 990 | 991 | elif action == "list": list() 992 | 993 | elif action in ("help", "--help", "-h"): print __doc__ 994 | 995 | elif action == "delete": 996 | if not args: 997 | raise InputError, "Action '%s' requires an argument" % action 998 | elif args[0].isdigit(): 999 | delete(int(args[0])) 1000 | else: 1001 | raise InputError, "Action '%s' requires a number as its argument" % action 1002 | 1003 | elif action in ("pause", "unpause"): 1004 | if not args: 1005 | raise InputError, "Action '%s' requires an argument" % action 1006 | elif args[0].isdigit(): 1007 | active = (action == "unpause") 1008 | toggleactive(int(args[0]), active) 1009 | else: 1010 | raise InputError, "Action '%s' requires a number as its argument" % action 1011 | 1012 | elif action == "reset": reset() 1013 | 1014 | elif action == "opmlexport": opmlexport() 1015 | 1016 | elif action == "opmlimport": 1017 | if not args: 1018 | raise InputError, "OPML import '%s' requires a filename argument" % action 1019 | opmlimport(args[0]) 1020 | 1021 | else: 1022 | raise InputError, "Invalid action" 1023 | 1024 | except InputError, e: 1025 | print "E:", e 1026 | print 1027 | print __doc__ 1028 | 1029 | -------------------------------------------------------------------------------- /summarize.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """Naive text summarizer""" 3 | 4 | from collections import defaultdict 5 | import re 6 | 7 | def sentences(text): 8 | start = 0 9 | for match in re.finditer('(\s*[.!?]\s*)|(\n{2,})', text): 10 | yield text[start:match.end()].strip() 11 | start = match.end() 12 | 13 | if start < len(text): 14 | yield text[start:].strip() 15 | 16 | 17 | def frequency(text): 18 | counts = defaultdict(int) 19 | for token in text.split(): # simplest tokenizer ever 20 | counts[token] += 1 21 | return counts 22 | 23 | 24 | def score(sentence, frequencies): 25 | return sum((frequencies[token] for token in sentence.split())) 26 | 27 | 28 | def reorder(sentences, text): 29 | sentences.sort(lambda a, b: text.find(a) - text.find(b)) 30 | return sentences 31 | 32 | 33 | def summarize(text, limit=3): 34 | items = [s for s in sentences(text)] 35 | items.sort(key=lambda s: score(s, frequency(text)), reverse=1) 36 | return '\n'.join(["

    %s

    " % s for s in reorder(items[:limit], text)]) 37 | --------------------------------------------------------------------------------