├── .github
    └── FUNDING.yml
├── .gitignore
├── CHANGELOG
├── README.md
├── config.py.example
├── feedparser.py
├── html2text.py
├── r2e
├── r2e.bat
├── readme.html
├── rss2email.py
└── summarize.py


/.github/FUNDING.yml:
--------------------------------------------------------------------------------
1 | github: [rcarmo]
2 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
 1 | *.py[cod]
 2 | 
 3 | # C extensions
 4 | *.so
 5 | 
 6 | # Packages
 7 | *.egg
 8 | *.egg-info
 9 | dist
10 | build
11 | eggs
12 | parts
13 | bin
14 | var
15 | sdist
16 | develop-eggs
17 | .installed.cfg
18 | lib
19 | lib64
20 | 
21 | # Installer logs
22 | pip-log.txt
23 | 
24 | # Unit test / coverage reports
25 | .coverage
26 | .tox
27 | nosetests.xml
28 | 
29 | # Translations
30 | *.mo
31 | 
32 | # Mr Developer
33 | .mr.developer.cfg
34 | .project
35 | .pydevproject
36 | 


--------------------------------------------------------------------------------
/CHANGELOG:
--------------------------------------------------------------------------------
  1 | v2.72-rcarmo (2015-03)
  2 |     * Flush read messages after one day instead of on every run
  3 | 
  4 | v2.71-rcarmo (2013-03)
  5 |     * Add IMAP support
  6 |     * Changed default CSS and config options
  7 |     * Added threading headers
  8 |     * Added experimental data:URI support for inline images
  9 |     * Bundled feedparser fixes
 10 | 
 11 | v2.71 (2011-03-04)
 12 |     * Potentialy safer method for writing feeds.dat on UNIX
 13 |     * Handle via links with no title attribute
 14 |     * Handle attributes more cleanly with OVERRIDE_EMAIL and DEFAULT_EMAIL
 15 | 
 16 | v2.70 (2010-12-21)
 17 |     * Improved handling of given feed email addresses to prevent mail servers rejecting poorly formed Froms
 18 |     * Added X-RSS-TAGS header that lists any tags provided by an entry, which will be helpful in filtering incoming messages
 19 | 
 20 | v2.69 (2010-11-12)
 21 |     * Added support for connecting to SMTP server via SSL, see SMTP_SSL option
 22 |     * Improved backwards compatibility by fixing issue with listing feeds when run with older Python versions
 23 |     * Added selective feed email overrides through OVERRIDE_EMAIL and DEFAULT_EMAIL options
 24 |     * Added NO_FRIENDLY_NAME to from from address only without the friendly name
 25 |     * Added X-RSS-URL header in each message with the link to the original item
 26 | 
 27 | v2.68 (2010-10-01)
 28 |     * Added ability to pause/resume checking of individual feeds through pause and unpause commands
 29 |     * Added ability to import and export OPML feed lists through importopml and exportopml commands
 30 |     
 31 | v2.67 (2010-09-21)
 32 |     * Fixed entries that include an id which is blank (i.e., an empty string) were being resent 
 33 |     * Fixed some entries not being sent by email because they had bad From headers	
 34 |     * Fixed From headers with HTML entities encoded twice
 35 |     * Compatibility changes to support most recent development versions of feedparser
 36 |     * Compatibility changes to support Google Reader feeds
 37 |     
 38 | v2.66 (2009-12-21)
 39 |     * Complete packaging of all necessary source files (rss2email, html2text, feedparser, r2e, etc.) into one bundle
 40 |         o Included a more complete config.py with all options
 41 |         o Default to HTML mail and CSS results 
 42 |     * Added 'reset' command to erase history of already seen entries
 43 |     * Changed project email to 'lindsey@allthingsrss.com' and project homepage to 'http://www.allthingsrss.com/rss2email/'
 44 |     * Made exception and error output text more useful
 45 |     * Added X-RSS-Feed and X-RSS-ID headers to each email for easier filtering
 46 |     * Improved enclosure handling
 47 |     * Fixed MacOS compatibility issues
 48 | 
 49 | v2.65 (2009-01-05)
 50 | 
 51 |     * Fixed warnings caused by Python v2.6 (using hashlib, removing mimify, etc.)
 52 |     * Deprecated QP_REQUIRED option as this is more than likely no longer needed and part of what triggered Python warnings
 53 |     * Fixed unicode errors in certain post headers
 54 |     * Attempted to incorporate Debian/Ubuntu patches into the mainstream release
 55 |     * Support img type enclosures
 56 |     * No file locking for SunOS
 57 | 
 58 | v2.64 (2008-10-21)
 59 |     * Bug-fix version
 60 |         o Gracefully handle missing charsets
 61 |         o Friendlier and more useful message if sendmail isn't installed
 62 |         o SunOS locking fix
 63 | 
 64 | v2.63 (2008-06-13)
 65 |     * Bug-fix version and license change:
 66 |         o Licensed under GPL 2 & 3 now
 67 |         o Display feed number in warning and error message lines
 68 |         o Fix for unicode handling problem with certain entry titles
 69 | 
 70 | v2.62 (2008-01-14)
 71 |     * Bug-fix version:
 72 |         o Simplified SunOS fix
 73 |         o Local feeds (/home/user/file.xml) should work
 74 | 
 75 | v2.61 (2007-12-07)
 76 |     * Bug-fix version:
 77 |         o Now really compatible with SunOS
 78 |         o Don't wrap long subject headers
 79 |         o New parameter CHARSET_LIST to override or supplement the order in which charsets are tried against an entry
 80 |         o Don't use blank content to generate id
 81 |         o Using GMail as mail server should work
 82 |         
 83 | v2.60 (2006-08-25)
 84 |     * Small bug-fix version:
 85 |         o Now compatible with SunOS
 86 |         o Correctly handle international character sets in email From
 87 | 
 88 | v2.59 (2006-06-09)
 89 |     * Finally added oft-requested support for enclosures. Any enclosures, such as a podcast MP3, will be listed under the entry URL
 90 |     * Made feed timeout compatible with Python versions 2.2 and higher, instead of v2.4 only
 91 |     * Added optional, configurable CSS styling to HTML mail. Set USE_CSS_STYLING=1 in your config.py to enable this. If you want to tweak the look, modify STYLE_SHEET.
 92 |     * Improved empty feed checking
 93 |     * Improved invalid feed messages
 94 |     * Unfortunately, rss2email is no longer compatible with Python v2.1. Two of the most serious lingering issues with rss2email were waiting forever for non-responsive feeds and its inablility to properly handle feeds with international characters. To properly fix these once and for all, rss2email now depends on functionality that was not available until Python v2.2. Hopefully this does not unduly inconvenience anyone that has not yet upgraded to a more current version of Python.
 95 | 
 96 | v2.58 (2006-05-11)
 97 |     * Total rewrite of email code that should fix encoding problems
 98 |     * Added configurable timeout for nonresponsive feeds
 99 |     * Fixed incorrectly using text summary_detail instead of html content
100 |     * Fixed bug with deleting feed 0 if no default email was set
101 |     * Print name of feed that is being deleted
102 | 
103 | v2.57 (2006-04-07)
104 |     * Integrated Joey Hess's patches
105 |        o First, a patch that makes delete more reliable, so it no longer allows you to remove the default email address ('feed' 0) and thereby hose your feed file, or 'remove' entries that don't exist without warning; and so it only says IDs have changed when they really have. Originally from http://bugs.debian.org/313101
106 |        o Next a patch that avoids a backtrace if there's no email address defined, and outputs a less scary error message.
107 |        o Next, a simple change to the usage; since the "email" subcommand always needs a parameter, don't mark it as optional.
108 |        o And, avoid a backtrace if the email subcommand does get run w/o a parameter.
109 |        o And also avoid backtraces if delete is run w/o a parameter. Also adds support for --help.
110 |        o Simple change, make a comment match reality (/usr/sbin/sendmail)
111 |        o This avoids another backtrace, this time if there's no feed file yet. [load()]
112 |        o Add a handler for the AttributeError exception, which feedparser can throw. Beats crashing..
113 |        o Next, four hunks that make it more robust if no default email address is set and feeds are added w/o an email address. This patch originally comes from http://bugs.debian.org/310485 which has some examples.
114 |        o Finally, this works around a bug in mimify that causes it to add a newline to the subject header if it contains very long words. Details at http://bugs.debian.org/320185. Note that Tatsuya Kinoshita has a larger patch torard the end of that bug report that deals with some other problems in this area, Aaron has seen that patch before and said it "looks pretty reasonable".
115 |     * add() catches error case on first feed add and no email address is set
116 |     * Made "emailaddress" consistent param label throughout
117 |     * Error message improvements
118 |     * Deleted problematic "if title" line
119 |     * Deleted space in front of SMTP_USER
120 |     * Only logs into SMTP server once
121 |     * Added exception handling around SMTP server connect and login attempt
122 |     * Broke contributors across multiple lines
123 | 
124 | v2.56 (2006-04-04)
125 |     * SMTP AUTH support added
126 |     * Windows support
127 |     * Fixed bug with HTML in titles
128 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | rss2imap
 2 | ========
 3 | 
 4 | An adaptation of rss2email that uses IMAP directly.
 5 | 
 6 | # What does it look like?
 7 | 
 8 | Well, with the shipping CSS in `config.py`, it looks like this:
 9 | 
10 | <img src="https://raw.github.com/rcarmo/rss2email/screenshots/mail.app.1.jpg" style="max-width: 100%; height: auto;">
11 | 
12 | ## What about mobile?
13 | 
14 | Well, it works fine with the Gmail app on both Android and iOS, as well as the native IMAP clients:
15 | 
16 | <img src="https://raw.github.com/rcarmo/rss2email/screenshots/gmail.ios.1.jpg" width="24%"><img src="https://raw.github.com/rcarmo/rss2email/screenshots/mail.ios.1.jpg" width="24%"><img src="https://raw.github.com/rcarmo/rss2email/screenshots/gmail.android.1.jpg" width="24%"><img src="https://raw.github.com/rcarmo/rss2email/screenshots/mail.android.1.jpg" width="24%">
17 | 
18 | As long as you sync, all the text will be available off-line (images are cached at the whim of the MUA).
19 | 
20 | The Gmail app ignores CSS and may have weird behaviors with long bits of text, though.
21 | 
22 | # Main Features:
23 | 
24 | * *NEW:* Automatically file away messages read after one day instead of on every run
25 | * Optional (naive) summarization of news items at the top of each item (see `SUMMARIZE` setting)
26 | * E-mail is injected directly via IMAP (so no delays or hassles with spam filters)
27 | * Feeds can be grouped into IMAP folders -- no inbox clutter!
28 | * Generates E-mail headers for threading, so a post that references another post (or that includes the same link) will show up as a thread on decent MUAs. Also, posts from the same feed will be part of the same thread)
29 | * Can (optionally) include images inline (as `data:` URIs for now -- which only works properly on iOS/Mac -- soon as MIME attachments)
30 | * Can (optionally) remove read (but not flagged) items automatically
31 | 
32 | # Project Status
33 | 
34 | Given that I've only had to tweak _one thing_ after two years of continued use, I'd say this is more than stable. I've gone off and built a multi-threaded app with a SQLite feed store called [bottle-fever](https://github.com/rcarmo/bottle-fever), but there's only so much free time, and even though this code is crammed with hideous legacy idioms, it works as is.
35 | 
36 | Come 2016, I switched to Feedly because the user experience on the iPad using [Reeder](http://reederapp.com) was a little better.
37 | 
38 | ## Similar Projects
39 | 
40 | Other projects I've come across that traveled this path in other languages:
41 | 
42 | * [greghendershott/feeds2gmail](https://github.com/greghendershott/feeds2gmail), using [Racket](https://www.racket-lang.org)
43 | * [Gonzih/feeds2imap.clj](https://github.com/Gonzih/feeds2imap.clj), using [Clojure](https://clojure.org)
44 | * [rcarmo/go-rss2imap](https://github.com/rcarmo/go-rss2imap) my attempt at tweaking a [Go](http://golang.org) version
45 | * [Riduidel/rrss2imap](https://github.com/Riduidel/rrss2imap), using [Rust](https://www.rust-lang.org/)
46 | 
47 | ## Exercises For The Reader
48 | 
49 | * Test nested folders (am only using single folders, not a nested hierarchy, so this might break)
50 | * Automatic message categorization using Bayesian filtering and NLTK
51 | * Better reference tracking to identify 'hot' items
52 | * Figure out a nice way to do favicons (X-Face is obsolete, and so is X-Image-URL)
53 | 
54 | # Here Be Dragons
55 | 
56 | Be aware that this works and is easy to hack, but uses old Python idioms and could do with some refactoring (PEP-8 zealots are sure to cringe as they read through the code -- I know I find it hideous, but it was a quick hack and has been working reliably for me for over two years now).
57 | 


--------------------------------------------------------------------------------
/config.py.example:
--------------------------------------------------------------------------------
  1 | ### Options for configuring rss2email ###
  2 | 
  3 | # The email address messages are from by default:
  4 | DEFAULT_FROM = "bozo@dev.null.invalid"
  5 | 
  6 | # 1: Send text/html messages when possible.
  7 | # 0: Convert HTML to plain text.
  8 | HTML_MAIL = 1
  9 | 
 10 | # 1: Only use the DEFAULT_FROM address.
 11 | # 0: Use the email address specified by the feed, when possible.
 12 | FORCE_FROM = 0
 13 | 
 14 | # 1: Receive one email per post.
 15 | # 0: Receive an email every time a post changes.
 16 | TRUST_GUID = 1
 17 | 
 18 | # 1: Generate Date header based on item's date, when possible.
 19 | # 0: Generate Date header based on time sent.
 20 | DATE_HEADER = 1
 21 | 
 22 | # A tuple consisting of some combination of
 23 | # ('issued', 'created', 'modified', 'expired')
 24 | # expressing ordered list of preference in dates 
 25 | # to use for the Date header of the email.
 26 | DATE_HEADER_ORDER = ('modified', 'issued', 'created')
 27 | 
 28 | # 1: Apply Q-P conversion (required for some MUAs).
 29 | # 0: Send message in 8-bits.
 30 | # http://cr.yp.to/smtp/8bitmime.html
 31 | #DEPRECATED 
 32 | QP_REQUIRED = 0
 33 | #DEPRECATED 
 34 | 	
 35 | # 1: Name feeds as they're being processed.
 36 | # 0: Keep quiet.
 37 | VERBOSE = 0
 38 | 
 39 | # 1: Use the publisher's email if you can't find the author's.
 40 | # 0: Just use the DEFAULT_FROM email instead.
 41 | USE_PUBLISHER_EMAIL = 1
 42 | 
 43 | # 1: Use SMTP_SERVER to send mail.
 44 | # 0: Call /usr/sbin/sendmail to send mail.
 45 | SMTP_SEND = 1
 46 | 
 47 | SMTP_SERVER = "smtp.yourisp.net:25"
 48 | AUTHREQUIRED = 0 # if you need to use SMTP AUTH set to 1
 49 | SMTP_USER = 'username'  # for SMTP AUTH, set SMTP username here
 50 | SMTP_PASS = 'password'  # for SMTP AUTH, set SMTP password here
 51 | 
 52 | # Connect to the SMTP server using SSL
 53 | 
 54 | SMTP_SSL = 0
 55 | 
 56 | # 1: Use IMAP_SERVER to deliver mail and ignore SMTP settings.
 57 | # 0: Use SMTP
 58 | 
 59 | IMAP_SEND = 1
 60 | IMAP_SERVER = "smtp.yourisp.net:143"
 61 | IMAP_USER = 'username'  # set IMAP username here
 62 | IMAP_PASS = 'password'  # set IMAP password here
 63 | 
 64 | # Connect to the IMAP server using SSL
 65 | IMAP_SSL = 1
 66 | 
 67 | # Synthesise From: addresses based on feed name and domain
 68 | IMAP_MUNGE_FROM = 1
 69 | IMAP_OVERRIDE_TO = 'Me <me@localhost>'
 70 | 
 71 | # Generate References: headers based on item tags and/or URLs in feed content
 72 | THREAD_ON_TAGS = 1
 73 | THREAD_ON_LINKS = 1
 74 | 
 75 | # Include inline images as a data: URI (not supported in MUAs like the mobile Gmail app, but works great on iOS and Mac)
 76 | INLINE_IMAGES_DATA_URI = 0
 77 | 
 78 | # Move read messages to a specific folder
 79 | IMAP_MOVE_READ_TO = False
 80 | 
 81 | # Mark all messages as read in the following folders
 82 | IMAP_MARK_AS_READ = None # ['Trash']
 83 | 
 84 | # Set this to add a bonus header to all emails (start with '\n').
 85 | BONUS_HEADER = ''
 86 | # Example: BONUS_HEADER = '\nApproved: joe@bob.org'
 87 | 
 88 | # Set this to override From addresses. Keys are feed URLs, values are new titles.
 89 | OVERRIDE_FROM = {}
 90 | 
 91 | # Set this to override From email addresses. Keys are feed URLs, values are new emails.
 92 | OVERRIDE_EMAIL = {}
 93 | 
 94 | DEFAULT_EMAIL = {}
 95 | 
 96 | # Only use the email from address rather than friendly name plus email address
 97 | NO_FRIENDLY_NAME = 0
 98 | 
 99 | # Set this to override the timeout (in seconds) for feed server response
100 | FEED_TIMEOUT = 60
101 | 
102 | # Optional CSS styling
103 | USE_CSS_STYLING = 1
104 | STYLE_SHEET='img {max-width: 100% !important; height: auto;} body, #body {font-size: 12pt; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; font-family: Georgia, Times New Roman, Times, serif;} a:link {color: #0000cc} h1.header a {font-weight: normal; text-decoration: none; color: black;} .summary {font-size: 80%; font-style: italic;}'
105 | 
106 | # If you have an HTTP Proxy set this in the format 'http://your.proxy.here:8080/'
107 | PROXY=""
108 | 
109 | # To most correctly encode emails with international characters, we iterate through the list below and use the first character set that works
110 | # Eventually (and theoretically) ISO-8859-1 and UTF-8 are our catch-all failsafes
111 | CHARSET_LIST='US-ASCII', 'BIG5', 'ISO-2022-JP', 'ISO-8859-1', 'UTF-8'
112 | 
113 | # Use Unicode characters instead of their ascii pseudo-replacements
114 | UNICODE_SNOB = 0
115 | 
116 | # Put the links after each paragraph instead of at the end.
117 | LINKS_EACH_PARAGRAPH = 0
118 | 
119 | # Wrap long lines at position. 0 for no wrapping. (Requires Python 2.3.)
120 | BODY_WIDTH = 0
121 | 
122 | # Change the default imap folder
123 | DEFAULT_IMAP_FOLDER = "INBOX"
124 | 
125 | # Whether to summarize the body or not. Set it to the number of sentences you require.
126 | SUMMARIZE = 0
127 | 
128 | # When set to "1", during OPML import, the title tag of parent "outline" element is used as folder path
129 | USE_OPML_TITLE_AS_FOLDER = O


--------------------------------------------------------------------------------
/html2text.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | """html2text: Turn HTML into equivalent Markdown-structured text."""
  3 | __version__ = "3.200.3"
  4 | __author__ = "Aaron Swartz (me@aaronsw.com)"
  5 | __copyright__ = "(C) 2004-2008 Aaron Swartz. GNU GPL 3."
  6 | __contributors__ = ["Martin 'Joey' Schulze", "Ricardo Reyes", "Kevin Jay North"]
  7 | 
  8 | # TODO:
  9 | #   Support decoded entities with unifiable.
 10 | 
 11 | try:
 12 |     True
 13 | except NameError:
 14 |     setattr(__builtins__, 'True', 1)
 15 |     setattr(__builtins__, 'False', 0)
 16 | 
 17 | def has_key(x, y):
 18 |     if hasattr(x, 'has_key'): return x.has_key(y)
 19 |     else: return y in x
 20 | 
 21 | try:
 22 |     import htmlentitydefs
 23 |     import urlparse
 24 |     import HTMLParser
 25 | except ImportError: #Python3
 26 |     import html.entities as htmlentitydefs
 27 |     import urllib.parse as urlparse
 28 |     import html.parser as HTMLParser
 29 | try: #Python3
 30 |     import urllib.request as urllib
 31 | except:
 32 |     import urllib
 33 | import optparse, re, sys, codecs, types
 34 | 
 35 | try: from textwrap import wrap
 36 | except: pass
 37 | 
 38 | # Use Unicode characters instead of their ascii psuedo-replacements
 39 | UNICODE_SNOB = 0
 40 | 
 41 | # Escape all special characters.  Output is less readable, but avoids corner case formatting issues.
 42 | ESCAPE_SNOB = 0
 43 | 
 44 | # Put the links after each paragraph instead of at the end.
 45 | LINKS_EACH_PARAGRAPH = 0
 46 | 
 47 | # Wrap long lines at position. 0 for no wrapping. (Requires Python 2.3.)
 48 | BODY_WIDTH = 78
 49 | 
 50 | # Don't show internal links (href="#local-anchor") -- corresponding link targets
 51 | # won't be visible in the plain text file anyway.
 52 | SKIP_INTERNAL_LINKS = True
 53 | 
 54 | # Use inline, rather than reference, formatting for images and links
 55 | INLINE_LINKS = False
 56 | 
 57 | # Number of pixels Google indents nested lists
 58 | GOOGLE_LIST_INDENT = 36
 59 | 
 60 | IGNORE_ANCHORS = False
 61 | IGNORE_IMAGES = False
 62 | IGNORE_EMPHASIS = False
 63 | 
 64 | ### Entity Nonsense ###
 65 | 
 66 | def name2cp(k):
 67 |     if k == 'apos': return ord("'")
 68 |     if hasattr(htmlentitydefs, "name2codepoint"): # requires Python 2.3
 69 |         return htmlentitydefs.name2codepoint[k]
 70 |     else:
 71 |         k = htmlentitydefs.entitydefs[k]
 72 |         if k.startswith("&#") and k.endswith(";"): return int(k[2:-1]) # not in latin-1
 73 |         return ord(codecs.latin_1_decode(k)[0])
 74 | 
 75 | unifiable = {'rsquo':"'", 'lsquo':"'", 'rdquo':'"', 'ldquo':'"',
 76 | 'copy':'(C)', 'mdash':'--', 'nbsp':' ', 'rarr':'->', 'larr':'<-', 'middot':'*',
 77 | 'ndash':'-', 'oelig':'oe', 'aelig':'ae',
 78 | 'agrave':'a', 'aacute':'a', 'acirc':'a', 'atilde':'a', 'auml':'a', 'aring':'a',
 79 | 'egrave':'e', 'eacute':'e', 'ecirc':'e', 'euml':'e',
 80 | 'igrave':'i', 'iacute':'i', 'icirc':'i', 'iuml':'i',
 81 | 'ograve':'o', 'oacute':'o', 'ocirc':'o', 'otilde':'o', 'ouml':'o',
 82 | 'ugrave':'u', 'uacute':'u', 'ucirc':'u', 'uuml':'u',
 83 | 'lrm':'', 'rlm':''}
 84 | 
 85 | unifiable_n = {}
 86 | 
 87 | for k in unifiable.keys():
 88 |     unifiable_n[name2cp(k)] = unifiable[k]
 89 | 
 90 | ### End Entity Nonsense ###
 91 | 
 92 | def onlywhite(line):
 93 |     """Return true if the line does only consist of whitespace characters."""
 94 |     for c in line:
 95 |         if c is not ' ' and c is not '  ':
 96 |             return c is ' '
 97 |     return line
 98 | 
 99 | def hn(tag):
100 |     if tag[0] == 'h' and len(tag) == 2:
101 |         try:
102 |             n = int(tag[1])
103 |             if n in range(1, 10): return n
104 |         except ValueError: return 0
105 | 
106 | def dumb_property_dict(style):
107 |     """returns a hash of css attributes"""
108 |     return dict([(x.strip(), y.strip()) for x, y in [z.split(':', 1) for z in style.split(';') if ':' in z]]);
109 | 
110 | def dumb_css_parser(data):
111 |     """returns a hash of css selectors, each of which contains a hash of css attributes"""
112 |     # remove @import sentences
113 |     data += ';'
114 |     importIndex = data.find('@import')
115 |     while importIndex != -1:
116 |         data = data[0:importIndex] + data[data.find(';', importIndex) + 1:]
117 |         importIndex = data.find('@import')
118 | 
119 |     # parse the css. reverted from dictionary compehension in order to support older pythons
120 |     elements =  [x.split('{') for x in data.split('}') if '{' in x.strip()]
121 |     try:
122 |         elements = dict([(a.strip(), dumb_property_dict(b)) for a, b in elements])
123 |     except ValueError:
124 |         elements = {} # not that important
125 | 
126 |     return elements
127 | 
128 | def element_style(attrs, style_def, parent_style):
129 |     """returns a hash of the 'final' style attributes of the element"""
130 |     style = parent_style.copy()
131 |     if 'class' in attrs:
132 |         for css_class in attrs['class'].split():
133 |             css_style = style_def['.' + css_class]
134 |             style.update(css_style)
135 |     if 'style' in attrs:
136 |         immediate_style = dumb_property_dict(attrs['style'])
137 |         style.update(immediate_style)
138 |     return style
139 | 
140 | def google_list_style(style):
141 |     """finds out whether this is an ordered or unordered list"""
142 |     if 'list-style-type' in style:
143 |         list_style = style['list-style-type']
144 |         if list_style in ['disc', 'circle', 'square', 'none']:
145 |             return 'ul'
146 |     return 'ol'
147 | 
148 | def google_has_height(style):
149 |     """check if the style of the element has the 'height' attribute explicitly defined"""
150 |     if 'height' in style:
151 |         return True
152 |     return False
153 | 
154 | def google_text_emphasis(style):
155 |     """return a list of all emphasis modifiers of the element"""
156 |     emphasis = []
157 |     if 'text-decoration' in style:
158 |         emphasis.append(style['text-decoration'])
159 |     if 'font-style' in style:
160 |         emphasis.append(style['font-style'])
161 |     if 'font-weight' in style:
162 |         emphasis.append(style['font-weight'])
163 |     return emphasis
164 | 
165 | def google_fixed_width_font(style):
166 |     """check if the css of the current element defines a fixed width font"""
167 |     font_family = ''
168 |     if 'font-family' in style:
169 |         font_family = style['font-family']
170 |     if 'Courier New' == font_family or 'Consolas' == font_family:
171 |         return True
172 |     return False
173 | 
174 | def list_numbering_start(attrs):
175 |     """extract numbering from list element attributes"""
176 |     if 'start' in attrs:
177 |         return int(attrs['start']) - 1
178 |     else:
179 |         return 0
180 | 
181 | class HTML2Text(HTMLParser.HTMLParser):
182 |     def __init__(self, out=None, baseurl=''):
183 |         HTMLParser.HTMLParser.__init__(self)
184 | 
185 |         # Config options
186 |         self.unicode_snob = UNICODE_SNOB
187 |         self.escape_snob = ESCAPE_SNOB
188 |         self.links_each_paragraph = LINKS_EACH_PARAGRAPH
189 |         self.body_width = BODY_WIDTH
190 |         self.skip_internal_links = SKIP_INTERNAL_LINKS
191 |         self.inline_links = INLINE_LINKS
192 |         self.google_list_indent = GOOGLE_LIST_INDENT
193 |         self.ignore_links = IGNORE_ANCHORS
194 |         self.ignore_images = IGNORE_IMAGES
195 |         self.ignore_emphasis = IGNORE_EMPHASIS
196 |         self.google_doc = False
197 |         self.ul_item_mark = '*'
198 |         self.emphasis_mark = '_'
199 |         self.strong_mark = '**'
200 | 
201 |         if out is None:
202 |             self.out = self.outtextf
203 |         else:
204 |             self.out = out
205 | 
206 |         self.outtextlist = []  # empty list to store output characters before they are "joined"
207 | 
208 |         try:
209 |             self.outtext = unicode()
210 |         except NameError:  # Python3
211 |             self.outtext = str()
212 | 
213 |         self.quiet = 0
214 |         self.p_p = 0  # number of newline character to print before next output
215 |         self.outcount = 0
216 |         self.start = 1
217 |         self.space = 0
218 |         self.a = []
219 |         self.astack = []
220 |         self.maybe_automatic_link = None
221 |         self.absolute_url_matcher = re.compile(r'^[a-zA-Z+]+://')
222 |         self.acount = 0
223 |         self.list = []
224 |         self.blockquote = 0
225 |         self.pre = 0
226 |         self.startpre = 0
227 |         self.code = False
228 |         self.br_toggle = ''
229 |         self.lastWasNL = 0
230 |         self.lastWasList = False
231 |         self.style = 0
232 |         self.style_def = {}
233 |         self.tag_stack = []
234 |         self.emphasis = 0
235 |         self.drop_white_space = 0
236 |         self.inheader = False
237 |         self.abbr_title = None  # current abbreviation definition
238 |         self.abbr_data = None  # last inner HTML (for abbr being defined)
239 |         self.abbr_list = {}  # stack of abbreviations to write later
240 |         self.baseurl = baseurl
241 | 
242 |         try: del unifiable_n[name2cp('nbsp')]
243 |         except KeyError: pass
244 |         unifiable['nbsp'] = '&nbsp_place_holder;'
245 | 
246 | 
247 |     def feed(self, data):
248 |         data = data.replace("</' + 'script>", "</ignore>")
249 |         HTMLParser.HTMLParser.feed(self, data)
250 | 
251 |     def handle(self, data):
252 |         self.feed(data)
253 |         self.feed("")
254 |         return self.optwrap(self.close())
255 | 
256 |     def outtextf(self, s):
257 |         self.outtextlist.append(s)
258 |         if s: self.lastWasNL = s[-1] == '\n'
259 | 
260 |     def close(self):
261 |         HTMLParser.HTMLParser.close(self)
262 | 
263 |         self.pbr()
264 |         self.o('', 0, 'end')
265 | 
266 |         self.outtext = self.outtext.join(self.outtextlist)
267 |         if self.unicode_snob:
268 |             nbsp = unichr(name2cp('nbsp'))
269 |         else:
270 |             nbsp = u' '
271 |         self.outtext = self.outtext.replace(u'&nbsp_place_holder;', nbsp)
272 | 
273 |         return self.outtext
274 | 
275 |     def handle_charref(self, c):
276 |         self.o(self.charref(c), 1)
277 | 
278 |     def handle_entityref(self, c):
279 |         self.o(self.entityref(c), 1)
280 | 
281 |     def handle_starttag(self, tag, attrs):
282 |         self.handle_tag(tag, attrs, 1)
283 | 
284 |     def handle_endtag(self, tag):
285 |         self.handle_tag(tag, None, 0)
286 | 
287 |     def previousIndex(self, attrs):
288 |         """ returns the index of certain set of attributes (of a link) in the
289 |             self.a list
290 | 
291 |             If the set of attributes is not found, returns None
292 |         """
293 |         if not has_key(attrs, 'href'): return None
294 | 
295 |         i = -1
296 |         for a in self.a:
297 |             i += 1
298 |             match = 0
299 | 
300 |             if has_key(a, 'href') and a['href'] == attrs['href']:
301 |                 if has_key(a, 'title') or has_key(attrs, 'title'):
302 |                         if (has_key(a, 'title') and has_key(attrs, 'title') and
303 |                             a['title'] == attrs['title']):
304 |                             match = True
305 |                 else:
306 |                     match = True
307 | 
308 |             if match: return i
309 | 
310 |     def drop_last(self, nLetters):
311 |         if not self.quiet:
312 |             self.outtext = self.outtext[:-nLetters]
313 | 
314 |     def handle_emphasis(self, start, tag_style, parent_style):
315 |         """handles various text emphases"""
316 |         tag_emphasis = google_text_emphasis(tag_style)
317 |         parent_emphasis = google_text_emphasis(parent_style)
318 | 
319 |         # handle Google's text emphasis
320 |         strikethrough =  'line-through' in tag_emphasis and self.hide_strikethrough
321 |         bold = 'bold' in tag_emphasis and not 'bold' in parent_emphasis
322 |         italic = 'italic' in tag_emphasis and not 'italic' in parent_emphasis
323 |         fixed = google_fixed_width_font(tag_style) and not \
324 |                 google_fixed_width_font(parent_style) and not self.pre
325 | 
326 |         if start:
327 |             # crossed-out text must be handled before other attributes
328 |             # in order not to output qualifiers unnecessarily
329 |             if bold or italic or fixed:
330 |                 self.emphasis += 1
331 |             if strikethrough:
332 |                 self.quiet += 1
333 |             if italic:
334 |                 self.o(self.emphasis_mark)
335 |                 self.drop_white_space += 1
336 |             if bold:
337 |                 self.o(self.strong_mark)
338 |                 self.drop_white_space += 1
339 |             if fixed:
340 |                 self.o('`')
341 |                 self.drop_white_space += 1
342 |                 self.code = True
343 |         else:
344 |             if bold or italic or fixed:
345 |                 # there must not be whitespace before closing emphasis mark
346 |                 self.emphasis -= 1
347 |                 self.space = 0
348 |                 self.outtext = self.outtext.rstrip()
349 |             if fixed:
350 |                 if self.drop_white_space:
351 |                     # empty emphasis, drop it
352 |                     self.drop_last(1)
353 |                     self.drop_white_space -= 1
354 |                 else:
355 |                     self.o('`')
356 |                 self.code = False
357 |             if bold:
358 |                 if self.drop_white_space:
359 |                     # empty emphasis, drop it
360 |                     self.drop_last(2)
361 |                     self.drop_white_space -= 1
362 |                 else:
363 |                     self.o(self.strong_mark)
364 |             if italic:
365 |                 if self.drop_white_space:
366 |                     # empty emphasis, drop it
367 |                     self.drop_last(1)
368 |                     self.drop_white_space -= 1
369 |                 else:
370 |                     self.o(self.emphasis_mark)
371 |             # space is only allowed after *all* emphasis marks
372 |             if (bold or italic) and not self.emphasis:
373 |                     self.o(" ")
374 |             if strikethrough:
375 |                 self.quiet -= 1
376 | 
377 |     def handle_tag(self, tag, attrs, start):
378 |         #attrs = fixattrs(attrs)
379 |         if attrs is None:
380 |             attrs = {}
381 |         else:
382 |             attrs = dict(attrs)
383 | 
384 |         if self.google_doc:
385 |             # the attrs parameter is empty for a closing tag. in addition, we
386 |             # need the attributes of the parent nodes in order to get a
387 |             # complete style description for the current element. we assume
388 |             # that google docs export well formed html.
389 |             parent_style = {}
390 |             if start:
391 |                 if self.tag_stack:
392 |                   parent_style = self.tag_stack[-1][2]
393 |                 tag_style = element_style(attrs, self.style_def, parent_style)
394 |                 self.tag_stack.append((tag, attrs, tag_style))
395 |             else:
396 |                 dummy, attrs, tag_style = self.tag_stack.pop()
397 |                 if self.tag_stack:
398 |                     parent_style = self.tag_stack[-1][2]
399 | 
400 |         if hn(tag):
401 |             self.p()
402 |             if start:
403 |                 self.inheader = True
404 |                 self.o(hn(tag)*"#" + ' ')
405 |             else:
406 |                 self.inheader = False
407 |                 return # prevent redundant emphasis marks on headers
408 | 
409 |         if tag in ['p', 'div']:
410 |             if self.google_doc:
411 |                 if start and google_has_height(tag_style):
412 |                     self.p()
413 |                 else:
414 |                     self.soft_br()
415 |             else:
416 |                 self.p()
417 | 
418 |         if tag == "br" and start: self.o("  \n")
419 | 
420 |         if tag == "hr" and start:
421 |             self.p()
422 |             self.o("* * *")
423 |             self.p()
424 | 
425 |         if tag in ["head", "style", 'script']:
426 |             if start: self.quiet += 1
427 |             else: self.quiet -= 1
428 | 
429 |         if tag == "style":
430 |             if start: self.style += 1
431 |             else: self.style -= 1
432 | 
433 |         if tag in ["body"]:
434 |             self.quiet = 0 # sites like 9rules.com never close <head>
435 | 
436 |         if tag == "blockquote":
437 |             if start:
438 |                 self.p(); self.o('> ', 0, 1); self.start = 1
439 |                 self.blockquote += 1
440 |             else:
441 |                 self.blockquote -= 1
442 |                 self.p()
443 | 
444 |         if tag in ['em', 'i', 'u'] and not self.ignore_emphasis: self.o(self.emphasis_mark)
445 |         if tag in ['strong', 'b'] and not self.ignore_emphasis: self.o(self.strong_mark)
446 |         if tag in ['del', 'strike', 's']:
447 |             if start:
448 |                 self.o("<"+tag+">")
449 |             else:
450 |                 self.o("</"+tag+">")
451 | 
452 |         if self.google_doc:
453 |             if not self.inheader:
454 |                 # handle some font attributes, but leave headers clean
455 |                 self.handle_emphasis(start, tag_style, parent_style)
456 | 
457 |         if tag in ["code", "tt"] and not self.pre: self.o('`') #TODO: `` `this` ``
458 |         if tag == "abbr":
459 |             if start:
460 |                 self.abbr_title = None
461 |                 self.abbr_data = ''
462 |                 if has_key(attrs, 'title'):
463 |                     self.abbr_title = attrs['title']
464 |             else:
465 |                 if self.abbr_title != None:
466 |                     self.abbr_list[self.abbr_data] = self.abbr_title
467 |                     self.abbr_title = None
468 |                 self.abbr_data = ''
469 | 
470 |         if tag == "a" and not self.ignore_links:
471 |             if start:
472 |                 if has_key(attrs, 'href') and not (self.skip_internal_links and attrs['href'].startswith('#')):
473 |                     self.astack.append(attrs)
474 |                     self.maybe_automatic_link = attrs['href']
475 |                 else:
476 |                     self.astack.append(None)
477 |             else:
478 |                 if self.astack:
479 |                     a = self.astack.pop()
480 |                     if self.maybe_automatic_link:
481 |                         self.maybe_automatic_link = None
482 |                     elif a:
483 |                         if self.inline_links:
484 |                             self.o("](" + escape_md(a['href']) + ")")
485 |                         else:
486 |                             i = self.previousIndex(a)
487 |                             if i is not None:
488 |                                 a = self.a[i]
489 |                             else:
490 |                                 self.acount += 1
491 |                                 a['count'] = self.acount
492 |                                 a['outcount'] = self.outcount
493 |                                 self.a.append(a)
494 |                             self.o("][" + str(a['count']) + "]")
495 | 
496 |         if tag == "img" and start and not self.ignore_images:
497 |             if has_key(attrs, 'src'):
498 |                 attrs['href'] = attrs['src']
499 |                 alt = attrs.get('alt', '')
500 |                 self.o("![" + escape_md(alt) + "]")
501 | 
502 |                 if self.inline_links:
503 |                     self.o("(" + escape_md(attrs['href']) + ")")
504 |                 else:
505 |                     i = self.previousIndex(attrs)
506 |                     if i is not None:
507 |                         attrs = self.a[i]
508 |                     else:
509 |                         self.acount += 1
510 |                         attrs['count'] = self.acount
511 |                         attrs['outcount'] = self.outcount
512 |                         self.a.append(attrs)
513 |                     self.o("[" + str(attrs['count']) + "]")
514 | 
515 |         if tag == 'dl' and start: self.p()
516 |         if tag == 'dt' and not start: self.pbr()
517 |         if tag == 'dd' and start: self.o('    ')
518 |         if tag == 'dd' and not start: self.pbr()
519 | 
520 |         if tag in ["ol", "ul"]:
521 |             # Google Docs create sub lists as top level lists
522 |             if (not self.list) and (not self.lastWasList):
523 |                 self.p()
524 |             if start:
525 |                 if self.google_doc:
526 |                     list_style = google_list_style(tag_style)
527 |                 else:
528 |                     list_style = tag
529 |                 numbering_start = list_numbering_start(attrs)
530 |                 self.list.append({'name':list_style, 'num':numbering_start})
531 |             else:
532 |                 if self.list: self.list.pop()
533 |             self.lastWasList = True
534 |         else:
535 |             self.lastWasList = False
536 | 
537 |         if tag == 'li':
538 |             self.pbr()
539 |             if start:
540 |                 if self.list: li = self.list[-1]
541 |                 else: li = {'name':'ul', 'num':0}
542 |                 if self.google_doc:
543 |                     nest_count = self.google_nest_count(tag_style)
544 |                 else:
545 |                     nest_count = len(self.list)
546 |                 self.o("  " * nest_count) #TODO: line up <ol><li>s > 9 correctly.
547 |                 if li['name'] == "ul": self.o(self.ul_item_mark + " ")
548 |                 elif li['name'] == "ol":
549 |                     li['num'] += 1
550 |                     self.o(str(li['num'])+". ")
551 |                 self.start = 1
552 | 
553 |         if tag in ["table", "tr"] and start: self.p()
554 |         if tag == 'td': self.pbr()
555 | 
556 |         if tag == "pre":
557 |             if start:
558 |                 self.startpre = 1
559 |                 self.pre = 1
560 |             else:
561 |                 self.pre = 0
562 |             self.p()
563 | 
564 |     def pbr(self):
565 |         if self.p_p == 0:
566 |             self.p_p = 1
567 | 
568 |     def p(self):
569 |         self.p_p = 2
570 | 
571 |     def soft_br(self):
572 |         self.pbr()
573 |         self.br_toggle = '  '
574 | 
575 |     def o(self, data, puredata=0, force=0):
576 |         if self.abbr_data is not None:
577 |             self.abbr_data += data
578 | 
579 |         if not self.quiet:
580 |             if self.google_doc:
581 |                 # prevent white space immediately after 'begin emphasis' marks ('**' and '_')
582 |                 lstripped_data = data.lstrip()
583 |                 if self.drop_white_space and not (self.pre or self.code):
584 |                     data = lstripped_data
585 |                 if lstripped_data != '':
586 |                     self.drop_white_space = 0
587 | 
588 |             if puredata and not self.pre:
589 |                 data = re.sub('\s+', ' ', data)
590 |                 if data and data[0] == ' ':
591 |                     self.space = 1
592 |                     data = data[1:]
593 |             if not data and not force: return
594 | 
595 |             if self.startpre:
596 |                 #self.out(" :") #TODO: not output when already one there
597 |                 if not data.startswith("\n"):  # <pre>stuff...
598 |                     data = "\n" + data
599 | 
600 |             bq = (">" * self.blockquote)
601 |             if not (force and data and data[0] == ">") and self.blockquote: bq += " "
602 | 
603 |             if self.pre:
604 |                 if not self.list:
605 |                     bq += "    "
606 |                 #else: list content is already partially indented
607 |                 for i in xrange(len(self.list)):
608 |                     bq += "    "
609 |                 data = data.replace("\n", "\n"+bq)
610 | 
611 |             if self.startpre:
612 |                 self.startpre = 0
613 |                 if self.list:
614 |                     data = data.lstrip("\n") # use existing initial indentation
615 | 
616 |             if self.start:
617 |                 self.space = 0
618 |                 self.p_p = 0
619 |                 self.start = 0
620 | 
621 |             if force == 'end':
622 |                 # It's the end.
623 |                 self.p_p = 0
624 |                 self.out("\n")
625 |                 self.space = 0
626 | 
627 |             if self.p_p:
628 |                 self.out((self.br_toggle+'\n'+bq)*self.p_p)
629 |                 self.space = 0
630 |                 self.br_toggle = ''
631 | 
632 |             if self.space:
633 |                 if not self.lastWasNL: self.out(' ')
634 |                 self.space = 0
635 | 
636 |             if self.a and ((self.p_p == 2 and self.links_each_paragraph) or force == "end"):
637 |                 if force == "end": self.out("\n")
638 | 
639 |                 newa = []
640 |                 for link in self.a:
641 |                     if self.outcount > link['outcount']:
642 |                         self.out("   ["+ str(link['count']) +"]: " + urlparse.urljoin(self.baseurl, link['href']))
643 |                         if has_key(link, 'title'): self.out(" ("+link['title']+")")
644 |                         self.out("\n")
645 |                     else:
646 |                         newa.append(link)
647 | 
648 |                 if self.a != newa: self.out("\n") # Don't need an extra line when nothing was done.
649 | 
650 |                 self.a = newa
651 | 
652 |             if self.abbr_list and force == "end":
653 |                 for abbr, definition in self.abbr_list.items():
654 |                     self.out("  *[" + abbr + "]: " + definition + "\n")
655 | 
656 |             self.p_p = 0
657 |             self.out(data)
658 |             self.outcount += 1
659 | 
660 |     def handle_data(self, data):
661 |         if r'\/script>' in data: self.quiet -= 1
662 | 
663 |         if self.style:
664 |             self.style_def.update(dumb_css_parser(data))
665 | 
666 |         if not self.maybe_automatic_link is None:
667 |             href = self.maybe_automatic_link
668 |             if href == data and self.absolute_url_matcher.match(href):
669 |                 self.o("<" + data + ">")
670 |                 return
671 |             else:
672 |                 self.o("[")
673 |                 self.maybe_automatic_link = None
674 | 
675 |         if not self.code and not self.pre:
676 |             data = escape_md_section(data, snob=self.escape_snob)
677 |         self.o(data, 1)
678 | 
679 |     def unknown_decl(self, data): pass
680 | 
681 |     def charref(self, name):
682 |         if name[0] in ['x','X']:
683 |             c = int(name[1:], 16)
684 |         else:
685 |             c = int(name)
686 | 
687 |         if not self.unicode_snob and c in unifiable_n.keys():
688 |             return unifiable_n[c]
689 |         else:
690 |             try:
691 |                 return unichr(c)
692 |             except NameError: #Python3
693 |                 return chr(c)
694 | 
695 |     def entityref(self, c):
696 |         if not self.unicode_snob and c in unifiable.keys():
697 |             return unifiable[c]
698 |         else:
699 |             try: name2cp(c)
700 |             except KeyError: return "&" + c + ';'
701 |             else:
702 |                 try:
703 |                     return unichr(name2cp(c))
704 |                 except NameError: #Python3
705 |                     return chr(name2cp(c))
706 | 
707 |     def replaceEntities(self, s):
708 |         s = s.group(1)
709 |         if s[0] == "#":
710 |             return self.charref(s[1:])
711 |         else: return self.entityref(s)
712 | 
713 |     r_unescape = re.compile(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));")
714 |     def unescape(self, s):
715 |         return self.r_unescape.sub(self.replaceEntities, s)
716 | 
717 |     def google_nest_count(self, style):
718 |         """calculate the nesting count of google doc lists"""
719 |         nest_count = 0
720 |         if 'margin-left' in style:
721 |             nest_count = int(style['margin-left'][:-2]) / self.google_list_indent
722 |         return nest_count
723 | 
724 | 
725 |     def optwrap(self, text):
726 |         """Wrap all paragraphs in the provided text."""
727 |         if not self.body_width:
728 |             return text
729 | 
730 |         assert wrap, "Requires Python 2.3."
731 |         result = ''
732 |         newlines = 0
733 |         for para in text.split("\n"):
734 |             if len(para) > 0:
735 |                 if not skipwrap(para):
736 |                     result += "\n".join(wrap(para, self.body_width))
737 |                     if para.endswith('  '):
738 |                         result += "  \n"
739 |                         newlines = 1
740 |                     else:
741 |                         result += "\n\n"
742 |                         newlines = 2
743 |                 else:
744 |                     if not onlywhite(para):
745 |                         result += para + "\n"
746 |                         newlines = 1
747 |             else:
748 |                 if newlines < 2:
749 |                     result += "\n"
750 |                     newlines += 1
751 |         return result
752 | 
753 | ordered_list_matcher = re.compile(r'\d+\.\s')
754 | unordered_list_matcher = re.compile(r'[-\*\+]\s')
755 | md_chars_matcher = re.compile(r"([\\\[\]\(\)])")
756 | md_chars_matcher_all = re.compile(r"([`\*_{}\[\]\(\)#!])")
757 | md_dot_matcher = re.compile(r"""
758 |     ^             # start of line
759 |     (\s*\d+)      # optional whitespace and a number
760 |     (\.)          # dot
761 |     (?=\s)        # lookahead assert whitespace
762 |     """, re.MULTILINE | re.VERBOSE)
763 | md_plus_matcher = re.compile(r"""
764 |     ^
765 |     (\s*)
766 |     (\+)
767 |     (?=\s)
768 |     """, flags=re.MULTILINE | re.VERBOSE)
769 | md_dash_matcher = re.compile(r"""
770 |     ^
771 |     (\s*)
772 |     (-)
773 |     (?=\s|\-)     # followed by whitespace (bullet list, or spaced out hr)
774 |                   # or another dash (header or hr)
775 |     """, flags=re.MULTILINE | re.VERBOSE)
776 | slash_chars = r'\`*_{}[]()#+-.!'
777 | md_backslash_matcher = re.compile(r'''
778 |     (\\)          # match one slash
779 |     (?=[%s])      # followed by a char that requires escaping
780 |     ''' % re.escape(slash_chars),
781 |     flags=re.VERBOSE)
782 | 
783 | def skipwrap(para):
784 |     # If the text begins with four spaces or one tab, it's a code block; don't wrap
785 |     if para[0:4] == '    ' or para[0] == '\t':
786 |         return True
787 |     # If the text begins with only two "--", possibly preceded by whitespace, that's
788 |     # an emdash; so wrap.
789 |     stripped = para.lstrip()
790 |     if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
791 |         return False
792 |     # I'm not sure what this is for; I thought it was to detect lists, but there's
793 |     # a <br>-inside-<span> case in one of the tests that also depends upon it.
794 |     if stripped[0:1] == '-' or stripped[0:1] == '*':
795 |         return True
796 |     # If the text begins with a single -, *, or +, followed by a space, or an integer,
797 |     # followed by a ., followed by a space (in either case optionally preceeded by
798 |     # whitespace), it's a list; don't wrap.
799 |     if ordered_list_matcher.match(stripped) or unordered_list_matcher.match(stripped):
800 |         return True
801 |     return False
802 | 
803 | def wrapwrite(text):
804 |     text = text.encode('utf-8')
805 |     try: #Python3
806 |         sys.stdout.buffer.write(text)
807 |     except AttributeError:
808 |         sys.stdout.write(text)
809 | 
810 | def html2text(html, baseurl='', plaintext=False):
811 |     h = HTML2Text(baseurl=baseurl)
812 |     h.ignore_links = plaintext
813 |     h.ignore_emphasis = plaintext
814 |     h.ignore_images = plaintext
815 |     return h.handle(html)
816 | 
817 | def unescape(s, unicode_snob=False):
818 |     h = HTML2Text()
819 |     h.unicode_snob = unicode_snob
820 |     return h.unescape(s)
821 | 
822 | def escape_md(text):
823 |     """Escapes markdown-sensitive characters within other markdown constructs."""
824 |     return md_chars_matcher.sub(r"\\\1", text)
825 | 
826 | def escape_md_section(text, snob=False):
827 |     """Escapes markdown-sensitive characters across whole document sections."""
828 |     text = md_backslash_matcher.sub(r"\\\1", text)
829 |     if snob:
830 |         text = md_chars_matcher_all.sub(r"\\\1", text)
831 |     text = md_dot_matcher.sub(r"\1\\\2", text)
832 |     text = md_plus_matcher.sub(r"\1\\\2", text)
833 |     text = md_dash_matcher.sub(r"\1\\\2", text)
834 |     return text
835 | 
836 | 
837 | def main():
838 |     baseurl = ''
839 | 
840 |     p = optparse.OptionParser('%prog [(filename|url) [encoding]]',
841 |                               version='%prog ' + __version__)
842 |     p.add_option("--ignore-emphasis", dest="ignore_emphasis", action="store_true",
843 |         default=IGNORE_EMPHASIS, help="don't include any formatting for emphasis")
844 |     p.add_option("--ignore-links", dest="ignore_links", action="store_true",
845 |         default=IGNORE_ANCHORS, help="don't include any formatting for links")
846 |     p.add_option("--ignore-images", dest="ignore_images", action="store_true",
847 |         default=IGNORE_IMAGES, help="don't include any formatting for images")
848 |     p.add_option("-g", "--google-doc", action="store_true", dest="google_doc",
849 |         default=False, help="convert an html-exported Google Document")
850 |     p.add_option("-d", "--dash-unordered-list", action="store_true", dest="ul_style_dash",
851 |         default=False, help="use a dash rather than a star for unordered list items")
852 |     p.add_option("-e", "--asterisk-emphasis", action="store_true", dest="em_style_asterisk",
853 |         default=False, help="use an asterisk rather than an underscore for emphasized text")
854 |     p.add_option("-b", "--body-width", dest="body_width", action="store", type="int",
855 |         default=BODY_WIDTH, help="number of characters per output line, 0 for no wrap")
856 |     p.add_option("-i", "--google-list-indent", dest="list_indent", action="store", type="int",
857 |         default=GOOGLE_LIST_INDENT, help="number of pixels Google indents nested lists")
858 |     p.add_option("-s", "--hide-strikethrough", action="store_true", dest="hide_strikethrough",
859 |         default=False, help="hide strike-through text. only relevant when -g is specified as well")
860 |     p.add_option("--escape-all", action="store_true", dest="escape_snob",
861 |         default=False, help="Escape all special characters.  Output is less readable, but avoids corner case formatting issues.")
862 |     (options, args) = p.parse_args()
863 | 
864 |     # process input
865 |     encoding = "utf-8"
866 |     if len(args) > 0:
867 |         file_ = args[0]
868 |         if len(args) == 2:
869 |             encoding = args[1]
870 |         if len(args) > 2:
871 |             p.error('Too many arguments')
872 | 
873 |         if file_.startswith('http://') or file_.startswith('https://'):
874 |             baseurl = file_
875 |             j = urllib.urlopen(baseurl)
876 |             data = j.read()
877 |             if encoding is None:
878 |                 try:
879 |                     from feedparser import _getCharacterEncoding as enc
880 |                 except ImportError:
881 |                     enc = lambda x, y: ('utf-8', 1)
882 |                 encoding = enc(j.headers, data)[0]
883 |                 if encoding == 'us-ascii':
884 |                     encoding = 'utf-8'
885 |         else:
886 |             data = open(file_, 'rb').read()
887 |             if encoding is None:
888 |                 try:
889 |                     from chardet import detect
890 |                 except ImportError:
891 |                     detect = lambda x: {'encoding': 'utf-8'}
892 |                 encoding = detect(data)['encoding']
893 |     else:
894 |         data = sys.stdin.read()
895 | 
896 |     data = data.decode(encoding)
897 |     h = HTML2Text(baseurl=baseurl)
898 |     # handle options
899 |     if options.ul_style_dash: h.ul_item_mark = '-'
900 |     if options.em_style_asterisk:
901 |         h.emphasis_mark = '*'
902 |         h.strong_mark = '__'
903 | 
904 |     h.body_width = options.body_width
905 |     h.list_indent = options.list_indent
906 |     h.ignore_emphasis = options.ignore_emphasis
907 |     h.ignore_links = options.ignore_links
908 |     h.ignore_images = options.ignore_images
909 |     h.google_doc = options.google_doc
910 |     h.hide_strikethrough = options.hide_strikethrough
911 |     h.escape_snob = options.escape_snob
912 | 
913 |     wrapwrite(h.handle(data))
914 | 
915 | 
916 | if __name__ == "__main__":
917 |     main()
918 | 


--------------------------------------------------------------------------------
/r2e:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | set -e
 4 | 
 5 | FEEDS=feeds.dat
 6 | 
 7 | # Look for feeds.dat in the current directory and, if found, use that
 8 | # as configuration data. Otherwise, use ~/.rss2email as a directory to
 9 | # store the data.
10 | 
11 | if [ -f "${FEEDS}" ]; then
12 |     CFDIR=.
13 | else
14 |     CFDIR="${HOME}/.rss2email"
15 | fi
16 | 
17 | if [ ! -d "${CFDIR}" ]; then
18 |     mkdir "${CFDIR}"
19 | fi
20 | 
21 | # Add $CFDIR to $PYTHONPATH so that config.py can be found.
22 | PYTHONPATH="${CFDIR}:${PYTHONPATH}" python rss2email.py "${CFDIR}/${FEEDS}" $*
23 | 


--------------------------------------------------------------------------------
/r2e.bat:
--------------------------------------------------------------------------------
1 | @python rss2email.py feeds.dat %1 %2 %3 %4 %5 %6 %7 %8 %9
2 | 


--------------------------------------------------------------------------------
/readme.html:
--------------------------------------------------------------------------------
  1 | <html>
  2 | <head>
  3 | <title>Getting Started With rss2email</title>
  4 | </head>
  5 | <body>
  6 | <h1>Getting Started With rss2email</h1>
  7 | 
  8 | <p style="color: green;">We highly recommend that you subscribe to the rss2email project feed so you can keep up to date with the latest version, bugfixes and features: <a href="http://feeds.feedburner.com/allthingsrss/hJBr">http://feeds.feedburner.com/allthingsrss/hJBr</a></p>
  9 | <p><a href="#windows">Instructions for Windows Users</a><br/>
 10 | <a href="#unix">Instructions for UNIX Users</a><br/>
 11 | <a href="#customizeit">Customizing rss2email</a></p>
 12 | 
 13 | 
 14 | <h2 id="windows">Instructions for Windows Users</h2>
 15 | 
 16 | <h3>Requirements</h3>
 17 | 
 18 | <p>Before you install rss2email, you'll need to make sure that a few things are in place. First, is that a version of <a href="http://www.python.org">Python</a> 2.x installed. Second, determine your outgoing email server's address. That should be all you need.</p>
 19 | 
 20 | <h3>Download</h3>
 21 | 
 22 | <ol>
 23 | <li>Create a new folder</li>
 24 | <li>Download the latest rss2email .ZIP file and unzip to the new folder
 25 | </ol>
 26 |   
 27 | <h3>Configure</h3>
 28 | 
 29 | <p>Edit the <code>config.py</code> file and fill in your outoing email server's details. If your server requires you to login, change <code>"AUTHREQUIRED = 0"</code> to <code>"AUTHREQUIRED = 1"</code> and enter your email username and password.</p>
 30 | 
 31 | <h3>Install</h3>
 32 | 
 33 | <p>From the command line, change to the folder you created. Now create a new feed database to send updates to your email address:</p>
 34 | 
 35 | <blockquote>
 36 |   <p><code>r2e new you@yourdomain.com</code></p>
 37 | </blockquote>
 38 | 
 39 | <p>Subscribe to some feeds:</p>
 40 | 
 41 | <blockquote>
 42 |   <p><code>r2e add http://feeds.feedburner.com/allthingsrss/hJBr</code></p>
 43 | </blockquote>
 44 | 
 45 | <p>That's the feed to be notified when there's a new version of rss2email. Repeat this for each feed you want to subscribe to.</p>
 46 | 
 47 | <p>When you run rss2email, it emails you about every story it hasn't seen before. But the first time you run it, that will be every story. To avoid this, you can ask rss2email not to send you any stories the first time you run it:</p>
 48 | 
 49 | <blockquote>
 50 |   <p><code>r2e run --no-send</code></p>
 51 | </blockquote>
 52 | 
 53 | <p>Then later, you can ask it to email you new stories:</p>
 54 | 
 55 | <blockquote>
 56 |   <p><code>r2e run</code></p>
 57 | 
 58 | </blockquote>
 59 | 
 60 | <p>If you get an error message "Sender domain must exist", add a line to <code>config.py</code> like this:</p>
 61 | 
 62 | <blockquote>
 63 |   <p><code>DEFAULT_FROM = rss2email@yoursite.com</code></p>
 64 | </blockquote>
 65 | 
 66 | <p>You can make the email address whatever you want, but your mail server requires that the yoursite.com part actually exists.</p>
 67 | 
 68 | <h3>Automating rss2email</h3>
 69 | 
 70 | <p>More than likely you will want rss2email to run automatically at a regular interval. Under Windows this is can be easily accomplished using the Windows Task Scheduler. This site has a nice <a href="http://www.iopus.com/guides/winscheduler.htm">tutorial</a> on it. Just select r2e.bat as the program to run. Once you've created the task, double click on it in the task list and change the Run entry so that "run" comes after r2e.bat. For example, if you installed rss2email in the C:\rss2email folder, then you would change the Run entry from "C:\rss2email\r2e.bat" to "C:\rss2email\r2e.bat run".</p>
 71 | 
 72 | <p>Now jump down to the section on <a href="#customizeit">customizing rss2email</a> to your needs.</p>
 73 | 
 74 | <h3>Upgrading to a new version</h3>
 75 | <p>Simply replace all of the files from the .ZIP package to your install directory <strong style="color: red;">EXCEPT config.py</strong></p>
 76 | 
 77 | <h2 id="unix">Instructions for UNIX/Linux Users</h2>
 78 | 
 79 | <h3>Requirements</h3>
 80 | 
 81 | <p>Before you install rss2email, you'll need to make sure that a few things are in place. First, is a version of <a href="http://www.python.org">Python</a> 2.x installed. Second, is whether you have sendmail (or a compatible replacement like postfix) installed. If sendmail isn't installed, determine your outgoing email server's address. That should be all you need.</p>
 82 | 
 83 | <h3>Download</h3>
 84 | 
 85 | <p>A quick way to get rss2email going is using pre-made packages. Here are releases for <a href="http://packages.debian.org/cgi-bin/search_packages.pl?searchon=names&amp;version=all&amp;exact=1&amp;keywords=rss2email">Debian</a> Linux, <a href="http://packages.ubuntu.com/search?keywords=rss2email&searchon=names&section=all">Ubuntu</a> Linux and <a href="ftp://ftp.netbsd.org/pub/NetBSD/NetBSD-current/pkgsrc/mail/rss2email/README.html">NetBSD</a>.</p>
 86 | 
 87 | <p>If you are unable to use these packages or you want the latest and greatest version, here's what you do:</p>
 88 | 
 89 | <blockquote><code>
 90 | Unarchive (probably 'tar -xzf') the rss2email .tar.gz package to [folder where you want rss2email files to live] <br>
 91 | cd [yourfolder] <br>
 92 | chmod +x r2e
 93 | </code></blockquote>
 94 | 
 95 | <h3>Install</h3>
 96 | 
 97 | <p>Create a new feed database with your target email address:</p>
 98 | 
 99 | <blockquote>
100 |   <p><code>./r2e new you@yourdomain.com</code></p>
101 | </blockquote>
102 | 
103 | <p>Subscribe to some feeds:</p>
104 | 
105 | <blockquote>
106 |   <p><code>./r2e add http://feeds.feedburner.com/allthingsrss/hJBr</code></p>
107 | 
108 | </blockquote>
109 | 
110 | <p>That's the feed to be notified when there's a new version of rss2email. Repeat this for each feed you want to subscribe to.</p>
111 | 
112 | <p>When you run rss2email, it emails you about every story it hasn't seen before. But the first time you run it, that will be every story. To avoid this, you can ask rss2email not to send you any stories the first time you run it:</p>
113 | 
114 | <blockquote>
115 |   <p><code>./r2e run --no-send</code></p>
116 | </blockquote>
117 | 
118 | <p>Then later, you can ask it to email you new stories:</p>
119 | 
120 | <blockquote>
121 |   <p><code>./r2e run</code></p>
122 | </blockquote>
123 | 
124 | <p>You probably want to set things up so that this command is run repeatedly. (One good way is via a cron job.)</p>
125 | 
126 | <p>If you get an error message "Sender domain must exist", add a line to <code>config.py</code> like this:</p>
127 | 
128 | <blockquote>
129 |   <p><code>DEFAULT_FROM = rss2email@yoursite.com</code></p>
130 | </blockquote>
131 | 
132 | <p>You can make the email address whatever you want, but your mail server requires that the yoursite.com part actually exists.</p>
133 | 
134 | <h3>Upgrading to a new version</h3>
135 | <p>Simply replace all of the files from the .tar.gz package to your install directory <strong style="color: red;">EXCEPT config.py</strong></p>
136 | 
137 | 
138 | 
139 | <h1 id="customizeit">Customize rss2email</h1>
140 | 
141 | <p>There are a number of options, described in full at the top of rss2email.py file, to customize the way rss2email behaves. If you want to change something, edit the <code>config.py</code> file. If you're not using rss2email under Windows, you'll have to create this file if it doesn't already exist.</p>
142 | 
143 | <p>For example, if you want to receive HTML mail, instead of having entries converted to plain text:</p>
144 | 
145 | <blockquote>
146 |   <p><code>HTML_MAIL = 1</code></p>
147 | 
148 | </blockquote>
149 | 
150 | <p>To be notified every time a post changes, instead of just when it's first posted:</p>
151 | 
152 | <blockquote>
153 |   <p><code>TRUST_GUID = 0</code></p>
154 | </blockquote>
155 | 
156 | <p>And to make the emails look as if they were sent when the item was posted:</p>
157 | 
158 | <blockquote>
159 |   <p><code>DATE_HEADER = 1</code></p>
160 |   
161 | </body>
162 | </html>  
163 |   


--------------------------------------------------------------------------------
/rss2email.py:
--------------------------------------------------------------------------------
   1 | #!/usr/bin/python
   2 | """rss2email: get RSS feeds emailed to you
   3 | http://rss2email.infogami.com
   4 | 
   5 | Usage:
   6 |   new [emailaddress] (create new feedfile)
   7 |   email newemailaddress (update default email)
   8 |   run [--no-send] [num],
   9 |   add feedurl [emailaddress] [folder]
  10 |   list
  11 |   reset
  12 |   delete n
  13 |   pause n
  14 |   unpause n
  15 |   opmlexport
  16 |   opmlimport filename
  17 | """
  18 | __version__ = "2.72-rcarmo"
  19 | __author__ = "Lindsey Smith (lindsey@allthingsrss.com)"
  20 | __copyright__ = "(C) 2004 Aaron Swartz. GNU GPL 2 or 3."
  21 | ___contributors__ = ["Dean Jackson", "Brian Lalor", "Joey Hess", 
  22 |                      "Matej Cepl", "Martin 'Joey' Schulze", 
  23 |                      "Marcel Ackermann (http://www.DreamFlasher.de)", 
  24 |                      "Rui Carmo (http://taoofmac.com)",
  25 |                      "Lindsey Smith (maintainer)", "Erik Hetzner",
  26 |                      "Aaron Swartz (original author)" ]
  27 | 
  28 | ### Import Modules ###
  29 | 
  30 | import os, sys, re, time
  31 | from datetime import datetime, timedelta
  32 | import socket, urllib2, urlparse, imaplib, smtplib
  33 | urllib2.install_opener(urllib2.build_opener())
  34 | import string, csv, StringIO
  35 | import hashlib, base64
  36 | import traceback, types
  37 | from types import *
  38 | import threading, subprocess
  39 | import cPickle as pickle
  40 | 
  41 | from email.MIMEText import MIMEText
  42 | from email.Header import Header
  43 | from email.Utils import parseaddr, formataddr
  44 |              
  45 | import feedparser
  46 | feedparser.USER_AGENT = "rss2email/"+__version__+ " +https://github.com/rcarmo/rss2imap"
  47 | 
  48 | import html2text as h2t
  49 | from summarize import summarize
  50 | 
  51 | DEFAULT_IMAP_FOLDER = "INBOX"
  52 | 
  53 | # Read options from config file if present.
  54 | sys.path.insert(0,".")
  55 | try:
  56 |     from config import *
  57 | except:
  58 |     pass
  59 | 
  60 | h2t.UNICODE_SNOB = UNICODE_SNOB
  61 | h2t.LINKS_EACH_PARAGRAPH = LINKS_EACH_PARAGRAPH
  62 | h2t.BODY_WIDTH = BODY_WIDTH
  63 | html2text = h2t.html2text
  64 | 
  65 | 
  66 | def send(sender, recipient, subject, body, contenttype, when, extraheaders=None, mailserver=None, folder=None):
  67 |     """Send an email.
  68 |     
  69 |     All arguments should be Unicode strings (plain ASCII works as well).
  70 |     
  71 |     Only the real name part of sender and recipient addresses may contain
  72 |     non-ASCII characters.
  73 |     
  74 |     The email will be properly MIME encoded and delivered though SMTP to
  75 |     localhost port 25.  This is easy to change if you want something different.
  76 |     
  77 |     The charset of the email will be the first one out of the list
  78 |     that can represent all the characters occurring in the email.
  79 |     """
  80 | 
  81 |     # Header class is smart enough to try US-ASCII, then the charset we
  82 |     # provide, then fall back to UTF-8.
  83 |     header_charset = 'ISO-8859-1'
  84 |     
  85 |     # We must choose the body charset manually
  86 |     for body_charset in CHARSET_LIST:
  87 |         try:
  88 |             body.encode(body_charset)
  89 |         except (UnicodeError, LookupError):
  90 |             pass
  91 |         else:
  92 |             break
  93 | 
  94 |     # Split real name (which is optional) and email address parts
  95 |     sender_name, sender_addr = parseaddr(sender)
  96 |     recipient_name, recipient_addr = parseaddr(recipient)
  97 |     
  98 |     # We must always pass Unicode strings to Header, otherwise it will
  99 |     # use RFC 2047 encoding even on plain ASCII strings.
 100 |     sender_name = str(Header(unicode(sender_name), header_charset))
 101 |     recipient_name = str(Header(unicode(recipient_name), header_charset))
 102 |     
 103 |     # Make sure email addresses do not contain non-ASCII characters
 104 |     sender_addr = sender_addr.encode('ascii')
 105 |     recipient_addr = recipient_addr.encode('ascii')
 106 |     
 107 |     # Create the message ('plain' stands for Content-Type: text/plain)
 108 |     msg = MIMEText(body.encode(body_charset), contenttype, body_charset)
 109 |     if IMAP_OVERRIDE_TO:
 110 |         msg['To'] = IMAP_OVERRIDE_TO
 111 |     else:
 112 |         msg['To'] = formataddr((recipient_name, recipient_addr))
 113 |     msg['Subject'] = Header(unicode(subject), header_charset)
 114 |     for hdr in extraheaders.keys():
 115 |         try:
 116 |             msg[hdr] = Header(unicode(extraheaders[hdr], header_charset))
 117 |         except:
 118 |             msg[hdr] = Header(extraheaders[hdr])
 119 |         
 120 |     fromhdr = formataddr((sender_name, sender_addr))
 121 |     if IMAP_MUNGE_FROM:
 122 |         msg['From'] = extraheaders['X-MUNGED-FROM']
 123 |     else:
 124 |         msg['From'] = fromhdr
 125 | 
 126 |     msg_as_string = msg.as_string()
 127 | 
 128 |     if IMAP_SEND:
 129 |         if not mailserver:
 130 |             try:
 131 |                 (host,port) = IMAP_SERVER.split(':',1)
 132 |             except ValueError:
 133 |                 host = IMAP_SERVER
 134 |                 port = 993 if IMAP_SSL else 143
 135 |             try:
 136 |                 if IMAP_SSL:
 137 |                     mailserver = imaplib.IMAP4_SSL(host, port)
 138 |                 else:
 139 |                     mailserver = imaplib.IMAP4(host, port)
 140 |                 # speed up interactions on TCP connections using small packets
 141 |                 mailserver.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
 142 |                 mailserver.login(IMAP_USER, IMAP_PASS)
 143 |             except KeyboardInterrupt:
 144 |                 raise
 145 |             except Exception, e:
 146 |                 print >>warn, ""
 147 |                 print >>warn, ('Fatal error: could not connect to mail server "%s"' % IMAP_SERVER)
 148 |                 print >>warn, ('Check your config.py file to confirm that IMAP_SERVER and other mail server settings are configured properly')
 149 |                 if hasattr(e, 'reason'):
 150 |                     print >>warn, "Reason:", e.reason
 151 |                 sys.exit(1)
 152 |         if not folder:
 153 |             folder = DEFAULT_IMAP_FOLDER
 154 |         #mailserver.debug = 4
 155 |         if mailserver.select(folder)[0] == 'NO':
 156 |             print >>warn, ("%s does not exist, creating" % folder)
 157 |             mailserver.create(folder)
 158 |             mailserver.subscribe(folder)
 159 |         mailserver.append(folder,'',imaplib.Time2Internaldate(when), msg_as_string)
 160 |         return mailserver
 161 | 
 162 |     elif SMTP_SEND:
 163 |         if not mailserver: 
 164 |             
 165 |             try:
 166 |                 if SMTP_SSL:
 167 |                     mailserver = smtplib.SMTP_SSL()
 168 |                 else:
 169 |                     mailserver = smtplib.SMTP()
 170 |                 mailserver.connect(SMTP_SERVER)
 171 |             except KeyboardInterrupt:
 172 |                 raise
 173 |             except Exception, e:
 174 |                 print >>warn, ""
 175 |                 print >>warn, ('Fatal error: could not connect to mail server "%s"' % SMTP_SERVER)
 176 |                 print >>warn, ('Check your config.py file to confirm that SMTP_SERVER and other mail server settings are configured properly')
 177 |                 if hasattr(e, 'reason'):
 178 |                     print >>warn, "Reason:", e.reason
 179 |                 sys.exit(1)
 180 |                     
 181 |             if AUTHREQUIRED:
 182 |                 try:
 183 |                     mailserver.ehlo()
 184 |                     if not SMTP_SSL: mailserver.starttls()
 185 |                     mailserver.ehlo()
 186 |                     mailserver.login(SMTP_USER, SMTP_PASS)
 187 |                 except KeyboardInterrupt:
 188 |                     raise
 189 |                 except Exception, e:
 190 |                     print >>warn, ""
 191 |                     print >>warn, ('Fatal error: could not authenticate with mail server "%s" as user "%s"' % (SMTP_SERVER, SMTP_USER))
 192 |                     print >>warn, ('Check your config.py file to confirm that SMTP_SERVER and other mail server settings are configured properly')
 193 |                     if hasattr(e, 'reason'):
 194 |                         print >>warn, "Reason:", e.reason
 195 |                     sys.exit(1)
 196 |                     
 197 |         mailserver.sendmail(sender, recipient, msg_as_string)
 198 |         return mailserver
 199 | 
 200 |     else:
 201 |         try:
 202 |             p = subprocess.Popen(["/usr/sbin/sendmail", recipient], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
 203 |             p.communicate(msg_as_string)
 204 |             status = p.returncode
 205 |             assert status != None, "just a sanity check"
 206 |             if status != 0:
 207 |                 print >>warn, ""
 208 |                 print >>warn, ('Fatal error: sendmail exited with code %s' % status)
 209 |                 sys.exit(1)
 210 |         except:
 211 |             print '''Error attempting to send email via sendmail. Possibly you need to configure your config.py to use a SMTP server? Please refer to the rss2email documentation or website (http://rss2email.infogami.com) for complete documentation of config.py. The options below may suffice for configuring email:
 212 | # 1: Use SMTP_SERVER to send mail.
 213 | # 0: Call /usr/sbin/sendmail to send mail.
 214 | SMTP_SEND = 0
 215 | 
 216 | SMTP_SERVER = "smtp.yourisp.net:25"
 217 | AUTHREQUIRED = 0 # if you need to use SMTP AUTH set to 1
 218 | SMTP_USER = 'username'  # for SMTP AUTH, set SMTP username here
 219 | SMTP_PASS = 'password'  # for SMTP AUTH, set SMTP password here
 220 | '''
 221 |             sys.exit(1)
 222 |         return None
 223 | 
 224 | 
 225 | warn = sys.stderr
 226 |     
 227 | if QP_REQUIRED:
 228 |     print >>warn, "QP_REQUIRED has been deprecated in rss2email."
 229 | 
 230 | 
 231 | unix = 0
 232 | try:
 233 |     import fcntl
 234 | # A pox on SunOS file locking methods   
 235 |     if (sys.platform.find('sunos') == -1): 
 236 |         unix = 1
 237 | except:
 238 |     pass
 239 |         
 240 | socket_errors = []
 241 | for e in ['error', 'gaierror']:
 242 |     if hasattr(socket, e): socket_errors.append(getattr(socket, e))
 243 | 
 244 | ### Utility Functions ###
 245 | 
 246 | import threading
 247 | class TimeoutError(Exception): pass
 248 | 
 249 | class InputError(Exception): pass
 250 | 
 251 | def timelimit(timeout, function):
 252 | #    def internal(function):
 253 |         def internal2(*args, **kw):
 254 |             """
 255 |             from http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/473878
 256 |             """
 257 |             class Calculator(threading.Thread):
 258 |                 def __init__(self):
 259 |                     threading.Thread.__init__(self)
 260 |                     self.result = None
 261 |                     self.error = None
 262 |                 
 263 |                 def run(self):
 264 |                     try:
 265 |                         self.result = function(*args, **kw)
 266 |                     except:
 267 |                         self.error = sys.exc_info()
 268 |             
 269 |             c = Calculator()
 270 |             c.setDaemon(True) # don't hold up exiting
 271 |             c.start()
 272 |             c.join(timeout)
 273 |             if c.isAlive():
 274 |                 raise TimeoutError
 275 |             if c.error:
 276 |                 raise c.error[0], c.error[1]
 277 |             return c.result
 278 |         return internal2
 279 | #    return internal
 280 |     
 281 | 
 282 | def isstr(f): return isinstance(f, type('')) or isinstance(f, type(u''))
 283 | def ishtml(t): return type(t) is type(())
 284 | def contains(a,b): return a.find(b) != -1
 285 | def unu(s): # I / freakin' hate / that unicode
 286 |     if type(s) is types.UnicodeType: return s.encode('utf-8')
 287 |     else: return s
 288 | 
 289 | ### Parsing Utilities ###
 290 | 
 291 | def getContent(entry, HTMLOK=0):
 292 |     """Select the best content from an entry, deHTMLizing if necessary.
 293 |     If raw HTML is best, an ('HTML', best) tuple is returned. """
 294 |     
 295 |     # How this works:
 296 |     #  * We have a bunch of potential contents. 
 297 |     #  * We go thru looking for our first choice. 
 298 |     #    (HTML or text, depending on HTMLOK)
 299 |     #  * If that doesn't work, we go thru looking for our second choice.
 300 |     #  * If that still doesn't work, we just take the first one.
 301 |     #
 302 |     # Possible future improvement:
 303 |     #  * Instead of just taking the first one
 304 |     #    pick the one in the "best" language.
 305 |     #  * HACK: hardcoded HTMLOK, should take a tuple of media types
 306 |     
 307 |     conts = entry.get('content', [])
 308 |     
 309 |     if entry.get('summary_detail', {}):
 310 |         conts += [entry.summary_detail]
 311 |     
 312 |     if conts:
 313 |         if HTMLOK:
 314 |             for c in conts:
 315 |                 if contains(c.type, 'html'): return ('HTML', c.value)
 316 |     
 317 |         if not HTMLOK: # Only need to convert to text if HTML isn't OK
 318 |             for c in conts:
 319 |                 if contains(c.type, 'html'):
 320 |                     return html2text(c.value)
 321 |         
 322 |         for c in conts:
 323 |             if c.type == 'text/plain': return c.value
 324 |     
 325 |         return conts[0].value   
 326 |     
 327 |     return ""
 328 | 
 329 | def getID(entry):
 330 |     """Get best ID from an entry."""
 331 |     if TRUST_GUID:
 332 |         if 'id' in entry and entry.id: 
 333 |             # Newer versions of feedparser could return a dictionary
 334 |             if type(entry.id) is DictType:
 335 |                 return entry.id.values()[0]
 336 | 
 337 |             return entry.id
 338 | 
 339 |     content = getContent(entry)
 340 |     if content and content != "\n": return hashlib.sha1(unu(content)).hexdigest()
 341 |     if 'link' in entry: return entry.link
 342 |     if 'title' in entry: return hashlib.sha1(unu(entry.title)).hexdigest()
 343 | 
 344 | def getName(r, entry):
 345 |     """Get the best name."""
 346 | 
 347 |     if NO_FRIENDLY_NAME: return ''
 348 | 
 349 |     feed = r.feed
 350 |     if hasattr(r, "url") and r.url in OVERRIDE_FROM.keys():
 351 |         return OVERRIDE_FROM[r.url]
 352 |     
 353 |     name = feed.get('title', '')
 354 | 
 355 |     if 'name' in entry.get('author_detail', []): # normally {} but py2.1
 356 |         if entry.author_detail.name:
 357 |             if name: name += ": "
 358 |             det=entry.author_detail.name
 359 |             try:
 360 |                 name +=  entry.author_detail.name
 361 |             except UnicodeDecodeError:
 362 |                 name +=  unicode(entry.author_detail.name, 'utf-8')
 363 | 
 364 |     elif 'name' in feed.get('author_detail', []):
 365 |         if feed.author_detail.name:
 366 |             if name: name += ", "
 367 |             name += feed.author_detail.name
 368 |     
 369 |     return name
 370 | 
 371 | 
 372 | def getMungedFrom(r):
 373 |     """Generate a better From."""
 374 | 
 375 |     feed = r.feed
 376 |     if hasattr(r, "url") and r.url in OVERRIDE_FROM.keys():
 377 |         return OVERRIDE_FROM[r.url]
 378 |     
 379 |     name = feed.get('title', 'unknown').lower()
 380 |     pattern = re.compile('[\W_]+',re.UNICODE)
 381 |     re.sub(pattern, '', name)
 382 |     name = "%s <%s@%s>" % (feed.get('title','Unnamed Feed'), name.replace(' ','_'), urlparse.urlparse(r.url).netloc)
 383 |     return name
 384 | 
 385 | 
 386 | def validateEmail(email, planb):
 387 |     """Do a basic quality check on email address, but return planb if email doesn't appear to be well-formed"""
 388 |     email_parts = email.split('@')
 389 |     if len(email_parts) != 2:
 390 |         return planb
 391 |     return email
 392 |     
 393 | def getEmail(r, entry):
 394 |     """Get the best email_address. If the best guess isn't well-formed (something@somthing.com), use DEFAULT_FROM instead"""
 395 |     
 396 |     feed = r.feed
 397 |         
 398 |     if FORCE_FROM: return DEFAULT_FROM
 399 |     
 400 |     if hasattr(r, "url") and r.url in OVERRIDE_EMAIL.keys():
 401 |         return validateEmail(OVERRIDE_EMAIL[r.url], DEFAULT_FROM)
 402 |     
 403 |     if 'email' in entry.get('author_detail', []):
 404 |         return validateEmail(entry.author_detail.email, DEFAULT_FROM)
 405 |     
 406 |     if 'email' in feed.get('author_detail', []):
 407 |         return validateEmail(feed.author_detail.email, DEFAULT_FROM)
 408 |         
 409 |     if USE_PUBLISHER_EMAIL:
 410 |         if 'email' in feed.get('publisher_detail', []):
 411 |             return validateEmail(feed.publisher_detail.email, DEFAULT_FROM)
 412 |         
 413 |         if feed.get("errorreportsto", ''):
 414 |             return validateEmail(feed.errorreportsto, DEFAULT_FROM)
 415 |             
 416 |     if hasattr(r, "url") and r.url in DEFAULT_EMAIL.keys():
 417 |         return DEFAULT_EMAIL[r.url]
 418 |     return DEFAULT_FROM
 419 | 
 420 | ### Simple Database of Feeds ###
 421 | 
 422 | class Feed:
 423 |     def __init__(self, url, to, folder=None):
 424 |         self.url, self.etag, self.modified, self.seen = url, None, None, {}
 425 |         self.active = True
 426 |         self.to = to
 427 |         self.folder = folder
 428 | 
 429 | def load(lock=1):
 430 |     if not os.path.exists(feedfile):
 431 |         print 'Feedfile "%s" does not exist.  If you\'re using r2e for the first time, you' % feedfile
 432 |         print "have to run 'r2e new' first."
 433 |         sys.exit(1)
 434 |     try:
 435 |         feedfileObject = open(feedfile, 'r')
 436 |     except IOError, e:
 437 |         print "Feedfile could not be opened: %s" % e
 438 |         sys.exit(1)
 439 |     feeds = pickle.load(feedfileObject)
 440 |     
 441 |     if lock:
 442 |         locktype = 0
 443 |         if unix:
 444 |             locktype = fcntl.LOCK_EX
 445 |             fcntl.flock(feedfileObject.fileno(), locktype)
 446 |         #HACK: to deal with lock caching
 447 |         feedfileObject = open(feedfile, 'r')
 448 |         feeds = pickle.load(feedfileObject)
 449 |         if unix: 
 450 |             fcntl.flock(feedfileObject.fileno(), locktype)
 451 |     if feeds: 
 452 |         for feed in feeds[1:]:
 453 |             if not hasattr(feed, 'active'): 
 454 |                 feed.active = True
 455 |         
 456 |     return feeds, feedfileObject
 457 | 
 458 | def unlock(feeds, feedfileObject):
 459 |     if not unix: 
 460 |         pickle.dump(feeds, open(feedfile, 'w'))
 461 |     else:   
 462 |         fd = open(feedfile+'.tmp', 'w')
 463 |         pickle.dump(feeds, fd)
 464 |         fd.flush()
 465 |         os.fsync(fd.fileno())
 466 |         fd.close()
 467 |         os.rename(feedfile+'.tmp', feedfile)
 468 |         fcntl.flock(feedfileObject.fileno(), fcntl.LOCK_UN)
 469 | 
 470 | #@timelimit(FEED_TIMEOUT)       
 471 | def parse(url, etag, modified):
 472 |     if PROXY == '':
 473 |         return feedparser.parse(url, etag, modified)
 474 |     else:
 475 |         proxy = urllib2.ProxyHandler( {"http":PROXY} )
 476 |         return feedparser.parse(url, etag, modified, handlers = [proxy])    
 477 |     
 478 |         
 479 | ### Program Functions ###
 480 | 
 481 | def add(*args):
 482 |     if len(args) == 2 and contains(args[1], '@') and not contains(args[1], '://'):
 483 |         urls, to = [args[0]], args[1]
 484 |         folder = None
 485 |     elif len(args) >= 2:
 486 |         urls, to, folder = [args[0]], None, ' '.join(args[1:])
 487 |     else:
 488 |         urls, to, folder = args, None, None
 489 |     
 490 |     feeds, feedfileObject = load()
 491 |     if (feeds and not isstr(feeds[0]) and to is None) or (not len(feeds) and to is None):
 492 |         print "No email address has been defined. Please run 'r2e email emailaddress' or"
 493 |         print "'r2e add url emailaddress'."
 494 |         sys.exit(1)
 495 |     for url in urls: feeds.append(Feed(url, to, folder))
 496 |     unlock(feeds, feedfileObject)
 497 | 
 498 | 
 499 | ### HTML Parser for grabbing links and images ###
 500 | 
 501 | from HTMLParser import HTMLParser
 502 | class Parser(HTMLParser):
 503 |     def __init__(self, tag = 'a', attr = 'href'):
 504 |         HTMLParser.__init__(self)
 505 |         self.tag = tag
 506 |         self.attr = attr
 507 |         self.attrs = []
 508 |     def handle_starttag(self, tag, attrs):
 509 |         if tag == self.tag:
 510 |             attrs = dict(attrs)
 511 |             if self.attr in attrs:
 512 |                 self.attrs.append(attrs[self.attr])
 513 | 
 514 | 
 515 | ### CSV dialect for parsing IMAP responses
 516 | class mailboxlist(csv.excel):
 517 |     delimiter = ' '
 518 | 
 519 | 
 520 | csv.register_dialect('mailboxlist',mailboxlist)
 521 | 
 522 | 
 523 | def uid(data):
 524 |     m = re.match('\d+ \(UID (?P<uid>\d+)\)', data)
 525 |     return m.group('uid')
 526 | 
 527 | 
 528 | def run(num=None):
 529 |     feeds, feedfileObject = load()
 530 |     mailserver = None
 531 |     try:
 532 |         # We store the default to address as the first item in the feeds list.
 533 |         # Here we take it out and save it for later.
 534 |         default_to = ""
 535 |         if feeds and isstr(feeds[0]): default_to = feeds[0]; ifeeds = feeds[1:] 
 536 |         else: ifeeds = feeds
 537 |         
 538 |         if num: ifeeds = [feeds[num]]
 539 |         feednum = 0
 540 |         
 541 |         for f in ifeeds:
 542 |             try: 
 543 |                 feednum += 1
 544 |                 if not f.active: continue
 545 |                 
 546 |                 if VERBOSE: print >>warn, 'I: Processing [%d] "%s"' % (feednum, f.url)
 547 |                 r = {}
 548 |                 try:
 549 |                     r = timelimit(FEED_TIMEOUT, parse)(f.url, f.etag, f.modified)
 550 |                 except TimeoutError:
 551 |                     print >>warn, 'W: feed [%d] "%s" timed out' % (feednum, f.url)
 552 |                     continue
 553 |                 
 554 |                 # Handle various status conditions, as required
 555 |                 if 'status' in r:
 556 |                     if r.status == 301: f.url = r['url']
 557 |                     elif r.status == 410:
 558 |                         print >>warn, "W: feed gone; deleting", f.url
 559 |                         feeds.remove(f)
 560 |                         continue
 561 |                 
 562 |                 http_status = r.get('status', 200)
 563 |                 if VERBOSE > 1: print >>warn, "I: http status", http_status
 564 |                 http_headers = r.get('headers', {
 565 |                   'content-type': 'application/rss+xml', 
 566 |                   'content-length':'1'})
 567 |                 exc_type = r.get("bozo_exception", Exception()).__class__
 568 |                 if http_status != 304 and not r.entries and not r.get('version', ''):
 569 |                     if http_status not in [200, 302]: 
 570 |                         print >>warn, "W: error %d [%d] %s" % (http_status, feednum, f.url)
 571 | 
 572 |                     elif contains(http_headers.get('content-type', 'rss'), 'html'):
 573 |                         print >>warn, "W: looks like HTML [%d] %s"  % (feednum, f.url)
 574 | 
 575 |                     elif http_headers.get('content-length', '1') == '0':
 576 |                         print >>warn, "W: empty page [%d] %s" % (feednum, f.url)
 577 | 
 578 |                     elif hasattr(socket, 'timeout') and exc_type == socket.timeout:
 579 |                         print >>warn, "W: timed out on [%d] %s" % (feednum, f.url)
 580 |                     
 581 |                     elif exc_type == IOError:
 582 |                         print >>warn, 'W: "%s" [%d] %s' % (r.bozo_exception, feednum, f.url)
 583 |                     
 584 |                     elif hasattr(feedparser, 'zlib') and exc_type == feedparser.zlib.error:
 585 |                         print >>warn, "W: broken compression [%d] %s" % (feednum, f.url)
 586 |                     
 587 |                     elif exc_type in socket_errors:
 588 |                         exc_reason = r.bozo_exception.args[1]
 589 |                         print >>warn, "W: %s [%d] %s" % (exc_reason, feednum, f.url)
 590 | 
 591 |                     elif exc_type == urllib2.URLError:
 592 |                         if r.bozo_exception.reason.__class__ in socket_errors:
 593 |                             exc_reason = r.bozo_exception.reason.args[1]
 594 |                         else:
 595 |                             exc_reason = r.bozo_exception.reason
 596 |                         print >>warn, "W: %s [%d] %s" % (exc_reason, feednum, f.url)
 597 |                     
 598 |                     elif exc_type == AttributeError:
 599 |                         print >>warn, "W: %s [%d] %s" % (r.bozo_exception, feednum, f.url)
 600 |                     
 601 |                     elif exc_type == KeyboardInterrupt:
 602 |                         raise r.bozo_exception
 603 |                         
 604 |                     elif r.bozo:
 605 |                         print >>warn, 'E: error in [%d] "%s" feed (%s)' % (feednum, f.url, r.get("bozo_exception", "can't process"))
 606 | 
 607 |                     else:
 608 |                         print >>warn, "=== rss2email encountered a problem with this feed ==="
 609 |                         print >>warn, "=== See the rss2email FAQ at http://www.allthingsrss.com/rss2email/ for assistance ==="
 610 |                         print >>warn, "=== If this occurs repeatedly, send this to lindsey@allthingsrss.com ==="
 611 |                         print >>warn, "E:", r.get("bozo_exception", "can't process"), f.url
 612 |                         print >>warn, r
 613 |                         print >>warn, "rss2email", __version__
 614 |                         print >>warn, "feedparser", feedparser.__version__
 615 |                         print >>warn, "html2text", h2t.__version__
 616 |                         print >>warn, "Python", sys.version
 617 |                         print >>warn, "=== END HERE ==="
 618 |                     continue
 619 |                 
 620 |                 r.entries.reverse()
 621 |                 
 622 |                 for entry in r.entries:
 623 |                     id = getID(entry)
 624 |                     
 625 |                     # If TRUST_GUID isn't set, we get back hashes of the content.
 626 |                     # Instead of letting these run wild, we put them in context
 627 |                     # by associating them with the actual ID (if it exists).
 628 |                     
 629 |                     frameid = entry.get('id')
 630 |                     if not(frameid): frameid = id
 631 |                     if type(frameid) is DictType:
 632 |                         frameid = frameid.values()[0]
 633 |                     
 634 |                     # If this item's ID is in our database
 635 |                     # then it's already been sent
 636 |                     # and we don't need to do anything more.
 637 |                     
 638 |                     if frameid in f.seen:
 639 |                         if f.seen[frameid] == id: continue
 640 | 
 641 |                     if not (f.to or default_to):
 642 |                         print "No default email address defined. Please run 'r2e email emailaddress'"
 643 |                         print "Ignoring feed %s" % f.url
 644 |                         break
 645 |                     
 646 |                     if 'title_detail' in entry and entry.title_detail:
 647 |                         title = entry.title_detail.value
 648 |                         if contains(entry.title_detail.type, 'html'):
 649 |                             title = html2text(title)
 650 |                     else:
 651 |                         title = getContent(entry)[:70]
 652 | 
 653 |                     title = title.replace("\n", " ").strip()
 654 |                     
 655 |                     when = time.gmtime()
 656 | 
 657 |                     if DATE_HEADER:
 658 |                         for datetype in DATE_HEADER_ORDER:
 659 |                             kind = datetype+"_parsed"
 660 |                             if kind in entry and entry[kind]: when = entry[kind]
 661 |                         
 662 |                     link = entry.get('link', "")
 663 |                     
 664 |                     from_addr = getEmail(r, entry)
 665 |                     
 666 |                     name = h2t.unescape(getName(r, entry))
 667 |                     fromhdr = formataddr((name, from_addr,))
 668 |                     tohdr = (f.to or default_to)
 669 |                     subjecthdr = title
 670 |                     datehdr = time.strftime("%a, %d %b %Y %H:%M:%S -0000", when)
 671 |                     useragenthdr = "rss2email"
 672 |                     
 673 |                     # Add post tags, if available
 674 |                     tagline = ""
 675 |                     if 'tags' in entry:
 676 |                         tags = entry.get('tags')
 677 |                         taglist = []
 678 |                         if tags:
 679 |                             for tag in tags:
 680 |                                 taglist.append(tag['term'])
 681 |                         if taglist:
 682 |                             tagline = ",".join(taglist)
 683 |                     
 684 |                     extraheaders = {'Date': datehdr, 'User-Agent': useragenthdr, 'X-RSS-Feed': f.url, 'Message-ID': '<%s>' % hashlib.sha1(id.encode('utf-8')).hexdigest(), 'X-RSS-ID': id, 'X-RSS-URL': link, 'X-RSS-TAGS' : tagline, 'X-MUNGED-FROM': getMungedFrom(r), 'References': ''}
 685 |                     if BONUS_HEADER != '':
 686 |                         for hdr in BONUS_HEADER.strip().splitlines():
 687 |                             pos = hdr.strip().find(':')
 688 |                             if pos > 0:
 689 |                                 extraheaders[hdr[:pos]] = hdr[pos+1:].strip()
 690 |                             else:
 691 |                                 print >>warn, "W: malformed BONUS HEADER", BONUS_HEADER 
 692 |                     
 693 |                     entrycontent = getContent(entry, HTMLOK=HTML_MAIL)
 694 |                     contenttype = 'plain'
 695 |                     content = ''
 696 |                     if THREAD_ON_TAGS and len(tagline):
 697 |                         extraheaders['References'] += ''.join([' <%s>' % hashlib.sha1(t.strip().encode('utf-8')).hexdigest() for t in tagline.split(',')])
 698 |                     if USE_CSS_STYLING and HTML_MAIL:
 699 |                         contenttype = 'html'
 700 |                         content = "<html>\n" 
 701 |                         content += '<head><meta http-equiv="Content-Type" content="text/html"><style>' + STYLE_SHEET + '</style></head>\n'
 702 |                         content += '<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">\n'
 703 |                         content += '<div id="entry">\n'
 704 |                         content += '<h1 class="header"'
 705 |                         content += '><a href="'+link+'">'+subjecthdr+'</a></h1>\n'
 706 |                         if ishtml(entrycontent):
 707 |                             body = entrycontent[1].strip()
 708 |                             if SUMMARIZE:
 709 |                                 content += '<div class="summary">%s</div>' % (summarize(html2text(body, plaintext=True), SUMMARIZE) + "<hr>")
 710 |                         else:
 711 |                             body = entrycontent.strip()
 712 |                             if SUMMARIZE:
 713 |                                 content += '<div class="summary">%s</div>' % (summarize(body, SUMMARIZE) + "<hr>")
 714 |                         if THREAD_ON_LINKS:
 715 |                             parser = Parser()
 716 |                             parser.feed(body)
 717 |                             extraheaders['References'] += ''.join([' <%s>' % hashlib.sha1(h.strip().encode('utf-8')).hexdigest() for h in parser.attrs])
 718 |                         if INLINE_IMAGES_DATA_URI:
 719 |                             parser = Parser(tag='img', attr='src')
 720 |                             parser.feed(body)
 721 |                             for src in parser.attrs:
 722 |                                 try:
 723 |                                     img = feedparser._open_resource(src, None, None, feedparser.USER_AGENT, link, [], {})
 724 |                                     data = img.read()
 725 |                                     if hasattr(img, 'headers'):
 726 |                                         headers = dict((k.lower(), v) for k, v in dict(img.headers).items())
 727 |                                         ctype = headers.get('content-type', None)
 728 |                                         if ctype and INLINE_IMAGES_DATA_URI:
 729 |                                             body = body.replace(src,'data:%s;base64,%s' % (ctype, base64.b64encode(data)))
 730 |                                 except:
 731 |                                     print >>warn, "Could not load image: %s" % src
 732 |                                     pass
 733 |                         if body != '':  
 734 |                             content += '<div id="body">\n' + body + '</div>\n'
 735 |                         content += '\n<p class="footer">URL: <a href="'+link+'">'+link+'</a>'
 736 |                         if hasattr(entry,'enclosures'):
 737 |                             for enclosure in entry.enclosures:
 738 |                                 if (hasattr(enclosure, 'url') and enclosure.url != ""):
 739 |                                     content += ('<br/>Enclosure: <a href="'+enclosure.url+'">'+enclosure.url+"</a>\n")
 740 |                                 if (hasattr(enclosure, 'src') and enclosure.src != ""):
 741 |                                     content += ('<br/>Enclosure: <a href="'+enclosure.src+'">'+enclosure.src+'</a><br/><img src="'+enclosure.src+'"\n')
 742 |                         if 'links' in entry:
 743 |                             for extralink in entry.links:
 744 |                                 if ('rel' in extralink) and extralink['rel'] == u'via':
 745 |                                     extraurl = extralink['href']
 746 |                                     extraurl = extraurl.replace('http://www.google.com/reader/public/atom/', 'http://www.google.com/reader/view/')
 747 |                                     viatitle = extraurl
 748 |                                     if ('title' in extralink):
 749 |                                         viatitle = extralink['title']
 750 |                                     content += '<br/>Via: <a href="'+extraurl+'">'+viatitle+'</a>\n'
 751 |                         content += '</p></div>\n'
 752 |                         content += "\n\n</body></html>"
 753 |                     else:   
 754 |                         if ishtml(entrycontent):
 755 |                             contenttype = 'html'
 756 |                             content = "<html>\n" 
 757 |                             content = ("<html><body>\n\n" + 
 758 |                                        '<h1><a href="'+link+'">'+subjecthdr+'</a></h1>\n\n' +
 759 |                                        entrycontent[1].strip() + # drop type tag (HACK: bad abstraction)
 760 |                                        '<p>URL: <a href="'+link+'">'+link+'</a></p>' )
 761 |                                        
 762 |                             if hasattr(entry,'enclosures'):
 763 |                                 for enclosure in entry.enclosures:
 764 |                                     if enclosure.url != "":
 765 |                                         content += ('Enclosure: <a href="'+enclosure.url+'">'+enclosure.url+"</a><br/>\n")
 766 |                             if 'links' in entry:
 767 |                                 for extralink in entry.links:
 768 |                                     if ('rel' in extralink) and extralink['rel'] == u'via':
 769 |                                         content += 'Via: <a href="'+extralink['href']+'">'+extralink['title']+'</a><br/>\n'
 770 |                                                                 
 771 |                             content += ("\n</body></html>")
 772 |                         else:
 773 |                             content = entrycontent.strip() + "\n\nURL: "+link
 774 |                             if hasattr(entry,'enclosures'):
 775 |                                 for enclosure in entry.enclosures:
 776 |                                     if enclosure.url != "":
 777 |                                         content += ('\nEnclosure: ' + enclosure.url + "\n")
 778 |                             if 'links' in entry:
 779 |                                 for extralink in entry.links:
 780 |                                     if ('rel' in extralink) and extralink['rel'] == u'via':
 781 |                                         content += '<a href="'+extralink['href']+'">Via: '+extralink['title']+'</a>\n'
 782 | 
 783 |                     mailserver = send(fromhdr, tohdr, subjecthdr, content, contenttype, when, extraheaders, mailserver, f.folder)
 784 |             
 785 |                     f.seen[frameid] = id
 786 |                     
 787 |                 f.etag, f.modified = r.get('etag', None), r.get('modified', None)
 788 |             except (KeyboardInterrupt, SystemExit):
 789 |                 raise
 790 |             except:
 791 |                 print >>warn, "=== rss2email encountered a problem with this feed ==="
 792 |                 print >>warn, "=== See the rss2email FAQ at http://www.allthingsrss.com/rss2email/ for assistance ==="
 793 |                 print >>warn, "=== If this occurs repeatedly, send this to lindsey@allthingsrss.com ==="
 794 |                 print >>warn, "E: could not parse", f.url
 795 |                 traceback.print_exc(file=warn)
 796 |                 print >>warn, "rss2email", __version__
 797 |                 print >>warn, "feedparser", feedparser.__version__
 798 |                 print >>warn, "html2text", h2t.__version__
 799 |                 print >>warn, "Python", sys.version
 800 |                 print >>warn, "=== END HERE ==="
 801 |                 continue
 802 | 
 803 |     finally:        
 804 |         unlock(feeds, feedfileObject)
 805 |         if mailserver:
 806 |             if IMAP_MARK_AS_READ:
 807 |                 for folder in IMAP_MARK_AS_READ:
 808 |                     mailserver.select(folder)
 809 |                     res, data = mailserver.search(None, '(UNSEEN UNFLAGGED)')
 810 |                     if res == 'OK':
 811 |                         items = data[0].split()
 812 |                         for i in items:
 813 |                             res, data = mailserver.fetch(i, "(UID)")
 814 |                             if data[0]:
 815 |                                 u = uid(data[0])
 816 |                                 res, data = mailserver.uid('STORE', u, '+FLAGS', '(\Seen)')
 817 |             if IMAP_MOVE_READ_TO:
 818 |                 typ, data = mailserver.list(pattern='*')
 819 |                 # Parse folder listing as a CSV dialect (automatically removes quotes)
 820 |                 reader = csv.reader(StringIO.StringIO('\n'.join(data)),dialect='mailboxlist')
 821 |                 # Iterate over each folder
 822 |                 for row in reader:
 823 |                     folder = row[-1:][0]
 824 |                     if folder == IMAP_MOVE_READ_TO or '\Noselect' in row[0]:
 825 |                         continue
 826 |                     mailserver.select(folder)
 827 |                     yesterday = (datetime.now() - timedelta(days=1)).strftime("%d-%b-%Y")
 828 |                     res, data = mailserver.search(None, '(SEEN BEFORE %s UNFLAGGED)' % yesterday)
 829 |                     if res == 'OK':
 830 |                         items = data[0].split()
 831 |                         for i in items:
 832 |                             res, data = mailserver.fetch(i, "(UID)")
 833 |                             if data[0]:
 834 |                                 u = uid(data[0])
 835 |                                 res, data = mailserver.uid('COPY', u, IMAP_MOVE_READ_TO)
 836 |                                 if res == 'OK':
 837 |                                     res, data = mailserver.uid('STORE', u, '+FLAGS', '(\Deleted)')
 838 |                                     mailserver.expunge()
 839 |             try:
 840 |                 mailserver.quit()
 841 |             except:
 842 |                 mailserver.logout()
 843 | 
 844 | def list():
 845 |     feeds, feedfileObject = load(lock=0)
 846 |     default_to = ""
 847 |     default_folder = DEFAULT_IMAP_FOLDER
 848 |     
 849 |     if feeds and isstr(feeds[0]):
 850 |         default_to = feeds[0]; ifeeds = feeds[1:]; i=1
 851 |         print "default email:", default_to
 852 |     else: ifeeds = feeds; i = 0
 853 |     for f in ifeeds:
 854 |         active = ('[ ]', '[*]')[f.active]
 855 |         print `i`+':',active, f.url, '(to: '+(f.to or (default_to+' (default)'))+', ' + 'folder: '+(f.folder or (default_folder+' (default))'))
 856 |         if not (f.to or default_to):
 857 |             print "   W: Please define a default address with 'r2e email emailaddress'"
 858 |         i+= 1
 859 | 
 860 | def opmlexport():
 861 |     import xml.sax.saxutils
 862 |     feeds, feedfileObject = load(lock=0)
 863 |     
 864 |     if feeds:
 865 |         print '<?xml version="1.0" encoding="UTF-8"?>\n<opml version="1.0">\n<head>\n<title>rss2email OPML export</title>\n</head>\n<body>'
 866 |         exportableFeeds = {}
 867 |         if USE_OPML_TITLE_AS_FOLDER:
 868 |             for f in feeds[1:]:
 869 |                 if not hasattr(exportableFeeds, f.folder):
 870 |                     exportableFeeds[f.folder] = {}
 871 |                 exportableFeeds[f.folder].append(f)
 872 | 
 873 |             for folder in exportableFeeds:
 874 |                 print '\n\t<outline text="%s" title="%s">' % (folder, folder)
 875 |                 for f in exportableFeeds[folder]:
 876 |                     url = xml.sax.saxutils.escape(f.url)
 877 |                     print '\n\t\t<outline type="rss" text="%s" xmlUrl="%s"/>' % (url, url)
 878 |                 print '\n\t</outline>'
 879 |         else:
 880 |             for f in feeds[1:]:
 881 |                 url = xml.sax.saxutils.escape(f.url)
 882 |                 print '<outline type="rss" text="%s" xmlUrl="%s"/>' % (url, url)
 883 |         print '\n</body>\n</opml>'
 884 | 
 885 | def opmlimport(importfile):
 886 |     importfileObject = None
 887 |     print 'Importing feeds from', importfile
 888 |     if not os.path.exists(importfile):
 889 |         print 'OPML import file "%s" does not exist.' % feedfile
 890 |     try:
 891 |         importfileObject = open(importfile, 'r')
 892 |     except IOError, e:
 893 |         print "OPML import file could not be opened: %s" % e
 894 |         sys.exit(1)
 895 |     try:
 896 |         import xml.dom.minidom
 897 |         dom = xml.dom.minidom.parse(importfileObject)
 898 |         newfeeds = dom.getElementsByTagName('outline')
 899 |     except:
 900 |         print 'E: Unable to parse OPML file'
 901 |         sys.exit(1)
 902 | 
 903 |     feeds, feedfileObject = load(lock=1)
 904 |     
 905 |     import xml.sax.saxutils
 906 |     
 907 |     for f in newfeeds:
 908 |         if f.hasAttribute('xmlUrl'):
 909 |             category = f.parentNode
 910 |             folder = None
 911 |             if USE_OPML_TITLE_AS_FOLDER:
 912 |                 if category.hasAttribute("title")!=None:
 913 |                     folder = category.getAttribute("title")
 914 |             feedurl = f.getAttribute('xmlUrl')
 915 |             print 'Adding %s' % xml.sax.saxutils.unescape(feedurl)
 916 |             feeds.append(Feed(feedurl, None, folder))
 917 |             
 918 |     unlock(feeds, feedfileObject)
 919 | 
 920 | def delete(n):
 921 |     feeds, feedfileObject = load()
 922 |     if (n == 0) and (feeds and isstr(feeds[0])):
 923 |         print >>warn, "W: ID has to be equal to or higher than 1"
 924 |     elif n >= len(feeds):
 925 |         print >>warn, "W: no such feed"
 926 |     else:
 927 |         print >>warn, "W: deleting feed %s" % feeds[n].url
 928 |         feeds = feeds[:n] + feeds[n+1:]
 929 |         if n != len(feeds):
 930 |             print >>warn, "W: feed IDs have changed, list before deleting again"
 931 |     unlock(feeds, feedfileObject)
 932 |     
 933 | def toggleactive(n, active):
 934 |     feeds, feedfileObject = load()
 935 |     if (n == 0) and (feeds and isstr(feeds[0])):
 936 |         print >>warn, "W: ID has to be equal to or higher than 1"
 937 |     elif n >= len(feeds):
 938 |         print >>warn, "W: no such feed"
 939 |     else:
 940 |         action = ('Pausing', 'Unpausing')[active]
 941 |         print >>warn, "%s feed %s" % (action, feeds[n].url)
 942 |         feeds[n].active = active
 943 |     unlock(feeds, feedfileObject)
 944 |     
 945 | def reset():
 946 |     feeds, feedfileObject = load()
 947 |     if feeds and isstr(feeds[0]):
 948 |         ifeeds = feeds[1:]
 949 |     else: ifeeds = feeds
 950 |     for f in ifeeds:
 951 |         if VERBOSE: print "Resetting %d already seen items" % len(f.seen)
 952 |         f.seen = {}
 953 |         f.etag = None
 954 |         f.modified = None
 955 |     
 956 |     unlock(feeds, feedfileObject)
 957 |     
 958 | def email(addr):
 959 |     feeds, feedfileObject = load()
 960 |     if feeds and isstr(feeds[0]): feeds[0] = addr
 961 |     else: feeds = [addr] + feeds
 962 |     unlock(feeds, feedfileObject)
 963 | 
 964 | if __name__ == '__main__':
 965 |     args = sys.argv
 966 |     try:
 967 |         if len(args) < 3: raise InputError, "insufficient args"
 968 |         feedfile, action, args = args[1], args[2], args[3:]
 969 |         
 970 |         if action == "run": 
 971 |             if args and args[0] == "--no-send":
 972 |                 def send(sender, recipient, subject, body, contenttype, when, extraheaders=None, mailserver=None, folder=None):
 973 |                     if VERBOSE: print 'Not sending:', unu(subject)
 974 | 
 975 |             if args and args[-1].isdigit(): run(int(args[-1]))
 976 |             else: run()
 977 | 
 978 |         elif action == "email":
 979 |             if not args:
 980 |                 raise InputError, "Action '%s' requires an argument" % action
 981 |             else:
 982 |                 email(args[0])
 983 | 
 984 |         elif action == "add": add(*args)
 985 | 
 986 |         elif action == "new": 
 987 |             if len(args) == 1: d = [args[0]]
 988 |             else: d = []
 989 |             pickle.dump(d, open(feedfile, 'w'))
 990 | 
 991 |         elif action == "list": list()
 992 | 
 993 |         elif action in ("help", "--help", "-h"): print __doc__
 994 | 
 995 |         elif action == "delete":
 996 |             if not args:
 997 |                 raise InputError, "Action '%s' requires an argument" % action
 998 |             elif args[0].isdigit():
 999 |                 delete(int(args[0]))
1000 |             else:
1001 |                 raise InputError, "Action '%s' requires a number as its argument" % action
1002 | 
1003 |         elif action in ("pause", "unpause"):
1004 |             if not args:
1005 |                 raise InputError, "Action '%s' requires an argument" % action
1006 |             elif args[0].isdigit():
1007 |                 active = (action == "unpause")
1008 |                 toggleactive(int(args[0]), active)
1009 |             else:
1010 |                 raise InputError, "Action '%s' requires a number as its argument" % action
1011 | 
1012 |         elif action == "reset": reset()
1013 | 
1014 |         elif action == "opmlexport": opmlexport()
1015 | 
1016 |         elif action == "opmlimport": 
1017 |             if not args:
1018 |                 raise InputError, "OPML import '%s' requires a filename argument" % action
1019 |             opmlimport(args[0])
1020 | 
1021 |         else:
1022 |             raise InputError, "Invalid action"
1023 |         
1024 |     except InputError, e:
1025 |         print "E:", e
1026 |         print
1027 |         print __doc__
1028 | 
1029 | 


--------------------------------------------------------------------------------
/summarize.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | """Naive text summarizer"""
 3 | 
 4 | from collections import defaultdict
 5 | import re
 6 | 
 7 | def sentences(text):
 8 |     start = 0
 9 |     for match in re.finditer('(\s*[.!?]\s*)|(\n{2,})', text):
10 |         yield text[start:match.end()].strip()
11 |         start = match.end()
12 | 
13 |     if start < len(text):
14 |         yield text[start:].strip()
15 |         
16 | 
17 | def frequency(text):
18 |     counts = defaultdict(int)
19 |     for token in text.split(): # simplest tokenizer ever
20 |         counts[token] += 1
21 |     return counts
22 | 
23 | 
24 | def score(sentence, frequencies):
25 |     return sum((frequencies[token] for token in sentence.split()))
26 | 
27 | 
28 | def reorder(sentences, text):
29 |     sentences.sort(lambda a, b: text.find(a) - text.find(b))
30 |     return sentences
31 | 
32 |     
33 | def summarize(text, limit=3):
34 |     items = [s for s in sentences(text)]
35 |     items.sort(key=lambda s: score(s, frequency(text)), reverse=1)
36 |     return '\n'.join(["<p>%s</p>" % s for s in reorder(items[:limit], text)])
37 | 


--------------------------------------------------------------------------------