├── README.md
├── deliciousmonitor.py
└── deliciousapi.py


/README.md:
--------------------------------------------------------------------------------
 1 | DeliciousAPI
 2 | ============
 3 | 
 4 | Unofficial Python API for retrieving data from Delicious.com
 5 | 
 6 | Features
 7 | --------
 8 | 
 9 | This Python module provides the following features plus some more:
10 | 
11 | * Retrieving a URL's full public bookmarking history including users who bookmarked the URL including tags used for such bookmarks and the creation time of the bookmark (up to YYYY-MM-DD granularity)
12 | * Top tags (up to a maximum of 10) including tag count title as stored on Delicious.com
13 | * Total number of bookmarks/users for this URL at Delicious.com retrieving a user's full bookmark collection, including any private bookmarks if you know the corresponding password
14 | * Retrieving a user's full public tagging vocabulary, i.e. tags and tag counts
15 | * Retrieving a user's network information (network members and network fans)
16 | * HTTP proxy support
17 | 
18 | The official Delicious.com API and the JSON/RSS feeds do not provide all the functionality mentioned above, and in such cases this module will query the Delicious.com website directly and extract the required information by parsing the HTML code of the resulting Web pages (a kind of poor man's web mining). The module is able to detect IP throttling, which is employed by Delicious.com to temporarily block abusive HTTP request behavior, and will raise a custom Python error to indicate that. Please be a nice netizen and do not stress the Delicious.com service more than necessary.  
19 | 
20 | Installation
21 | ------------
22 | 
23 | You can now download and install DeliciousAPI from Python Package Index (aka Python Cheese Shop) (includes only deliciousapi.py) via setuptools/easy_install. Just run
24 | 
25 |     $ easy_install DeliciousAPI
26 | 
27 | After installation, a simple import deliciousapi in your Python scripts will do the trick.
28 | 
29 | An alternative installation method is downloading the code straight from the git repository.
30 | 
31 | Updates
32 | -------
33 | 
34 | If you used setuptools/easy_install for installation, you can update DeliciousAPI via
35 | 
36 |     $ easy_install -U DeliciousAPI
37 | 
38 | Alternatively, if you downladed the code from the git repository, simply pull the latest changes.
39 | 
40 | Usage
41 | -----
42 | 
43 | For now, please refer to the documentation available at [http://www.michael-noll.com/projects/delicious-python-api/](http://www.michael-noll.com/projects/delicious-python-api/).
44 | 
45 | Important
46 | ---------
47 | 
48 | It is strongly advised that you read the Delicious.com Terms of Use prior to using this Python module. In particular, read section 5 'Intellectual Property'.
49 | 
50 | License
51 | -------
52 | 
53 | The code is licensed to you under version 2 of the GNU General Public License.
54 | 
55 | Copyright
56 | ---------
57 | 
58 | Copyright 2006-2010 Michael G. Noll <http://www.michael-noll.com/>
59 | 
60 | 


--------------------------------------------------------------------------------
/deliciousmonitor.py:
--------------------------------------------------------------------------------
  1 | """
  2 |     A module to monitor a delicious.com bookmark RSS feed and store it with some additional metadata to file.
  3 |     
  4 |     (c) 2006-2008 Michael G. Noll <http://www.michael-noll.com/>
  5 |     
  6 | """
  7 | import codecs
  8 | import datetime
  9 | import os
 10 | import sys
 11 | import time
 12 | 
 13 | try:
 14 |     import deliciousapi
 15 | except:
 16 |     print "ERROR: could not import DeliciousAPI module"
 17 |     print
 18 |     print "You can download DeliciousAPI from the Python Cheese Shop at"
 19 |     print "http://pypi.python.org/pypi/DeliciousAPI"
 20 |     print
 21 | 
 22 | try:
 23 |     import feedparser
 24 | except:
 25 |     print "ERROR: could not import Universal Feed Parser module"
 26 |     print
 27 |     print "You can download Universal Feed Parser from the Python Cheese Shop at"
 28 |     print "http://pypi.python.org/pypi/FeedParser"
 29 |     print
 30 |     raise
 31 | 
 32 | 
 33 | class DeliciousMonitor(object):
 34 |     """Monitors a delicious.com bookmark RSS feed, retrieves metadata for each bookmark and stores it to file.
 35 |     
 36 |     By default, the delicious.com hotlist (i.e. the front page) is monitored.
 37 |     Whenever the monitor discovers a new URL in a bookmark, it retrieves
 38 |     some metadata for it from delicious.com (currently, common tags and number
 39 |     of bookmarks) and stores this information to file.
 40 |     
 41 |     Note that URLs which have been processed in previous runs will NOT be
 42 |     processed again, i.e. delicious.com metadata will NOT be updated.
 43 |     
 44 |     """
 45 |     
 46 |     def __init__(self, rss_url="http://feeds.delicious.com/v2/rss", filename="delicious-monitor.xml", log_filename="delicious-monitor.log", interval=30, verbose=True):
 47 |         """
 48 |         Parameters:
 49 |             rss_url (optional, default: "http://feeds.delicious.com/v2/rss")
 50 |                 The URL of the RSS feed to monitor.
 51 |             
 52 |             filename (optional, default: "delicious-monitor.xml")
 53 |                 The name of the file to which metadata about the RSS feed will be stored.
 54 |             
 55 |             log_filename (optional, default: "delicious-monitor.log")
 56 |                 The name of the log file, which is used to identify "new" entries in the RSS feed.
 57 |             
 58 |             interval (optional, default: 30)
 59 |                 Time between monitor runs in minutes.
 60 |             
 61 |             verbose (optional, default: True)
 62 |                 Whether to print non-critical processing information to STDOUT or not.
 63 |                 
 64 |         """
 65 |         self.rss_url = rss_url
 66 |         self._delicious = deliciousapi.DeliciousAPI()
 67 |         self.filename = filename
 68 |         self.log_filename = log_filename
 69 |         self.interval = interval
 70 |         self.verbose = verbose
 71 |         self.urls = []
 72 |         # ensure that the name of the output file and log file is not None etc.
 73 |         assert self.filename
 74 |         assert self.log_filename
 75 |         
 76 |     def run(self):
 77 |         """Start the monitor."""
 78 |         while True:
 79 |             time_before_run = datetime.datetime.now()
 80 |             
 81 |             # do the actual monitoring work
 82 |             if self.verbose:
 83 |                 print "[MONITOR] Starting monitor run - %s" % time_before_run.strftime("%Y-%m-%d @ %H:%M:%S")
 84 |             self.monitor()
 85 |             time_after_run = datetime.datetime.now()
 86 | 
 87 |             # calculate the number of seconds to wait until the next run
 88 |             interval = datetime.timedelta(seconds=60*self.interval)
 89 |             next_run_time = time_before_run + interval
 90 |             elapsed = time_after_run - time_before_run
 91 |             if interval >= elapsed:
 92 |                 wait_seconds = (interval - elapsed).seconds
 93 |             else:
 94 |                 # the run took longer than our interval time between runs;
 95 |                 # in this case, we continue immediately but will still wait
 96 |                 # three seconds in order not to stress delicious.com too much
 97 |                 wait_seconds = 3
 98 |                 next_run_time = datetime.datetime.now() + datetime.timedelta(seconds=wait_seconds)
 99 |             
100 |             # sleep until the next run
101 |             if self.verbose:
102 |                 print "[MONITOR] Next monitor run on %s (sleeping for %s seconds)" % (next_run_time.strftime("%Y-%m-%d @ %H:%M:%S"), wait_seconds)
103 |             time.sleep(wait_seconds)
104 |         
105 |     def monitor(self):
106 |         """Monitors an RSS feed."""
107 |         
108 |         # download and parse RSS feed
109 |         f = feedparser.parse(self.rss_url)
110 |         
111 |         output_file = codecs.open(self.filename, "a", "utf8")
112 |         log_file = None
113 |         
114 |         if os.access(self.log_filename, os.F_OK):
115 |             if self.verbose:
116 |                 print "[MONITOR] Log file found. Trying to resume...",
117 |             try:
118 |                 # read in previous log data for resuming
119 |                 log_file = open(self.log_filename, 'r')
120 |                 # remove leading and trailing whitespace if any (incl. newlines)
121 |                 self.urls = [line.strip() for line in log_file.readlines()]
122 |                 log_file.close()
123 |                 if self.verbose:
124 |                     print "done"
125 |             except IOError:
126 |                 # most probably, the log file does not exist (yet)
127 |                 if self.verbose:
128 |                     print "failed"
129 |         else:
130 |             # log file does not exist, so there isn't any resume data
131 |             # to read in
132 |             pass
133 | 
134 |         try:
135 |             # now open it for writing (i.e., appending) and logging
136 |             if self.verbose:
137 |                 print "[MONITOR] Open log file for appending...",
138 |             log_file = open(self.log_filename, 'a')
139 |             if self.verbose:
140 |                 print "done"
141 |         except IOError:
142 |             if self.verbose:
143 |                 print "failed"
144 |             print "[MONITOR] ERROR: could not open log file for appending"
145 |             self._cleanup()
146 |             return
147 |         
148 |         # get only new entries
149 |         new_entries = []
150 |         for entry in f.entries:
151 |             new_entries = [entry for entry in f.entries if entry.link not in self.urls]
152 |         
153 |         if self.verbose:
154 |             print "[MONITOR] Found %s new entries" % len(new_entries)
155 |         
156 |         # query metadata about each entry from delicious.com
157 |         for index, entry in enumerate(new_entries):
158 |             url = entry.link
159 |             
160 |             if self.verbose:
161 |                 print "[MONITOR] Processing entry #%s: '%s'" % (index + 1, url),
162 |             try:
163 |                 time.sleep(1) # be nice and wait 1 sec between connects to delicious.com
164 |                 document = self._delicious.get_url(url)
165 |             except (deliciousapi.DeliciousError,), error_string:
166 |                 if self.verbose:
167 |                     print "failed"
168 |                 print "[MONITOR] ERROR: %s" % error_string
169 |                 # clean up
170 |                 output_file.close()
171 |                 log_file.close()
172 |                 return
173 |             
174 |             if self.verbose:
175 |                 print "done"
176 |             
177 |             # update log file
178 |             log_file.write("%s\n" % url)
179 |             # update output file
180 |             output_file.write('<document url="%s" users="%s" top_tags="%s">\n' % (url, document.total_bookmarks, len(document.top_tags)))
181 |             for tag, count in document.top_tags:
182 |                 output_file.write('    <top_tag name="%s" count="%s" />\n' % (tag, count))
183 |             output_file.write('</document>\n')
184 |             output_file.flush()
185 |         
186 |         # clean up
187 |         output_file.close()
188 |         log_file.close()
189 |         
190 |         
191 | if __name__ == "__main__":
192 |     monitor = DeliciousMonitor(interval=30)
193 |     monitor.run()
194 | 


--------------------------------------------------------------------------------
/deliciousapi.py:
--------------------------------------------------------------------------------
   1 | """
   2 |     Unofficial Python API for retrieving data from Delicious.com.
   3 | 
   4 |     This module provides the following features plus some more:
   5 | 
   6 |     * retrieving a URL's full public bookmarking history including
   7 |         * users who bookmarked the URL including tags used for such bookmarks
   8 |           and the creation time of the bookmark (up to YYYY-MM-DD granularity)
   9 |         * top tags (up to a maximum of 10) including tag count
  10 |         * title as stored on Delicious.com
  11 |         * total number of bookmarks/users for this URL at Delicious.com
  12 |     * retrieving a user's full bookmark collection, including any private bookmarks
  13 |       if you know the corresponding password
  14 |     * retrieving a user's full public tagging vocabulary, i.e. tags and tag counts
  15 |     * retrieving a user's network information (network members and network fans)
  16 |     * HTTP proxy support
  17 |     * updated to support Delicious.com "version 2" (mini-relaunch as of August 2008)
  18 | 
  19 |     The official Delicious.com API and the JSON/RSS feeds do not provide all
  20 |     the functionality mentioned above, and in such cases this module will query
  21 |     the Delicious.com *website* directly and extract the required information
  22 |     by parsing the HTML code of the resulting Web pages (a kind of poor man's
  23 |     web mining). The module is able to detect IP throttling, which is employed
  24 |     by Delicious.com to temporarily block abusive HTTP request behavior, and
  25 |     will raise a custom Python error to indicate that. Please be a nice netizen
  26 |     and do not stress the Delicious.com service more than necessary.
  27 | 
  28 |     It is strongly advised that you read the Delicious.com Terms of Use
  29 |     before using this Python module. In particular, read section 5
  30 |     'Intellectual Property'.
  31 | 
  32 |     The code is licensed to you under version 2 of the GNU General Public
  33 |     License.
  34 | 
  35 |     More information about this module can be found at
  36 |     http://www.michael-noll.com/wiki/Del.icio.us_Python_API
  37 | 
  38 |     Changelog is available at
  39 |     http://code.michael-noll.com/?p=deliciousapi;a=log
  40 | 
  41 |     Copyright 2006-2010 Michael G. Noll <http://www.michael-noll.com/>
  42 | 
  43 | """
  44 | 
  45 | __author__ = "Michael G. Noll"
  46 | __copyright__ = "(c) 2006-2010 Michael G. Noll"
  47 | __description__ = "Unofficial Python API for retrieving data from Delicious.com"
  48 | __email__ = "coding[AT]michael-REMOVEME-noll[DOT]com"
  49 | __license__ = "GPLv2"
  50 | __maintainer__ = "Michael G. Noll"
  51 | __status__ = "Development"
  52 | __url__ = "http://www.michael-noll.com/"
  53 | __version__ = "1.6.7"
  54 | 
  55 | import cgi
  56 | import datetime
  57 | import hashlib
  58 | from operator import itemgetter
  59 | import re
  60 | import socket
  61 | import time
  62 | import urllib2
  63 | 
  64 | try:
  65 |     from BeautifulSoup import BeautifulSoup
  66 | except:
  67 |     print "ERROR: could not import BeautifulSoup Python module"
  68 |     print
  69 |     print "You can download BeautifulSoup from the Python Cheese Shop at"
  70 |     print "http://cheeseshop.python.org/pypi/BeautifulSoup/"
  71 |     print "or directly from http://www.crummy.com/software/BeautifulSoup/"
  72 |     print
  73 |     raise
  74 | 
  75 | try:
  76 |     import simplejson
  77 | except:
  78 |     print "ERROR: could not import simplejson module"
  79 |     print
  80 |     print "Since version 1.5.0, DeliciousAPI requires the simplejson module."
  81 |     print "You can download simplejson from the Python Cheese Shop at"
  82 |     print "http://pypi.python.org/pypi/simplejson"
  83 |     print
  84 |     raise
  85 | 
  86 | 
  87 | class DeliciousUser(object):
  88 |     """This class wraps all available information about a user into one object.
  89 | 
  90 |     Variables:
  91 |         bookmarks:
  92 |             A list of (url, tags, title, comment, timestamp) tuples representing
  93 |             a user's bookmark collection.
  94 | 
  95 |             url is a 'unicode'
  96 |             tags is a 'list' of 'unicode' ([] if no tags)
  97 |             title is a 'unicode'
  98 |             comment is a 'unicode' (u"" if no comment)
  99 |             timestamp is a 'datetime.datetime'
 100 | 
 101 |         tags (read-only property):
 102 |             A list of (tag, tag_count) tuples, aggregated over all a user's
 103 |             retrieved bookmarks. The tags represent a user's tagging vocabulary.
 104 | 
 105 |         username:
 106 |             The Delicious.com account name of the user.
 107 | 
 108 |     """
 109 | 
 110 |     def __init__(self, username, bookmarks=None):
 111 |         assert username
 112 |         self.username = username
 113 |         self.bookmarks = bookmarks or []
 114 | 
 115 |     def __str__(self):
 116 |         total_tag_count = 0
 117 |         total_tags = set()
 118 |         for url, tags, title, comment, timestamp in self.bookmarks:
 119 |             if tags:
 120 |                 total_tag_count += len(tags)
 121 |             for tag in tags:
 122 |                 total_tags.add(tag)
 123 |         return "[%s] %d bookmarks, %d tags (%d unique)" % \
 124 |                     (self.username, len(self.bookmarks), total_tag_count, len(total_tags))
 125 | 
 126 |     def __repr__(self):
 127 |         return self.username
 128 | 
 129 |     def get_tags(self):
 130 |         """Returns a dictionary mapping tags to their tag count.
 131 | 
 132 |         For example, if the tag count of tag 'foo' is 23, then
 133 |         23 bookmarks were annotated with 'foo'. A different way
 134 |         to put it is that 23 users used the tag 'foo' when
 135 |         bookmarking the URL.
 136 | 
 137 |         """
 138 |         total_tags = {}
 139 |         for url, tags, title, comment, timestamp in self.bookmarks:
 140 |             for tag in tags:
 141 |                 total_tags[tag] = total_tags.get(tag, 0) + 1
 142 |         return total_tags
 143 |     tags = property(fget=get_tags, doc="Returns a dictionary mapping tags to their tag count")
 144 | 
 145 | 
 146 | class DeliciousURL(object):
 147 |     """This class wraps all available information about a web document into one object.
 148 | 
 149 |     Variables:
 150 |         bookmarks:
 151 |             A list of (user, tags, comment, timestamp) tuples, representing a
 152 |             document's bookmark history. Generally, this variable is populated
 153 |             via get_url(), so the number of bookmarks available in this variable
 154 |             depends on the parameters of get_url(). See get_url() for more
 155 |             information.
 156 | 
 157 |             user is a 'unicode'
 158 |             tags is a 'list' of 'unicode's ([] if no tags)
 159 |             comment is a 'unicode' (u"" if no comment)
 160 |             timestamp is a 'datetime.datetime' (granularity: creation *day*,
 161 |                 i.e. the day but not the time of day)
 162 | 
 163 |         tags (read-only property):
 164 |             A list of (tag, tag_count) tuples, aggregated over all a document's
 165 |             retrieved bookmarks.
 166 | 
 167 |         top_tags:
 168 |             A list of (tag, tag_count) tuples, representing a document's so-called
 169 |             "top tags", i.e. the up to 10 most popular tags for this document.
 170 | 
 171 |         url:
 172 |             The URL of the document.
 173 | 
 174 |         hash (read-only property):
 175 |             The MD5 hash of the URL.
 176 | 
 177 |         title:
 178 |             The document's title.
 179 | 
 180 |         total_bookmarks:
 181 |             The number of total bookmarks (posts) of the document.
 182 |             Note that the value of total_bookmarks can be greater than the
 183 |             length of "bookmarks" depending on how much (detailed) bookmark
 184 |             data could be retrieved from Delicious.com.
 185 | 
 186 |             Here's some more background information:
 187 |             The value of total_bookmarks is the "real" number of bookmarks of
 188 |             URL "url" stored at Delicious.com as reported by Delicious.com
 189 |             itself (so it's the "ground truth"). On the other hand, the length
 190 |             of "bookmarks" depends on iteratively scraped bookmarking data.
 191 |             Since scraping Delicous.com's Web pages has its limits in practice,
 192 |             this means that DeliciousAPI could most likely not retrieve all
 193 |             available bookmarks. In such a case, the value reported by
 194 |             total_bookmarks is greater than the length of "bookmarks".
 195 | 
 196 |     """
 197 | 
 198 |     def __init__(self, url, top_tags=None, bookmarks=None, title=u"", total_bookmarks=0):
 199 |         assert url
 200 |         self.url = url
 201 |         self.top_tags = top_tags or []
 202 |         self.bookmarks = bookmarks or []
 203 |         self.title = title
 204 |         self.total_bookmarks = total_bookmarks
 205 | 
 206 |     def __str__(self):
 207 |         total_tag_count = 0
 208 |         total_tags = set()
 209 |         for user, tags, comment, timestamp in self.bookmarks:
 210 |             if tags:
 211 |                 total_tag_count += len(tags)
 212 |             for tag in tags:
 213 |                 total_tags.add(tag)
 214 |         return "[%s] %d total bookmarks (= users), %d tags (%d unique), %d out of 10 max 'top' tags" % \
 215 |                     (self.url, self.total_bookmarks, total_tag_count, \
 216 |                     len(total_tags), len(self.top_tags))
 217 | 
 218 |     def __repr__(self):
 219 |         return self.url
 220 | 
 221 |     def get_tags(self):
 222 |         """Returns a dictionary mapping tags to their tag count.
 223 | 
 224 |         For example, if the tag count of tag 'foo' is 23, then
 225 |         23 bookmarks were annotated with 'foo'. A different way
 226 |         to put it is that 23 users used the tag 'foo' when
 227 |         bookmarking the URL.
 228 | 
 229 |         @return: Dictionary mapping tags to their tag count.
 230 | 
 231 |         """
 232 |         total_tags = {}
 233 |         for user, tags, comment, timestamp in self.bookmarks:
 234 |             for tag in tags:
 235 |                 total_tags[tag] = total_tags.get(tag, 0) + 1
 236 |         return total_tags
 237 |     tags = property(fget=get_tags, doc="Returns a dictionary mapping tags to their tag count")
 238 | 
 239 |     def get_hash(self):
 240 |         m = hashlib.md5()
 241 |         m.update(self.url)
 242 |         return m.hexdigest()
 243 |     hash = property(fget=get_hash, doc="Returns the MD5 hash of the URL of this document")
 244 | 
 245 | 
 246 | class DeliciousAPI(object):
 247 |     """
 248 |     This class provides a custom, unofficial API to the Delicious.com service.
 249 | 
 250 |     Instead of using just the functionality provided by the official
 251 |     Delicious.com API (which has limited features), this class retrieves
 252 |     information from the Delicious.com website directly and extracts data from
 253 |     the Web pages.
 254 | 
 255 |     Note that Delicious.com will block clients with too many queries in a
 256 |     certain time frame (similar to their API throttling). So be a nice citizen
 257 |     and don't stress their website.
 258 | 
 259 |     """
 260 | 
 261 |     def __init__(self,
 262 |                     http_proxy="",
 263 |                     tries=3,
 264 |                     wait_seconds=3,
 265 |                     user_agent="DeliciousAPI/%s (+http://www.michael-noll.com/wiki/Del.icio.us_Python_API)" % __version__,
 266 |                     timeout=30,
 267 |         ):
 268 |         """Set up the API module.
 269 | 
 270 |         @param http_proxy: Optional, default: "".
 271 |             Use an HTTP proxy for HTTP connections. Proxy support for
 272 |             HTTPS is not available yet.
 273 |             Format: "hostname:port" (e.g., "localhost:8080")
 274 |         @type http_proxy: str
 275 | 
 276 |         @param tries: Optional, default: 3.
 277 |             Try the specified number of times when downloading a monitored
 278 |             document fails. tries must be >= 1. See also wait_seconds.
 279 |         @type tries: int
 280 | 
 281 |         @param wait_seconds: Optional, default: 3.
 282 |             Wait the specified number of seconds before re-trying to
 283 |             download a monitored document. wait_seconds must be >= 0.
 284 |             See also tries.
 285 |         @type wait_seconds: int
 286 | 
 287 |         @param user_agent: Optional, default: "DeliciousAPI/<version>
 288 |             (+http://www.michael-noll.com/wiki/Del.icio.us_Python_API)".
 289 |             The User-Agent HTTP Header to use when querying Delicous.com.
 290 |         @type user_agent: str
 291 | 
 292 |         @param timeout: Optional, default: 30.
 293 |             Set network timeout. timeout must be >= 0.
 294 |         @type timeout: int
 295 | 
 296 |         """
 297 |         assert tries >= 1
 298 |         assert wait_seconds >= 0
 299 |         assert timeout >= 0
 300 |         self.http_proxy = http_proxy
 301 |         self.tries = tries
 302 |         self.wait_seconds = wait_seconds
 303 |         self.user_agent = user_agent
 304 |         self.timeout = timeout
 305 |         socket.setdefaulttimeout(self.timeout)
 306 | 
 307 | 
 308 |     def _query(self, path, host="delicious.com", user=None, password=None, use_ssl=False):
 309 |         """Queries Delicious.com for information, specified by (query) path.
 310 | 
 311 |         @param path: The HTTP query path.
 312 |         @type path: str
 313 | 
 314 |         @param host: The host to query, default: "delicious.com".
 315 |         @type host: str
 316 | 
 317 |         @param user: The Delicious.com username if any, default: None.
 318 |         @type user: str
 319 | 
 320 |         @param password: The Delicious.com password of user, default: None.
 321 |         @type password: unicode/str
 322 | 
 323 |         @param use_ssl: Whether to use SSL encryption or not, default: False.
 324 |         @type use_ssl: bool
 325 | 
 326 |         @return: None on errors (i.e. on all HTTP status other than 200).
 327 |             On success, returns the content of the HTML response.
 328 | 
 329 |         """
 330 |         opener = None
 331 |         handlers = []
 332 | 
 333 |         # add HTTP Basic authentication if available
 334 |         if user and password:
 335 |             pwd_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
 336 |             pwd_mgr.add_password(None, host, user, password)
 337 |             basic_auth_handler = urllib2.HTTPBasicAuthHandler(pwd_mgr)
 338 |             handlers.append(basic_auth_handler)
 339 | 
 340 |         # add proxy support if requested
 341 |         if self.http_proxy:
 342 |             proxy_handler = urllib2.ProxyHandler({'http': 'http://%s' % self.http_proxy})
 343 |             handlers.append(proxy_handler)
 344 | 
 345 |         if handlers:
 346 |             opener = urllib2.build_opener(*handlers)
 347 |         else:
 348 |             opener = urllib2.build_opener()
 349 |         opener.addheaders = [('User-agent', self.user_agent)]
 350 | 
 351 |         data = None
 352 |         tries = self.tries
 353 | 
 354 |         if use_ssl:
 355 |             protocol = "https"
 356 |         else:
 357 |             protocol = "http"
 358 |         url = "%s://%s%s" % (protocol, host, path)
 359 | 
 360 |         while tries > 0:
 361 |             try:
 362 |                 f = opener.open(url)
 363 |                 data = f.read()
 364 |                 f.close()
 365 |                 break
 366 |             except urllib2.HTTPError, e:
 367 |                 if e.code == 301:
 368 |                     raise DeliciousMovedPermanentlyWarning, "Delicious.com status %s - url moved permanently" % e.code
 369 |                 if e.code == 302:
 370 |                     raise DeliciousMovedTemporarilyWarning, "Delicious.com status %s - url moved temporarily" % e.code
 371 |                 elif e.code == 401:
 372 |                     raise DeliciousUnauthorizedError, "Delicious.com error %s - unauthorized (authentication failed?)" % e.code
 373 |                 elif e.code == 403:
 374 |                     raise DeliciousForbiddenError, "Delicious.com error %s - forbidden" % e.code
 375 |                 elif e.code == 404:
 376 |                     raise DeliciousNotFoundError, "Delicious.com error %s - url not found" % e.code
 377 |                 elif e.code == 500:
 378 |                     raise Delicious500Error, "Delicious.com error %s - server problem" % e.code
 379 |                 elif e.code == 503 or e.code == 999:
 380 |                     raise DeliciousThrottleError, "Delicious.com error %s - unable to process request (your IP address has been throttled/blocked)" % e.code
 381 |                 else:
 382 |                     raise DeliciousUnknownError, "Delicious.com error %s - unknown error" % e.code
 383 |                 break
 384 |             except urllib2.URLError, e:
 385 |                 time.sleep(self.wait_seconds)
 386 |             except socket.error, msg:
 387 |                 # sometimes we get a "Connection Refused" error
 388 |                 # wait a bit and then try again
 389 |                 time.sleep(self.wait_seconds)
 390 |             #finally:
 391 |             #    f.close()
 392 |             tries -= 1
 393 |         return data
 394 | 
 395 | 
 396 |     def get_url(self, url, max_bookmarks=50, sleep_seconds=1):
 397 |         """
 398 |         Returns a DeliciousURL instance representing the Delicious.com history of url.
 399 | 
 400 |         Generally, this method is what you want for getting title, bookmark, tag,
 401 |         and user information about a URL.
 402 | 
 403 |         Delicious only returns up to 50 bookmarks per URL. This means that
 404 |         we have to do subsequent queries plus parsing if we want to retrieve
 405 |         more than 50. Roughly speaking, the processing time of get_url()
 406 |         increases linearly with the number of 50-bookmarks-chunks; i.e.
 407 |         it will take 10 times longer to retrieve 500 bookmarks than 50.
 408 | 
 409 |         @param url: The URL of the web document to be queried for.
 410 |         @type url: str
 411 | 
 412 |         @param max_bookmarks: Optional, default: 50.
 413 |             See the documentation of get_bookmarks() for more information
 414 |             as get_url() uses get_bookmarks() to retrieve a url's
 415 |             bookmarking history.
 416 |         @type max_bookmarks: int
 417 | 
 418 |         @param sleep_seconds: Optional, default: 1.
 419 |             See the documentation of get_bookmarks() for more information
 420 |             as get_url() uses get_bookmarks() to retrieve a url's
 421 |             bookmarking history. sleep_seconds must be >= 1 to comply with
 422 |             Delicious.com's Terms of Use.
 423 |         @type sleep_seconds: int
 424 | 
 425 |         @return: DeliciousURL instance representing the Delicious.com history
 426 |             of url.
 427 | 
 428 |         """
 429 |         # we must wait at least 1 second between subsequent queries to
 430 |         # comply with Delicious.com's Terms of Use
 431 |         assert sleep_seconds >= 1
 432 | 
 433 |         document = DeliciousURL(url)
 434 | 
 435 |         m = hashlib.md5()
 436 |         m.update(url)
 437 |         hash = m.hexdigest()
 438 | 
 439 |         path = "/v2/json/urlinfo/%s" % hash
 440 |         data = self._query(path, host="feeds.delicious.com")
 441 |         if data:
 442 |             urlinfo = {}
 443 |             try:
 444 |                 urlinfo = simplejson.loads(data)
 445 |                 if urlinfo:
 446 |                     urlinfo = urlinfo[0]
 447 |                 else:
 448 |                     urlinfo = {}
 449 |             except TypeError:
 450 |                 pass
 451 |             try:
 452 |                 document.title = urlinfo['title'] or u""
 453 |             except KeyError:
 454 |                 pass
 455 |             try:
 456 |                 top_tags = urlinfo['top_tags'] or {}
 457 |                 if top_tags:
 458 |                     document.top_tags = sorted(top_tags.iteritems(), key=itemgetter(1), reverse=True)
 459 |                 else:
 460 |                     document.top_tags = []
 461 |             except KeyError:
 462 |                 pass
 463 |             try:
 464 |                 document.total_bookmarks = int(urlinfo['total_posts'])
 465 |             except (KeyError, ValueError):
 466 |                 pass
 467 |             document.bookmarks = self.get_bookmarks(url=url, max_bookmarks=max_bookmarks, sleep_seconds=sleep_seconds)
 468 | 
 469 | 
 470 |         return document
 471 | 
 472 |     def get_network(self, username):
 473 |         """
 474 |         Returns the user's list of followees and followers.
 475 | 
 476 |         Followees are users in his Delicious "network", i.e. those users whose
 477 |         bookmark streams he's subscribed to. Followers are his Delicious.com
 478 |         "fans", i.e. those users who have subscribed to the given user's
 479 |         bookmark stream).
 480 | 
 481 |         Example:
 482 | 
 483 |                 A -------->   --------> C
 484 |                 D --------> B --------> E
 485 |                 F -------->   --------> F
 486 | 
 487 |             followers               followees
 488 |             of B                    of B
 489 | 
 490 |         Arrows from user A to user B denote that A has subscribed to B's
 491 |         bookmark stream, i.e. A is "following" or "tracking" B.
 492 | 
 493 |         Note that user F is both a followee and a follower of B, i.e. F tracks
 494 |         B and vice versa. In Delicious.com terms, F is called a "mutual fan"
 495 |         of B.
 496 | 
 497 |         Comparing this network concept to information retrieval, one could say
 498 |         that followers are incoming links and followees outgoing links of B.
 499 | 
 500 |         @param username: Delicous.com username for which network information is
 501 |             retrieved.
 502 |         @type username: unicode/str
 503 | 
 504 |         @return: Tuple of two lists ([<followees>, [<followers>]), where each list
 505 |             contains tuples of (username, tracking_since_timestamp).
 506 |             If a network is set as private, i.e. hidden from public view,
 507 |             (None, None) is returned.
 508 |             If a network is public but empty, ([], []) is returned.
 509 | 
 510 |         """
 511 |         assert username
 512 |         followees = followers = None
 513 | 
 514 |         # followees (network members)
 515 |         path = "/v2/json/networkmembers/%s" % username
 516 |         data = None
 517 |         try:
 518 |             data = self._query(path, host="feeds.delicious.com")
 519 |         except DeliciousForbiddenError:
 520 |             pass
 521 |         if data:
 522 |             followees = []
 523 | 
 524 |             users = []
 525 |             try:
 526 |                 users = simplejson.loads(data)
 527 |             except TypeError:
 528 |                 pass
 529 | 
 530 |             uname = tracking_since = None
 531 | 
 532 |             for user in users:
 533 |                 # followee's username
 534 |                 try:
 535 |                     uname = user['user']
 536 |                 except KeyError:
 537 |                     pass
 538 |                 # try to convert uname to Unicode
 539 |                 if uname:
 540 |                     try:
 541 |                         # we assume UTF-8 encoding
 542 |                         uname = uname.decode('utf-8')
 543 |                     except UnicodeDecodeError:
 544 |                         pass
 545 |                 # time when the given user started tracking this user
 546 |                 try:
 547 |                     tracking_since = datetime.datetime.strptime(user['dt'], "%Y-%m-%dT%H:%M:%SZ")
 548 |                 except KeyError:
 549 |                     pass
 550 |                 if uname:
 551 |                     followees.append( (uname, tracking_since) )
 552 | 
 553 |         # followers (network fans)
 554 |         path = "/v2/json/networkfans/%s" % username
 555 |         data = None
 556 |         try:
 557 |             data = self._query(path, host="feeds.delicious.com")
 558 |         except DeliciousForbiddenError:
 559 |             pass
 560 |         if data:
 561 |             followers = []
 562 | 
 563 |             users = []
 564 |             try:
 565 |                 users = simplejson.loads(data)
 566 |             except TypeError:
 567 |                 pass
 568 | 
 569 |             uname = tracking_since = None
 570 | 
 571 |             for user in users:
 572 |                 # fan's username
 573 |                 try:
 574 |                     uname = user['user']
 575 |                 except KeyError:
 576 |                     pass
 577 |                 # try to convert uname to Unicode
 578 |                 if uname:
 579 |                     try:
 580 |                         # we assume UTF-8 encoding
 581 |                         uname = uname.decode('utf-8')
 582 |                     except UnicodeDecodeError:
 583 |                         pass
 584 |                 # time when fan started tracking the given user
 585 |                 try:
 586 |                     tracking_since = datetime.datetime.strptime(user['dt'], "%Y-%m-%dT%H:%M:%SZ")
 587 |                 except KeyError:
 588 |                     pass
 589 |                 if uname:
 590 |                     followers.append( (uname, tracking_since) )
 591 |         return ( followees, followers )
 592 | 
 593 |     def get_bookmarks(self, url=None, username=None, max_bookmarks=50, sleep_seconds=1):
 594 |         """
 595 |         Returns the bookmarks of url or user, respectively.
 596 | 
 597 |         Delicious.com only returns up to 50 bookmarks per URL on its website.
 598 |         This means that we have to do subsequent queries plus parsing if
 599 |         we want to retrieve more than 50. Roughly speaking, the processing
 600 |         time of get_bookmarks() increases linearly with the number of
 601 |         50-bookmarks-chunks; i.e. it will take 10 times longer to retrieve
 602 |         500 bookmarks than 50.
 603 | 
 604 |         @param url: The URL of the web document to be queried for.
 605 |             Cannot be used together with 'username'.
 606 |         @type url: str
 607 | 
 608 |         @param username: The Delicious.com username to be queried for.
 609 |             Cannot be used together with 'url'.
 610 |         @type username: str
 611 | 
 612 |         @param max_bookmarks: Optional, default: 50.
 613 |             Maximum number of bookmarks to retrieve. Set to 0 to disable
 614 |             this limitation/the maximum and retrieve all available
 615 |             bookmarks of the given url.
 616 | 
 617 |             Bookmarks are sorted so that newer bookmarks are first.
 618 |             Setting max_bookmarks to 50 means that get_bookmarks() will retrieve
 619 |             the 50 most recent bookmarks of the given url.
 620 | 
 621 |             In the case of getting bookmarks of a URL (url is set),
 622 |             get_bookmarks() will take *considerably* longer to run
 623 |             for pages with lots of bookmarks when setting max_bookmarks
 624 |             to a high number or when you completely disable the limit.
 625 |             Delicious returns only up to 50 bookmarks per result page,
 626 |             so for example retrieving 250 bookmarks requires 5 HTTP
 627 |             connections and parsing 5 HTML pages plus wait time between
 628 |             queries (to comply with delicious' Terms of Use; see
 629 |             also parameter 'sleep_seconds').
 630 | 
 631 |             In the case of getting bookmarks of a user (username is set),
 632 |             the same restrictions as for a URL apply with the exception
 633 |             that we can retrieve up to 100 bookmarks per HTTP query
 634 |             (instead of only up to 50 per HTTP query for a URL).
 635 |         @type max_bookmarks: int
 636 | 
 637 |         @param sleep_seconds: Optional, default: 1.
 638 |                 Wait the specified number of seconds between subsequent
 639 |                 queries in case that there are multiple pages of bookmarks
 640 |                 for the given url. sleep_seconds must be >= 1 to comply with
 641 |                 Delicious.com's Terms of Use.
 642 |                 See also parameter 'max_bookmarks'.
 643 |         @type sleep_seconds: int
 644 | 
 645 |         @return: Returns the bookmarks of url or user, respectively.
 646 |             For urls, it returns a list of (user, tags, comment, timestamp)
 647 |             tuples.
 648 |             For users, it returns a list of (url, tags, title, comment,
 649 |             timestamp) tuples.
 650 | 
 651 |             Bookmarks are sorted "descendingly" by creation time, i.e. newer
 652 |             bookmarks come first.
 653 | 
 654 |         """
 655 |         # we must wait at least 1 second between subsequent queries to
 656 |         # comply with delicious' Terms of Use
 657 |         assert sleep_seconds >= 1
 658 | 
 659 |         # url XOR username
 660 |         assert bool(username) is not bool(url)
 661 | 
 662 |         # maximum number of urls/posts Delicious.com will display
 663 |         # per page on its website
 664 |         max_html_count = 100
 665 |         # maximum number of pages that Delicious.com will display;
 666 |         # currently, the maximum number of pages is 20. Delicious.com
 667 |         # allows to go beyond page 20 via pagination, but page N (for
 668 |         # N > 20) will always display the same content as page 20.
 669 |         max_html_pages = 20
 670 | 
 671 |         path = None
 672 |         if url:
 673 |             m = hashlib.md5()
 674 |             m.update(url)
 675 |             hash = m.hexdigest()
 676 | 
 677 |             # path will change later on if there are multiple pages of boomarks
 678 |             # for the given url
 679 |             path = "/url/%s" % hash
 680 |         elif username:
 681 |             # path will change later on if there are multiple pages of boomarks
 682 |             # for the given username
 683 |             path = "/%s?setcount=%d" % (username, max_html_count)
 684 |         else:
 685 |             raise Exception('You must specify either url or user.')
 686 | 
 687 |         page_index = 1
 688 |         bookmarks = []
 689 |         while path and page_index <= max_html_pages:
 690 |             data = self._query(path)
 691 |             path = None
 692 |             if data:
 693 |                 # extract bookmarks from current page
 694 |                 if url:
 695 |                     bookmarks.extend(self._extract_bookmarks_from_url_history(data))
 696 |                 else:
 697 |                     bookmarks.extend(self._extract_bookmarks_from_user_history(data))
 698 | 
 699 |                 # stop scraping if we already have as many bookmarks as we want
 700 |                 if (len(bookmarks) >= max_bookmarks) and max_bookmarks != 0:
 701 |                     break
 702 |                 else:
 703 |                     # check if there are multiple pages of bookmarks for this
 704 |                     # url on Delicious.com
 705 |                     soup = BeautifulSoup(data)
 706 |                     paginations = soup.findAll("div", id="pagination")
 707 |                     if paginations:
 708 |                         # find next path
 709 |                         nexts = paginations[0].findAll("a", attrs={ "class": "pn next" })
 710 |                         if nexts and (max_bookmarks == 0 or len(bookmarks) < max_bookmarks) and len(bookmarks) > 0:
 711 |                             # e.g. /url/2bb293d594a93e77d45c2caaf120e1b1?show=all&page=2
 712 |                             path = nexts[0]['href']
 713 |                             if username:
 714 |                                 path += "&setcount=%d" % max_html_count
 715 |                             page_index += 1
 716 |                             # wait one second between queries to be compliant with
 717 |                             # delicious' Terms of Use
 718 |                             time.sleep(sleep_seconds)
 719 |         if max_bookmarks > 0:
 720 |             return bookmarks[:max_bookmarks]
 721 |         else:
 722 |             return bookmarks
 723 | 
 724 | 
 725 |     def _extract_bookmarks_from_url_history(self, data):
 726 |         """
 727 |         Extracts user bookmarks from a URL's history page on Delicious.com.
 728 | 
 729 |         The Python library BeautifulSoup is used to parse the HTML page.
 730 | 
 731 |         @param data: The HTML source of a URL history Web page on Delicious.com.
 732 |         @type data: str
 733 | 
 734 |         @return: list of user bookmarks of the corresponding URL
 735 | 
 736 |         """
 737 |         bookmarks = []
 738 |         soup = BeautifulSoup(data)
 739 | 
 740 |         bookmark_elements = soup.findAll("div", attrs={"class": re.compile("^bookmark\s*")})
 741 |         timestamp = None
 742 |         for bookmark_element in bookmark_elements:
 743 | 
 744 |             # extract bookmark creation time
 745 |             #
 746 |             # this timestamp has to "persist" until a new timestamp is
 747 |             # found (delicious only provides the creation time data for the
 748 |             # first bookmark in the list of bookmarks for a given day
 749 |             dategroups = bookmark_element.findAll("div", attrs={"class": "dateGroup"})
 750 |             if dategroups:
 751 |                 spans = dategroups[0].findAll('span')
 752 |                 if spans:
 753 |                     date_str = spans[0].contents[0].strip()
 754 |                     timestamp =  datetime.datetime.strptime(date_str, '%d %b %y')
 755 | 
 756 |             # extract comments
 757 |             comment = u""
 758 |             datas = bookmark_element.findAll("div", attrs={"class": "data"})
 759 |             if datas:
 760 |                 divs = datas[0].findAll("div", attrs={"class": "description"})
 761 |                 if divs:
 762 |                     comment = divs[0].contents[0].strip()
 763 | 
 764 |             # extract tags
 765 |             user_tags = []
 766 |             tagdisplays = bookmark_element.findAll("div", attrs={"class": "tagdisplay"})
 767 |             if tagdisplays:
 768 |                 aset  = tagdisplays[0].findAll("a", attrs={"class": "tag noplay"})
 769 |                 for a in aset:
 770 |                     tag = a.contents[0]
 771 |                     user_tags.append(tag)
 772 | 
 773 |             # extract user information
 774 |             metas = bookmark_element.findAll("div", attrs={"class": "meta"})
 775 |             if metas:
 776 |                 links = metas[0].findAll("a", attrs={"class": "user user-tag"})
 777 |                 if links:
 778 |                     try:
 779 |                         user = links[0]['href'][1:]
 780 |                     except IndexError:
 781 |                         # WORKAROUND: it seems there is a bug on Delicious.com where
 782 |                         # sometimes a bookmark is shown in a URL history without any
 783 |                         # associated Delicious username (username is empty); this could
 784 |                         # be caused by special characters in the username or other things
 785 |                         #
 786 |                         # this problem of Delicious is very rare, so we just skip such
 787 |                         # entries until they find a fix
 788 |                         pass
 789 |                     bookmarks.append( (user, user_tags, comment, timestamp) )
 790 | 
 791 |         return bookmarks
 792 | 
 793 |     def _extract_bookmarks_from_user_history(self, data):
 794 |         """
 795 |         Extracts a user's bookmarks from his user page on Delicious.com.
 796 | 
 797 |         The Python library BeautifulSoup is used to parse the HTML page.
 798 | 
 799 |         @param data: The HTML source of a user page on Delicious.com.
 800 |         @type data: str
 801 | 
 802 |         @return: list of bookmarks of the corresponding user
 803 | 
 804 |         """
 805 |         bookmarks = []
 806 |         soup = BeautifulSoup(data)
 807 | 
 808 |         ul = soup.find("ul", id="bookmarklist")
 809 |         if ul:
 810 |             bookmark_elements = ul.findAll("div", attrs={"class": re.compile("^bookmark\s*")})
 811 |             timestamp = None
 812 |             for bookmark_element in bookmark_elements:
 813 | 
 814 |                 # extract bookmark creation time
 815 |                 #
 816 |                 # this timestamp has to "persist" until a new timestamp is
 817 |                 # found (delicious only provides the creation time data for the
 818 |                 # first bookmark in the list of bookmarks for a given day
 819 |                 dategroups = bookmark_element.findAll("div", attrs={"class": "dateGroup"})
 820 |                 if dategroups:
 821 |                     spans = dategroups[0].findAll('span')
 822 |                     if spans:
 823 |                         date_str = spans[0].contents[0].strip()
 824 |                         timestamp =  datetime.datetime.strptime(date_str, '%d %b %y')
 825 | 
 826 |                 # extract url, title and comments
 827 |                 url = u""
 828 |                 title = u""
 829 |                 comment = u""
 830 |                 datas = bookmark_element.findAll("div", attrs={"class": "data"})
 831 |                 if datas:
 832 |                     links = datas[0].findAll("a", attrs={"class": re.compile("^taggedlink\s*")})
 833 |                     if links and links[0].contents:
 834 |                         title = links[0].contents[0].strip()
 835 |                         url = links[0]['href']
 836 |                     divs = datas[0].findAll("div", attrs={"class": "description"})
 837 |                     if divs:
 838 |                         comment = divs[0].contents[0].strip()
 839 | 
 840 |                 # extract tags
 841 |                 url_tags = []
 842 |                 tagdisplays = bookmark_element.findAll("div", attrs={"class": "tagdisplay"})
 843 |                 if tagdisplays:
 844 |                     aset = tagdisplays[0].findAll("a", attrs={"class": "tag noplay"})
 845 |                     for a in aset:
 846 |                         tag = a.contents[0]
 847 |                         url_tags.append(tag)
 848 | 
 849 |                 bookmarks.append( (url, url_tags, title, comment, timestamp) )
 850 | 
 851 |         return bookmarks
 852 | 
 853 | 
 854 |     def get_user(self, username, password=None, max_bookmarks=50, sleep_seconds=1):
 855 |         """Retrieves a user's bookmarks from Delicious.com.
 856 | 
 857 |         If a correct username AND password are supplied, a user's *full*
 858 |         bookmark collection (which also includes private bookmarks) is
 859 |         retrieved. Data communication is encrypted using SSL in this case.
 860 | 
 861 |         If no password is supplied, only the *public* bookmarks of the user
 862 |         are retrieved. Here, the parameter 'max_bookmarks' specifies how
 863 |         many public bookmarks will be retrieved (default: 50). Set the
 864 |         parameter to 0 to retrieve all public bookmarks.
 865 | 
 866 |         This function can be used to backup all of a user's bookmarks if
 867 |         called with a username and password.
 868 | 
 869 |         @param username: The Delicious.com username.
 870 |         @type username: str
 871 | 
 872 |         @param password: Optional, default: None.
 873 |             The user's Delicious.com password. If password is set,
 874 |             all communication with Delicious.com is SSL-encrypted.
 875 |         @type password: unicode/str
 876 | 
 877 |         @param max_bookmarks: Optional, default: 50.
 878 |             See the documentation of get_bookmarks() for more
 879 |             information as get_url() uses get_bookmarks() to
 880 |             retrieve a url's bookmarking history.
 881 |             The parameter is NOT used when a password is specified
 882 |             because in this case the *full* bookmark collection of
 883 |             a user will be retrieved.
 884 |         @type max_bookmarks: int
 885 | 
 886 |         @param sleep_seconds: Optional, default: 1.
 887 |             See the documentation of get_bookmarks() for more information as
 888 |             get_url() uses get_bookmarks() to retrieve a url's bookmarking
 889 |             history. sleep_seconds must be >= 1 to comply with Delicious.com's
 890 |             Terms of Use.
 891 |         @type sleep_seconds: int
 892 | 
 893 |         @return: DeliciousUser instance
 894 | 
 895 |         """
 896 |         assert username
 897 |         user = DeliciousUser(username)
 898 |         bookmarks = []
 899 |         if password:
 900 |             # We have username AND password, so we call
 901 |             # the official Delicious.com API.
 902 |             path = "/v1/posts/all"
 903 |             data = self._query(path, host="api.del.icio.us", use_ssl=True, user=username, password=password)
 904 |             if data:
 905 |                 soup = BeautifulSoup(data)
 906 |                 elements = soup.findAll("post")
 907 |                 for element in elements:
 908 |                     url = element["href"]
 909 |                     title = element["description"] or u""
 910 |                     comment = element["extended"] or u""
 911 |                     tags = []
 912 |                     if element["tag"]:
 913 |                         tags = element["tag"].split()
 914 |                     timestamp = datetime.datetime.strptime(element["time"], "%Y-%m-%dT%H:%M:%SZ")
 915 |                     bookmarks.append( (url, tags, title, comment, timestamp) )
 916 |             user.bookmarks = bookmarks
 917 |         else:
 918 |             # We have only the username, so we extract data from
 919 |             # the user's JSON feed. However, the feed is restricted
 920 |             # to the most recent public bookmarks of the user, which
 921 |             # is about 100 if any. So if we need more than 100, we start
 922 |             # scraping the Delicious.com website directly
 923 |             if max_bookmarks > 0 and max_bookmarks <= 100:
 924 |                 path = "/v2/json/%s?count=100" % username
 925 |                 data = self._query(path, host="feeds.delicious.com", user=username)
 926 |                 if data:
 927 |                     posts = []
 928 |                     try:
 929 |                         posts = simplejson.loads(data)
 930 |                     except TypeError:
 931 |                         pass
 932 | 
 933 |                     url = timestamp = None
 934 |                     title = comment = u""
 935 |                     tags = []
 936 | 
 937 |                     for post in posts:
 938 |                         # url
 939 |                         try:
 940 |                             url = post['u']
 941 |                         except KeyError:
 942 |                             pass
 943 |                         # title
 944 |                         try:
 945 |                             title = post['d']
 946 |                         except KeyError:
 947 |                             pass
 948 |                         # tags
 949 |                         try:
 950 |                             tags = post['t']
 951 |                         except KeyError:
 952 |                             pass
 953 |                         if not tags:
 954 |                             tags = [u"system:unfiled"]
 955 |                         # comment / notes
 956 |                         try:
 957 |                             comment = post['n']
 958 |                         except KeyError:
 959 |                             pass
 960 |                         # bookmark creation time
 961 |                         try:
 962 |                             timestamp = datetime.datetime.strptime(post['dt'], "%Y-%m-%dT%H:%M:%SZ")
 963 |                         except KeyError:
 964 |                             pass
 965 |                         bookmarks.append( (url, tags, title, comment, timestamp) )
 966 |                     user.bookmarks = bookmarks[:max_bookmarks]
 967 |             else:
 968 |                 # TODO: retrieve the first 100 bookmarks via JSON before
 969 |                 #       falling back to scraping the delicous.com website
 970 |                 user.bookmarks = self.get_bookmarks(username=username, max_bookmarks=max_bookmarks, sleep_seconds=sleep_seconds)
 971 |         return user
 972 | 
 973 |     def get_urls(self, tag=None, popular=True, max_urls=100, sleep_seconds=1):
 974 |         """
 975 |         Returns the list of recent URLs (of web documents) tagged with a given tag.
 976 | 
 977 |         This is very similar to parsing Delicious' RSS/JSON feeds directly,
 978 |         but this function will return up to 2,000 links compared to a maximum
 979 |         of 100 links when using the official feeds (with query parameter
 980 |         count=100).
 981 | 
 982 |         The return list of links will be sorted by recency in descending order,
 983 |         i.e. newest items first.
 984 | 
 985 |         Note that even when setting max_urls, get_urls() cannot guarantee that
 986 |         it can retrieve *at least* this many URLs. It is really just an upper
 987 |         bound.
 988 | 
 989 |         @param tag: Retrieve links which have been tagged with the given tag.
 990 |             If tag is not set (default), links will be retrieved from the
 991 |             Delicious.com front page (aka "delicious hotlist").
 992 |         @type tag: unicode/str
 993 | 
 994 |         @param popular: If true (default), retrieve only popular links (i.e.
 995 |             /popular/<tag>). Otherwise, the most recent links tagged with
 996 |             the given tag will be retrieved (i.e. /tag/<tag>).
 997 | 
 998 |             As of January 2009, it seems that Delicious.com modified the list
 999 |             of popular tags to contain only up to a maximum of 15 URLs.
1000 |             This also means that setting max_urls to values larger than 15
1001 |             will not change the results of get_urls().
1002 |             So if you are interested in more URLs, set the "popular" parameter
1003 |             to false.
1004 | 
1005 |             Note that if you set popular to False, the returned list of URLs
1006 |             might contain duplicate items. This is due to the way Delicious.com
1007 |             creates its /tag/<tag> Web pages. So if you need a certain
1008 |             number of unique URLs, you have to take care of that in your
1009 |             own code.
1010 |         @type popular: bool
1011 | 
1012 |         @param max_urls: Retrieve at most max_urls links. The default is 100,
1013 |             which is the maximum number of links that can be retrieved by
1014 |             parsing the official JSON feeds. The maximum value of max_urls
1015 |             in practice is 2000 (currently). If it is set higher, Delicious
1016 |             will return the same links over and over again, giving lots of
1017 |             duplicate items.
1018 |         @type max_urls: int
1019 | 
1020 |         @param sleep_seconds: Optional, default: 1.
1021 |             Wait the specified number of seconds between subsequent queries in
1022 |             case that there are multiple pages of bookmarks for the given url.
1023 |             Must be greater than or equal to 1 to comply with Delicious.com's
1024 |             Terms of Use.
1025 |             See also parameter 'max_urls'.
1026 |         @type sleep_seconds: int
1027 | 
1028 |         @return: The list of recent URLs (of web documents) tagged with a given tag.
1029 | 
1030 |         """
1031 |         assert sleep_seconds >= 1
1032 |         urls = []
1033 |         path = None
1034 |         if tag is None or (tag is not None and max_urls > 0 and max_urls <= 100):
1035 |             # use official JSON feeds
1036 |             max_json_count = 100
1037 |             if tag:
1038 |                 # tag-specific JSON feed
1039 |                 if popular:
1040 |                     path = "/v2/json/popular/%s?count=%d" % (tag, max_json_count)
1041 |                 else:
1042 |                     path = "/v2/json/tag/%s?count=%d" % (tag, max_json_count)
1043 |             else:
1044 |                 # Delicious.com hotlist
1045 |                 path = "/v2/json/?count=%d" % (max_json_count)
1046 |             data = self._query(path, host="feeds.delicious.com")
1047 |             if data:
1048 |                 posts = []
1049 |                 try:
1050 |                     posts = simplejson.loads(data)
1051 |                 except TypeError:
1052 |                     pass
1053 | 
1054 |                 for post in posts:
1055 |                     # url
1056 |                     try:
1057 |                         url = post['u']
1058 |                         if url:
1059 |                             urls.append(url)
1060 |                     except KeyError:
1061 |                         pass
1062 |         else:
1063 |             # maximum number of urls/posts Delicious.com will display
1064 |             # per page on its website
1065 |             max_html_count = 100
1066 |             # maximum number of pages that Delicious.com will display;
1067 |             # currently, the maximum number of pages is 20. Delicious.com
1068 |             # allows to go beyond page 20 via pagination, but page N (for
1069 |             # N > 20) will always display the same content as page 20.
1070 |             max_html_pages = 20
1071 | 
1072 |             if popular:
1073 |                 path = "/popular/%s?setcount=%d" % (tag, max_html_count)
1074 |             else:
1075 |                 path = "/tag/%s?setcount=%d" % (tag, max_html_count)
1076 | 
1077 |             page_index = 1
1078 |             urls = []
1079 |             while path and page_index <= max_html_pages:
1080 |                 data = self._query(path)
1081 |                 path = None
1082 |                 if data:
1083 |                     # extract urls from current page
1084 |                     soup = BeautifulSoup(data)
1085 |                     links = soup.findAll("a", attrs={"class": re.compile("^taggedlink\s*")})
1086 |                     for link in links:
1087 |                         try:
1088 |                             url = link['href']
1089 |                             if url:
1090 |                                 urls.append(url)
1091 |                         except KeyError:
1092 |                             pass
1093 | 
1094 |                     # check if there are more multiple pages of urls
1095 |                     soup = BeautifulSoup(data)
1096 |                     paginations = soup.findAll("div", id="pagination")
1097 |                     if paginations:
1098 |                         # find next path
1099 |                         nexts = paginations[0].findAll("a", attrs={ "class": "pn next" })
1100 |                         if nexts and (max_urls == 0 or len(urls) < max_urls) and len(urls) > 0:
1101 |                             # e.g. /url/2bb293d594a93e77d45c2caaf120e1b1?show=all&page=2
1102 |                             path = nexts[0]['href']
1103 |                             path += "&setcount=%d" % max_html_count
1104 |                             page_index += 1
1105 |                             # wait between queries to Delicious.com to be
1106 |                             # compliant with its Terms of Use
1107 |                             time.sleep(sleep_seconds)
1108 |         if max_urls > 0:
1109 |             return urls[:max_urls]
1110 |         else:
1111 |             return urls
1112 | 
1113 | 
1114 |     def get_tags_of_user(self, username):
1115 |         """
1116 |         Retrieves user's public tags and their tag counts from Delicious.com.
1117 |         The tags represent a user's full public tagging vocabulary.
1118 | 
1119 |         DeliciousAPI uses the official JSON feed of the user. We could use
1120 |         RSS here, but the JSON feed has proven to be faster in practice.
1121 | 
1122 |         @param username: The Delicious.com username.
1123 |         @type username: str
1124 | 
1125 |         @return: Dictionary mapping tags to their tag counts.
1126 | 
1127 |         """
1128 |         tags = {}
1129 |         path = "/v2/json/tags/%s" % username
1130 |         data = self._query(path, host="feeds.delicious.com")
1131 |         if data:
1132 |             try:
1133 |                 tags = simplejson.loads(data)
1134 |             except TypeError:
1135 |                 pass
1136 |         return tags
1137 | 
1138 |     def get_number_of_users(self, url):
1139 |         """get_number_of_users() is obsolete and has been removed. Please use get_url() instead."""
1140 |         reason = "get_number_of_users() is obsolete and has been removed. Please use get_url() instead."
1141 |         raise Exception(reason)
1142 | 
1143 |     def get_common_tags_of_url(self, url):
1144 |         """get_common_tags_of_url() is obsolete and has been removed. Please use get_url() instead."""
1145 |         reason = "get_common_tags_of_url() is obsolete and has been removed. Please use get_url() instead."
1146 |         raise Exception(reason)
1147 | 
1148 |     def _html_escape(self, s):
1149 |         """HTML-escape a string or object.
1150 | 
1151 |         This converts any non-string objects passed into it to strings
1152 |         (actually, using unicode()).  All values returned are
1153 |         non-unicode strings (using "&#num;" entities for all non-ASCII
1154 |         characters).
1155 | 
1156 |         None is treated specially, and returns the empty string.
1157 | 
1158 |         @param s: The string that needs to be escaped.
1159 |         @type s: str
1160 | 
1161 |         @return: The escaped string.
1162 | 
1163 |         """
1164 |         if s is None:
1165 |             return ''
1166 |         if not isinstance(s, basestring):
1167 |             if hasattr(s, '__unicode__'):
1168 |                 s = unicode(s)
1169 |             else:
1170 |                 s = str(s)
1171 |         s = cgi.escape(s, True)
1172 |         if isinstance(s, unicode):
1173 |             s = s.encode('ascii', 'xmlcharrefreplace')
1174 |         return s
1175 | 
1176 | 
1177 | class DeliciousError(Exception):
1178 |     """Used to indicate that an error occurred when trying to access Delicious.com via its API."""
1179 | 
1180 | class DeliciousWarning(Exception):
1181 |     """Used to indicate a warning when trying to access Delicious.com via its API.
1182 | 
1183 |     Warnings are raised when it is useful to alert the user of some condition
1184 |     where that condition doesn't warrant raising an exception and terminating
1185 |     the program. For example, we issue a warning when Delicious.com returns a
1186 |     HTTP status code for redirections (3xx).
1187 |     """
1188 | 
1189 | class DeliciousThrottleError(DeliciousError):
1190 |     """Used to indicate that the client computer (i.e. its IP address) has been temporarily blocked by Delicious.com."""
1191 |     pass
1192 | 
1193 | class DeliciousUnknownError(DeliciousError):
1194 |     """Used to indicate that Delicious.com returned an (HTTP) error which we don't know how to handle yet."""
1195 |     pass
1196 | 
1197 | class DeliciousUnauthorizedError(DeliciousError):
1198 |     """Used to indicate that Delicious.com returned a 401 Unauthorized error.
1199 | 
1200 |     Most of the time, the user credentials for accessing restricted functions
1201 |     of the official Delicious.com API are incorrect.
1202 | 
1203 |     """
1204 |     pass
1205 | 
1206 | class DeliciousForbiddenError(DeliciousError):
1207 |     """Used to indicate that Delicious.com returned a 403 Forbidden error.
1208 |     """
1209 |     pass
1210 | 
1211 | 
1212 | class DeliciousNotFoundError(DeliciousError):
1213 |     """Used to indicate that Delicious.com returned a 404 Not Found error.
1214 | 
1215 |     Most of the time, retrying some seconds later fixes the problem
1216 |     (because we only query existing pages with this API).
1217 | 
1218 |     """
1219 |     pass
1220 | 
1221 | class Delicious500Error(DeliciousError):
1222 |     """Used to indicate that Delicious.com returned a 500 error.
1223 | 
1224 |     Most of the time, retrying some seconds later fixes the problem.
1225 | 
1226 |     """
1227 |     pass
1228 | 
1229 | class DeliciousMovedPermanentlyWarning(DeliciousWarning):
1230 |     """Used to indicate that Delicious.com returned a 301 Found (Moved Permanently) redirection."""
1231 |     pass
1232 | 
1233 | class DeliciousMovedTemporarilyWarning(DeliciousWarning):
1234 |     """Used to indicate that Delicious.com returned a 302 Found (Moved Temporarily) redirection."""
1235 |     pass
1236 | 
1237 | __all__ = ['DeliciousAPI', 'DeliciousURL', 'DeliciousError', 'DeliciousThrottleError', 'DeliciousUnauthorizedError', 'DeliciousUnknownError', 'DeliciousNotFoundError' , 'Delicious500Error', 'DeliciousMovedTemporarilyWarning']
1238 | 
1239 | if __name__ == "__main__":
1240 |     d = DeliciousAPI()
1241 |     max_bookmarks = 50
1242 |     url = 'http://www.michael-noll.com/wiki/Del.icio.us_Python_API'
1243 |     print "Retrieving Delicious.com information about url"
1244 |     print "'%s'" % url
1245 |     print "Note: This might take some time..."
1246 |     print "========================================================="
1247 |     document = d.get_url(url, max_bookmarks=max_bookmarks)
1248 |     print document
1249 | 


--------------------------------------------------------------------------------