├── README.md ├── deliciousmonitor.py └── deliciousapi.py /README.md: -------------------------------------------------------------------------------- 1 | DeliciousAPI 2 | ============ 3 | 4 | Unofficial Python API for retrieving data from Delicious.com 5 | 6 | Features 7 | -------- 8 | 9 | This Python module provides the following features plus some more: 10 | 11 | * Retrieving a URL's full public bookmarking history including users who bookmarked the URL including tags used for such bookmarks and the creation time of the bookmark (up to YYYY-MM-DD granularity) 12 | * Top tags (up to a maximum of 10) including tag count title as stored on Delicious.com 13 | * Total number of bookmarks/users for this URL at Delicious.com retrieving a user's full bookmark collection, including any private bookmarks if you know the corresponding password 14 | * Retrieving a user's full public tagging vocabulary, i.e. tags and tag counts 15 | * Retrieving a user's network information (network members and network fans) 16 | * HTTP proxy support 17 | 18 | The official Delicious.com API and the JSON/RSS feeds do not provide all the functionality mentioned above, and in such cases this module will query the Delicious.com website directly and extract the required information by parsing the HTML code of the resulting Web pages (a kind of poor man's web mining). The module is able to detect IP throttling, which is employed by Delicious.com to temporarily block abusive HTTP request behavior, and will raise a custom Python error to indicate that. Please be a nice netizen and do not stress the Delicious.com service more than necessary. 19 | 20 | Installation 21 | ------------ 22 | 23 | You can now download and install DeliciousAPI from Python Package Index (aka Python Cheese Shop) (includes only deliciousapi.py) via setuptools/easy_install. Just run 24 | 25 | $ easy_install DeliciousAPI 26 | 27 | After installation, a simple import deliciousapi in your Python scripts will do the trick. 28 | 29 | An alternative installation method is downloading the code straight from the git repository. 30 | 31 | Updates 32 | ------- 33 | 34 | If you used setuptools/easy_install for installation, you can update DeliciousAPI via 35 | 36 | $ easy_install -U DeliciousAPI 37 | 38 | Alternatively, if you downladed the code from the git repository, simply pull the latest changes. 39 | 40 | Usage 41 | ----- 42 | 43 | For now, please refer to the documentation available at [http://www.michael-noll.com/projects/delicious-python-api/](http://www.michael-noll.com/projects/delicious-python-api/). 44 | 45 | Important 46 | --------- 47 | 48 | It is strongly advised that you read the Delicious.com Terms of Use prior to using this Python module. In particular, read section 5 'Intellectual Property'. 49 | 50 | License 51 | ------- 52 | 53 | The code is licensed to you under version 2 of the GNU General Public License. 54 | 55 | Copyright 56 | --------- 57 | 58 | Copyright 2006-2010 Michael G. Noll 59 | 60 | -------------------------------------------------------------------------------- /deliciousmonitor.py: -------------------------------------------------------------------------------- 1 | """ 2 | A module to monitor a delicious.com bookmark RSS feed and store it with some additional metadata to file. 3 | 4 | (c) 2006-2008 Michael G. Noll 5 | 6 | """ 7 | import codecs 8 | import datetime 9 | import os 10 | import sys 11 | import time 12 | 13 | try: 14 | import deliciousapi 15 | except: 16 | print "ERROR: could not import DeliciousAPI module" 17 | print 18 | print "You can download DeliciousAPI from the Python Cheese Shop at" 19 | print "http://pypi.python.org/pypi/DeliciousAPI" 20 | print 21 | 22 | try: 23 | import feedparser 24 | except: 25 | print "ERROR: could not import Universal Feed Parser module" 26 | print 27 | print "You can download Universal Feed Parser from the Python Cheese Shop at" 28 | print "http://pypi.python.org/pypi/FeedParser" 29 | print 30 | raise 31 | 32 | 33 | class DeliciousMonitor(object): 34 | """Monitors a delicious.com bookmark RSS feed, retrieves metadata for each bookmark and stores it to file. 35 | 36 | By default, the delicious.com hotlist (i.e. the front page) is monitored. 37 | Whenever the monitor discovers a new URL in a bookmark, it retrieves 38 | some metadata for it from delicious.com (currently, common tags and number 39 | of bookmarks) and stores this information to file. 40 | 41 | Note that URLs which have been processed in previous runs will NOT be 42 | processed again, i.e. delicious.com metadata will NOT be updated. 43 | 44 | """ 45 | 46 | def __init__(self, rss_url="http://feeds.delicious.com/v2/rss", filename="delicious-monitor.xml", log_filename="delicious-monitor.log", interval=30, verbose=True): 47 | """ 48 | Parameters: 49 | rss_url (optional, default: "http://feeds.delicious.com/v2/rss") 50 | The URL of the RSS feed to monitor. 51 | 52 | filename (optional, default: "delicious-monitor.xml") 53 | The name of the file to which metadata about the RSS feed will be stored. 54 | 55 | log_filename (optional, default: "delicious-monitor.log") 56 | The name of the log file, which is used to identify "new" entries in the RSS feed. 57 | 58 | interval (optional, default: 30) 59 | Time between monitor runs in minutes. 60 | 61 | verbose (optional, default: True) 62 | Whether to print non-critical processing information to STDOUT or not. 63 | 64 | """ 65 | self.rss_url = rss_url 66 | self._delicious = deliciousapi.DeliciousAPI() 67 | self.filename = filename 68 | self.log_filename = log_filename 69 | self.interval = interval 70 | self.verbose = verbose 71 | self.urls = [] 72 | # ensure that the name of the output file and log file is not None etc. 73 | assert self.filename 74 | assert self.log_filename 75 | 76 | def run(self): 77 | """Start the monitor.""" 78 | while True: 79 | time_before_run = datetime.datetime.now() 80 | 81 | # do the actual monitoring work 82 | if self.verbose: 83 | print "[MONITOR] Starting monitor run - %s" % time_before_run.strftime("%Y-%m-%d @ %H:%M:%S") 84 | self.monitor() 85 | time_after_run = datetime.datetime.now() 86 | 87 | # calculate the number of seconds to wait until the next run 88 | interval = datetime.timedelta(seconds=60*self.interval) 89 | next_run_time = time_before_run + interval 90 | elapsed = time_after_run - time_before_run 91 | if interval >= elapsed: 92 | wait_seconds = (interval - elapsed).seconds 93 | else: 94 | # the run took longer than our interval time between runs; 95 | # in this case, we continue immediately but will still wait 96 | # three seconds in order not to stress delicious.com too much 97 | wait_seconds = 3 98 | next_run_time = datetime.datetime.now() + datetime.timedelta(seconds=wait_seconds) 99 | 100 | # sleep until the next run 101 | if self.verbose: 102 | print "[MONITOR] Next monitor run on %s (sleeping for %s seconds)" % (next_run_time.strftime("%Y-%m-%d @ %H:%M:%S"), wait_seconds) 103 | time.sleep(wait_seconds) 104 | 105 | def monitor(self): 106 | """Monitors an RSS feed.""" 107 | 108 | # download and parse RSS feed 109 | f = feedparser.parse(self.rss_url) 110 | 111 | output_file = codecs.open(self.filename, "a", "utf8") 112 | log_file = None 113 | 114 | if os.access(self.log_filename, os.F_OK): 115 | if self.verbose: 116 | print "[MONITOR] Log file found. Trying to resume...", 117 | try: 118 | # read in previous log data for resuming 119 | log_file = open(self.log_filename, 'r') 120 | # remove leading and trailing whitespace if any (incl. newlines) 121 | self.urls = [line.strip() for line in log_file.readlines()] 122 | log_file.close() 123 | if self.verbose: 124 | print "done" 125 | except IOError: 126 | # most probably, the log file does not exist (yet) 127 | if self.verbose: 128 | print "failed" 129 | else: 130 | # log file does not exist, so there isn't any resume data 131 | # to read in 132 | pass 133 | 134 | try: 135 | # now open it for writing (i.e., appending) and logging 136 | if self.verbose: 137 | print "[MONITOR] Open log file for appending...", 138 | log_file = open(self.log_filename, 'a') 139 | if self.verbose: 140 | print "done" 141 | except IOError: 142 | if self.verbose: 143 | print "failed" 144 | print "[MONITOR] ERROR: could not open log file for appending" 145 | self._cleanup() 146 | return 147 | 148 | # get only new entries 149 | new_entries = [] 150 | for entry in f.entries: 151 | new_entries = [entry for entry in f.entries if entry.link not in self.urls] 152 | 153 | if self.verbose: 154 | print "[MONITOR] Found %s new entries" % len(new_entries) 155 | 156 | # query metadata about each entry from delicious.com 157 | for index, entry in enumerate(new_entries): 158 | url = entry.link 159 | 160 | if self.verbose: 161 | print "[MONITOR] Processing entry #%s: '%s'" % (index + 1, url), 162 | try: 163 | time.sleep(1) # be nice and wait 1 sec between connects to delicious.com 164 | document = self._delicious.get_url(url) 165 | except (deliciousapi.DeliciousError,), error_string: 166 | if self.verbose: 167 | print "failed" 168 | print "[MONITOR] ERROR: %s" % error_string 169 | # clean up 170 | output_file.close() 171 | log_file.close() 172 | return 173 | 174 | if self.verbose: 175 | print "done" 176 | 177 | # update log file 178 | log_file.write("%s\n" % url) 179 | # update output file 180 | output_file.write('\n' % (url, document.total_bookmarks, len(document.top_tags))) 181 | for tag, count in document.top_tags: 182 | output_file.write(' \n' % (tag, count)) 183 | output_file.write('\n') 184 | output_file.flush() 185 | 186 | # clean up 187 | output_file.close() 188 | log_file.close() 189 | 190 | 191 | if __name__ == "__main__": 192 | monitor = DeliciousMonitor(interval=30) 193 | monitor.run() 194 | -------------------------------------------------------------------------------- /deliciousapi.py: -------------------------------------------------------------------------------- 1 | """ 2 | Unofficial Python API for retrieving data from Delicious.com. 3 | 4 | This module provides the following features plus some more: 5 | 6 | * retrieving a URL's full public bookmarking history including 7 | * users who bookmarked the URL including tags used for such bookmarks 8 | and the creation time of the bookmark (up to YYYY-MM-DD granularity) 9 | * top tags (up to a maximum of 10) including tag count 10 | * title as stored on Delicious.com 11 | * total number of bookmarks/users for this URL at Delicious.com 12 | * retrieving a user's full bookmark collection, including any private bookmarks 13 | if you know the corresponding password 14 | * retrieving a user's full public tagging vocabulary, i.e. tags and tag counts 15 | * retrieving a user's network information (network members and network fans) 16 | * HTTP proxy support 17 | * updated to support Delicious.com "version 2" (mini-relaunch as of August 2008) 18 | 19 | The official Delicious.com API and the JSON/RSS feeds do not provide all 20 | the functionality mentioned above, and in such cases this module will query 21 | the Delicious.com *website* directly and extract the required information 22 | by parsing the HTML code of the resulting Web pages (a kind of poor man's 23 | web mining). The module is able to detect IP throttling, which is employed 24 | by Delicious.com to temporarily block abusive HTTP request behavior, and 25 | will raise a custom Python error to indicate that. Please be a nice netizen 26 | and do not stress the Delicious.com service more than necessary. 27 | 28 | It is strongly advised that you read the Delicious.com Terms of Use 29 | before using this Python module. In particular, read section 5 30 | 'Intellectual Property'. 31 | 32 | The code is licensed to you under version 2 of the GNU General Public 33 | License. 34 | 35 | More information about this module can be found at 36 | http://www.michael-noll.com/wiki/Del.icio.us_Python_API 37 | 38 | Changelog is available at 39 | http://code.michael-noll.com/?p=deliciousapi;a=log 40 | 41 | Copyright 2006-2010 Michael G. Noll 42 | 43 | """ 44 | 45 | __author__ = "Michael G. Noll" 46 | __copyright__ = "(c) 2006-2010 Michael G. Noll" 47 | __description__ = "Unofficial Python API for retrieving data from Delicious.com" 48 | __email__ = "coding[AT]michael-REMOVEME-noll[DOT]com" 49 | __license__ = "GPLv2" 50 | __maintainer__ = "Michael G. Noll" 51 | __status__ = "Development" 52 | __url__ = "http://www.michael-noll.com/" 53 | __version__ = "1.6.7" 54 | 55 | import cgi 56 | import datetime 57 | import hashlib 58 | from operator import itemgetter 59 | import re 60 | import socket 61 | import time 62 | import urllib2 63 | 64 | try: 65 | from BeautifulSoup import BeautifulSoup 66 | except: 67 | print "ERROR: could not import BeautifulSoup Python module" 68 | print 69 | print "You can download BeautifulSoup from the Python Cheese Shop at" 70 | print "http://cheeseshop.python.org/pypi/BeautifulSoup/" 71 | print "or directly from http://www.crummy.com/software/BeautifulSoup/" 72 | print 73 | raise 74 | 75 | try: 76 | import simplejson 77 | except: 78 | print "ERROR: could not import simplejson module" 79 | print 80 | print "Since version 1.5.0, DeliciousAPI requires the simplejson module." 81 | print "You can download simplejson from the Python Cheese Shop at" 82 | print "http://pypi.python.org/pypi/simplejson" 83 | print 84 | raise 85 | 86 | 87 | class DeliciousUser(object): 88 | """This class wraps all available information about a user into one object. 89 | 90 | Variables: 91 | bookmarks: 92 | A list of (url, tags, title, comment, timestamp) tuples representing 93 | a user's bookmark collection. 94 | 95 | url is a 'unicode' 96 | tags is a 'list' of 'unicode' ([] if no tags) 97 | title is a 'unicode' 98 | comment is a 'unicode' (u"" if no comment) 99 | timestamp is a 'datetime.datetime' 100 | 101 | tags (read-only property): 102 | A list of (tag, tag_count) tuples, aggregated over all a user's 103 | retrieved bookmarks. The tags represent a user's tagging vocabulary. 104 | 105 | username: 106 | The Delicious.com account name of the user. 107 | 108 | """ 109 | 110 | def __init__(self, username, bookmarks=None): 111 | assert username 112 | self.username = username 113 | self.bookmarks = bookmarks or [] 114 | 115 | def __str__(self): 116 | total_tag_count = 0 117 | total_tags = set() 118 | for url, tags, title, comment, timestamp in self.bookmarks: 119 | if tags: 120 | total_tag_count += len(tags) 121 | for tag in tags: 122 | total_tags.add(tag) 123 | return "[%s] %d bookmarks, %d tags (%d unique)" % \ 124 | (self.username, len(self.bookmarks), total_tag_count, len(total_tags)) 125 | 126 | def __repr__(self): 127 | return self.username 128 | 129 | def get_tags(self): 130 | """Returns a dictionary mapping tags to their tag count. 131 | 132 | For example, if the tag count of tag 'foo' is 23, then 133 | 23 bookmarks were annotated with 'foo'. A different way 134 | to put it is that 23 users used the tag 'foo' when 135 | bookmarking the URL. 136 | 137 | """ 138 | total_tags = {} 139 | for url, tags, title, comment, timestamp in self.bookmarks: 140 | for tag in tags: 141 | total_tags[tag] = total_tags.get(tag, 0) + 1 142 | return total_tags 143 | tags = property(fget=get_tags, doc="Returns a dictionary mapping tags to their tag count") 144 | 145 | 146 | class DeliciousURL(object): 147 | """This class wraps all available information about a web document into one object. 148 | 149 | Variables: 150 | bookmarks: 151 | A list of (user, tags, comment, timestamp) tuples, representing a 152 | document's bookmark history. Generally, this variable is populated 153 | via get_url(), so the number of bookmarks available in this variable 154 | depends on the parameters of get_url(). See get_url() for more 155 | information. 156 | 157 | user is a 'unicode' 158 | tags is a 'list' of 'unicode's ([] if no tags) 159 | comment is a 'unicode' (u"" if no comment) 160 | timestamp is a 'datetime.datetime' (granularity: creation *day*, 161 | i.e. the day but not the time of day) 162 | 163 | tags (read-only property): 164 | A list of (tag, tag_count) tuples, aggregated over all a document's 165 | retrieved bookmarks. 166 | 167 | top_tags: 168 | A list of (tag, tag_count) tuples, representing a document's so-called 169 | "top tags", i.e. the up to 10 most popular tags for this document. 170 | 171 | url: 172 | The URL of the document. 173 | 174 | hash (read-only property): 175 | The MD5 hash of the URL. 176 | 177 | title: 178 | The document's title. 179 | 180 | total_bookmarks: 181 | The number of total bookmarks (posts) of the document. 182 | Note that the value of total_bookmarks can be greater than the 183 | length of "bookmarks" depending on how much (detailed) bookmark 184 | data could be retrieved from Delicious.com. 185 | 186 | Here's some more background information: 187 | The value of total_bookmarks is the "real" number of bookmarks of 188 | URL "url" stored at Delicious.com as reported by Delicious.com 189 | itself (so it's the "ground truth"). On the other hand, the length 190 | of "bookmarks" depends on iteratively scraped bookmarking data. 191 | Since scraping Delicous.com's Web pages has its limits in practice, 192 | this means that DeliciousAPI could most likely not retrieve all 193 | available bookmarks. In such a case, the value reported by 194 | total_bookmarks is greater than the length of "bookmarks". 195 | 196 | """ 197 | 198 | def __init__(self, url, top_tags=None, bookmarks=None, title=u"", total_bookmarks=0): 199 | assert url 200 | self.url = url 201 | self.top_tags = top_tags or [] 202 | self.bookmarks = bookmarks or [] 203 | self.title = title 204 | self.total_bookmarks = total_bookmarks 205 | 206 | def __str__(self): 207 | total_tag_count = 0 208 | total_tags = set() 209 | for user, tags, comment, timestamp in self.bookmarks: 210 | if tags: 211 | total_tag_count += len(tags) 212 | for tag in tags: 213 | total_tags.add(tag) 214 | return "[%s] %d total bookmarks (= users), %d tags (%d unique), %d out of 10 max 'top' tags" % \ 215 | (self.url, self.total_bookmarks, total_tag_count, \ 216 | len(total_tags), len(self.top_tags)) 217 | 218 | def __repr__(self): 219 | return self.url 220 | 221 | def get_tags(self): 222 | """Returns a dictionary mapping tags to their tag count. 223 | 224 | For example, if the tag count of tag 'foo' is 23, then 225 | 23 bookmarks were annotated with 'foo'. A different way 226 | to put it is that 23 users used the tag 'foo' when 227 | bookmarking the URL. 228 | 229 | @return: Dictionary mapping tags to their tag count. 230 | 231 | """ 232 | total_tags = {} 233 | for user, tags, comment, timestamp in self.bookmarks: 234 | for tag in tags: 235 | total_tags[tag] = total_tags.get(tag, 0) + 1 236 | return total_tags 237 | tags = property(fget=get_tags, doc="Returns a dictionary mapping tags to their tag count") 238 | 239 | def get_hash(self): 240 | m = hashlib.md5() 241 | m.update(self.url) 242 | return m.hexdigest() 243 | hash = property(fget=get_hash, doc="Returns the MD5 hash of the URL of this document") 244 | 245 | 246 | class DeliciousAPI(object): 247 | """ 248 | This class provides a custom, unofficial API to the Delicious.com service. 249 | 250 | Instead of using just the functionality provided by the official 251 | Delicious.com API (which has limited features), this class retrieves 252 | information from the Delicious.com website directly and extracts data from 253 | the Web pages. 254 | 255 | Note that Delicious.com will block clients with too many queries in a 256 | certain time frame (similar to their API throttling). So be a nice citizen 257 | and don't stress their website. 258 | 259 | """ 260 | 261 | def __init__(self, 262 | http_proxy="", 263 | tries=3, 264 | wait_seconds=3, 265 | user_agent="DeliciousAPI/%s (+http://www.michael-noll.com/wiki/Del.icio.us_Python_API)" % __version__, 266 | timeout=30, 267 | ): 268 | """Set up the API module. 269 | 270 | @param http_proxy: Optional, default: "". 271 | Use an HTTP proxy for HTTP connections. Proxy support for 272 | HTTPS is not available yet. 273 | Format: "hostname:port" (e.g., "localhost:8080") 274 | @type http_proxy: str 275 | 276 | @param tries: Optional, default: 3. 277 | Try the specified number of times when downloading a monitored 278 | document fails. tries must be >= 1. See also wait_seconds. 279 | @type tries: int 280 | 281 | @param wait_seconds: Optional, default: 3. 282 | Wait the specified number of seconds before re-trying to 283 | download a monitored document. wait_seconds must be >= 0. 284 | See also tries. 285 | @type wait_seconds: int 286 | 287 | @param user_agent: Optional, default: "DeliciousAPI/ 288 | (+http://www.michael-noll.com/wiki/Del.icio.us_Python_API)". 289 | The User-Agent HTTP Header to use when querying Delicous.com. 290 | @type user_agent: str 291 | 292 | @param timeout: Optional, default: 30. 293 | Set network timeout. timeout must be >= 0. 294 | @type timeout: int 295 | 296 | """ 297 | assert tries >= 1 298 | assert wait_seconds >= 0 299 | assert timeout >= 0 300 | self.http_proxy = http_proxy 301 | self.tries = tries 302 | self.wait_seconds = wait_seconds 303 | self.user_agent = user_agent 304 | self.timeout = timeout 305 | socket.setdefaulttimeout(self.timeout) 306 | 307 | 308 | def _query(self, path, host="delicious.com", user=None, password=None, use_ssl=False): 309 | """Queries Delicious.com for information, specified by (query) path. 310 | 311 | @param path: The HTTP query path. 312 | @type path: str 313 | 314 | @param host: The host to query, default: "delicious.com". 315 | @type host: str 316 | 317 | @param user: The Delicious.com username if any, default: None. 318 | @type user: str 319 | 320 | @param password: The Delicious.com password of user, default: None. 321 | @type password: unicode/str 322 | 323 | @param use_ssl: Whether to use SSL encryption or not, default: False. 324 | @type use_ssl: bool 325 | 326 | @return: None on errors (i.e. on all HTTP status other than 200). 327 | On success, returns the content of the HTML response. 328 | 329 | """ 330 | opener = None 331 | handlers = [] 332 | 333 | # add HTTP Basic authentication if available 334 | if user and password: 335 | pwd_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm() 336 | pwd_mgr.add_password(None, host, user, password) 337 | basic_auth_handler = urllib2.HTTPBasicAuthHandler(pwd_mgr) 338 | handlers.append(basic_auth_handler) 339 | 340 | # add proxy support if requested 341 | if self.http_proxy: 342 | proxy_handler = urllib2.ProxyHandler({'http': 'http://%s' % self.http_proxy}) 343 | handlers.append(proxy_handler) 344 | 345 | if handlers: 346 | opener = urllib2.build_opener(*handlers) 347 | else: 348 | opener = urllib2.build_opener() 349 | opener.addheaders = [('User-agent', self.user_agent)] 350 | 351 | data = None 352 | tries = self.tries 353 | 354 | if use_ssl: 355 | protocol = "https" 356 | else: 357 | protocol = "http" 358 | url = "%s://%s%s" % (protocol, host, path) 359 | 360 | while tries > 0: 361 | try: 362 | f = opener.open(url) 363 | data = f.read() 364 | f.close() 365 | break 366 | except urllib2.HTTPError, e: 367 | if e.code == 301: 368 | raise DeliciousMovedPermanentlyWarning, "Delicious.com status %s - url moved permanently" % e.code 369 | if e.code == 302: 370 | raise DeliciousMovedTemporarilyWarning, "Delicious.com status %s - url moved temporarily" % e.code 371 | elif e.code == 401: 372 | raise DeliciousUnauthorizedError, "Delicious.com error %s - unauthorized (authentication failed?)" % e.code 373 | elif e.code == 403: 374 | raise DeliciousForbiddenError, "Delicious.com error %s - forbidden" % e.code 375 | elif e.code == 404: 376 | raise DeliciousNotFoundError, "Delicious.com error %s - url not found" % e.code 377 | elif e.code == 500: 378 | raise Delicious500Error, "Delicious.com error %s - server problem" % e.code 379 | elif e.code == 503 or e.code == 999: 380 | raise DeliciousThrottleError, "Delicious.com error %s - unable to process request (your IP address has been throttled/blocked)" % e.code 381 | else: 382 | raise DeliciousUnknownError, "Delicious.com error %s - unknown error" % e.code 383 | break 384 | except urllib2.URLError, e: 385 | time.sleep(self.wait_seconds) 386 | except socket.error, msg: 387 | # sometimes we get a "Connection Refused" error 388 | # wait a bit and then try again 389 | time.sleep(self.wait_seconds) 390 | #finally: 391 | # f.close() 392 | tries -= 1 393 | return data 394 | 395 | 396 | def get_url(self, url, max_bookmarks=50, sleep_seconds=1): 397 | """ 398 | Returns a DeliciousURL instance representing the Delicious.com history of url. 399 | 400 | Generally, this method is what you want for getting title, bookmark, tag, 401 | and user information about a URL. 402 | 403 | Delicious only returns up to 50 bookmarks per URL. This means that 404 | we have to do subsequent queries plus parsing if we want to retrieve 405 | more than 50. Roughly speaking, the processing time of get_url() 406 | increases linearly with the number of 50-bookmarks-chunks; i.e. 407 | it will take 10 times longer to retrieve 500 bookmarks than 50. 408 | 409 | @param url: The URL of the web document to be queried for. 410 | @type url: str 411 | 412 | @param max_bookmarks: Optional, default: 50. 413 | See the documentation of get_bookmarks() for more information 414 | as get_url() uses get_bookmarks() to retrieve a url's 415 | bookmarking history. 416 | @type max_bookmarks: int 417 | 418 | @param sleep_seconds: Optional, default: 1. 419 | See the documentation of get_bookmarks() for more information 420 | as get_url() uses get_bookmarks() to retrieve a url's 421 | bookmarking history. sleep_seconds must be >= 1 to comply with 422 | Delicious.com's Terms of Use. 423 | @type sleep_seconds: int 424 | 425 | @return: DeliciousURL instance representing the Delicious.com history 426 | of url. 427 | 428 | """ 429 | # we must wait at least 1 second between subsequent queries to 430 | # comply with Delicious.com's Terms of Use 431 | assert sleep_seconds >= 1 432 | 433 | document = DeliciousURL(url) 434 | 435 | m = hashlib.md5() 436 | m.update(url) 437 | hash = m.hexdigest() 438 | 439 | path = "/v2/json/urlinfo/%s" % hash 440 | data = self._query(path, host="feeds.delicious.com") 441 | if data: 442 | urlinfo = {} 443 | try: 444 | urlinfo = simplejson.loads(data) 445 | if urlinfo: 446 | urlinfo = urlinfo[0] 447 | else: 448 | urlinfo = {} 449 | except TypeError: 450 | pass 451 | try: 452 | document.title = urlinfo['title'] or u"" 453 | except KeyError: 454 | pass 455 | try: 456 | top_tags = urlinfo['top_tags'] or {} 457 | if top_tags: 458 | document.top_tags = sorted(top_tags.iteritems(), key=itemgetter(1), reverse=True) 459 | else: 460 | document.top_tags = [] 461 | except KeyError: 462 | pass 463 | try: 464 | document.total_bookmarks = int(urlinfo['total_posts']) 465 | except (KeyError, ValueError): 466 | pass 467 | document.bookmarks = self.get_bookmarks(url=url, max_bookmarks=max_bookmarks, sleep_seconds=sleep_seconds) 468 | 469 | 470 | return document 471 | 472 | def get_network(self, username): 473 | """ 474 | Returns the user's list of followees and followers. 475 | 476 | Followees are users in his Delicious "network", i.e. those users whose 477 | bookmark streams he's subscribed to. Followers are his Delicious.com 478 | "fans", i.e. those users who have subscribed to the given user's 479 | bookmark stream). 480 | 481 | Example: 482 | 483 | A --------> --------> C 484 | D --------> B --------> E 485 | F --------> --------> F 486 | 487 | followers followees 488 | of B of B 489 | 490 | Arrows from user A to user B denote that A has subscribed to B's 491 | bookmark stream, i.e. A is "following" or "tracking" B. 492 | 493 | Note that user F is both a followee and a follower of B, i.e. F tracks 494 | B and vice versa. In Delicious.com terms, F is called a "mutual fan" 495 | of B. 496 | 497 | Comparing this network concept to information retrieval, one could say 498 | that followers are incoming links and followees outgoing links of B. 499 | 500 | @param username: Delicous.com username for which network information is 501 | retrieved. 502 | @type username: unicode/str 503 | 504 | @return: Tuple of two lists ([, []), where each list 505 | contains tuples of (username, tracking_since_timestamp). 506 | If a network is set as private, i.e. hidden from public view, 507 | (None, None) is returned. 508 | If a network is public but empty, ([], []) is returned. 509 | 510 | """ 511 | assert username 512 | followees = followers = None 513 | 514 | # followees (network members) 515 | path = "/v2/json/networkmembers/%s" % username 516 | data = None 517 | try: 518 | data = self._query(path, host="feeds.delicious.com") 519 | except DeliciousForbiddenError: 520 | pass 521 | if data: 522 | followees = [] 523 | 524 | users = [] 525 | try: 526 | users = simplejson.loads(data) 527 | except TypeError: 528 | pass 529 | 530 | uname = tracking_since = None 531 | 532 | for user in users: 533 | # followee's username 534 | try: 535 | uname = user['user'] 536 | except KeyError: 537 | pass 538 | # try to convert uname to Unicode 539 | if uname: 540 | try: 541 | # we assume UTF-8 encoding 542 | uname = uname.decode('utf-8') 543 | except UnicodeDecodeError: 544 | pass 545 | # time when the given user started tracking this user 546 | try: 547 | tracking_since = datetime.datetime.strptime(user['dt'], "%Y-%m-%dT%H:%M:%SZ") 548 | except KeyError: 549 | pass 550 | if uname: 551 | followees.append( (uname, tracking_since) ) 552 | 553 | # followers (network fans) 554 | path = "/v2/json/networkfans/%s" % username 555 | data = None 556 | try: 557 | data = self._query(path, host="feeds.delicious.com") 558 | except DeliciousForbiddenError: 559 | pass 560 | if data: 561 | followers = [] 562 | 563 | users = [] 564 | try: 565 | users = simplejson.loads(data) 566 | except TypeError: 567 | pass 568 | 569 | uname = tracking_since = None 570 | 571 | for user in users: 572 | # fan's username 573 | try: 574 | uname = user['user'] 575 | except KeyError: 576 | pass 577 | # try to convert uname to Unicode 578 | if uname: 579 | try: 580 | # we assume UTF-8 encoding 581 | uname = uname.decode('utf-8') 582 | except UnicodeDecodeError: 583 | pass 584 | # time when fan started tracking the given user 585 | try: 586 | tracking_since = datetime.datetime.strptime(user['dt'], "%Y-%m-%dT%H:%M:%SZ") 587 | except KeyError: 588 | pass 589 | if uname: 590 | followers.append( (uname, tracking_since) ) 591 | return ( followees, followers ) 592 | 593 | def get_bookmarks(self, url=None, username=None, max_bookmarks=50, sleep_seconds=1): 594 | """ 595 | Returns the bookmarks of url or user, respectively. 596 | 597 | Delicious.com only returns up to 50 bookmarks per URL on its website. 598 | This means that we have to do subsequent queries plus parsing if 599 | we want to retrieve more than 50. Roughly speaking, the processing 600 | time of get_bookmarks() increases linearly with the number of 601 | 50-bookmarks-chunks; i.e. it will take 10 times longer to retrieve 602 | 500 bookmarks than 50. 603 | 604 | @param url: The URL of the web document to be queried for. 605 | Cannot be used together with 'username'. 606 | @type url: str 607 | 608 | @param username: The Delicious.com username to be queried for. 609 | Cannot be used together with 'url'. 610 | @type username: str 611 | 612 | @param max_bookmarks: Optional, default: 50. 613 | Maximum number of bookmarks to retrieve. Set to 0 to disable 614 | this limitation/the maximum and retrieve all available 615 | bookmarks of the given url. 616 | 617 | Bookmarks are sorted so that newer bookmarks are first. 618 | Setting max_bookmarks to 50 means that get_bookmarks() will retrieve 619 | the 50 most recent bookmarks of the given url. 620 | 621 | In the case of getting bookmarks of a URL (url is set), 622 | get_bookmarks() will take *considerably* longer to run 623 | for pages with lots of bookmarks when setting max_bookmarks 624 | to a high number or when you completely disable the limit. 625 | Delicious returns only up to 50 bookmarks per result page, 626 | so for example retrieving 250 bookmarks requires 5 HTTP 627 | connections and parsing 5 HTML pages plus wait time between 628 | queries (to comply with delicious' Terms of Use; see 629 | also parameter 'sleep_seconds'). 630 | 631 | In the case of getting bookmarks of a user (username is set), 632 | the same restrictions as for a URL apply with the exception 633 | that we can retrieve up to 100 bookmarks per HTTP query 634 | (instead of only up to 50 per HTTP query for a URL). 635 | @type max_bookmarks: int 636 | 637 | @param sleep_seconds: Optional, default: 1. 638 | Wait the specified number of seconds between subsequent 639 | queries in case that there are multiple pages of bookmarks 640 | for the given url. sleep_seconds must be >= 1 to comply with 641 | Delicious.com's Terms of Use. 642 | See also parameter 'max_bookmarks'. 643 | @type sleep_seconds: int 644 | 645 | @return: Returns the bookmarks of url or user, respectively. 646 | For urls, it returns a list of (user, tags, comment, timestamp) 647 | tuples. 648 | For users, it returns a list of (url, tags, title, comment, 649 | timestamp) tuples. 650 | 651 | Bookmarks are sorted "descendingly" by creation time, i.e. newer 652 | bookmarks come first. 653 | 654 | """ 655 | # we must wait at least 1 second between subsequent queries to 656 | # comply with delicious' Terms of Use 657 | assert sleep_seconds >= 1 658 | 659 | # url XOR username 660 | assert bool(username) is not bool(url) 661 | 662 | # maximum number of urls/posts Delicious.com will display 663 | # per page on its website 664 | max_html_count = 100 665 | # maximum number of pages that Delicious.com will display; 666 | # currently, the maximum number of pages is 20. Delicious.com 667 | # allows to go beyond page 20 via pagination, but page N (for 668 | # N > 20) will always display the same content as page 20. 669 | max_html_pages = 20 670 | 671 | path = None 672 | if url: 673 | m = hashlib.md5() 674 | m.update(url) 675 | hash = m.hexdigest() 676 | 677 | # path will change later on if there are multiple pages of boomarks 678 | # for the given url 679 | path = "/url/%s" % hash 680 | elif username: 681 | # path will change later on if there are multiple pages of boomarks 682 | # for the given username 683 | path = "/%s?setcount=%d" % (username, max_html_count) 684 | else: 685 | raise Exception('You must specify either url or user.') 686 | 687 | page_index = 1 688 | bookmarks = [] 689 | while path and page_index <= max_html_pages: 690 | data = self._query(path) 691 | path = None 692 | if data: 693 | # extract bookmarks from current page 694 | if url: 695 | bookmarks.extend(self._extract_bookmarks_from_url_history(data)) 696 | else: 697 | bookmarks.extend(self._extract_bookmarks_from_user_history(data)) 698 | 699 | # stop scraping if we already have as many bookmarks as we want 700 | if (len(bookmarks) >= max_bookmarks) and max_bookmarks != 0: 701 | break 702 | else: 703 | # check if there are multiple pages of bookmarks for this 704 | # url on Delicious.com 705 | soup = BeautifulSoup(data) 706 | paginations = soup.findAll("div", id="pagination") 707 | if paginations: 708 | # find next path 709 | nexts = paginations[0].findAll("a", attrs={ "class": "pn next" }) 710 | if nexts and (max_bookmarks == 0 or len(bookmarks) < max_bookmarks) and len(bookmarks) > 0: 711 | # e.g. /url/2bb293d594a93e77d45c2caaf120e1b1?show=all&page=2 712 | path = nexts[0]['href'] 713 | if username: 714 | path += "&setcount=%d" % max_html_count 715 | page_index += 1 716 | # wait one second between queries to be compliant with 717 | # delicious' Terms of Use 718 | time.sleep(sleep_seconds) 719 | if max_bookmarks > 0: 720 | return bookmarks[:max_bookmarks] 721 | else: 722 | return bookmarks 723 | 724 | 725 | def _extract_bookmarks_from_url_history(self, data): 726 | """ 727 | Extracts user bookmarks from a URL's history page on Delicious.com. 728 | 729 | The Python library BeautifulSoup is used to parse the HTML page. 730 | 731 | @param data: The HTML source of a URL history Web page on Delicious.com. 732 | @type data: str 733 | 734 | @return: list of user bookmarks of the corresponding URL 735 | 736 | """ 737 | bookmarks = [] 738 | soup = BeautifulSoup(data) 739 | 740 | bookmark_elements = soup.findAll("div", attrs={"class": re.compile("^bookmark\s*")}) 741 | timestamp = None 742 | for bookmark_element in bookmark_elements: 743 | 744 | # extract bookmark creation time 745 | # 746 | # this timestamp has to "persist" until a new timestamp is 747 | # found (delicious only provides the creation time data for the 748 | # first bookmark in the list of bookmarks for a given day 749 | dategroups = bookmark_element.findAll("div", attrs={"class": "dateGroup"}) 750 | if dategroups: 751 | spans = dategroups[0].findAll('span') 752 | if spans: 753 | date_str = spans[0].contents[0].strip() 754 | timestamp = datetime.datetime.strptime(date_str, '%d %b %y') 755 | 756 | # extract comments 757 | comment = u"" 758 | datas = bookmark_element.findAll("div", attrs={"class": "data"}) 759 | if datas: 760 | divs = datas[0].findAll("div", attrs={"class": "description"}) 761 | if divs: 762 | comment = divs[0].contents[0].strip() 763 | 764 | # extract tags 765 | user_tags = [] 766 | tagdisplays = bookmark_element.findAll("div", attrs={"class": "tagdisplay"}) 767 | if tagdisplays: 768 | aset = tagdisplays[0].findAll("a", attrs={"class": "tag noplay"}) 769 | for a in aset: 770 | tag = a.contents[0] 771 | user_tags.append(tag) 772 | 773 | # extract user information 774 | metas = bookmark_element.findAll("div", attrs={"class": "meta"}) 775 | if metas: 776 | links = metas[0].findAll("a", attrs={"class": "user user-tag"}) 777 | if links: 778 | try: 779 | user = links[0]['href'][1:] 780 | except IndexError: 781 | # WORKAROUND: it seems there is a bug on Delicious.com where 782 | # sometimes a bookmark is shown in a URL history without any 783 | # associated Delicious username (username is empty); this could 784 | # be caused by special characters in the username or other things 785 | # 786 | # this problem of Delicious is very rare, so we just skip such 787 | # entries until they find a fix 788 | pass 789 | bookmarks.append( (user, user_tags, comment, timestamp) ) 790 | 791 | return bookmarks 792 | 793 | def _extract_bookmarks_from_user_history(self, data): 794 | """ 795 | Extracts a user's bookmarks from his user page on Delicious.com. 796 | 797 | The Python library BeautifulSoup is used to parse the HTML page. 798 | 799 | @param data: The HTML source of a user page on Delicious.com. 800 | @type data: str 801 | 802 | @return: list of bookmarks of the corresponding user 803 | 804 | """ 805 | bookmarks = [] 806 | soup = BeautifulSoup(data) 807 | 808 | ul = soup.find("ul", id="bookmarklist") 809 | if ul: 810 | bookmark_elements = ul.findAll("div", attrs={"class": re.compile("^bookmark\s*")}) 811 | timestamp = None 812 | for bookmark_element in bookmark_elements: 813 | 814 | # extract bookmark creation time 815 | # 816 | # this timestamp has to "persist" until a new timestamp is 817 | # found (delicious only provides the creation time data for the 818 | # first bookmark in the list of bookmarks for a given day 819 | dategroups = bookmark_element.findAll("div", attrs={"class": "dateGroup"}) 820 | if dategroups: 821 | spans = dategroups[0].findAll('span') 822 | if spans: 823 | date_str = spans[0].contents[0].strip() 824 | timestamp = datetime.datetime.strptime(date_str, '%d %b %y') 825 | 826 | # extract url, title and comments 827 | url = u"" 828 | title = u"" 829 | comment = u"" 830 | datas = bookmark_element.findAll("div", attrs={"class": "data"}) 831 | if datas: 832 | links = datas[0].findAll("a", attrs={"class": re.compile("^taggedlink\s*")}) 833 | if links and links[0].contents: 834 | title = links[0].contents[0].strip() 835 | url = links[0]['href'] 836 | divs = datas[0].findAll("div", attrs={"class": "description"}) 837 | if divs: 838 | comment = divs[0].contents[0].strip() 839 | 840 | # extract tags 841 | url_tags = [] 842 | tagdisplays = bookmark_element.findAll("div", attrs={"class": "tagdisplay"}) 843 | if tagdisplays: 844 | aset = tagdisplays[0].findAll("a", attrs={"class": "tag noplay"}) 845 | for a in aset: 846 | tag = a.contents[0] 847 | url_tags.append(tag) 848 | 849 | bookmarks.append( (url, url_tags, title, comment, timestamp) ) 850 | 851 | return bookmarks 852 | 853 | 854 | def get_user(self, username, password=None, max_bookmarks=50, sleep_seconds=1): 855 | """Retrieves a user's bookmarks from Delicious.com. 856 | 857 | If a correct username AND password are supplied, a user's *full* 858 | bookmark collection (which also includes private bookmarks) is 859 | retrieved. Data communication is encrypted using SSL in this case. 860 | 861 | If no password is supplied, only the *public* bookmarks of the user 862 | are retrieved. Here, the parameter 'max_bookmarks' specifies how 863 | many public bookmarks will be retrieved (default: 50). Set the 864 | parameter to 0 to retrieve all public bookmarks. 865 | 866 | This function can be used to backup all of a user's bookmarks if 867 | called with a username and password. 868 | 869 | @param username: The Delicious.com username. 870 | @type username: str 871 | 872 | @param password: Optional, default: None. 873 | The user's Delicious.com password. If password is set, 874 | all communication with Delicious.com is SSL-encrypted. 875 | @type password: unicode/str 876 | 877 | @param max_bookmarks: Optional, default: 50. 878 | See the documentation of get_bookmarks() for more 879 | information as get_url() uses get_bookmarks() to 880 | retrieve a url's bookmarking history. 881 | The parameter is NOT used when a password is specified 882 | because in this case the *full* bookmark collection of 883 | a user will be retrieved. 884 | @type max_bookmarks: int 885 | 886 | @param sleep_seconds: Optional, default: 1. 887 | See the documentation of get_bookmarks() for more information as 888 | get_url() uses get_bookmarks() to retrieve a url's bookmarking 889 | history. sleep_seconds must be >= 1 to comply with Delicious.com's 890 | Terms of Use. 891 | @type sleep_seconds: int 892 | 893 | @return: DeliciousUser instance 894 | 895 | """ 896 | assert username 897 | user = DeliciousUser(username) 898 | bookmarks = [] 899 | if password: 900 | # We have username AND password, so we call 901 | # the official Delicious.com API. 902 | path = "/v1/posts/all" 903 | data = self._query(path, host="api.del.icio.us", use_ssl=True, user=username, password=password) 904 | if data: 905 | soup = BeautifulSoup(data) 906 | elements = soup.findAll("post") 907 | for element in elements: 908 | url = element["href"] 909 | title = element["description"] or u"" 910 | comment = element["extended"] or u"" 911 | tags = [] 912 | if element["tag"]: 913 | tags = element["tag"].split() 914 | timestamp = datetime.datetime.strptime(element["time"], "%Y-%m-%dT%H:%M:%SZ") 915 | bookmarks.append( (url, tags, title, comment, timestamp) ) 916 | user.bookmarks = bookmarks 917 | else: 918 | # We have only the username, so we extract data from 919 | # the user's JSON feed. However, the feed is restricted 920 | # to the most recent public bookmarks of the user, which 921 | # is about 100 if any. So if we need more than 100, we start 922 | # scraping the Delicious.com website directly 923 | if max_bookmarks > 0 and max_bookmarks <= 100: 924 | path = "/v2/json/%s?count=100" % username 925 | data = self._query(path, host="feeds.delicious.com", user=username) 926 | if data: 927 | posts = [] 928 | try: 929 | posts = simplejson.loads(data) 930 | except TypeError: 931 | pass 932 | 933 | url = timestamp = None 934 | title = comment = u"" 935 | tags = [] 936 | 937 | for post in posts: 938 | # url 939 | try: 940 | url = post['u'] 941 | except KeyError: 942 | pass 943 | # title 944 | try: 945 | title = post['d'] 946 | except KeyError: 947 | pass 948 | # tags 949 | try: 950 | tags = post['t'] 951 | except KeyError: 952 | pass 953 | if not tags: 954 | tags = [u"system:unfiled"] 955 | # comment / notes 956 | try: 957 | comment = post['n'] 958 | except KeyError: 959 | pass 960 | # bookmark creation time 961 | try: 962 | timestamp = datetime.datetime.strptime(post['dt'], "%Y-%m-%dT%H:%M:%SZ") 963 | except KeyError: 964 | pass 965 | bookmarks.append( (url, tags, title, comment, timestamp) ) 966 | user.bookmarks = bookmarks[:max_bookmarks] 967 | else: 968 | # TODO: retrieve the first 100 bookmarks via JSON before 969 | # falling back to scraping the delicous.com website 970 | user.bookmarks = self.get_bookmarks(username=username, max_bookmarks=max_bookmarks, sleep_seconds=sleep_seconds) 971 | return user 972 | 973 | def get_urls(self, tag=None, popular=True, max_urls=100, sleep_seconds=1): 974 | """ 975 | Returns the list of recent URLs (of web documents) tagged with a given tag. 976 | 977 | This is very similar to parsing Delicious' RSS/JSON feeds directly, 978 | but this function will return up to 2,000 links compared to a maximum 979 | of 100 links when using the official feeds (with query parameter 980 | count=100). 981 | 982 | The return list of links will be sorted by recency in descending order, 983 | i.e. newest items first. 984 | 985 | Note that even when setting max_urls, get_urls() cannot guarantee that 986 | it can retrieve *at least* this many URLs. It is really just an upper 987 | bound. 988 | 989 | @param tag: Retrieve links which have been tagged with the given tag. 990 | If tag is not set (default), links will be retrieved from the 991 | Delicious.com front page (aka "delicious hotlist"). 992 | @type tag: unicode/str 993 | 994 | @param popular: If true (default), retrieve only popular links (i.e. 995 | /popular/). Otherwise, the most recent links tagged with 996 | the given tag will be retrieved (i.e. /tag/). 997 | 998 | As of January 2009, it seems that Delicious.com modified the list 999 | of popular tags to contain only up to a maximum of 15 URLs. 1000 | This also means that setting max_urls to values larger than 15 1001 | will not change the results of get_urls(). 1002 | So if you are interested in more URLs, set the "popular" parameter 1003 | to false. 1004 | 1005 | Note that if you set popular to False, the returned list of URLs 1006 | might contain duplicate items. This is due to the way Delicious.com 1007 | creates its /tag/ Web pages. So if you need a certain 1008 | number of unique URLs, you have to take care of that in your 1009 | own code. 1010 | @type popular: bool 1011 | 1012 | @param max_urls: Retrieve at most max_urls links. The default is 100, 1013 | which is the maximum number of links that can be retrieved by 1014 | parsing the official JSON feeds. The maximum value of max_urls 1015 | in practice is 2000 (currently). If it is set higher, Delicious 1016 | will return the same links over and over again, giving lots of 1017 | duplicate items. 1018 | @type max_urls: int 1019 | 1020 | @param sleep_seconds: Optional, default: 1. 1021 | Wait the specified number of seconds between subsequent queries in 1022 | case that there are multiple pages of bookmarks for the given url. 1023 | Must be greater than or equal to 1 to comply with Delicious.com's 1024 | Terms of Use. 1025 | See also parameter 'max_urls'. 1026 | @type sleep_seconds: int 1027 | 1028 | @return: The list of recent URLs (of web documents) tagged with a given tag. 1029 | 1030 | """ 1031 | assert sleep_seconds >= 1 1032 | urls = [] 1033 | path = None 1034 | if tag is None or (tag is not None and max_urls > 0 and max_urls <= 100): 1035 | # use official JSON feeds 1036 | max_json_count = 100 1037 | if tag: 1038 | # tag-specific JSON feed 1039 | if popular: 1040 | path = "/v2/json/popular/%s?count=%d" % (tag, max_json_count) 1041 | else: 1042 | path = "/v2/json/tag/%s?count=%d" % (tag, max_json_count) 1043 | else: 1044 | # Delicious.com hotlist 1045 | path = "/v2/json/?count=%d" % (max_json_count) 1046 | data = self._query(path, host="feeds.delicious.com") 1047 | if data: 1048 | posts = [] 1049 | try: 1050 | posts = simplejson.loads(data) 1051 | except TypeError: 1052 | pass 1053 | 1054 | for post in posts: 1055 | # url 1056 | try: 1057 | url = post['u'] 1058 | if url: 1059 | urls.append(url) 1060 | except KeyError: 1061 | pass 1062 | else: 1063 | # maximum number of urls/posts Delicious.com will display 1064 | # per page on its website 1065 | max_html_count = 100 1066 | # maximum number of pages that Delicious.com will display; 1067 | # currently, the maximum number of pages is 20. Delicious.com 1068 | # allows to go beyond page 20 via pagination, but page N (for 1069 | # N > 20) will always display the same content as page 20. 1070 | max_html_pages = 20 1071 | 1072 | if popular: 1073 | path = "/popular/%s?setcount=%d" % (tag, max_html_count) 1074 | else: 1075 | path = "/tag/%s?setcount=%d" % (tag, max_html_count) 1076 | 1077 | page_index = 1 1078 | urls = [] 1079 | while path and page_index <= max_html_pages: 1080 | data = self._query(path) 1081 | path = None 1082 | if data: 1083 | # extract urls from current page 1084 | soup = BeautifulSoup(data) 1085 | links = soup.findAll("a", attrs={"class": re.compile("^taggedlink\s*")}) 1086 | for link in links: 1087 | try: 1088 | url = link['href'] 1089 | if url: 1090 | urls.append(url) 1091 | except KeyError: 1092 | pass 1093 | 1094 | # check if there are more multiple pages of urls 1095 | soup = BeautifulSoup(data) 1096 | paginations = soup.findAll("div", id="pagination") 1097 | if paginations: 1098 | # find next path 1099 | nexts = paginations[0].findAll("a", attrs={ "class": "pn next" }) 1100 | if nexts and (max_urls == 0 or len(urls) < max_urls) and len(urls) > 0: 1101 | # e.g. /url/2bb293d594a93e77d45c2caaf120e1b1?show=all&page=2 1102 | path = nexts[0]['href'] 1103 | path += "&setcount=%d" % max_html_count 1104 | page_index += 1 1105 | # wait between queries to Delicious.com to be 1106 | # compliant with its Terms of Use 1107 | time.sleep(sleep_seconds) 1108 | if max_urls > 0: 1109 | return urls[:max_urls] 1110 | else: 1111 | return urls 1112 | 1113 | 1114 | def get_tags_of_user(self, username): 1115 | """ 1116 | Retrieves user's public tags and their tag counts from Delicious.com. 1117 | The tags represent a user's full public tagging vocabulary. 1118 | 1119 | DeliciousAPI uses the official JSON feed of the user. We could use 1120 | RSS here, but the JSON feed has proven to be faster in practice. 1121 | 1122 | @param username: The Delicious.com username. 1123 | @type username: str 1124 | 1125 | @return: Dictionary mapping tags to their tag counts. 1126 | 1127 | """ 1128 | tags = {} 1129 | path = "/v2/json/tags/%s" % username 1130 | data = self._query(path, host="feeds.delicious.com") 1131 | if data: 1132 | try: 1133 | tags = simplejson.loads(data) 1134 | except TypeError: 1135 | pass 1136 | return tags 1137 | 1138 | def get_number_of_users(self, url): 1139 | """get_number_of_users() is obsolete and has been removed. Please use get_url() instead.""" 1140 | reason = "get_number_of_users() is obsolete and has been removed. Please use get_url() instead." 1141 | raise Exception(reason) 1142 | 1143 | def get_common_tags_of_url(self, url): 1144 | """get_common_tags_of_url() is obsolete and has been removed. Please use get_url() instead.""" 1145 | reason = "get_common_tags_of_url() is obsolete and has been removed. Please use get_url() instead." 1146 | raise Exception(reason) 1147 | 1148 | def _html_escape(self, s): 1149 | """HTML-escape a string or object. 1150 | 1151 | This converts any non-string objects passed into it to strings 1152 | (actually, using unicode()). All values returned are 1153 | non-unicode strings (using "&#num;" entities for all non-ASCII 1154 | characters). 1155 | 1156 | None is treated specially, and returns the empty string. 1157 | 1158 | @param s: The string that needs to be escaped. 1159 | @type s: str 1160 | 1161 | @return: The escaped string. 1162 | 1163 | """ 1164 | if s is None: 1165 | return '' 1166 | if not isinstance(s, basestring): 1167 | if hasattr(s, '__unicode__'): 1168 | s = unicode(s) 1169 | else: 1170 | s = str(s) 1171 | s = cgi.escape(s, True) 1172 | if isinstance(s, unicode): 1173 | s = s.encode('ascii', 'xmlcharrefreplace') 1174 | return s 1175 | 1176 | 1177 | class DeliciousError(Exception): 1178 | """Used to indicate that an error occurred when trying to access Delicious.com via its API.""" 1179 | 1180 | class DeliciousWarning(Exception): 1181 | """Used to indicate a warning when trying to access Delicious.com via its API. 1182 | 1183 | Warnings are raised when it is useful to alert the user of some condition 1184 | where that condition doesn't warrant raising an exception and terminating 1185 | the program. For example, we issue a warning when Delicious.com returns a 1186 | HTTP status code for redirections (3xx). 1187 | """ 1188 | 1189 | class DeliciousThrottleError(DeliciousError): 1190 | """Used to indicate that the client computer (i.e. its IP address) has been temporarily blocked by Delicious.com.""" 1191 | pass 1192 | 1193 | class DeliciousUnknownError(DeliciousError): 1194 | """Used to indicate that Delicious.com returned an (HTTP) error which we don't know how to handle yet.""" 1195 | pass 1196 | 1197 | class DeliciousUnauthorizedError(DeliciousError): 1198 | """Used to indicate that Delicious.com returned a 401 Unauthorized error. 1199 | 1200 | Most of the time, the user credentials for accessing restricted functions 1201 | of the official Delicious.com API are incorrect. 1202 | 1203 | """ 1204 | pass 1205 | 1206 | class DeliciousForbiddenError(DeliciousError): 1207 | """Used to indicate that Delicious.com returned a 403 Forbidden error. 1208 | """ 1209 | pass 1210 | 1211 | 1212 | class DeliciousNotFoundError(DeliciousError): 1213 | """Used to indicate that Delicious.com returned a 404 Not Found error. 1214 | 1215 | Most of the time, retrying some seconds later fixes the problem 1216 | (because we only query existing pages with this API). 1217 | 1218 | """ 1219 | pass 1220 | 1221 | class Delicious500Error(DeliciousError): 1222 | """Used to indicate that Delicious.com returned a 500 error. 1223 | 1224 | Most of the time, retrying some seconds later fixes the problem. 1225 | 1226 | """ 1227 | pass 1228 | 1229 | class DeliciousMovedPermanentlyWarning(DeliciousWarning): 1230 | """Used to indicate that Delicious.com returned a 301 Found (Moved Permanently) redirection.""" 1231 | pass 1232 | 1233 | class DeliciousMovedTemporarilyWarning(DeliciousWarning): 1234 | """Used to indicate that Delicious.com returned a 302 Found (Moved Temporarily) redirection.""" 1235 | pass 1236 | 1237 | __all__ = ['DeliciousAPI', 'DeliciousURL', 'DeliciousError', 'DeliciousThrottleError', 'DeliciousUnauthorizedError', 'DeliciousUnknownError', 'DeliciousNotFoundError' , 'Delicious500Error', 'DeliciousMovedTemporarilyWarning'] 1238 | 1239 | if __name__ == "__main__": 1240 | d = DeliciousAPI() 1241 | max_bookmarks = 50 1242 | url = 'http://www.michael-noll.com/wiki/Del.icio.us_Python_API' 1243 | print "Retrieving Delicious.com information about url" 1244 | print "'%s'" % url 1245 | print "Note: This might take some time..." 1246 | print "=========================================================" 1247 | document = d.get_url(url, max_bookmarks=max_bookmarks) 1248 | print document 1249 | --------------------------------------------------------------------------------