├── README.md
├── deliciousmonitor.py
└── deliciousapi.py
/README.md:
--------------------------------------------------------------------------------
1 | DeliciousAPI
2 | ============
3 |
4 | Unofficial Python API for retrieving data from Delicious.com
5 |
6 | Features
7 | --------
8 |
9 | This Python module provides the following features plus some more:
10 |
11 | * Retrieving a URL's full public bookmarking history including users who bookmarked the URL including tags used for such bookmarks and the creation time of the bookmark (up to YYYY-MM-DD granularity)
12 | * Top tags (up to a maximum of 10) including tag count title as stored on Delicious.com
13 | * Total number of bookmarks/users for this URL at Delicious.com retrieving a user's full bookmark collection, including any private bookmarks if you know the corresponding password
14 | * Retrieving a user's full public tagging vocabulary, i.e. tags and tag counts
15 | * Retrieving a user's network information (network members and network fans)
16 | * HTTP proxy support
17 |
18 | The official Delicious.com API and the JSON/RSS feeds do not provide all the functionality mentioned above, and in such cases this module will query the Delicious.com website directly and extract the required information by parsing the HTML code of the resulting Web pages (a kind of poor man's web mining). The module is able to detect IP throttling, which is employed by Delicious.com to temporarily block abusive HTTP request behavior, and will raise a custom Python error to indicate that. Please be a nice netizen and do not stress the Delicious.com service more than necessary.
19 |
20 | Installation
21 | ------------
22 |
23 | You can now download and install DeliciousAPI from Python Package Index (aka Python Cheese Shop) (includes only deliciousapi.py) via setuptools/easy_install. Just run
24 |
25 | $ easy_install DeliciousAPI
26 |
27 | After installation, a simple import deliciousapi in your Python scripts will do the trick.
28 |
29 | An alternative installation method is downloading the code straight from the git repository.
30 |
31 | Updates
32 | -------
33 |
34 | If you used setuptools/easy_install for installation, you can update DeliciousAPI via
35 |
36 | $ easy_install -U DeliciousAPI
37 |
38 | Alternatively, if you downladed the code from the git repository, simply pull the latest changes.
39 |
40 | Usage
41 | -----
42 |
43 | For now, please refer to the documentation available at [http://www.michael-noll.com/projects/delicious-python-api/](http://www.michael-noll.com/projects/delicious-python-api/).
44 |
45 | Important
46 | ---------
47 |
48 | It is strongly advised that you read the Delicious.com Terms of Use prior to using this Python module. In particular, read section 5 'Intellectual Property'.
49 |
50 | License
51 | -------
52 |
53 | The code is licensed to you under version 2 of the GNU General Public License.
54 |
55 | Copyright
56 | ---------
57 |
58 | Copyright 2006-2010 Michael G. Noll
59 |
60 |
--------------------------------------------------------------------------------
/deliciousmonitor.py:
--------------------------------------------------------------------------------
1 | """
2 | A module to monitor a delicious.com bookmark RSS feed and store it with some additional metadata to file.
3 |
4 | (c) 2006-2008 Michael G. Noll
5 |
6 | """
7 | import codecs
8 | import datetime
9 | import os
10 | import sys
11 | import time
12 |
13 | try:
14 | import deliciousapi
15 | except:
16 | print "ERROR: could not import DeliciousAPI module"
17 | print
18 | print "You can download DeliciousAPI from the Python Cheese Shop at"
19 | print "http://pypi.python.org/pypi/DeliciousAPI"
20 | print
21 |
22 | try:
23 | import feedparser
24 | except:
25 | print "ERROR: could not import Universal Feed Parser module"
26 | print
27 | print "You can download Universal Feed Parser from the Python Cheese Shop at"
28 | print "http://pypi.python.org/pypi/FeedParser"
29 | print
30 | raise
31 |
32 |
33 | class DeliciousMonitor(object):
34 | """Monitors a delicious.com bookmark RSS feed, retrieves metadata for each bookmark and stores it to file.
35 |
36 | By default, the delicious.com hotlist (i.e. the front page) is monitored.
37 | Whenever the monitor discovers a new URL in a bookmark, it retrieves
38 | some metadata for it from delicious.com (currently, common tags and number
39 | of bookmarks) and stores this information to file.
40 |
41 | Note that URLs which have been processed in previous runs will NOT be
42 | processed again, i.e. delicious.com metadata will NOT be updated.
43 |
44 | """
45 |
46 | def __init__(self, rss_url="http://feeds.delicious.com/v2/rss", filename="delicious-monitor.xml", log_filename="delicious-monitor.log", interval=30, verbose=True):
47 | """
48 | Parameters:
49 | rss_url (optional, default: "http://feeds.delicious.com/v2/rss")
50 | The URL of the RSS feed to monitor.
51 |
52 | filename (optional, default: "delicious-monitor.xml")
53 | The name of the file to which metadata about the RSS feed will be stored.
54 |
55 | log_filename (optional, default: "delicious-monitor.log")
56 | The name of the log file, which is used to identify "new" entries in the RSS feed.
57 |
58 | interval (optional, default: 30)
59 | Time between monitor runs in minutes.
60 |
61 | verbose (optional, default: True)
62 | Whether to print non-critical processing information to STDOUT or not.
63 |
64 | """
65 | self.rss_url = rss_url
66 | self._delicious = deliciousapi.DeliciousAPI()
67 | self.filename = filename
68 | self.log_filename = log_filename
69 | self.interval = interval
70 | self.verbose = verbose
71 | self.urls = []
72 | # ensure that the name of the output file and log file is not None etc.
73 | assert self.filename
74 | assert self.log_filename
75 |
76 | def run(self):
77 | """Start the monitor."""
78 | while True:
79 | time_before_run = datetime.datetime.now()
80 |
81 | # do the actual monitoring work
82 | if self.verbose:
83 | print "[MONITOR] Starting monitor run - %s" % time_before_run.strftime("%Y-%m-%d @ %H:%M:%S")
84 | self.monitor()
85 | time_after_run = datetime.datetime.now()
86 |
87 | # calculate the number of seconds to wait until the next run
88 | interval = datetime.timedelta(seconds=60*self.interval)
89 | next_run_time = time_before_run + interval
90 | elapsed = time_after_run - time_before_run
91 | if interval >= elapsed:
92 | wait_seconds = (interval - elapsed).seconds
93 | else:
94 | # the run took longer than our interval time between runs;
95 | # in this case, we continue immediately but will still wait
96 | # three seconds in order not to stress delicious.com too much
97 | wait_seconds = 3
98 | next_run_time = datetime.datetime.now() + datetime.timedelta(seconds=wait_seconds)
99 |
100 | # sleep until the next run
101 | if self.verbose:
102 | print "[MONITOR] Next monitor run on %s (sleeping for %s seconds)" % (next_run_time.strftime("%Y-%m-%d @ %H:%M:%S"), wait_seconds)
103 | time.sleep(wait_seconds)
104 |
105 | def monitor(self):
106 | """Monitors an RSS feed."""
107 |
108 | # download and parse RSS feed
109 | f = feedparser.parse(self.rss_url)
110 |
111 | output_file = codecs.open(self.filename, "a", "utf8")
112 | log_file = None
113 |
114 | if os.access(self.log_filename, os.F_OK):
115 | if self.verbose:
116 | print "[MONITOR] Log file found. Trying to resume...",
117 | try:
118 | # read in previous log data for resuming
119 | log_file = open(self.log_filename, 'r')
120 | # remove leading and trailing whitespace if any (incl. newlines)
121 | self.urls = [line.strip() for line in log_file.readlines()]
122 | log_file.close()
123 | if self.verbose:
124 | print "done"
125 | except IOError:
126 | # most probably, the log file does not exist (yet)
127 | if self.verbose:
128 | print "failed"
129 | else:
130 | # log file does not exist, so there isn't any resume data
131 | # to read in
132 | pass
133 |
134 | try:
135 | # now open it for writing (i.e., appending) and logging
136 | if self.verbose:
137 | print "[MONITOR] Open log file for appending...",
138 | log_file = open(self.log_filename, 'a')
139 | if self.verbose:
140 | print "done"
141 | except IOError:
142 | if self.verbose:
143 | print "failed"
144 | print "[MONITOR] ERROR: could not open log file for appending"
145 | self._cleanup()
146 | return
147 |
148 | # get only new entries
149 | new_entries = []
150 | for entry in f.entries:
151 | new_entries = [entry for entry in f.entries if entry.link not in self.urls]
152 |
153 | if self.verbose:
154 | print "[MONITOR] Found %s new entries" % len(new_entries)
155 |
156 | # query metadata about each entry from delicious.com
157 | for index, entry in enumerate(new_entries):
158 | url = entry.link
159 |
160 | if self.verbose:
161 | print "[MONITOR] Processing entry #%s: '%s'" % (index + 1, url),
162 | try:
163 | time.sleep(1) # be nice and wait 1 sec between connects to delicious.com
164 | document = self._delicious.get_url(url)
165 | except (deliciousapi.DeliciousError,), error_string:
166 | if self.verbose:
167 | print "failed"
168 | print "[MONITOR] ERROR: %s" % error_string
169 | # clean up
170 | output_file.close()
171 | log_file.close()
172 | return
173 |
174 | if self.verbose:
175 | print "done"
176 |
177 | # update log file
178 | log_file.write("%s\n" % url)
179 | # update output file
180 | output_file.write('\n' % (url, document.total_bookmarks, len(document.top_tags)))
181 | for tag, count in document.top_tags:
182 | output_file.write(' \n' % (tag, count))
183 | output_file.write('\n')
184 | output_file.flush()
185 |
186 | # clean up
187 | output_file.close()
188 | log_file.close()
189 |
190 |
191 | if __name__ == "__main__":
192 | monitor = DeliciousMonitor(interval=30)
193 | monitor.run()
194 |
--------------------------------------------------------------------------------
/deliciousapi.py:
--------------------------------------------------------------------------------
1 | """
2 | Unofficial Python API for retrieving data from Delicious.com.
3 |
4 | This module provides the following features plus some more:
5 |
6 | * retrieving a URL's full public bookmarking history including
7 | * users who bookmarked the URL including tags used for such bookmarks
8 | and the creation time of the bookmark (up to YYYY-MM-DD granularity)
9 | * top tags (up to a maximum of 10) including tag count
10 | * title as stored on Delicious.com
11 | * total number of bookmarks/users for this URL at Delicious.com
12 | * retrieving a user's full bookmark collection, including any private bookmarks
13 | if you know the corresponding password
14 | * retrieving a user's full public tagging vocabulary, i.e. tags and tag counts
15 | * retrieving a user's network information (network members and network fans)
16 | * HTTP proxy support
17 | * updated to support Delicious.com "version 2" (mini-relaunch as of August 2008)
18 |
19 | The official Delicious.com API and the JSON/RSS feeds do not provide all
20 | the functionality mentioned above, and in such cases this module will query
21 | the Delicious.com *website* directly and extract the required information
22 | by parsing the HTML code of the resulting Web pages (a kind of poor man's
23 | web mining). The module is able to detect IP throttling, which is employed
24 | by Delicious.com to temporarily block abusive HTTP request behavior, and
25 | will raise a custom Python error to indicate that. Please be a nice netizen
26 | and do not stress the Delicious.com service more than necessary.
27 |
28 | It is strongly advised that you read the Delicious.com Terms of Use
29 | before using this Python module. In particular, read section 5
30 | 'Intellectual Property'.
31 |
32 | The code is licensed to you under version 2 of the GNU General Public
33 | License.
34 |
35 | More information about this module can be found at
36 | http://www.michael-noll.com/wiki/Del.icio.us_Python_API
37 |
38 | Changelog is available at
39 | http://code.michael-noll.com/?p=deliciousapi;a=log
40 |
41 | Copyright 2006-2010 Michael G. Noll
42 |
43 | """
44 |
45 | __author__ = "Michael G. Noll"
46 | __copyright__ = "(c) 2006-2010 Michael G. Noll"
47 | __description__ = "Unofficial Python API for retrieving data from Delicious.com"
48 | __email__ = "coding[AT]michael-REMOVEME-noll[DOT]com"
49 | __license__ = "GPLv2"
50 | __maintainer__ = "Michael G. Noll"
51 | __status__ = "Development"
52 | __url__ = "http://www.michael-noll.com/"
53 | __version__ = "1.6.7"
54 |
55 | import cgi
56 | import datetime
57 | import hashlib
58 | from operator import itemgetter
59 | import re
60 | import socket
61 | import time
62 | import urllib2
63 |
64 | try:
65 | from BeautifulSoup import BeautifulSoup
66 | except:
67 | print "ERROR: could not import BeautifulSoup Python module"
68 | print
69 | print "You can download BeautifulSoup from the Python Cheese Shop at"
70 | print "http://cheeseshop.python.org/pypi/BeautifulSoup/"
71 | print "or directly from http://www.crummy.com/software/BeautifulSoup/"
72 | print
73 | raise
74 |
75 | try:
76 | import simplejson
77 | except:
78 | print "ERROR: could not import simplejson module"
79 | print
80 | print "Since version 1.5.0, DeliciousAPI requires the simplejson module."
81 | print "You can download simplejson from the Python Cheese Shop at"
82 | print "http://pypi.python.org/pypi/simplejson"
83 | print
84 | raise
85 |
86 |
87 | class DeliciousUser(object):
88 | """This class wraps all available information about a user into one object.
89 |
90 | Variables:
91 | bookmarks:
92 | A list of (url, tags, title, comment, timestamp) tuples representing
93 | a user's bookmark collection.
94 |
95 | url is a 'unicode'
96 | tags is a 'list' of 'unicode' ([] if no tags)
97 | title is a 'unicode'
98 | comment is a 'unicode' (u"" if no comment)
99 | timestamp is a 'datetime.datetime'
100 |
101 | tags (read-only property):
102 | A list of (tag, tag_count) tuples, aggregated over all a user's
103 | retrieved bookmarks. The tags represent a user's tagging vocabulary.
104 |
105 | username:
106 | The Delicious.com account name of the user.
107 |
108 | """
109 |
110 | def __init__(self, username, bookmarks=None):
111 | assert username
112 | self.username = username
113 | self.bookmarks = bookmarks or []
114 |
115 | def __str__(self):
116 | total_tag_count = 0
117 | total_tags = set()
118 | for url, tags, title, comment, timestamp in self.bookmarks:
119 | if tags:
120 | total_tag_count += len(tags)
121 | for tag in tags:
122 | total_tags.add(tag)
123 | return "[%s] %d bookmarks, %d tags (%d unique)" % \
124 | (self.username, len(self.bookmarks), total_tag_count, len(total_tags))
125 |
126 | def __repr__(self):
127 | return self.username
128 |
129 | def get_tags(self):
130 | """Returns a dictionary mapping tags to their tag count.
131 |
132 | For example, if the tag count of tag 'foo' is 23, then
133 | 23 bookmarks were annotated with 'foo'. A different way
134 | to put it is that 23 users used the tag 'foo' when
135 | bookmarking the URL.
136 |
137 | """
138 | total_tags = {}
139 | for url, tags, title, comment, timestamp in self.bookmarks:
140 | for tag in tags:
141 | total_tags[tag] = total_tags.get(tag, 0) + 1
142 | return total_tags
143 | tags = property(fget=get_tags, doc="Returns a dictionary mapping tags to their tag count")
144 |
145 |
146 | class DeliciousURL(object):
147 | """This class wraps all available information about a web document into one object.
148 |
149 | Variables:
150 | bookmarks:
151 | A list of (user, tags, comment, timestamp) tuples, representing a
152 | document's bookmark history. Generally, this variable is populated
153 | via get_url(), so the number of bookmarks available in this variable
154 | depends on the parameters of get_url(). See get_url() for more
155 | information.
156 |
157 | user is a 'unicode'
158 | tags is a 'list' of 'unicode's ([] if no tags)
159 | comment is a 'unicode' (u"" if no comment)
160 | timestamp is a 'datetime.datetime' (granularity: creation *day*,
161 | i.e. the day but not the time of day)
162 |
163 | tags (read-only property):
164 | A list of (tag, tag_count) tuples, aggregated over all a document's
165 | retrieved bookmarks.
166 |
167 | top_tags:
168 | A list of (tag, tag_count) tuples, representing a document's so-called
169 | "top tags", i.e. the up to 10 most popular tags for this document.
170 |
171 | url:
172 | The URL of the document.
173 |
174 | hash (read-only property):
175 | The MD5 hash of the URL.
176 |
177 | title:
178 | The document's title.
179 |
180 | total_bookmarks:
181 | The number of total bookmarks (posts) of the document.
182 | Note that the value of total_bookmarks can be greater than the
183 | length of "bookmarks" depending on how much (detailed) bookmark
184 | data could be retrieved from Delicious.com.
185 |
186 | Here's some more background information:
187 | The value of total_bookmarks is the "real" number of bookmarks of
188 | URL "url" stored at Delicious.com as reported by Delicious.com
189 | itself (so it's the "ground truth"). On the other hand, the length
190 | of "bookmarks" depends on iteratively scraped bookmarking data.
191 | Since scraping Delicous.com's Web pages has its limits in practice,
192 | this means that DeliciousAPI could most likely not retrieve all
193 | available bookmarks. In such a case, the value reported by
194 | total_bookmarks is greater than the length of "bookmarks".
195 |
196 | """
197 |
198 | def __init__(self, url, top_tags=None, bookmarks=None, title=u"", total_bookmarks=0):
199 | assert url
200 | self.url = url
201 | self.top_tags = top_tags or []
202 | self.bookmarks = bookmarks or []
203 | self.title = title
204 | self.total_bookmarks = total_bookmarks
205 |
206 | def __str__(self):
207 | total_tag_count = 0
208 | total_tags = set()
209 | for user, tags, comment, timestamp in self.bookmarks:
210 | if tags:
211 | total_tag_count += len(tags)
212 | for tag in tags:
213 | total_tags.add(tag)
214 | return "[%s] %d total bookmarks (= users), %d tags (%d unique), %d out of 10 max 'top' tags" % \
215 | (self.url, self.total_bookmarks, total_tag_count, \
216 | len(total_tags), len(self.top_tags))
217 |
218 | def __repr__(self):
219 | return self.url
220 |
221 | def get_tags(self):
222 | """Returns a dictionary mapping tags to their tag count.
223 |
224 | For example, if the tag count of tag 'foo' is 23, then
225 | 23 bookmarks were annotated with 'foo'. A different way
226 | to put it is that 23 users used the tag 'foo' when
227 | bookmarking the URL.
228 |
229 | @return: Dictionary mapping tags to their tag count.
230 |
231 | """
232 | total_tags = {}
233 | for user, tags, comment, timestamp in self.bookmarks:
234 | for tag in tags:
235 | total_tags[tag] = total_tags.get(tag, 0) + 1
236 | return total_tags
237 | tags = property(fget=get_tags, doc="Returns a dictionary mapping tags to their tag count")
238 |
239 | def get_hash(self):
240 | m = hashlib.md5()
241 | m.update(self.url)
242 | return m.hexdigest()
243 | hash = property(fget=get_hash, doc="Returns the MD5 hash of the URL of this document")
244 |
245 |
246 | class DeliciousAPI(object):
247 | """
248 | This class provides a custom, unofficial API to the Delicious.com service.
249 |
250 | Instead of using just the functionality provided by the official
251 | Delicious.com API (which has limited features), this class retrieves
252 | information from the Delicious.com website directly and extracts data from
253 | the Web pages.
254 |
255 | Note that Delicious.com will block clients with too many queries in a
256 | certain time frame (similar to their API throttling). So be a nice citizen
257 | and don't stress their website.
258 |
259 | """
260 |
261 | def __init__(self,
262 | http_proxy="",
263 | tries=3,
264 | wait_seconds=3,
265 | user_agent="DeliciousAPI/%s (+http://www.michael-noll.com/wiki/Del.icio.us_Python_API)" % __version__,
266 | timeout=30,
267 | ):
268 | """Set up the API module.
269 |
270 | @param http_proxy: Optional, default: "".
271 | Use an HTTP proxy for HTTP connections. Proxy support for
272 | HTTPS is not available yet.
273 | Format: "hostname:port" (e.g., "localhost:8080")
274 | @type http_proxy: str
275 |
276 | @param tries: Optional, default: 3.
277 | Try the specified number of times when downloading a monitored
278 | document fails. tries must be >= 1. See also wait_seconds.
279 | @type tries: int
280 |
281 | @param wait_seconds: Optional, default: 3.
282 | Wait the specified number of seconds before re-trying to
283 | download a monitored document. wait_seconds must be >= 0.
284 | See also tries.
285 | @type wait_seconds: int
286 |
287 | @param user_agent: Optional, default: "DeliciousAPI/
288 | (+http://www.michael-noll.com/wiki/Del.icio.us_Python_API)".
289 | The User-Agent HTTP Header to use when querying Delicous.com.
290 | @type user_agent: str
291 |
292 | @param timeout: Optional, default: 30.
293 | Set network timeout. timeout must be >= 0.
294 | @type timeout: int
295 |
296 | """
297 | assert tries >= 1
298 | assert wait_seconds >= 0
299 | assert timeout >= 0
300 | self.http_proxy = http_proxy
301 | self.tries = tries
302 | self.wait_seconds = wait_seconds
303 | self.user_agent = user_agent
304 | self.timeout = timeout
305 | socket.setdefaulttimeout(self.timeout)
306 |
307 |
308 | def _query(self, path, host="delicious.com", user=None, password=None, use_ssl=False):
309 | """Queries Delicious.com for information, specified by (query) path.
310 |
311 | @param path: The HTTP query path.
312 | @type path: str
313 |
314 | @param host: The host to query, default: "delicious.com".
315 | @type host: str
316 |
317 | @param user: The Delicious.com username if any, default: None.
318 | @type user: str
319 |
320 | @param password: The Delicious.com password of user, default: None.
321 | @type password: unicode/str
322 |
323 | @param use_ssl: Whether to use SSL encryption or not, default: False.
324 | @type use_ssl: bool
325 |
326 | @return: None on errors (i.e. on all HTTP status other than 200).
327 | On success, returns the content of the HTML response.
328 |
329 | """
330 | opener = None
331 | handlers = []
332 |
333 | # add HTTP Basic authentication if available
334 | if user and password:
335 | pwd_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
336 | pwd_mgr.add_password(None, host, user, password)
337 | basic_auth_handler = urllib2.HTTPBasicAuthHandler(pwd_mgr)
338 | handlers.append(basic_auth_handler)
339 |
340 | # add proxy support if requested
341 | if self.http_proxy:
342 | proxy_handler = urllib2.ProxyHandler({'http': 'http://%s' % self.http_proxy})
343 | handlers.append(proxy_handler)
344 |
345 | if handlers:
346 | opener = urllib2.build_opener(*handlers)
347 | else:
348 | opener = urllib2.build_opener()
349 | opener.addheaders = [('User-agent', self.user_agent)]
350 |
351 | data = None
352 | tries = self.tries
353 |
354 | if use_ssl:
355 | protocol = "https"
356 | else:
357 | protocol = "http"
358 | url = "%s://%s%s" % (protocol, host, path)
359 |
360 | while tries > 0:
361 | try:
362 | f = opener.open(url)
363 | data = f.read()
364 | f.close()
365 | break
366 | except urllib2.HTTPError, e:
367 | if e.code == 301:
368 | raise DeliciousMovedPermanentlyWarning, "Delicious.com status %s - url moved permanently" % e.code
369 | if e.code == 302:
370 | raise DeliciousMovedTemporarilyWarning, "Delicious.com status %s - url moved temporarily" % e.code
371 | elif e.code == 401:
372 | raise DeliciousUnauthorizedError, "Delicious.com error %s - unauthorized (authentication failed?)" % e.code
373 | elif e.code == 403:
374 | raise DeliciousForbiddenError, "Delicious.com error %s - forbidden" % e.code
375 | elif e.code == 404:
376 | raise DeliciousNotFoundError, "Delicious.com error %s - url not found" % e.code
377 | elif e.code == 500:
378 | raise Delicious500Error, "Delicious.com error %s - server problem" % e.code
379 | elif e.code == 503 or e.code == 999:
380 | raise DeliciousThrottleError, "Delicious.com error %s - unable to process request (your IP address has been throttled/blocked)" % e.code
381 | else:
382 | raise DeliciousUnknownError, "Delicious.com error %s - unknown error" % e.code
383 | break
384 | except urllib2.URLError, e:
385 | time.sleep(self.wait_seconds)
386 | except socket.error, msg:
387 | # sometimes we get a "Connection Refused" error
388 | # wait a bit and then try again
389 | time.sleep(self.wait_seconds)
390 | #finally:
391 | # f.close()
392 | tries -= 1
393 | return data
394 |
395 |
396 | def get_url(self, url, max_bookmarks=50, sleep_seconds=1):
397 | """
398 | Returns a DeliciousURL instance representing the Delicious.com history of url.
399 |
400 | Generally, this method is what you want for getting title, bookmark, tag,
401 | and user information about a URL.
402 |
403 | Delicious only returns up to 50 bookmarks per URL. This means that
404 | we have to do subsequent queries plus parsing if we want to retrieve
405 | more than 50. Roughly speaking, the processing time of get_url()
406 | increases linearly with the number of 50-bookmarks-chunks; i.e.
407 | it will take 10 times longer to retrieve 500 bookmarks than 50.
408 |
409 | @param url: The URL of the web document to be queried for.
410 | @type url: str
411 |
412 | @param max_bookmarks: Optional, default: 50.
413 | See the documentation of get_bookmarks() for more information
414 | as get_url() uses get_bookmarks() to retrieve a url's
415 | bookmarking history.
416 | @type max_bookmarks: int
417 |
418 | @param sleep_seconds: Optional, default: 1.
419 | See the documentation of get_bookmarks() for more information
420 | as get_url() uses get_bookmarks() to retrieve a url's
421 | bookmarking history. sleep_seconds must be >= 1 to comply with
422 | Delicious.com's Terms of Use.
423 | @type sleep_seconds: int
424 |
425 | @return: DeliciousURL instance representing the Delicious.com history
426 | of url.
427 |
428 | """
429 | # we must wait at least 1 second between subsequent queries to
430 | # comply with Delicious.com's Terms of Use
431 | assert sleep_seconds >= 1
432 |
433 | document = DeliciousURL(url)
434 |
435 | m = hashlib.md5()
436 | m.update(url)
437 | hash = m.hexdigest()
438 |
439 | path = "/v2/json/urlinfo/%s" % hash
440 | data = self._query(path, host="feeds.delicious.com")
441 | if data:
442 | urlinfo = {}
443 | try:
444 | urlinfo = simplejson.loads(data)
445 | if urlinfo:
446 | urlinfo = urlinfo[0]
447 | else:
448 | urlinfo = {}
449 | except TypeError:
450 | pass
451 | try:
452 | document.title = urlinfo['title'] or u""
453 | except KeyError:
454 | pass
455 | try:
456 | top_tags = urlinfo['top_tags'] or {}
457 | if top_tags:
458 | document.top_tags = sorted(top_tags.iteritems(), key=itemgetter(1), reverse=True)
459 | else:
460 | document.top_tags = []
461 | except KeyError:
462 | pass
463 | try:
464 | document.total_bookmarks = int(urlinfo['total_posts'])
465 | except (KeyError, ValueError):
466 | pass
467 | document.bookmarks = self.get_bookmarks(url=url, max_bookmarks=max_bookmarks, sleep_seconds=sleep_seconds)
468 |
469 |
470 | return document
471 |
472 | def get_network(self, username):
473 | """
474 | Returns the user's list of followees and followers.
475 |
476 | Followees are users in his Delicious "network", i.e. those users whose
477 | bookmark streams he's subscribed to. Followers are his Delicious.com
478 | "fans", i.e. those users who have subscribed to the given user's
479 | bookmark stream).
480 |
481 | Example:
482 |
483 | A --------> --------> C
484 | D --------> B --------> E
485 | F --------> --------> F
486 |
487 | followers followees
488 | of B of B
489 |
490 | Arrows from user A to user B denote that A has subscribed to B's
491 | bookmark stream, i.e. A is "following" or "tracking" B.
492 |
493 | Note that user F is both a followee and a follower of B, i.e. F tracks
494 | B and vice versa. In Delicious.com terms, F is called a "mutual fan"
495 | of B.
496 |
497 | Comparing this network concept to information retrieval, one could say
498 | that followers are incoming links and followees outgoing links of B.
499 |
500 | @param username: Delicous.com username for which network information is
501 | retrieved.
502 | @type username: unicode/str
503 |
504 | @return: Tuple of two lists ([, []), where each list
505 | contains tuples of (username, tracking_since_timestamp).
506 | If a network is set as private, i.e. hidden from public view,
507 | (None, None) is returned.
508 | If a network is public but empty, ([], []) is returned.
509 |
510 | """
511 | assert username
512 | followees = followers = None
513 |
514 | # followees (network members)
515 | path = "/v2/json/networkmembers/%s" % username
516 | data = None
517 | try:
518 | data = self._query(path, host="feeds.delicious.com")
519 | except DeliciousForbiddenError:
520 | pass
521 | if data:
522 | followees = []
523 |
524 | users = []
525 | try:
526 | users = simplejson.loads(data)
527 | except TypeError:
528 | pass
529 |
530 | uname = tracking_since = None
531 |
532 | for user in users:
533 | # followee's username
534 | try:
535 | uname = user['user']
536 | except KeyError:
537 | pass
538 | # try to convert uname to Unicode
539 | if uname:
540 | try:
541 | # we assume UTF-8 encoding
542 | uname = uname.decode('utf-8')
543 | except UnicodeDecodeError:
544 | pass
545 | # time when the given user started tracking this user
546 | try:
547 | tracking_since = datetime.datetime.strptime(user['dt'], "%Y-%m-%dT%H:%M:%SZ")
548 | except KeyError:
549 | pass
550 | if uname:
551 | followees.append( (uname, tracking_since) )
552 |
553 | # followers (network fans)
554 | path = "/v2/json/networkfans/%s" % username
555 | data = None
556 | try:
557 | data = self._query(path, host="feeds.delicious.com")
558 | except DeliciousForbiddenError:
559 | pass
560 | if data:
561 | followers = []
562 |
563 | users = []
564 | try:
565 | users = simplejson.loads(data)
566 | except TypeError:
567 | pass
568 |
569 | uname = tracking_since = None
570 |
571 | for user in users:
572 | # fan's username
573 | try:
574 | uname = user['user']
575 | except KeyError:
576 | pass
577 | # try to convert uname to Unicode
578 | if uname:
579 | try:
580 | # we assume UTF-8 encoding
581 | uname = uname.decode('utf-8')
582 | except UnicodeDecodeError:
583 | pass
584 | # time when fan started tracking the given user
585 | try:
586 | tracking_since = datetime.datetime.strptime(user['dt'], "%Y-%m-%dT%H:%M:%SZ")
587 | except KeyError:
588 | pass
589 | if uname:
590 | followers.append( (uname, tracking_since) )
591 | return ( followees, followers )
592 |
593 | def get_bookmarks(self, url=None, username=None, max_bookmarks=50, sleep_seconds=1):
594 | """
595 | Returns the bookmarks of url or user, respectively.
596 |
597 | Delicious.com only returns up to 50 bookmarks per URL on its website.
598 | This means that we have to do subsequent queries plus parsing if
599 | we want to retrieve more than 50. Roughly speaking, the processing
600 | time of get_bookmarks() increases linearly with the number of
601 | 50-bookmarks-chunks; i.e. it will take 10 times longer to retrieve
602 | 500 bookmarks than 50.
603 |
604 | @param url: The URL of the web document to be queried for.
605 | Cannot be used together with 'username'.
606 | @type url: str
607 |
608 | @param username: The Delicious.com username to be queried for.
609 | Cannot be used together with 'url'.
610 | @type username: str
611 |
612 | @param max_bookmarks: Optional, default: 50.
613 | Maximum number of bookmarks to retrieve. Set to 0 to disable
614 | this limitation/the maximum and retrieve all available
615 | bookmarks of the given url.
616 |
617 | Bookmarks are sorted so that newer bookmarks are first.
618 | Setting max_bookmarks to 50 means that get_bookmarks() will retrieve
619 | the 50 most recent bookmarks of the given url.
620 |
621 | In the case of getting bookmarks of a URL (url is set),
622 | get_bookmarks() will take *considerably* longer to run
623 | for pages with lots of bookmarks when setting max_bookmarks
624 | to a high number or when you completely disable the limit.
625 | Delicious returns only up to 50 bookmarks per result page,
626 | so for example retrieving 250 bookmarks requires 5 HTTP
627 | connections and parsing 5 HTML pages plus wait time between
628 | queries (to comply with delicious' Terms of Use; see
629 | also parameter 'sleep_seconds').
630 |
631 | In the case of getting bookmarks of a user (username is set),
632 | the same restrictions as for a URL apply with the exception
633 | that we can retrieve up to 100 bookmarks per HTTP query
634 | (instead of only up to 50 per HTTP query for a URL).
635 | @type max_bookmarks: int
636 |
637 | @param sleep_seconds: Optional, default: 1.
638 | Wait the specified number of seconds between subsequent
639 | queries in case that there are multiple pages of bookmarks
640 | for the given url. sleep_seconds must be >= 1 to comply with
641 | Delicious.com's Terms of Use.
642 | See also parameter 'max_bookmarks'.
643 | @type sleep_seconds: int
644 |
645 | @return: Returns the bookmarks of url or user, respectively.
646 | For urls, it returns a list of (user, tags, comment, timestamp)
647 | tuples.
648 | For users, it returns a list of (url, tags, title, comment,
649 | timestamp) tuples.
650 |
651 | Bookmarks are sorted "descendingly" by creation time, i.e. newer
652 | bookmarks come first.
653 |
654 | """
655 | # we must wait at least 1 second between subsequent queries to
656 | # comply with delicious' Terms of Use
657 | assert sleep_seconds >= 1
658 |
659 | # url XOR username
660 | assert bool(username) is not bool(url)
661 |
662 | # maximum number of urls/posts Delicious.com will display
663 | # per page on its website
664 | max_html_count = 100
665 | # maximum number of pages that Delicious.com will display;
666 | # currently, the maximum number of pages is 20. Delicious.com
667 | # allows to go beyond page 20 via pagination, but page N (for
668 | # N > 20) will always display the same content as page 20.
669 | max_html_pages = 20
670 |
671 | path = None
672 | if url:
673 | m = hashlib.md5()
674 | m.update(url)
675 | hash = m.hexdigest()
676 |
677 | # path will change later on if there are multiple pages of boomarks
678 | # for the given url
679 | path = "/url/%s" % hash
680 | elif username:
681 | # path will change later on if there are multiple pages of boomarks
682 | # for the given username
683 | path = "/%s?setcount=%d" % (username, max_html_count)
684 | else:
685 | raise Exception('You must specify either url or user.')
686 |
687 | page_index = 1
688 | bookmarks = []
689 | while path and page_index <= max_html_pages:
690 | data = self._query(path)
691 | path = None
692 | if data:
693 | # extract bookmarks from current page
694 | if url:
695 | bookmarks.extend(self._extract_bookmarks_from_url_history(data))
696 | else:
697 | bookmarks.extend(self._extract_bookmarks_from_user_history(data))
698 |
699 | # stop scraping if we already have as many bookmarks as we want
700 | if (len(bookmarks) >= max_bookmarks) and max_bookmarks != 0:
701 | break
702 | else:
703 | # check if there are multiple pages of bookmarks for this
704 | # url on Delicious.com
705 | soup = BeautifulSoup(data)
706 | paginations = soup.findAll("div", id="pagination")
707 | if paginations:
708 | # find next path
709 | nexts = paginations[0].findAll("a", attrs={ "class": "pn next" })
710 | if nexts and (max_bookmarks == 0 or len(bookmarks) < max_bookmarks) and len(bookmarks) > 0:
711 | # e.g. /url/2bb293d594a93e77d45c2caaf120e1b1?show=all&page=2
712 | path = nexts[0]['href']
713 | if username:
714 | path += "&setcount=%d" % max_html_count
715 | page_index += 1
716 | # wait one second between queries to be compliant with
717 | # delicious' Terms of Use
718 | time.sleep(sleep_seconds)
719 | if max_bookmarks > 0:
720 | return bookmarks[:max_bookmarks]
721 | else:
722 | return bookmarks
723 |
724 |
725 | def _extract_bookmarks_from_url_history(self, data):
726 | """
727 | Extracts user bookmarks from a URL's history page on Delicious.com.
728 |
729 | The Python library BeautifulSoup is used to parse the HTML page.
730 |
731 | @param data: The HTML source of a URL history Web page on Delicious.com.
732 | @type data: str
733 |
734 | @return: list of user bookmarks of the corresponding URL
735 |
736 | """
737 | bookmarks = []
738 | soup = BeautifulSoup(data)
739 |
740 | bookmark_elements = soup.findAll("div", attrs={"class": re.compile("^bookmark\s*")})
741 | timestamp = None
742 | for bookmark_element in bookmark_elements:
743 |
744 | # extract bookmark creation time
745 | #
746 | # this timestamp has to "persist" until a new timestamp is
747 | # found (delicious only provides the creation time data for the
748 | # first bookmark in the list of bookmarks for a given day
749 | dategroups = bookmark_element.findAll("div", attrs={"class": "dateGroup"})
750 | if dategroups:
751 | spans = dategroups[0].findAll('span')
752 | if spans:
753 | date_str = spans[0].contents[0].strip()
754 | timestamp = datetime.datetime.strptime(date_str, '%d %b %y')
755 |
756 | # extract comments
757 | comment = u""
758 | datas = bookmark_element.findAll("div", attrs={"class": "data"})
759 | if datas:
760 | divs = datas[0].findAll("div", attrs={"class": "description"})
761 | if divs:
762 | comment = divs[0].contents[0].strip()
763 |
764 | # extract tags
765 | user_tags = []
766 | tagdisplays = bookmark_element.findAll("div", attrs={"class": "tagdisplay"})
767 | if tagdisplays:
768 | aset = tagdisplays[0].findAll("a", attrs={"class": "tag noplay"})
769 | for a in aset:
770 | tag = a.contents[0]
771 | user_tags.append(tag)
772 |
773 | # extract user information
774 | metas = bookmark_element.findAll("div", attrs={"class": "meta"})
775 | if metas:
776 | links = metas[0].findAll("a", attrs={"class": "user user-tag"})
777 | if links:
778 | try:
779 | user = links[0]['href'][1:]
780 | except IndexError:
781 | # WORKAROUND: it seems there is a bug on Delicious.com where
782 | # sometimes a bookmark is shown in a URL history without any
783 | # associated Delicious username (username is empty); this could
784 | # be caused by special characters in the username or other things
785 | #
786 | # this problem of Delicious is very rare, so we just skip such
787 | # entries until they find a fix
788 | pass
789 | bookmarks.append( (user, user_tags, comment, timestamp) )
790 |
791 | return bookmarks
792 |
793 | def _extract_bookmarks_from_user_history(self, data):
794 | """
795 | Extracts a user's bookmarks from his user page on Delicious.com.
796 |
797 | The Python library BeautifulSoup is used to parse the HTML page.
798 |
799 | @param data: The HTML source of a user page on Delicious.com.
800 | @type data: str
801 |
802 | @return: list of bookmarks of the corresponding user
803 |
804 | """
805 | bookmarks = []
806 | soup = BeautifulSoup(data)
807 |
808 | ul = soup.find("ul", id="bookmarklist")
809 | if ul:
810 | bookmark_elements = ul.findAll("div", attrs={"class": re.compile("^bookmark\s*")})
811 | timestamp = None
812 | for bookmark_element in bookmark_elements:
813 |
814 | # extract bookmark creation time
815 | #
816 | # this timestamp has to "persist" until a new timestamp is
817 | # found (delicious only provides the creation time data for the
818 | # first bookmark in the list of bookmarks for a given day
819 | dategroups = bookmark_element.findAll("div", attrs={"class": "dateGroup"})
820 | if dategroups:
821 | spans = dategroups[0].findAll('span')
822 | if spans:
823 | date_str = spans[0].contents[0].strip()
824 | timestamp = datetime.datetime.strptime(date_str, '%d %b %y')
825 |
826 | # extract url, title and comments
827 | url = u""
828 | title = u""
829 | comment = u""
830 | datas = bookmark_element.findAll("div", attrs={"class": "data"})
831 | if datas:
832 | links = datas[0].findAll("a", attrs={"class": re.compile("^taggedlink\s*")})
833 | if links and links[0].contents:
834 | title = links[0].contents[0].strip()
835 | url = links[0]['href']
836 | divs = datas[0].findAll("div", attrs={"class": "description"})
837 | if divs:
838 | comment = divs[0].contents[0].strip()
839 |
840 | # extract tags
841 | url_tags = []
842 | tagdisplays = bookmark_element.findAll("div", attrs={"class": "tagdisplay"})
843 | if tagdisplays:
844 | aset = tagdisplays[0].findAll("a", attrs={"class": "tag noplay"})
845 | for a in aset:
846 | tag = a.contents[0]
847 | url_tags.append(tag)
848 |
849 | bookmarks.append( (url, url_tags, title, comment, timestamp) )
850 |
851 | return bookmarks
852 |
853 |
854 | def get_user(self, username, password=None, max_bookmarks=50, sleep_seconds=1):
855 | """Retrieves a user's bookmarks from Delicious.com.
856 |
857 | If a correct username AND password are supplied, a user's *full*
858 | bookmark collection (which also includes private bookmarks) is
859 | retrieved. Data communication is encrypted using SSL in this case.
860 |
861 | If no password is supplied, only the *public* bookmarks of the user
862 | are retrieved. Here, the parameter 'max_bookmarks' specifies how
863 | many public bookmarks will be retrieved (default: 50). Set the
864 | parameter to 0 to retrieve all public bookmarks.
865 |
866 | This function can be used to backup all of a user's bookmarks if
867 | called with a username and password.
868 |
869 | @param username: The Delicious.com username.
870 | @type username: str
871 |
872 | @param password: Optional, default: None.
873 | The user's Delicious.com password. If password is set,
874 | all communication with Delicious.com is SSL-encrypted.
875 | @type password: unicode/str
876 |
877 | @param max_bookmarks: Optional, default: 50.
878 | See the documentation of get_bookmarks() for more
879 | information as get_url() uses get_bookmarks() to
880 | retrieve a url's bookmarking history.
881 | The parameter is NOT used when a password is specified
882 | because in this case the *full* bookmark collection of
883 | a user will be retrieved.
884 | @type max_bookmarks: int
885 |
886 | @param sleep_seconds: Optional, default: 1.
887 | See the documentation of get_bookmarks() for more information as
888 | get_url() uses get_bookmarks() to retrieve a url's bookmarking
889 | history. sleep_seconds must be >= 1 to comply with Delicious.com's
890 | Terms of Use.
891 | @type sleep_seconds: int
892 |
893 | @return: DeliciousUser instance
894 |
895 | """
896 | assert username
897 | user = DeliciousUser(username)
898 | bookmarks = []
899 | if password:
900 | # We have username AND password, so we call
901 | # the official Delicious.com API.
902 | path = "/v1/posts/all"
903 | data = self._query(path, host="api.del.icio.us", use_ssl=True, user=username, password=password)
904 | if data:
905 | soup = BeautifulSoup(data)
906 | elements = soup.findAll("post")
907 | for element in elements:
908 | url = element["href"]
909 | title = element["description"] or u""
910 | comment = element["extended"] or u""
911 | tags = []
912 | if element["tag"]:
913 | tags = element["tag"].split()
914 | timestamp = datetime.datetime.strptime(element["time"], "%Y-%m-%dT%H:%M:%SZ")
915 | bookmarks.append( (url, tags, title, comment, timestamp) )
916 | user.bookmarks = bookmarks
917 | else:
918 | # We have only the username, so we extract data from
919 | # the user's JSON feed. However, the feed is restricted
920 | # to the most recent public bookmarks of the user, which
921 | # is about 100 if any. So if we need more than 100, we start
922 | # scraping the Delicious.com website directly
923 | if max_bookmarks > 0 and max_bookmarks <= 100:
924 | path = "/v2/json/%s?count=100" % username
925 | data = self._query(path, host="feeds.delicious.com", user=username)
926 | if data:
927 | posts = []
928 | try:
929 | posts = simplejson.loads(data)
930 | except TypeError:
931 | pass
932 |
933 | url = timestamp = None
934 | title = comment = u""
935 | tags = []
936 |
937 | for post in posts:
938 | # url
939 | try:
940 | url = post['u']
941 | except KeyError:
942 | pass
943 | # title
944 | try:
945 | title = post['d']
946 | except KeyError:
947 | pass
948 | # tags
949 | try:
950 | tags = post['t']
951 | except KeyError:
952 | pass
953 | if not tags:
954 | tags = [u"system:unfiled"]
955 | # comment / notes
956 | try:
957 | comment = post['n']
958 | except KeyError:
959 | pass
960 | # bookmark creation time
961 | try:
962 | timestamp = datetime.datetime.strptime(post['dt'], "%Y-%m-%dT%H:%M:%SZ")
963 | except KeyError:
964 | pass
965 | bookmarks.append( (url, tags, title, comment, timestamp) )
966 | user.bookmarks = bookmarks[:max_bookmarks]
967 | else:
968 | # TODO: retrieve the first 100 bookmarks via JSON before
969 | # falling back to scraping the delicous.com website
970 | user.bookmarks = self.get_bookmarks(username=username, max_bookmarks=max_bookmarks, sleep_seconds=sleep_seconds)
971 | return user
972 |
973 | def get_urls(self, tag=None, popular=True, max_urls=100, sleep_seconds=1):
974 | """
975 | Returns the list of recent URLs (of web documents) tagged with a given tag.
976 |
977 | This is very similar to parsing Delicious' RSS/JSON feeds directly,
978 | but this function will return up to 2,000 links compared to a maximum
979 | of 100 links when using the official feeds (with query parameter
980 | count=100).
981 |
982 | The return list of links will be sorted by recency in descending order,
983 | i.e. newest items first.
984 |
985 | Note that even when setting max_urls, get_urls() cannot guarantee that
986 | it can retrieve *at least* this many URLs. It is really just an upper
987 | bound.
988 |
989 | @param tag: Retrieve links which have been tagged with the given tag.
990 | If tag is not set (default), links will be retrieved from the
991 | Delicious.com front page (aka "delicious hotlist").
992 | @type tag: unicode/str
993 |
994 | @param popular: If true (default), retrieve only popular links (i.e.
995 | /popular/). Otherwise, the most recent links tagged with
996 | the given tag will be retrieved (i.e. /tag/).
997 |
998 | As of January 2009, it seems that Delicious.com modified the list
999 | of popular tags to contain only up to a maximum of 15 URLs.
1000 | This also means that setting max_urls to values larger than 15
1001 | will not change the results of get_urls().
1002 | So if you are interested in more URLs, set the "popular" parameter
1003 | to false.
1004 |
1005 | Note that if you set popular to False, the returned list of URLs
1006 | might contain duplicate items. This is due to the way Delicious.com
1007 | creates its /tag/ Web pages. So if you need a certain
1008 | number of unique URLs, you have to take care of that in your
1009 | own code.
1010 | @type popular: bool
1011 |
1012 | @param max_urls: Retrieve at most max_urls links. The default is 100,
1013 | which is the maximum number of links that can be retrieved by
1014 | parsing the official JSON feeds. The maximum value of max_urls
1015 | in practice is 2000 (currently). If it is set higher, Delicious
1016 | will return the same links over and over again, giving lots of
1017 | duplicate items.
1018 | @type max_urls: int
1019 |
1020 | @param sleep_seconds: Optional, default: 1.
1021 | Wait the specified number of seconds between subsequent queries in
1022 | case that there are multiple pages of bookmarks for the given url.
1023 | Must be greater than or equal to 1 to comply with Delicious.com's
1024 | Terms of Use.
1025 | See also parameter 'max_urls'.
1026 | @type sleep_seconds: int
1027 |
1028 | @return: The list of recent URLs (of web documents) tagged with a given tag.
1029 |
1030 | """
1031 | assert sleep_seconds >= 1
1032 | urls = []
1033 | path = None
1034 | if tag is None or (tag is not None and max_urls > 0 and max_urls <= 100):
1035 | # use official JSON feeds
1036 | max_json_count = 100
1037 | if tag:
1038 | # tag-specific JSON feed
1039 | if popular:
1040 | path = "/v2/json/popular/%s?count=%d" % (tag, max_json_count)
1041 | else:
1042 | path = "/v2/json/tag/%s?count=%d" % (tag, max_json_count)
1043 | else:
1044 | # Delicious.com hotlist
1045 | path = "/v2/json/?count=%d" % (max_json_count)
1046 | data = self._query(path, host="feeds.delicious.com")
1047 | if data:
1048 | posts = []
1049 | try:
1050 | posts = simplejson.loads(data)
1051 | except TypeError:
1052 | pass
1053 |
1054 | for post in posts:
1055 | # url
1056 | try:
1057 | url = post['u']
1058 | if url:
1059 | urls.append(url)
1060 | except KeyError:
1061 | pass
1062 | else:
1063 | # maximum number of urls/posts Delicious.com will display
1064 | # per page on its website
1065 | max_html_count = 100
1066 | # maximum number of pages that Delicious.com will display;
1067 | # currently, the maximum number of pages is 20. Delicious.com
1068 | # allows to go beyond page 20 via pagination, but page N (for
1069 | # N > 20) will always display the same content as page 20.
1070 | max_html_pages = 20
1071 |
1072 | if popular:
1073 | path = "/popular/%s?setcount=%d" % (tag, max_html_count)
1074 | else:
1075 | path = "/tag/%s?setcount=%d" % (tag, max_html_count)
1076 |
1077 | page_index = 1
1078 | urls = []
1079 | while path and page_index <= max_html_pages:
1080 | data = self._query(path)
1081 | path = None
1082 | if data:
1083 | # extract urls from current page
1084 | soup = BeautifulSoup(data)
1085 | links = soup.findAll("a", attrs={"class": re.compile("^taggedlink\s*")})
1086 | for link in links:
1087 | try:
1088 | url = link['href']
1089 | if url:
1090 | urls.append(url)
1091 | except KeyError:
1092 | pass
1093 |
1094 | # check if there are more multiple pages of urls
1095 | soup = BeautifulSoup(data)
1096 | paginations = soup.findAll("div", id="pagination")
1097 | if paginations:
1098 | # find next path
1099 | nexts = paginations[0].findAll("a", attrs={ "class": "pn next" })
1100 | if nexts and (max_urls == 0 or len(urls) < max_urls) and len(urls) > 0:
1101 | # e.g. /url/2bb293d594a93e77d45c2caaf120e1b1?show=all&page=2
1102 | path = nexts[0]['href']
1103 | path += "&setcount=%d" % max_html_count
1104 | page_index += 1
1105 | # wait between queries to Delicious.com to be
1106 | # compliant with its Terms of Use
1107 | time.sleep(sleep_seconds)
1108 | if max_urls > 0:
1109 | return urls[:max_urls]
1110 | else:
1111 | return urls
1112 |
1113 |
1114 | def get_tags_of_user(self, username):
1115 | """
1116 | Retrieves user's public tags and their tag counts from Delicious.com.
1117 | The tags represent a user's full public tagging vocabulary.
1118 |
1119 | DeliciousAPI uses the official JSON feed of the user. We could use
1120 | RSS here, but the JSON feed has proven to be faster in practice.
1121 |
1122 | @param username: The Delicious.com username.
1123 | @type username: str
1124 |
1125 | @return: Dictionary mapping tags to their tag counts.
1126 |
1127 | """
1128 | tags = {}
1129 | path = "/v2/json/tags/%s" % username
1130 | data = self._query(path, host="feeds.delicious.com")
1131 | if data:
1132 | try:
1133 | tags = simplejson.loads(data)
1134 | except TypeError:
1135 | pass
1136 | return tags
1137 |
1138 | def get_number_of_users(self, url):
1139 | """get_number_of_users() is obsolete and has been removed. Please use get_url() instead."""
1140 | reason = "get_number_of_users() is obsolete and has been removed. Please use get_url() instead."
1141 | raise Exception(reason)
1142 |
1143 | def get_common_tags_of_url(self, url):
1144 | """get_common_tags_of_url() is obsolete and has been removed. Please use get_url() instead."""
1145 | reason = "get_common_tags_of_url() is obsolete and has been removed. Please use get_url() instead."
1146 | raise Exception(reason)
1147 |
1148 | def _html_escape(self, s):
1149 | """HTML-escape a string or object.
1150 |
1151 | This converts any non-string objects passed into it to strings
1152 | (actually, using unicode()). All values returned are
1153 | non-unicode strings (using "num;" entities for all non-ASCII
1154 | characters).
1155 |
1156 | None is treated specially, and returns the empty string.
1157 |
1158 | @param s: The string that needs to be escaped.
1159 | @type s: str
1160 |
1161 | @return: The escaped string.
1162 |
1163 | """
1164 | if s is None:
1165 | return ''
1166 | if not isinstance(s, basestring):
1167 | if hasattr(s, '__unicode__'):
1168 | s = unicode(s)
1169 | else:
1170 | s = str(s)
1171 | s = cgi.escape(s, True)
1172 | if isinstance(s, unicode):
1173 | s = s.encode('ascii', 'xmlcharrefreplace')
1174 | return s
1175 |
1176 |
1177 | class DeliciousError(Exception):
1178 | """Used to indicate that an error occurred when trying to access Delicious.com via its API."""
1179 |
1180 | class DeliciousWarning(Exception):
1181 | """Used to indicate a warning when trying to access Delicious.com via its API.
1182 |
1183 | Warnings are raised when it is useful to alert the user of some condition
1184 | where that condition doesn't warrant raising an exception and terminating
1185 | the program. For example, we issue a warning when Delicious.com returns a
1186 | HTTP status code for redirections (3xx).
1187 | """
1188 |
1189 | class DeliciousThrottleError(DeliciousError):
1190 | """Used to indicate that the client computer (i.e. its IP address) has been temporarily blocked by Delicious.com."""
1191 | pass
1192 |
1193 | class DeliciousUnknownError(DeliciousError):
1194 | """Used to indicate that Delicious.com returned an (HTTP) error which we don't know how to handle yet."""
1195 | pass
1196 |
1197 | class DeliciousUnauthorizedError(DeliciousError):
1198 | """Used to indicate that Delicious.com returned a 401 Unauthorized error.
1199 |
1200 | Most of the time, the user credentials for accessing restricted functions
1201 | of the official Delicious.com API are incorrect.
1202 |
1203 | """
1204 | pass
1205 |
1206 | class DeliciousForbiddenError(DeliciousError):
1207 | """Used to indicate that Delicious.com returned a 403 Forbidden error.
1208 | """
1209 | pass
1210 |
1211 |
1212 | class DeliciousNotFoundError(DeliciousError):
1213 | """Used to indicate that Delicious.com returned a 404 Not Found error.
1214 |
1215 | Most of the time, retrying some seconds later fixes the problem
1216 | (because we only query existing pages with this API).
1217 |
1218 | """
1219 | pass
1220 |
1221 | class Delicious500Error(DeliciousError):
1222 | """Used to indicate that Delicious.com returned a 500 error.
1223 |
1224 | Most of the time, retrying some seconds later fixes the problem.
1225 |
1226 | """
1227 | pass
1228 |
1229 | class DeliciousMovedPermanentlyWarning(DeliciousWarning):
1230 | """Used to indicate that Delicious.com returned a 301 Found (Moved Permanently) redirection."""
1231 | pass
1232 |
1233 | class DeliciousMovedTemporarilyWarning(DeliciousWarning):
1234 | """Used to indicate that Delicious.com returned a 302 Found (Moved Temporarily) redirection."""
1235 | pass
1236 |
1237 | __all__ = ['DeliciousAPI', 'DeliciousURL', 'DeliciousError', 'DeliciousThrottleError', 'DeliciousUnauthorizedError', 'DeliciousUnknownError', 'DeliciousNotFoundError' , 'Delicious500Error', 'DeliciousMovedTemporarilyWarning']
1238 |
1239 | if __name__ == "__main__":
1240 | d = DeliciousAPI()
1241 | max_bookmarks = 50
1242 | url = 'http://www.michael-noll.com/wiki/Del.icio.us_Python_API'
1243 | print "Retrieving Delicious.com information about url"
1244 | print "'%s'" % url
1245 | print "Note: This might take some time..."
1246 | print "========================================================="
1247 | document = d.get_url(url, max_bookmarks=max_bookmarks)
1248 | print document
1249 |
--------------------------------------------------------------------------------