├── .gitignore ├── MANIFEST ├── setup.py ├── duckduckgo.py ├── ddg └── README.rst /.gitignore: -------------------------------------------------------------------------------- 1 | build/ 2 | dist/ 3 | *.pyc 4 | *.swp 5 | -------------------------------------------------------------------------------- /MANIFEST: -------------------------------------------------------------------------------- 1 | duckduckgo.py 2 | setup.py 3 | ddg 4 | README.rst 5 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from distutils.core import setup 2 | setup(name='duckduckgo', 3 | version='0.1.0', 4 | py_modules=['duckduckgo'], 5 | data_files=[('/usr/bin', ['ddg'])]) 6 | 7 | -------------------------------------------------------------------------------- /duckduckgo.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from lxml import html 3 | import time 4 | 5 | def search(keywords, max_results=None): 6 | url = 'https://duckduckgo.com/html/' 7 | params = { 8 | 'q': keywords, 9 | 's': '0', 10 | } 11 | 12 | yielded = 0 13 | while True: 14 | res = requests.post(url, data=params) 15 | doc = html.fromstring(res.text) 16 | 17 | results = [a.get('href') for a in doc.cssselect('#links .links_main a')] 18 | for result in results: 19 | yield result 20 | time.sleep(0.1) 21 | yielded += 1 22 | if max_results and yielded >= max_results: 23 | return 24 | 25 | try: 26 | form = doc.cssselect('.results_links_more form')[-1] 27 | except IndexError: 28 | return 29 | params = dict(form.fields) 30 | 31 | 32 | -------------------------------------------------------------------------------- /ddg: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | import duckduckgo 3 | import sys 4 | import getopt 5 | 6 | 7 | def usage(): 8 | print >>sys.stderr, 'Usage: ddg [options] ' 9 | print >>sys.stderr, ' -n N return N search results' 10 | 11 | 12 | def main(): 13 | max_results = 0 14 | 15 | try: 16 | optlist, args = getopt.getopt(sys.argv[1:], 'n:') 17 | except getopt.GetoptError, err: 18 | print >>sys.stderr, str(err) 19 | sys.exit(2) 20 | 21 | for o, a in optlist: 22 | if o == '-n': 23 | max_results = int(a) 24 | 25 | if len(args) < 1: 26 | usage() 27 | sys.exit(2) 28 | 29 | keywords = args 30 | 31 | for result in duckduckgo.search(keywords, max_results=max_results): 32 | print result 33 | 34 | 35 | 36 | if __name__ == '__main__': 37 | main() 38 | 39 | 40 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | duckduckgo.py 2 | ============= 3 | 4 | duckduckgo.py is a simple python module to scrape the duckduckgo search results. The install script will also make available a ddg command line utility that can be conveniently used in a shell pipeline. 5 | 6 | A word of warning 7 | ----------------- 8 | 9 | This code is intended as a demonstration and, as all scraping utilities, should be used with great caution. By default the code will pause a few milliseconds each time it yields a result to avoid overloading the DDG servers. 10 | 11 | Usage 12 | ----- 13 | 14 | It can be used as a python module 15 | 16 | .. code-block:: pycon 17 | 18 | >>> import duckduckgo 19 | >>> for link in duckduckgo.search('duckduckgo', max_results=10): 20 | ... print link 21 | ... 22 | https://duckduckgo.com/ 23 | https://en.wikipedia.org/wiki/DuckDuckGo 24 | https://duckduckgo.com/about.html 25 | [...] 26 | 27 | 28 | Or as a command line tool 29 | 30 | .. code-block:: bash 31 | 32 | $ ddg -n 10 duckduckgo 33 | 34 | 35 | Installation 36 | ------------ 37 | 38 | .. code-block:: bash 39 | 40 | $ python setup.py install 41 | 42 | 43 | 44 | --------------------------------------------------------------------------------