├── .gitignore ├── README.md ├── README.rst ├── deploy.py ├── requests_viewer ├── __init__.py ├── image.py ├── js.py ├── main.py ├── web.py └── web_compat.py ├── resources └── screenshot1.png ├── setup.cfg ├── setup.py ├── tests ├── __init__.py └── run_test.py └── tox.ini /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *#* 3 | *.DS_STORE 4 | *.log 5 | *Data.fs* 6 | *flymake* 7 | *egg* 8 | build/ 9 | __pycache__/ 10 | /.Python 11 | /bin/ 12 | /include/ 13 | /lib/ 14 | /pip-selfcheck.json 15 | .tox/ 16 | comments/ 17 | dist/ 18 | *silly* 19 | extras/ 20 | .cache/ -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## requests_viewer 2 | 3 | The idea is that `requests_viewer` can tell us information about our requests-objects quickly. 4 | 5 | It opens up an HTML page with information for the request. 6 | 7 | ```python 8 | import requests 9 | from requests_viewer.web import view_request 10 | view_request(requests.get("https://xkcd.com/")) 11 | 12 | # or 13 | 14 | from requests_viewer import main 15 | main("https://xkcd.com/") # considers different mime types 16 | ``` 17 | 18 | ### Main features 19 | 20 | - HTML page is being shown as how the crawler sees it 21 | * Can extract the domain and hot-link so that it looks almost indistuinguishable 22 | - Contains other nice functions to show lxml tree nodes 23 | - Can visually show diffs between 2 html pages / trees 24 | 25 | ```python 26 | from requests_viewer.web import view_diff_tree, get_tree 27 | url1 = "http://xkcd.com/" 28 | url2 = "http://xkcd.com/1/" 29 | tree1 = get_tree(url1) # get tree from request object directly 30 | tree2 = get_tree(url2) # could instead use `make_tree` if you already have a req 31 | view_diff_tree(tree1, tree2) 32 | ``` 33 | 34 | Results in: 35 | 36 |

37 | 38 |

39 | 40 | ### Installation 41 | 42 | pip install requests_viewer 43 | pip3 install requests_viewer 44 | 45 | Note that in order to do the real fancy stuff, you should install: 46 | 47 | pip install requests_viewer[fancy] 48 | pip3 install requests_viewer[fancy] 49 | 50 | this will install `lxml`, `bs4` and `tldextract`. 51 | 52 | ### Types it can show currently: 53 | 54 | - text/html 55 | - image/* 56 | - application/json 57 | 58 | ### Usability 59 | 60 | Some example `web.py` functions: 61 | 62 | ``` python 63 | def slugify(value): 64 | def view_request(r, domain=None): 65 | def view_html(x): 66 | def view_node(node, attach_head=False, question_contains=None): 67 | def view_tree(node): 68 | def view_diff_tree(tree1, tree2, url, diff_method): 69 | def view_diff_html(html1, html2, url, diff_method): 70 | def view_diff(html1, html2, tree1, tree2, url, diff_method): 71 | def make_parent_line(node, attach_head=False, question_contains=None): 72 | def extract_domain(url): 73 | def make_tree(html, domain=None): 74 | def get_tree(url, domain=None): 75 | def get_local_tree(url, domain=None): 76 | ``` 77 | 78 | ### Contribute 79 | 80 | This package is very small at the moment. I very much encourage you to contribute: 81 | 82 | - Most likely we will want to show headers on the top of the package (html) 83 | - Make the encoding an argument (instead of fixed utf8) 84 | 85 | Note that I use [yapf](https://github.com/google/yapf) with max-line=100 to avoid any styling discussion. 86 | -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | ## requests_viewer 2 | 3 | The idea is that `requests_viewer` can tell us information about our requests-objects quickly. 4 | 5 | It opens up an HTML page with information for the request. 6 | 7 | ```python 8 | import requests 9 | from requests_viewer.web import view_request 10 | view_request(requests.get("https://xkcd.com/")) 11 | 12 | # or 13 | 14 | from requests_viewer import main 15 | main("https://xkcd.com/") # considers different mime types 16 | ``` 17 | 18 | ### Main features 19 | 20 | - HTML page is being shown as how the crawler sees it 21 | * Can extract the domain and hot-link so that it looks almost indistuinguishable 22 | - Contains other nice functions to show lxml tree nodes 23 | - Can visually show diffs between 2 html pages / trees 24 | 25 | ```python 26 | from requests_viewer.web import view_diff_tree, get_tree 27 | url1 = "http://xkcd.com/" 28 | url2 = "http://xkcd.com/1/" 29 | tree1 = get_tree(url1) # get tree from request object directly 30 | tree2 = get_tree(url2) # could instead use `make_tree` if you already have a req 31 | view_diff_tree(tree1, tree2) 32 | ``` 33 | 34 | Results in: 35 | 36 |

37 | 38 |

39 | 40 | ### Installation 41 | 42 | pip install requests_viewer 43 | pip3 install requests_viewer 44 | 45 | Note that in order to do the real fancy stuff, you should install: 46 | 47 | pip install requests_viewer[fancy] 48 | pip3 install requests_viewer[fancy] 49 | 50 | this will install `lxml`, `bs4` and `tldextract`. 51 | 52 | ### Types it can show currently: 53 | 54 | - text/html 55 | - image/* 56 | - application/json 57 | 58 | ### Usability 59 | 60 | Some example `web.py` functions: 61 | 62 | ``` python 63 | def slugify(value): 64 | def view_request(r, domain=None): 65 | def view_html(x): 66 | def view_node(node, attach_head=False, question_contains=None): 67 | def view_tree(node): 68 | def view_diff_tree(tree1, tree2, url, diff_method): 69 | def view_diff_html(html1, html2, url, diff_method): 70 | def view_diff(html1, html2, tree1, tree2, url, diff_method): 71 | def make_parent_line(node, attach_head=False, question_contains=None): 72 | def extract_domain(url): 73 | def make_tree(html, domain=None): 74 | def get_tree(url, domain=None): 75 | def get_local_tree(url, domain=None): 76 | ``` 77 | 78 | ### Contribute 79 | 80 | This package is very small at the moment. I very much encourage you to contribute: 81 | 82 | - Most likely we will want to show headers on the top of the package (html) 83 | -------------------------------------------------------------------------------- /deploy.py: -------------------------------------------------------------------------------- 1 | """ File unrelated to the package, except for convenience in deploying """ 2 | import re 3 | import sh 4 | import os 5 | 6 | commit_count = sh.git('rev-list', ['--all']).count('\n') 7 | 8 | with open('setup.py') as f: 9 | setup = f.read() 10 | 11 | setup = re.sub("MICRO_VERSION = '[0-9]+'", "MICRO_VERSION = '{}'".format(commit_count), setup) 12 | 13 | major = re.search("MAJOR_VERSION = '([0-9]+)'", setup).groups()[0] 14 | minor = re.search("MINOR_VERSION = '([0-9]+)'", setup).groups()[0] 15 | micro = re.search("MICRO_VERSION = '([0-9]+)'", setup).groups()[0] 16 | version = '{}.{}.{}'.format(major, minor, micro) 17 | 18 | with open('setup.py', 'w') as f: 19 | f.write(setup) 20 | 21 | with open('requests_viewer/__init__.py') as f: 22 | init = f.read() 23 | 24 | with open('requests_viewer/__init__.py', 'w') as f: 25 | f.write( 26 | re.sub('__version__ = "[0-9.]+"', 27 | '__version__ = "{}"'.format(version), init)) 28 | 29 | py_version = "python3.5" if sh.which("python3.5") is not None else "python" 30 | os.system('{} setup.py sdist bdist_wheel upload'.format(py_version)) 31 | -------------------------------------------------------------------------------- /requests_viewer/__init__.py: -------------------------------------------------------------------------------- 1 | """ requests_viewer; able to show how requests look like """ 2 | 3 | __project__ = 'requests_viewer' 4 | __version__ = '0.0.1' 5 | 6 | from requests_viewer.main import main 7 | 8 | try: 9 | from requests_viewer.web import get_tree 10 | from requests_viewer.web import view_tree 11 | except ImportError: 12 | print("Cannot import `lxml`, limited functionality.") 13 | -------------------------------------------------------------------------------- /requests_viewer/image.py: -------------------------------------------------------------------------------- 1 | import base64 2 | import tempfile 3 | import time 4 | import webbrowser 5 | 6 | 7 | def wrap_img_into_html(content_type, x): 8 | return ''.format(content_type, x.decode("utf8")) 9 | 10 | 11 | def view_request(r): 12 | with tempfile.NamedTemporaryFile("w", suffix='.html', delete=False) as f: 13 | f.write(wrap_img_into_html(r.headers['Content-Type'], base64.b64encode(r.content))) 14 | f.flush() 15 | webbrowser.open('file://' + f.name) 16 | time.sleep(1) 17 | -------------------------------------------------------------------------------- /requests_viewer/js.py: -------------------------------------------------------------------------------- 1 | import json 2 | import tempfile 3 | import time 4 | import webbrowser 5 | 6 | 7 | def wrap_json_into_html(x): 8 | return "{}".format(x) 9 | 10 | 11 | def view_request(r): 12 | js = json.dumps(r.json(), indent=4) 13 | with tempfile.NamedTemporaryFile("w", suffix='.html', delete=False) as f: 14 | f.write(wrap_json_into_html(js)) 15 | f.flush() 16 | webbrowser.open('file://' + f.name) 17 | time.sleep(1) 18 | -------------------------------------------------------------------------------- /requests_viewer/main.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import requests 3 | import requests_viewer.js as js 4 | import requests_viewer.image as image 5 | 6 | try: 7 | import requests_viewer.web as web 8 | except ImportError: 9 | import requests_viewer.web_compat as web 10 | 11 | 12 | def main(url=None, default=None): 13 | if url is None: 14 | url = sys.argv[1] 15 | r = requests.get(url) 16 | content_type = r.headers.get('Content-Type', default) 17 | if content_type is None: 18 | raise TypeError("Content type header not set and default=None") 19 | if content_type.startswith("text/html"): 20 | web.view_request(r) 21 | elif content_type.startswith("image"): 22 | image.view_request(r) 23 | elif content_type.startswith("application/json"): 24 | js.view_request(r) 25 | else: 26 | raise TypeError("Content type not supported: " + content_type) 27 | -------------------------------------------------------------------------------- /requests_viewer/web.py: -------------------------------------------------------------------------------- 1 | import lxml.html.diff 2 | import lxml.html 3 | from bs4 import UnicodeDammit 4 | import re 5 | import requests 6 | import time 7 | import webbrowser 8 | import tempfile 9 | 10 | 11 | def slugify(value): 12 | return re.sub(r'[^\w\s-]', '', re.sub(r'[-\s]+', '-', value)).strip().lower() 13 | 14 | 15 | def view_request(r, domain=None): 16 | if domain is None: 17 | domain = extract_domain(r.url) 18 | view_tree(make_tree(r.content, domain)) 19 | 20 | 21 | def view_html(x): 22 | with tempfile.NamedTemporaryFile(mode="w", suffix='.html', delete=False) as f: 23 | f.write(x) 24 | f.flush() 25 | webbrowser.open('file://' + f.name) 26 | time.sleep(1) 27 | 28 | 29 | def view_node(node, attach_head=False, question_contains=None): 30 | newstr = make_parent_line(node, attach_head, question_contains) 31 | view_tree(newstr) 32 | 33 | 34 | def view_tree(node): 35 | view_html(lxml.html.tostring(node).decode('utf8')) 36 | 37 | 38 | def view_diff_tree(tree1, tree2, url='', diff_method=lxml.html.diff.htmldiff): 39 | html1 = lxml.html.tostring(tree1).decode('utf8') 40 | html2 = lxml.html.tostring(tree2).decode('utf8') 41 | view_diff(html1, html2, tree1, tree2, url, diff_method) 42 | 43 | 44 | def view_diff_html(html1, html2, url='', diff_method=lxml.html.diff.htmldiff): 45 | tree1 = lxml.html.fromstring(html1) 46 | tree2 = lxml.html.fromstring(html2) 47 | view_diff(html1, html2, tree1, tree2, url, diff_method) 48 | 49 | 50 | def view_diff(html1, html2, tree1, tree2, url='', diff_method=lxml.html.diff.htmldiff): 51 | diff_html = diff_method(tree1, tree2) 52 | diff_tree = lxml.html.fromstring(diff_html) 53 | ins_counts = diff_tree.xpath('count(//ins)') 54 | del_counts = diff_tree.xpath('count(//del)') 55 | pure_diff = '' 56 | for y in [z for z in diff_tree.iter() if z.tag in ['ins', 'del']]: 57 | if y.text is not None: 58 | color = 'lightgreen' if 'ins' in y.tag else 'red' 59 | pure_diff += '
{}
'.format(color, y.text) 60 | print('From t1 to t2, {} insertions and {} deleted'.format(ins_counts, del_counts)) 61 | diff = 'diff' + diff_html 64 | view_html(diff) 65 | view_html(html1) 66 | view_html(html2) 67 | view_html('{}'.format(str(pure_diff))) 68 | 69 | 70 | def make_parent_line(node, attach_head=False, question_contains=None): 71 | # Add how much text context is given. e.g. 2 would mean 2 parent's text 72 | # nodes are also displayed 73 | if question_contains is not None: 74 | newstr = does_this_element_contain(question_contains, lxml.html.tostring(node)) 75 | else: 76 | newstr = lxml.html.tostring(node) 77 | parent = node.getparent() 78 | while parent is not None: 79 | if attach_head and parent.tag == 'html': 80 | newstr = lxml.html.tostring(parent.find( 81 | './/head'), encoding='utf8').decode('utf8') + newstr 82 | tag, items = parent.tag, parent.items() 83 | attrs = " ".join(['{}="{}"'.format(x[0], x[1]) for x in items if len(x) == 2]) 84 | newstr = '<{} {}>{}'.format(tag, attrs, newstr, tag) 85 | parent = parent.getparent() 86 | return newstr 87 | 88 | 89 | def extract_domain(url): 90 | import tldextract 91 | tld = ".".join([x for x in tldextract.extract(url) if x]) 92 | protocol = url.split('//', 1)[0] 93 | if protocol == 'file:': 94 | protocol += '///' 95 | else: 96 | protocol += '//' 97 | return protocol + tld 98 | 99 | 100 | def does_this_element_contain(text='pagination', node_str=''): 101 | templ = '
' 102 | templ += '
' 103 | templ += 'Does this element contain {}?' 104 | templ += '
{}
' 105 | return templ.format(text, node_str) 106 | 107 | 108 | def make_tree(html, domain=None): 109 | 110 | ud = UnicodeDammit(html, is_html=True) 111 | 112 | tree = lxml.html.fromstring(ud.unicode_markup) 113 | 114 | if domain is not None: 115 | tree.make_links_absolute(domain) 116 | 117 | for el in tree.iter(): 118 | 119 | # remove comments 120 | if isinstance(el, lxml.html.HtmlComment): 121 | el.getparent().remove(el) 122 | continue 123 | 124 | if el.tag == 'script': 125 | el.getparent().remove(el) 126 | continue 127 | 128 | return tree 129 | 130 | 131 | def get_tree(url, domain=None): 132 | r = requests.get(url, headers={ 133 | 'User-Agent': 'Mozilla/5.0 ;Windows NT 6.1; WOW64; Trident/7.0; rv:11.0; like Gecko'}) 134 | if domain is None: 135 | domain = extract_domain(url) 136 | return make_tree(r.text, domain) 137 | 138 | 139 | def get_html(url, domain=None): 140 | return lxml.html.tostring(get_tree(url, domain)).decode("utf8") 141 | 142 | 143 | def get_local_tree(url, domain=None): 144 | if domain is None: 145 | domain = extract_domain(url) 146 | with open(url) as f: 147 | html = f.read() 148 | return make_tree(html, domain) 149 | 150 | 151 | def normalize(s): 152 | return re.sub(r'\s+', lambda x: '\n' if '\n' in x.group(0) else ' ', s).strip() 153 | 154 | 155 | def get_text_and_tail(node): 156 | text = node.text if node.text else '' 157 | tail = node.tail if node.tail else '' 158 | return text + ' ' + tail 159 | -------------------------------------------------------------------------------- /requests_viewer/web_compat.py: -------------------------------------------------------------------------------- 1 | import tempfile 2 | import time 3 | import webbrowser 4 | 5 | 6 | def view_request(r): 7 | with tempfile.NamedTemporaryFile("w") as f: 8 | f.write(r.text) 9 | webbrowser.open('file://' + f.name) 10 | time.sleep(1) 11 | -------------------------------------------------------------------------------- /resources/screenshot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/kootenpv/requests_viewer/3dd21c014814a3d081632b12483ba27f7a1e8cf5/resources/screenshot1.png -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description-file = README.md 3 | 4 | [bdist_rpm] 5 | doc_files = README.md 6 | 7 | [wheel] 8 | universal = 1 -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | from setuptools import find_packages 2 | from setuptools import setup 3 | 4 | MAJOR_VERSION = '0' 5 | MINOR_VERSION = '0' 6 | MICRO_VERSION = '10' 7 | VERSION = "{}.{}.{}".format(MAJOR_VERSION, MINOR_VERSION, MICRO_VERSION) 8 | 9 | setup(name='requests_viewer', 10 | version=VERSION, 11 | description="requests_viewer!", 12 | url='https://github.com/kootenpv/requests_viewer', 13 | author='Pascal van Kooten', 14 | author_email='kootenpv@gmail.com', 15 | license='MIT', 16 | install_requires=[ 17 | 'requests', 18 | ], 19 | extras_require={ 20 | 'fancy': ['lxml', 'bs4', 'tldextract'] 21 | }, 22 | packages=find_packages(), 23 | zip_safe=False, 24 | platforms='any') 25 | -------------------------------------------------------------------------------- /tests/__init__.py: -------------------------------------------------------------------------------- 1 | """ Test folder separated. """ 2 | -------------------------------------------------------------------------------- /tests/run_test.py: -------------------------------------------------------------------------------- 1 | """ Contains py.test tests. """ 2 | 3 | from requests_viewer.main import main 4 | from requests_viewer.web_compat import view_request 5 | 6 | 7 | def test_integration(): 8 | main("https://pypi.python.org/pypi/requests_viewer") 9 | # from requests_viewer.web import view_diff_tree, get_tree 10 | # url1 = "http://xkcd.com/" 11 | # url2 = "http://xkcd.com/1/" 12 | # tree1, tree2 = get_tree(url1), get_tree(url2) 13 | # view_diff_tree(tree1, tree2) 14 | -------------------------------------------------------------------------------- /tox.ini: -------------------------------------------------------------------------------- 1 | [tox] 2 | envlist = py35,py27 3 | 4 | [testenv] 5 | # If you add a new dep here you probably need to add it in setup.py as well 6 | deps = 7 | pytest 8 | commands = py.test -v tests/run_tests.py 9 | --------------------------------------------------------------------------------