├── .gitignore
├── README.md
├── README.rst
├── deploy.py
├── requests_viewer
├── __init__.py
├── image.py
├── js.py
├── main.py
├── web.py
└── web_compat.py
├── resources
└── screenshot1.png
├── setup.cfg
├── setup.py
├── tests
├── __init__.py
└── run_test.py
└── tox.ini
/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *#*
3 | *.DS_STORE
4 | *.log
5 | *Data.fs*
6 | *flymake*
7 | *egg*
8 | build/
9 | __pycache__/
10 | /.Python
11 | /bin/
12 | /include/
13 | /lib/
14 | /pip-selfcheck.json
15 | .tox/
16 | comments/
17 | dist/
18 | *silly*
19 | extras/
20 | .cache/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## requests_viewer
2 |
3 | The idea is that `requests_viewer` can tell us information about our requests-objects quickly.
4 |
5 | It opens up an HTML page with information for the request.
6 |
7 | ```python
8 | import requests
9 | from requests_viewer.web import view_request
10 | view_request(requests.get("https://xkcd.com/"))
11 |
12 | # or
13 |
14 | from requests_viewer import main
15 | main("https://xkcd.com/") # considers different mime types
16 | ```
17 |
18 | ### Main features
19 |
20 | - HTML page is being shown as how the crawler sees it
21 | * Can extract the domain and hot-link so that it looks almost indistuinguishable
22 | - Contains other nice functions to show lxml tree nodes
23 | - Can visually show diffs between 2 html pages / trees
24 |
25 | ```python
26 | from requests_viewer.web import view_diff_tree, get_tree
27 | url1 = "http://xkcd.com/"
28 | url2 = "http://xkcd.com/1/"
29 | tree1 = get_tree(url1) # get tree from request object directly
30 | tree2 = get_tree(url2) # could instead use `make_tree` if you already have a req
31 | view_diff_tree(tree1, tree2)
32 | ```
33 |
34 | Results in:
35 |
36 |
37 |
38 |
39 |
40 | ### Installation
41 |
42 | pip install requests_viewer
43 | pip3 install requests_viewer
44 |
45 | Note that in order to do the real fancy stuff, you should install:
46 |
47 | pip install requests_viewer[fancy]
48 | pip3 install requests_viewer[fancy]
49 |
50 | this will install `lxml`, `bs4` and `tldextract`.
51 |
52 | ### Types it can show currently:
53 |
54 | - text/html
55 | - image/*
56 | - application/json
57 |
58 | ### Usability
59 |
60 | Some example `web.py` functions:
61 |
62 | ``` python
63 | def slugify(value):
64 | def view_request(r, domain=None):
65 | def view_html(x):
66 | def view_node(node, attach_head=False, question_contains=None):
67 | def view_tree(node):
68 | def view_diff_tree(tree1, tree2, url, diff_method):
69 | def view_diff_html(html1, html2, url, diff_method):
70 | def view_diff(html1, html2, tree1, tree2, url, diff_method):
71 | def make_parent_line(node, attach_head=False, question_contains=None):
72 | def extract_domain(url):
73 | def make_tree(html, domain=None):
74 | def get_tree(url, domain=None):
75 | def get_local_tree(url, domain=None):
76 | ```
77 |
78 | ### Contribute
79 |
80 | This package is very small at the moment. I very much encourage you to contribute:
81 |
82 | - Most likely we will want to show headers on the top of the package (html)
83 | - Make the encoding an argument (instead of fixed utf8)
84 |
85 | Note that I use [yapf](https://github.com/google/yapf) with max-line=100 to avoid any styling discussion.
86 |
--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
1 | ## requests_viewer
2 |
3 | The idea is that `requests_viewer` can tell us information about our requests-objects quickly.
4 |
5 | It opens up an HTML page with information for the request.
6 |
7 | ```python
8 | import requests
9 | from requests_viewer.web import view_request
10 | view_request(requests.get("https://xkcd.com/"))
11 |
12 | # or
13 |
14 | from requests_viewer import main
15 | main("https://xkcd.com/") # considers different mime types
16 | ```
17 |
18 | ### Main features
19 |
20 | - HTML page is being shown as how the crawler sees it
21 | * Can extract the domain and hot-link so that it looks almost indistuinguishable
22 | - Contains other nice functions to show lxml tree nodes
23 | - Can visually show diffs between 2 html pages / trees
24 |
25 | ```python
26 | from requests_viewer.web import view_diff_tree, get_tree
27 | url1 = "http://xkcd.com/"
28 | url2 = "http://xkcd.com/1/"
29 | tree1 = get_tree(url1) # get tree from request object directly
30 | tree2 = get_tree(url2) # could instead use `make_tree` if you already have a req
31 | view_diff_tree(tree1, tree2)
32 | ```
33 |
34 | Results in:
35 |
36 |
37 |
38 |
39 |
40 | ### Installation
41 |
42 | pip install requests_viewer
43 | pip3 install requests_viewer
44 |
45 | Note that in order to do the real fancy stuff, you should install:
46 |
47 | pip install requests_viewer[fancy]
48 | pip3 install requests_viewer[fancy]
49 |
50 | this will install `lxml`, `bs4` and `tldextract`.
51 |
52 | ### Types it can show currently:
53 |
54 | - text/html
55 | - image/*
56 | - application/json
57 |
58 | ### Usability
59 |
60 | Some example `web.py` functions:
61 |
62 | ``` python
63 | def slugify(value):
64 | def view_request(r, domain=None):
65 | def view_html(x):
66 | def view_node(node, attach_head=False, question_contains=None):
67 | def view_tree(node):
68 | def view_diff_tree(tree1, tree2, url, diff_method):
69 | def view_diff_html(html1, html2, url, diff_method):
70 | def view_diff(html1, html2, tree1, tree2, url, diff_method):
71 | def make_parent_line(node, attach_head=False, question_contains=None):
72 | def extract_domain(url):
73 | def make_tree(html, domain=None):
74 | def get_tree(url, domain=None):
75 | def get_local_tree(url, domain=None):
76 | ```
77 |
78 | ### Contribute
79 |
80 | This package is very small at the moment. I very much encourage you to contribute:
81 |
82 | - Most likely we will want to show headers on the top of the package (html)
83 |
--------------------------------------------------------------------------------
/deploy.py:
--------------------------------------------------------------------------------
1 | """ File unrelated to the package, except for convenience in deploying """
2 | import re
3 | import sh
4 | import os
5 |
6 | commit_count = sh.git('rev-list', ['--all']).count('\n')
7 |
8 | with open('setup.py') as f:
9 | setup = f.read()
10 |
11 | setup = re.sub("MICRO_VERSION = '[0-9]+'", "MICRO_VERSION = '{}'".format(commit_count), setup)
12 |
13 | major = re.search("MAJOR_VERSION = '([0-9]+)'", setup).groups()[0]
14 | minor = re.search("MINOR_VERSION = '([0-9]+)'", setup).groups()[0]
15 | micro = re.search("MICRO_VERSION = '([0-9]+)'", setup).groups()[0]
16 | version = '{}.{}.{}'.format(major, minor, micro)
17 |
18 | with open('setup.py', 'w') as f:
19 | f.write(setup)
20 |
21 | with open('requests_viewer/__init__.py') as f:
22 | init = f.read()
23 |
24 | with open('requests_viewer/__init__.py', 'w') as f:
25 | f.write(
26 | re.sub('__version__ = "[0-9.]+"',
27 | '__version__ = "{}"'.format(version), init))
28 |
29 | py_version = "python3.5" if sh.which("python3.5") is not None else "python"
30 | os.system('{} setup.py sdist bdist_wheel upload'.format(py_version))
31 |
--------------------------------------------------------------------------------
/requests_viewer/__init__.py:
--------------------------------------------------------------------------------
1 | """ requests_viewer; able to show how requests look like """
2 |
3 | __project__ = 'requests_viewer'
4 | __version__ = '0.0.1'
5 |
6 | from requests_viewer.main import main
7 |
8 | try:
9 | from requests_viewer.web import get_tree
10 | from requests_viewer.web import view_tree
11 | except ImportError:
12 | print("Cannot import `lxml`, limited functionality.")
13 |
--------------------------------------------------------------------------------
/requests_viewer/image.py:
--------------------------------------------------------------------------------
1 | import base64
2 | import tempfile
3 | import time
4 | import webbrowser
5 |
6 |
7 | def wrap_img_into_html(content_type, x):
8 | return '
'.format(content_type, x.decode("utf8"))
9 |
10 |
11 | def view_request(r):
12 | with tempfile.NamedTemporaryFile("w", suffix='.html', delete=False) as f:
13 | f.write(wrap_img_into_html(r.headers['Content-Type'], base64.b64encode(r.content)))
14 | f.flush()
15 | webbrowser.open('file://' + f.name)
16 | time.sleep(1)
17 |
--------------------------------------------------------------------------------
/requests_viewer/js.py:
--------------------------------------------------------------------------------
1 | import json
2 | import tempfile
3 | import time
4 | import webbrowser
5 |
6 |
7 | def wrap_json_into_html(x):
8 | return "{}
".format(x)
9 |
10 |
11 | def view_request(r):
12 | js = json.dumps(r.json(), indent=4)
13 | with tempfile.NamedTemporaryFile("w", suffix='.html', delete=False) as f:
14 | f.write(wrap_json_into_html(js))
15 | f.flush()
16 | webbrowser.open('file://' + f.name)
17 | time.sleep(1)
18 |
--------------------------------------------------------------------------------
/requests_viewer/main.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import requests
3 | import requests_viewer.js as js
4 | import requests_viewer.image as image
5 |
6 | try:
7 | import requests_viewer.web as web
8 | except ImportError:
9 | import requests_viewer.web_compat as web
10 |
11 |
12 | def main(url=None, default=None):
13 | if url is None:
14 | url = sys.argv[1]
15 | r = requests.get(url)
16 | content_type = r.headers.get('Content-Type', default)
17 | if content_type is None:
18 | raise TypeError("Content type header not set and default=None")
19 | if content_type.startswith("text/html"):
20 | web.view_request(r)
21 | elif content_type.startswith("image"):
22 | image.view_request(r)
23 | elif content_type.startswith("application/json"):
24 | js.view_request(r)
25 | else:
26 | raise TypeError("Content type not supported: " + content_type)
27 |
--------------------------------------------------------------------------------
/requests_viewer/web.py:
--------------------------------------------------------------------------------
1 | import lxml.html.diff
2 | import lxml.html
3 | from bs4 import UnicodeDammit
4 | import re
5 | import requests
6 | import time
7 | import webbrowser
8 | import tempfile
9 |
10 |
11 | def slugify(value):
12 | return re.sub(r'[^\w\s-]', '', re.sub(r'[-\s]+', '-', value)).strip().lower()
13 |
14 |
15 | def view_request(r, domain=None):
16 | if domain is None:
17 | domain = extract_domain(r.url)
18 | view_tree(make_tree(r.content, domain))
19 |
20 |
21 | def view_html(x):
22 | with tempfile.NamedTemporaryFile(mode="w", suffix='.html', delete=False) as f:
23 | f.write(x)
24 | f.flush()
25 | webbrowser.open('file://' + f.name)
26 | time.sleep(1)
27 |
28 |
29 | def view_node(node, attach_head=False, question_contains=None):
30 | newstr = make_parent_line(node, attach_head, question_contains)
31 | view_tree(newstr)
32 |
33 |
34 | def view_tree(node):
35 | view_html(lxml.html.tostring(node).decode('utf8'))
36 |
37 |
38 | def view_diff_tree(tree1, tree2, url='', diff_method=lxml.html.diff.htmldiff):
39 | html1 = lxml.html.tostring(tree1).decode('utf8')
40 | html2 = lxml.html.tostring(tree2).decode('utf8')
41 | view_diff(html1, html2, tree1, tree2, url, diff_method)
42 |
43 |
44 | def view_diff_html(html1, html2, url='', diff_method=lxml.html.diff.htmldiff):
45 | tree1 = lxml.html.fromstring(html1)
46 | tree2 = lxml.html.fromstring(html2)
47 | view_diff(html1, html2, tree1, tree2, url, diff_method)
48 |
49 |
50 | def view_diff(html1, html2, tree1, tree2, url='', diff_method=lxml.html.diff.htmldiff):
51 | diff_html = diff_method(tree1, tree2)
52 | diff_tree = lxml.html.fromstring(diff_html)
53 | ins_counts = diff_tree.xpath('count(//ins)')
54 | del_counts = diff_tree.xpath('count(//del)')
55 | pure_diff = ''
56 | for y in [z for z in diff_tree.iter() if z.tag in ['ins', 'del']]:
57 | if y.text is not None:
58 | color = 'lightgreen' if 'ins' in y.tag else 'red'
59 | pure_diff += '{}
'.format(color, y.text)
60 | print('From t1 to t2, {} insertions and {} deleted'.format(ins_counts, del_counts))
61 | diff = 'diff' + diff_html
64 | view_html(diff)
65 | view_html(html1)
66 | view_html(html2)
67 | view_html('{}'.format(str(pure_diff)))
68 |
69 |
70 | def make_parent_line(node, attach_head=False, question_contains=None):
71 | # Add how much text context is given. e.g. 2 would mean 2 parent's text
72 | # nodes are also displayed
73 | if question_contains is not None:
74 | newstr = does_this_element_contain(question_contains, lxml.html.tostring(node))
75 | else:
76 | newstr = lxml.html.tostring(node)
77 | parent = node.getparent()
78 | while parent is not None:
79 | if attach_head and parent.tag == 'html':
80 | newstr = lxml.html.tostring(parent.find(
81 | './/head'), encoding='utf8').decode('utf8') + newstr
82 | tag, items = parent.tag, parent.items()
83 | attrs = " ".join(['{}="{}"'.format(x[0], x[1]) for x in items if len(x) == 2])
84 | newstr = '<{} {}>{}{}>'.format(tag, attrs, newstr, tag)
85 | parent = parent.getparent()
86 | return newstr
87 |
88 |
89 | def extract_domain(url):
90 | import tldextract
91 | tld = ".".join([x for x in tldextract.extract(url) if x])
92 | protocol = url.split('//', 1)[0]
93 | if protocol == 'file:':
94 | protocol += '///'
95 | else:
96 | protocol += '//'
97 | return protocol + tld
98 |
99 |
100 | def does_this_element_contain(text='pagination', node_str=''):
101 | templ = ''
102 | templ += '
'
103 | templ += 'Does this element contain {}?'
104 | templ += '
{}
'
105 | return templ.format(text, node_str)
106 |
107 |
108 | def make_tree(html, domain=None):
109 |
110 | ud = UnicodeDammit(html, is_html=True)
111 |
112 | tree = lxml.html.fromstring(ud.unicode_markup)
113 |
114 | if domain is not None:
115 | tree.make_links_absolute(domain)
116 |
117 | for el in tree.iter():
118 |
119 | # remove comments
120 | if isinstance(el, lxml.html.HtmlComment):
121 | el.getparent().remove(el)
122 | continue
123 |
124 | if el.tag == 'script':
125 | el.getparent().remove(el)
126 | continue
127 |
128 | return tree
129 |
130 |
131 | def get_tree(url, domain=None):
132 | r = requests.get(url, headers={
133 | 'User-Agent': 'Mozilla/5.0 ;Windows NT 6.1; WOW64; Trident/7.0; rv:11.0; like Gecko'})
134 | if domain is None:
135 | domain = extract_domain(url)
136 | return make_tree(r.text, domain)
137 |
138 |
139 | def get_html(url, domain=None):
140 | return lxml.html.tostring(get_tree(url, domain)).decode("utf8")
141 |
142 |
143 | def get_local_tree(url, domain=None):
144 | if domain is None:
145 | domain = extract_domain(url)
146 | with open(url) as f:
147 | html = f.read()
148 | return make_tree(html, domain)
149 |
150 |
151 | def normalize(s):
152 | return re.sub(r'\s+', lambda x: '\n' if '\n' in x.group(0) else ' ', s).strip()
153 |
154 |
155 | def get_text_and_tail(node):
156 | text = node.text if node.text else ''
157 | tail = node.tail if node.tail else ''
158 | return text + ' ' + tail
159 |
--------------------------------------------------------------------------------
/requests_viewer/web_compat.py:
--------------------------------------------------------------------------------
1 | import tempfile
2 | import time
3 | import webbrowser
4 |
5 |
6 | def view_request(r):
7 | with tempfile.NamedTemporaryFile("w") as f:
8 | f.write(r.text)
9 | webbrowser.open('file://' + f.name)
10 | time.sleep(1)
11 |
--------------------------------------------------------------------------------
/resources/screenshot1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/kootenpv/requests_viewer/3dd21c014814a3d081632b12483ba27f7a1e8cf5/resources/screenshot1.png
--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.md
3 |
4 | [bdist_rpm]
5 | doc_files = README.md
6 |
7 | [wheel]
8 | universal = 1
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import find_packages
2 | from setuptools import setup
3 |
4 | MAJOR_VERSION = '0'
5 | MINOR_VERSION = '0'
6 | MICRO_VERSION = '10'
7 | VERSION = "{}.{}.{}".format(MAJOR_VERSION, MINOR_VERSION, MICRO_VERSION)
8 |
9 | setup(name='requests_viewer',
10 | version=VERSION,
11 | description="requests_viewer!",
12 | url='https://github.com/kootenpv/requests_viewer',
13 | author='Pascal van Kooten',
14 | author_email='kootenpv@gmail.com',
15 | license='MIT',
16 | install_requires=[
17 | 'requests',
18 | ],
19 | extras_require={
20 | 'fancy': ['lxml', 'bs4', 'tldextract']
21 | },
22 | packages=find_packages(),
23 | zip_safe=False,
24 | platforms='any')
25 |
--------------------------------------------------------------------------------
/tests/__init__.py:
--------------------------------------------------------------------------------
1 | """ Test folder separated. """
2 |
--------------------------------------------------------------------------------
/tests/run_test.py:
--------------------------------------------------------------------------------
1 | """ Contains py.test tests. """
2 |
3 | from requests_viewer.main import main
4 | from requests_viewer.web_compat import view_request
5 |
6 |
7 | def test_integration():
8 | main("https://pypi.python.org/pypi/requests_viewer")
9 | # from requests_viewer.web import view_diff_tree, get_tree
10 | # url1 = "http://xkcd.com/"
11 | # url2 = "http://xkcd.com/1/"
12 | # tree1, tree2 = get_tree(url1), get_tree(url2)
13 | # view_diff_tree(tree1, tree2)
14 |
--------------------------------------------------------------------------------
/tox.ini:
--------------------------------------------------------------------------------
1 | [tox]
2 | envlist = py35,py27
3 |
4 | [testenv]
5 | # If you add a new dep here you probably need to add it in setup.py as well
6 | deps =
7 | pytest
8 | commands = py.test -v tests/run_tests.py
9 |
--------------------------------------------------------------------------------