test

├── .travis.yml
├── LICENSE.txt
├── MANIFEST.in
├── README.rst
├── html2markdown.py
├── requirements.txt
├── setup.cfg
├── setup.py
└── tests.py


/.travis.yml:
--------------------------------------------------------------------------------
 1 | language: python
 2 | python:
 3 |   - 2.7
 4 |   - 3.6
 5 | cache: pip
 6 | install:
 7 |   - pip install -r requirements.txt
 8 |   - pip install markdown
 9 | script:
10 |   - python tests.py


--------------------------------------------------------------------------------
/LICENSE.txt:
--------------------------------------------------------------------------------
1 | Copyright 2017 David L (dlon)
2 | 
3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4 | 
5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6 | 
7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.


--------------------------------------------------------------------------------
/MANIFEST.in:
--------------------------------------------------------------------------------
1 | include README.rst


--------------------------------------------------------------------------------
/README.rst:
--------------------------------------------------------------------------------
 1 | =============
 2 | html2markdown
 3 | =============
 4 | 
 5 | .. image:: https://travis-ci.com/dlon/html2markdown.svg?branch=master
 6 |    :target: https://travis-ci.com/dlon/html2markdown
 7 | 
 8 | **Experimental**
 9 | 
10 | **Purpose**: Converts html to markdown while preserving unsupported html markup. The goal is to generate markdown that can be converted back into html. This is the major difference between html2markdown and html2text. The latter doesn't purport to be reversible.
11 | 
12 | Usage example
13 | =============
14 | ::
15 | 
16 | 	import html2markdown
17 | 	print html2markdown.convert('<h2>Test</h2><pre><code>Here is some code</code></pre>')
18 | 
19 | Output::
20 | 
21 | 	## Test
22 | 	
23 | 	    Here is some code
24 | 
25 | Information and caveats
26 | =======================
27 | 
28 | Does not convert the content of block-type tags other than ``<p>`` -- such as ``<div>`` tags -- into Markdown
29 | -------------------------------------------------------------------------------------------------------------
30 | 
31 | It does convert to markdown the content of inline-type tags, e.g. ``<span>``.
32 | 
33 | **Input**: ``<div>this is stuff. <strong>stuff</strong></div>``
34 | 
35 | **Result**: ``<div>this is stuff. <strong>stuff</strong></div>``  
36 | 
37 | **Input**: ``<p>this is stuff. <strong>stuff</strong></p>``  
38 | 
39 | **Result**: ``this is stuff. __stuff__`` (surrounded by a newline on either side)  
40 | 
41 | **Input**: ``<span style="text-decoration:line-through;">strike <strong>through</strong> some text</span> here``  
42 | 
43 | **Result**: ``<span style="text-decoration:line-through;">strike __through__ some text</span> here``  
44 | 
45 | Except in unprocessed block-type tags, formatting characters are escaped
46 | ------------------------------------------------------------------------
47 | 
48 | **Input**: ``<p>**escape me?**</p>`` (in html, we would use \<strong\> here)  
49 | 
50 | **Result**: ``\*\*escape me?\*\*``  
51 | 
52 | **Input**: ``<span>**escape me?**</span>``  
53 | 
54 | **Result**: ``<span>\*\*escape me?\*\*</span>``  
55 | 
56 | **Input**: ``<div>**escape me?**</div>``  
57 | 
58 | **Result**: ``<div>**escape me?**</div>`` (block-type)  
59 | 
60 | Attributes not supported by Markdown are kept
61 | ---------------------------------------------
62 | 
63 | **Example**: ``<a href="http://myaddress" title="click me"><strong>link</strong></a>``  
64 | 
65 | **Result**: ``[__link__](http://myaddress "click me")``  
66 | 
67 | **Example**: ``<a onclick="javascript:dostuff()" href="http://myaddress" title="click me"><strong>link</strong></a>``  
68 | 
69 | **Result**: ``<a onclick="javascript:dostuff()" href="http://myaddress" title="click me">__link__</a>`` (the attribute *onclick* is not supported, so the tag is left alone)  
70 | 
71 | 
72 | Limitations
73 | ===========
74 | 
75 | - Tables are kept as html.
76 | 
77 | Changes
78 | =======
79 | 
80 | 0.1.7:
81 | 
82 | - Improved handling of inline tags.
83 | - Fix: Ignore ``<a>`` tags without an href attribute.
84 | - Improve escaping.
85 | 
86 | 0.1.6: Added tests and support for Python versions below 2.7.
87 | 
88 | 0.1.5: Fix Unicode issue in Python 3.
89 | 
90 | 0.1.0: First version.
91 | 


--------------------------------------------------------------------------------
/html2markdown.py:
--------------------------------------------------------------------------------
  1 | # -*- coding:utf8 -*-
  2 | """html2markdown converts an html string to markdown while preserving unsupported markup."""
  3 | #
  4 | # Copyright 2017-2018 David Lönnhager (dlon)
  5 | #
  6 | # Permission is hereby granted, free of charge, to any person obtaining a copy of
  7 | # this software and associated documentation files (the "Software"), to deal in
  8 | # the Software without restriction, including without limitation the rights to
  9 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
 10 | # of the Software, and to permit persons to whom the Software is furnished
 11 | # to do so, subject to the following conditions:
 12 | #
 13 | # The above copyright notice and this permission notice shall be included in all
 14 | # copies or substantial portions of the Software.
 15 | #
 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
 17 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
 18 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
 19 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
 20 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
 21 | # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 22 | #
 23 | 
 24 | import bs4
 25 | from bs4 import BeautifulSoup
 26 | import re
 27 | 
 28 | import sys
 29 | if sys.version_info[0] > 2:
 30 | 	unicode = str
 31 | 
 32 | _supportedTags = {
 33 | 	# NOTE: will be ignored if they have unsupported attributes (cf. _supportedAttributes)
 34 | 	'blockquote',
 35 | 	'p',
 36 | 	'a',
 37 | 	'h1','h2','h3','h4','h5','h6',
 38 | 	'strong','b',
 39 | 	'em','i',
 40 | 	'ul','ol','li',
 41 | 	'br',
 42 | 	'img',
 43 | 	'pre','code',
 44 | 	'hr'
 45 | }
 46 | _supportedAttributes = (
 47 | 	'a href',
 48 | 	'a title',
 49 | 	'img alt',
 50 | 	'img src',
 51 | 	'img title',
 52 | )
 53 | 
 54 | _inlineTags = {
 55 | 	# these can be mixed with markdown (when unprocessed)
 56 | 	# block tags will be surrounded by newlines and be unprocessed inside
 57 | 	# (unless supported tag + supported attribute[s])
 58 | 	'a',
 59 | 	'abbr',
 60 | 	'acronym',
 61 | 	'audio',
 62 | 	'b',
 63 | 	'bdi',
 64 | 	'bdo',
 65 | 	'big',
 66 | 	#'br',
 67 | 	'button',
 68 | 	#'canvas',
 69 | 	'cite',
 70 | 	'code',
 71 | 	'data',
 72 | 	'datalist',
 73 | 	'del',
 74 | 	'dfn',
 75 | 	'em',
 76 | 	#'embed',
 77 | 	'i',
 78 | 	#'iframe',
 79 | 	#'img',
 80 | 	#'input',
 81 | 	'ins',
 82 | 	'kbd',
 83 | 	'label',
 84 | 	'map',
 85 | 	'mark',
 86 | 	'meter',
 87 | 	#'noscript',
 88 | 	'object',
 89 | 	#'output',
 90 | 	'picture',
 91 | 	#'progress',
 92 | 	'q',
 93 | 	'ruby',
 94 | 	's',
 95 | 	'samp',
 96 | 	#'script',
 97 | 	'select',
 98 | 	'slot',
 99 | 	'small',
100 | 	'span',
101 | 	'strike',
102 | 	'strong',
103 | 	'sub',
104 | 	'sup',
105 | 	'svg',
106 | 	'template',
107 | 	'textarea',
108 | 	'time',
109 | 	'u',
110 | 	'tt',
111 | 	'var',
112 | 	#'video',
113 | 	'wbr',
114 | }
115 | 
116 | def _supportedAttrs(tag):
117 | 	sAttrs = [attr.split(' ')[1] for attr in _supportedAttributes if attr.split(' ')[0]==tag.name]
118 | 	for attr in tag.attrs:
119 | 		if attr not in sAttrs:
120 | 			return False
121 | 	return True
122 | 
123 | def _recursivelyValid(tag):
124 | 	# not all tags require this property
125 | 	# requires: <blockquote><p style="...">asdf</p></blockquote>
126 | 	# does not: <div><p style="...">asdf</p></div>
127 | 	children = tag.find_all(recursive = False)
128 | 	for child in children:
129 | 		if not _recursivelyValid(child):
130 | 			return False
131 | 	if tag.name == '[document]':
132 | 		return True
133 | 	elif tag.name in _inlineTags:
134 | 		return True
135 | 	elif tag.name not in _supportedTags:
136 | 		return False
137 | 	if not _supportedAttrs(tag):
138 | 		return False
139 | 	return True
140 | 
141 | 
142 | 
143 | _escapeCharSequence = tuple(r'\`*_[]#')
144 | _escapeCharRegexStr = '([{}])'.format(''.join(re.escape(c) for c in _escapeCharSequence))
145 | _escapeCharSub = re.compile(_escapeCharRegexStr).sub
146 | 
147 | 
148 | def _escapeCharacters(tag):
149 | 	"""non-recursively escape underlines and asterisks
150 | 	in the tag"""
151 | 	for i,c in enumerate(tag.contents):
152 | 		if type(c) != bs4.element.NavigableString:
153 | 			continue
154 | 		c.replace_with(_escapeCharSub(r'\\\1', c))
155 | 
156 | def _breakRemNewlines(tag):
157 | 	"""non-recursively break spaces and remove newlines in the tag"""
158 | 	for i,c in enumerate(tag.contents):
159 | 		if type(c) != bs4.element.NavigableString:
160 | 			continue
161 | 		c.replace_with(re.sub(r' {2,}', ' ', c).replace('\n',''))
162 | 
163 | def _markdownify(tag, _listType=None, _blockQuote=False, _listIndex=1):
164 | 	"""recursively converts a tag into markdown"""
165 | 	children = tag.find_all(recursive=False)
166 | 
167 | 	if tag.name == '[document]':
168 | 		for child in children:
169 | 			_markdownify(child)
170 | 		return
171 | 
172 | 	if tag.name not in _supportedTags or not _supportedAttrs(tag):
173 | 		if tag.name not in _inlineTags:
174 | 			tag.insert_before('\n\n')
175 | 			tag.insert_after('\n\n')
176 | 		else:
177 | 			_escapeCharacters(tag)
178 | 			for child in children:
179 | 				_markdownify(child)
180 | 		return
181 | 	if tag.name not in ('pre', 'code'):
182 | 		_escapeCharacters(tag)
183 | 		_breakRemNewlines(tag)
184 | 	if tag.name == 'p':
185 | 		if tag.string != None:
186 | 			if tag.string.strip() == u'':
187 | 				tag.string = u'\xa0'
188 | 				tag.unwrap()
189 | 				return
190 | 		if not _blockQuote:
191 | 			tag.insert_before('\n\n')
192 | 			tag.insert_after('\n\n')
193 | 		else:
194 | 			tag.insert_before('\n')
195 | 			tag.insert_after('\n')
196 | 		tag.unwrap()
197 | 
198 | 		for child in children:
199 | 			_markdownify(child)
200 | 	elif tag.name == 'br':
201 | 		tag.string = '  \n'
202 | 		tag.unwrap()
203 | 	elif tag.name == 'img':
204 | 		alt = ''
205 | 		title = ''
206 | 		if tag.has_attr('alt'):
207 | 			alt = tag['alt']
208 | 		if tag.has_attr('title') and tag['title']:
209 | 			title = ' "%s"' % tag['title']
210 | 		tag.string = '![%s](%s%s)' % (alt, tag['src'], title)
211 | 		tag.unwrap()
212 | 	elif tag.name == 'hr':
213 | 		tag.string = '\n---\n'
214 | 		tag.unwrap()
215 | 	elif tag.name == 'pre':
216 | 		tag.insert_before('\n\n')
217 | 		tag.insert_after('\n\n')
218 | 		if tag.code:
219 | 			if not _supportedAttrs(tag.code):
220 | 				return
221 | 			for child in tag.code.find_all(recursive=False):
222 | 				if child.name != 'br':
223 | 					return
224 | 			# code block
225 | 			for br in tag.code.find_all('br'):
226 | 				br.string = '\n'
227 | 				br.unwrap()
228 | 			tag.code.unwrap()
229 | 			lines = unicode(tag).strip().split('\n')
230 | 			lines[0] = lines[0][5:]
231 | 			lines[-1] = lines[-1][:-6]
232 | 			if not lines[-1]:
233 | 				lines.pop()
234 | 			for i,line in enumerate(lines):
235 | 				line = line.replace(u'\xa0', ' ')
236 | 				lines[i] = '    %s' % line
237 | 			tag.replace_with(BeautifulSoup('\n'.join(lines), 'html.parser'))
238 | 		return
239 | 	elif tag.name == 'code':
240 | 		# inline code
241 | 		if children:
242 | 			return
243 | 		tag.insert_before('`` ')
244 | 		tag.insert_after(' ``')
245 | 		tag.unwrap()
246 | 	elif _recursivelyValid(tag):
247 | 		if tag.name == 'blockquote':
248 | 			# ! FIXME: hack
249 | 			tag.insert_before('<<<BLOCKQUOTE: ')
250 | 			tag.insert_after('>>>')
251 | 			tag.unwrap()
252 | 			for child in children:
253 | 				_markdownify(child, _blockQuote=True)
254 | 			return
255 | 		elif tag.name == 'a':
256 | 			# process children first
257 | 			for child in children:
258 | 				_markdownify(child)
259 | 			if not tag.has_attr('href'):
260 | 				return
261 | 			if tag.string != tag.get('href') or tag.has_attr('title'):
262 | 				title = ''
263 | 				if tag.has_attr('title'):
264 | 					title = ' "%s"' % tag['title']
265 | 				tag.string = '[%s](%s%s)' % (BeautifulSoup(unicode(tag), 'html.parser').string,
266 | 					tag.get('href', ''),
267 | 					title)
268 | 			else:
269 | 				# ! FIXME: hack
270 | 				tag.string = '<<<FLOATING LINK: %s>>>' % tag.string
271 | 			tag.unwrap()
272 | 			return
273 | 		elif tag.name == 'h1':
274 | 			tag.insert_before('\n\n# ')
275 | 			tag.insert_after('\n\n')
276 | 			tag.unwrap()
277 | 		elif tag.name == 'h2':
278 | 			tag.insert_before('\n\n## ')
279 | 			tag.insert_after('\n\n')
280 | 			tag.unwrap()
281 | 		elif tag.name == 'h3':
282 | 			tag.insert_before('\n\n### ')
283 | 			tag.insert_after('\n\n')
284 | 			tag.unwrap()
285 | 		elif tag.name == 'h4':
286 | 			tag.insert_before('\n\n#### ')
287 | 			tag.insert_after('\n\n')
288 | 			tag.unwrap()
289 | 		elif tag.name == 'h5':
290 | 			tag.insert_before('\n\n##### ')
291 | 			tag.insert_after('\n\n')
292 | 			tag.unwrap()
293 | 		elif tag.name == 'h6':
294 | 			tag.insert_before('\n\n###### ')
295 | 			tag.insert_after('\n\n')
296 | 			tag.unwrap()
297 | 		elif tag.name in ('ul', 'ol'):
298 | 			tag.insert_before('\n\n')
299 | 			tag.insert_after('\n\n')
300 | 			tag.unwrap()
301 | 			for i, child in enumerate(children):
302 | 				_markdownify(child, _listType=tag.name, _listIndex=i+1)
303 | 			return
304 | 		elif tag.name == 'li':
305 | 			if not _listType:
306 | 				# <li> outside of list; ignore
307 | 				return
308 | 			if _listType == 'ul':
309 | 				tag.insert_before('*   ')
310 | 			else:
311 | 				tag.insert_before('%d.   ' % _listIndex)
312 | 			for child in children:
313 | 				_markdownify(child)
314 | 			for c in tag.contents:
315 | 				if type(c) != bs4.element.NavigableString:
316 | 					continue
317 | 				c.replace_with('\n    '.join(c.split('\n')))
318 | 			tag.insert_after('\n')
319 | 			tag.unwrap()
320 | 			return
321 | 		elif tag.name in ('strong','b'):
322 | 			tag.insert_before('__')
323 | 			tag.insert_after('__')
324 | 			tag.unwrap()
325 | 		elif tag.name in ('em','i'):
326 | 			tag.insert_before('_')
327 | 			tag.insert_after('_')
328 | 			tag.unwrap()
329 | 		for child in children:
330 | 			_markdownify(child)
331 | 
332 | def convert(html):
333 | 	"""converts an html string to markdown while preserving unsupported markup."""
334 | 	bs = BeautifulSoup(html, 'html.parser')
335 | 	_markdownify(bs)
336 | 	ret = unicode(bs).replace(u'\xa0', '&nbsp;')
337 | 	ret = re.sub(r'\n{3,}', r'\n\n', ret)
338 | 	# ! FIXME: hack
339 | 	ret = re.sub(r'&lt;&lt;&lt;FLOATING LINK: (.+)&gt;&gt;&gt;', r'<\1>', ret)
340 | 	# ! FIXME: hack
341 | 	sp = re.split(r'(&lt;&lt;&lt;BLOCKQUOTE: .*?&gt;&gt;&gt;)', ret, flags=re.DOTALL)
342 | 	for i,e in enumerate(sp):
343 | 		if e[:len('&lt;&lt;&lt;BLOCKQUOTE:')] == '&lt;&lt;&lt;BLOCKQUOTE:':
344 | 			sp[i] = '> ' + e[len('&lt;&lt;&lt;BLOCKQUOTE:') : -len('&gt;&gt;&gt;')]
345 | 			sp[i] = sp[i].replace('\n', '\n> ')
346 | 	ret = ''.join(sp)
347 | 	return ret.strip('\n')
348 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | beautifulsoup4


--------------------------------------------------------------------------------
/setup.cfg:
--------------------------------------------------------------------------------
1 | [metadata]
2 | description-file = README.rst


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf8 -*-
 2 | 
 3 | from setuptools import setup
 4 | 
 5 | from os import path
 6 | import io
 7 | this_directory = path.abspath(path.dirname(__file__))
 8 | with io.open(path.join(this_directory, 'README.rst'), encoding='utf-8') as f:
 9 |     longdesc = f.read()
10 | 
11 | setup(
12 |     name='html2markdown',
13 |     py_modules=['html2markdown'],
14 |     version='0.1.7',
15 |     description='Conservatively convert html to markdown',
16 |     author='David Lönnhager',
17 |     author_email='dv.lnh.d@gmail.com',
18 |     url='https://github.com/dlon/html2markdown',
19 |     install_requires=[
20 |         'beautifulsoup4'
21 |     ],
22 |     long_description=longdesc,
23 |     long_description_content_type='text/x-rst',
24 | )
25 | 


--------------------------------------------------------------------------------
/tests.py:
--------------------------------------------------------------------------------
  1 | import unittest
  2 | import html2markdown
  3 | import markdown
  4 | import bs4
  5 | 
  6 | 
  7 | class TestGenericTags(unittest.TestCase):
  8 | 
  9 | 	emptyElements = {
 10 | 		'embed',
 11 | 		'img',
 12 | 		'input',
 13 | 		'wbr',
 14 | 	}
 15 | 
 16 | 	def test_block_tag_content(self):
 17 | 		"""content of block-type tags should not be converted (except <p>)"""
 18 | 		testStr = '<div>this is stuff. <strong>stuff</strong></div>'
 19 | 		mdStr = html2markdown.convert(testStr)
 20 | 		self.assertEqual(mdStr, testStr)
 21 | 
 22 | 	def test_p_content(self):
 23 | 		"""<p>'s content should be converted"""
 24 | 		testStr = '<p>this is stuff. <strong>stuff</strong></p>'
 25 | 		expectedStr = 'this is stuff. __stuff__'
 26 | 		mdStr = html2markdown.convert(testStr)
 27 | 		self.assertEqual(mdStr, expectedStr)
 28 | 
 29 | 	def test_inline_tag_break(self):
 30 | 		"""inline-type tags should not cause line breaks"""
 31 | 		emptyElements = self.emptyElements
 32 | 		for tag in html2markdown._inlineTags:
 33 | 			if tag not in emptyElements:
 34 | 				testStr = '<p>test <%s>test</%s> test</p>' % (tag, tag)
 35 | 			else:
 36 | 				testStr = '<p>test <%s /> test</p>' % tag
 37 | 			mdStr = html2markdown.convert(testStr)
 38 | 			bs = bs4.BeautifulSoup(markdown.markdown(mdStr), 'html.parser')
 39 | 
 40 | 			self.assertEqual(len(bs.find_all('p')), 1)
 41 | 
 42 | 	def test_inline_tag_content(self):
 43 | 		"""content of inline-type tags should be converted"""
 44 | 		emptyElements = self.emptyElements
 45 | 		for tag in html2markdown._inlineTags:
 46 | 			if tag in emptyElements:
 47 | 				continue
 48 | 
 49 | 			testStr = '<%s style="text-decoration:line-through;">strike <strong>through</strong> some text</%s> here' % (tag, tag)
 50 | 			expectedStr = '<%s style="text-decoration:line-through;">strike __through__ some text</%s> here' % (tag, tag)
 51 | 
 52 | 			mdStr = html2markdown.convert(testStr)
 53 | 
 54 | 			self.assertEqual(mdStr, expectedStr, 'Tag: {}'.format(tag))
 55 | 
 56 | 			bs = bs4.BeautifulSoup(markdown.markdown(mdStr), 'html.parser')
 57 | 			self.assertEqual(
 58 | 				len(bs.find_all('strong')), 1 if tag != 'strong' else 2,
 59 | 				'Tag: {}. Conversion: {}'.format(tag, mdStr)
 60 | 			)
 61 | 
 62 | class TestEscaping(unittest.TestCase):
 63 | 
 64 | 	escapableChars = r'\`*_{}[]()#+-.!'
 65 | 
 66 | 	@classmethod
 67 | 	def setUpClass(cls):
 68 | 		cls.escapedChars = html2markdown._escapeCharSequence
 69 | 
 70 | 	def test_block_tag_escaping(self):
 71 | 		"""formatting characters should NOT be escaped for block-type tags (except <p>)"""
 72 | 		for escChar in self.escapableChars:
 73 | 			testStr = '<div>**escape me**</div>'.replace('*', escChar)
 74 | 			expectedStr = '<div>**escape me**</div>'.replace('*', escChar)
 75 | 			mdStr = html2markdown.convert(testStr)
 76 | 			self.assertEqual(mdStr, expectedStr)
 77 | 
 78 | 	def test_p_escaping(self):
 79 | 		"""formatting characters should be escaped for p tags"""
 80 | 		for escChar in self.escapedChars:
 81 | 			testStr = '<p>**escape me**</p>'.replace('*', escChar)
 82 | 			expectedStr = '\*\*escape me\*\*'.replace('*', escChar)
 83 | 			mdStr = html2markdown.convert(testStr)
 84 | 			self.assertEqual(mdStr, expectedStr)
 85 | 
 86 | 	def test_p_escaping_2(self):
 87 | 		"""ensure all escapable characters are retained for <p>"""
 88 | 		for escChar in self.escapableChars:
 89 | 			testStr = '<p>**escape me**</p>'.replace('*', escChar)
 90 | 			mdStr = html2markdown.convert(testStr)
 91 | 			reconstructedStr = markdown.markdown(mdStr)
 92 | 			self.assertEqual(reconstructedStr, testStr)
 93 | 
 94 | 	def test_inline_tag_escaping(self):
 95 | 		"""formatting characters should be escaped for inline-type tags"""
 96 | 		for escChar in self.escapedChars:
 97 | 			testStr = '<span>**escape me**</span>'
 98 | 			expectedStr = '<span>\*\*escape me\*\*</span>'
 99 | 			mdStr = html2markdown.convert(testStr)
100 | 			self.assertEqual(mdStr, expectedStr)
101 | 
102 | 	def test_inline_tag_escaping_2(self):
103 | 		"""ensure all escapable characters are retained for inline-type tags"""
104 | 		for escChar in self.escapableChars:
105 | 			testStr = '<p><span>**escape me**</span></p>'
106 | 			mdStr = html2markdown.convert(testStr)
107 | 			reconstructedStr = markdown.markdown(mdStr)
108 | 			self.assertEqual(reconstructedStr, testStr)
109 | 
110 | 	def test_header(self):
111 | 		result = html2markdown.convert('<p># test</p>')
112 | 		bs = bs4.BeautifulSoup(markdown.markdown(result), 'html.parser')
113 | 		self.assertEqual(len(bs.find_all('h1')), 0)
114 | 
115 | 		result = html2markdown.convert('<p><h1>test</h1></p>')
116 | 		bs = bs4.BeautifulSoup(markdown.markdown(result), 'html.parser')
117 | 		self.assertEqual(len(bs.find_all('h1')), 1)
118 | 
119 | 	def test_links(self):
120 | 		result = html2markdown.convert('<p>[http://google.com](test)</p>')
121 | 		bs = bs4.BeautifulSoup(markdown.markdown(result), 'html.parser')
122 | 		self.assertEqual(len(bs.find_all('a')), 0)
123 | 
124 | 		result = html2markdown.convert('<p><a href="http://google.com">test</a></p>')
125 | 		bs = bs4.BeautifulSoup(markdown.markdown(result), 'html.parser')
126 | 		self.assertEqual(len(bs.find_all('a')), 1)
127 | 
128 | class TestTags(unittest.TestCase):
129 | 
130 | 	genericStr = '<div><p>asdf</p></div><h2>Test</h2><pre><code>Here is some code</code></pre>'
131 | 	problematic_a_string_1 = "before <a>test</a> after"
132 | 	problematic_a_string_2 = "before <a title=\"test_title\">test</a> after"
133 | 	problematic_a_string_3 = "<a></a>"
134 | 	problematic_a_string_4 = "<a href=\"test\" title=\"test\">test</a>"
135 | 	problematic_a_string_5 = "<a href=\"test\">test</a>"
136 | 	problematic_a_string_6 = "<a href=\"test2\">test</a>"
137 | 
138 | 	def test_h2(self):
139 | 		mdStr = html2markdown.convert(self.genericStr)
140 | 		reconstructedStr = markdown.markdown(mdStr)
141 | 
142 | 		bs = bs4.BeautifulSoup(reconstructedStr, 'html.parser')
143 | 		childTags = bs.find_all(recursive=False)
144 | 
145 | 		self.assertEqual(childTags[1].name, 'h2')
146 | 		self.assertEqual(childTags[1].string, 'Test')
147 | 
148 | 	def test_a(self):
149 | 		mdStr = html2markdown.convert(self.problematic_a_string_1)
150 | 		self.assertEqual(mdStr, self.problematic_a_string_1,
151 | 			"<a> tag without an href attribute should be left alone")
152 | 
153 | 		mdStr = html2markdown.convert(self.problematic_a_string_2)
154 | 		self.assertEqual(mdStr, self.problematic_a_string_2,
155 | 			"<a> tag without an href attribute should be left alone")
156 | 
157 | 		mdStr = html2markdown.convert(self.problematic_a_string_3)
158 | 		self.assertEqual(mdStr, self.problematic_a_string_3,
159 | 			"<a> tag without an href attribute should be left alone")
160 | 
161 | 		mdStr = html2markdown.convert(self.problematic_a_string_4)
162 | 		self.assertEqual(mdStr, '[test](test "test")')
163 | 
164 | 		mdStr = html2markdown.convert(self.problematic_a_string_5)
165 | 		self.assertEqual(mdStr, '<test>')
166 | 
167 | 		mdStr = html2markdown.convert(self.problematic_a_string_6)
168 | 		self.assertEqual(mdStr, '[test](test2)')
169 | 
170 | 	def test_span(self):
171 | 		"""content of inline-type tags should be converted"""
172 | 		testStr = '<span style="text-decoration:line-through;">strike <strong>through</strong> some text</span> here'
173 | 		expectedStr = '<span style="text-decoration:line-through;">strike __through__ some text</span> here'
174 | 		mdStr = html2markdown.convert(testStr)
175 | 		self.assertEqual(mdStr, expectedStr)
176 | 
177 | if __name__ == '__main__':
178 | 	unittest.main()
179 | 


--------------------------------------------------------------------------------