├── .travis.yml ├── LICENSE.txt ├── MANIFEST.in ├── README.rst ├── html2markdown.py ├── requirements.txt ├── setup.cfg ├── setup.py └── tests.py /.travis.yml: -------------------------------------------------------------------------------- 1 | language: python 2 | python: 3 | - 2.7 4 | - 3.6 5 | cache: pip 6 | install: 7 | - pip install -r requirements.txt 8 | - pip install markdown 9 | script: 10 | - python tests.py -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | Copyright 2017 David L (dlon) 2 | 3 | Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: 4 | 5 | The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. 6 | 7 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -------------------------------------------------------------------------------- /MANIFEST.in: -------------------------------------------------------------------------------- 1 | include README.rst -------------------------------------------------------------------------------- /README.rst: -------------------------------------------------------------------------------- 1 | ============= 2 | html2markdown 3 | ============= 4 | 5 | .. image:: https://travis-ci.com/dlon/html2markdown.svg?branch=master 6 | :target: https://travis-ci.com/dlon/html2markdown 7 | 8 | **Experimental** 9 | 10 | **Purpose**: Converts html to markdown while preserving unsupported html markup. The goal is to generate markdown that can be converted back into html. This is the major difference between html2markdown and html2text. The latter doesn't purport to be reversible. 11 | 12 | Usage example 13 | ============= 14 | :: 15 | 16 | import html2markdown 17 | print html2markdown.convert('

Test

Here is some code
') 18 | 19 | Output:: 20 | 21 | ## Test 22 | 23 | Here is some code 24 | 25 | Information and caveats 26 | ======================= 27 | 28 | Does not convert the content of block-type tags other than ``

`` -- such as ``

`` tags -- into Markdown 29 | ------------------------------------------------------------------------------------------------------------- 30 | 31 | It does convert to markdown the content of inline-type tags, e.g. ````. 32 | 33 | **Input**: ``
this is stuff. stuff
`` 34 | 35 | **Result**: ``
this is stuff. stuff
`` 36 | 37 | **Input**: ``

this is stuff. stuff

`` 38 | 39 | **Result**: ``this is stuff. __stuff__`` (surrounded by a newline on either side) 40 | 41 | **Input**: ``strike through some text here`` 42 | 43 | **Result**: ``strike __through__ some text here`` 44 | 45 | Except in unprocessed block-type tags, formatting characters are escaped 46 | ------------------------------------------------------------------------ 47 | 48 | **Input**: ``

**escape me?**

`` (in html, we would use \ here) 49 | 50 | **Result**: ``\*\*escape me?\*\*`` 51 | 52 | **Input**: ``**escape me?**`` 53 | 54 | **Result**: ``\*\*escape me?\*\*`` 55 | 56 | **Input**: ``
**escape me?**
`` 57 | 58 | **Result**: ``
**escape me?**
`` (block-type) 59 | 60 | Attributes not supported by Markdown are kept 61 | --------------------------------------------- 62 | 63 | **Example**: ``link`` 64 | 65 | **Result**: ``[__link__](http://myaddress "click me")`` 66 | 67 | **Example**: ``link`` 68 | 69 | **Result**: ``__link__`` (the attribute *onclick* is not supported, so the tag is left alone) 70 | 71 | 72 | Limitations 73 | =========== 74 | 75 | - Tables are kept as html. 76 | 77 | Changes 78 | ======= 79 | 80 | 0.1.7: 81 | 82 | - Improved handling of inline tags. 83 | - Fix: Ignore ```` tags without an href attribute. 84 | - Improve escaping. 85 | 86 | 0.1.6: Added tests and support for Python versions below 2.7. 87 | 88 | 0.1.5: Fix Unicode issue in Python 3. 89 | 90 | 0.1.0: First version. 91 | -------------------------------------------------------------------------------- /html2markdown.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf8 -*- 2 | """html2markdown converts an html string to markdown while preserving unsupported markup.""" 3 | # 4 | # Copyright 2017-2018 David Lönnhager (dlon) 5 | # 6 | # Permission is hereby granted, free of charge, to any person obtaining a copy of 7 | # this software and associated documentation files (the "Software"), to deal in 8 | # the Software without restriction, including without limitation the rights to 9 | # use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 10 | # of the Software, and to permit persons to whom the Software is furnished 11 | # to do so, subject to the following conditions: 12 | # 13 | # The above copyright notice and this permission notice shall be included in all 14 | # copies or substantial portions of the Software. 15 | # 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, 17 | # INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A 18 | # PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 19 | # COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 20 | # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 21 | # WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 22 | # 23 | 24 | import bs4 25 | from bs4 import BeautifulSoup 26 | import re 27 | 28 | import sys 29 | if sys.version_info[0] > 2: 30 | unicode = str 31 | 32 | _supportedTags = { 33 | # NOTE: will be ignored if they have unsupported attributes (cf. _supportedAttributes) 34 | 'blockquote', 35 | 'p', 36 | 'a', 37 | 'h1','h2','h3','h4','h5','h6', 38 | 'strong','b', 39 | 'em','i', 40 | 'ul','ol','li', 41 | 'br', 42 | 'img', 43 | 'pre','code', 44 | 'hr' 45 | } 46 | _supportedAttributes = ( 47 | 'a href', 48 | 'a title', 49 | 'img alt', 50 | 'img src', 51 | 'img title', 52 | ) 53 | 54 | _inlineTags = { 55 | # these can be mixed with markdown (when unprocessed) 56 | # block tags will be surrounded by newlines and be unprocessed inside 57 | # (unless supported tag + supported attribute[s]) 58 | 'a', 59 | 'abbr', 60 | 'acronym', 61 | 'audio', 62 | 'b', 63 | 'bdi', 64 | 'bdo', 65 | 'big', 66 | #'br', 67 | 'button', 68 | #'canvas', 69 | 'cite', 70 | 'code', 71 | 'data', 72 | 'datalist', 73 | 'del', 74 | 'dfn', 75 | 'em', 76 | #'embed', 77 | 'i', 78 | #'iframe', 79 | #'img', 80 | #'input', 81 | 'ins', 82 | 'kbd', 83 | 'label', 84 | 'map', 85 | 'mark', 86 | 'meter', 87 | #'noscript', 88 | 'object', 89 | #'output', 90 | 'picture', 91 | #'progress', 92 | 'q', 93 | 'ruby', 94 | 's', 95 | 'samp', 96 | #'script', 97 | 'select', 98 | 'slot', 99 | 'small', 100 | 'span', 101 | 'strike', 102 | 'strong', 103 | 'sub', 104 | 'sup', 105 | 'svg', 106 | 'template', 107 | 'textarea', 108 | 'time', 109 | 'u', 110 | 'tt', 111 | 'var', 112 | #'video', 113 | 'wbr', 114 | } 115 | 116 | def _supportedAttrs(tag): 117 | sAttrs = [attr.split(' ')[1] for attr in _supportedAttributes if attr.split(' ')[0]==tag.name] 118 | for attr in tag.attrs: 119 | if attr not in sAttrs: 120 | return False 121 | return True 122 | 123 | def _recursivelyValid(tag): 124 | # not all tags require this property 125 | # requires:

asdf

126 | # does not:

asdf

127 | children = tag.find_all(recursive = False) 128 | for child in children: 129 | if not _recursivelyValid(child): 130 | return False 131 | if tag.name == '[document]': 132 | return True 133 | elif tag.name in _inlineTags: 134 | return True 135 | elif tag.name not in _supportedTags: 136 | return False 137 | if not _supportedAttrs(tag): 138 | return False 139 | return True 140 | 141 | 142 | 143 | _escapeCharSequence = tuple(r'\`*_[]#') 144 | _escapeCharRegexStr = '([{}])'.format(''.join(re.escape(c) for c in _escapeCharSequence)) 145 | _escapeCharSub = re.compile(_escapeCharRegexStr).sub 146 | 147 | 148 | def _escapeCharacters(tag): 149 | """non-recursively escape underlines and asterisks 150 | in the tag""" 151 | for i,c in enumerate(tag.contents): 152 | if type(c) != bs4.element.NavigableString: 153 | continue 154 | c.replace_with(_escapeCharSub(r'\\\1', c)) 155 | 156 | def _breakRemNewlines(tag): 157 | """non-recursively break spaces and remove newlines in the tag""" 158 | for i,c in enumerate(tag.contents): 159 | if type(c) != bs4.element.NavigableString: 160 | continue 161 | c.replace_with(re.sub(r' {2,}', ' ', c).replace('\n','')) 162 | 163 | def _markdownify(tag, _listType=None, _blockQuote=False, _listIndex=1): 164 | """recursively converts a tag into markdown""" 165 | children = tag.find_all(recursive=False) 166 | 167 | if tag.name == '[document]': 168 | for child in children: 169 | _markdownify(child) 170 | return 171 | 172 | if tag.name not in _supportedTags or not _supportedAttrs(tag): 173 | if tag.name not in _inlineTags: 174 | tag.insert_before('\n\n') 175 | tag.insert_after('\n\n') 176 | else: 177 | _escapeCharacters(tag) 178 | for child in children: 179 | _markdownify(child) 180 | return 181 | if tag.name not in ('pre', 'code'): 182 | _escapeCharacters(tag) 183 | _breakRemNewlines(tag) 184 | if tag.name == 'p': 185 | if tag.string != None: 186 | if tag.string.strip() == u'': 187 | tag.string = u'\xa0' 188 | tag.unwrap() 189 | return 190 | if not _blockQuote: 191 | tag.insert_before('\n\n') 192 | tag.insert_after('\n\n') 193 | else: 194 | tag.insert_before('\n') 195 | tag.insert_after('\n') 196 | tag.unwrap() 197 | 198 | for child in children: 199 | _markdownify(child) 200 | elif tag.name == 'br': 201 | tag.string = ' \n' 202 | tag.unwrap() 203 | elif tag.name == 'img': 204 | alt = '' 205 | title = '' 206 | if tag.has_attr('alt'): 207 | alt = tag['alt'] 208 | if tag.has_attr('title') and tag['title']: 209 | title = ' "%s"' % tag['title'] 210 | tag.string = '![%s](%s%s)' % (alt, tag['src'], title) 211 | tag.unwrap() 212 | elif tag.name == 'hr': 213 | tag.string = '\n---\n' 214 | tag.unwrap() 215 | elif tag.name == 'pre': 216 | tag.insert_before('\n\n') 217 | tag.insert_after('\n\n') 218 | if tag.code: 219 | if not _supportedAttrs(tag.code): 220 | return 221 | for child in tag.code.find_all(recursive=False): 222 | if child.name != 'br': 223 | return 224 | # code block 225 | for br in tag.code.find_all('br'): 226 | br.string = '\n' 227 | br.unwrap() 228 | tag.code.unwrap() 229 | lines = unicode(tag).strip().split('\n') 230 | lines[0] = lines[0][5:] 231 | lines[-1] = lines[-1][:-6] 232 | if not lines[-1]: 233 | lines.pop() 234 | for i,line in enumerate(lines): 235 | line = line.replace(u'\xa0', ' ') 236 | lines[i] = ' %s' % line 237 | tag.replace_with(BeautifulSoup('\n'.join(lines), 'html.parser')) 238 | return 239 | elif tag.name == 'code': 240 | # inline code 241 | if children: 242 | return 243 | tag.insert_before('`` ') 244 | tag.insert_after(' ``') 245 | tag.unwrap() 246 | elif _recursivelyValid(tag): 247 | if tag.name == 'blockquote': 248 | # ! FIXME: hack 249 | tag.insert_before('<<>>') 251 | tag.unwrap() 252 | for child in children: 253 | _markdownify(child, _blockQuote=True) 254 | return 255 | elif tag.name == 'a': 256 | # process children first 257 | for child in children: 258 | _markdownify(child) 259 | if not tag.has_attr('href'): 260 | return 261 | if tag.string != tag.get('href') or tag.has_attr('title'): 262 | title = '' 263 | if tag.has_attr('title'): 264 | title = ' "%s"' % tag['title'] 265 | tag.string = '[%s](%s%s)' % (BeautifulSoup(unicode(tag), 'html.parser').string, 266 | tag.get('href', ''), 267 | title) 268 | else: 269 | # ! FIXME: hack 270 | tag.string = '<<>>' % tag.string 271 | tag.unwrap() 272 | return 273 | elif tag.name == 'h1': 274 | tag.insert_before('\n\n# ') 275 | tag.insert_after('\n\n') 276 | tag.unwrap() 277 | elif tag.name == 'h2': 278 | tag.insert_before('\n\n## ') 279 | tag.insert_after('\n\n') 280 | tag.unwrap() 281 | elif tag.name == 'h3': 282 | tag.insert_before('\n\n### ') 283 | tag.insert_after('\n\n') 284 | tag.unwrap() 285 | elif tag.name == 'h4': 286 | tag.insert_before('\n\n#### ') 287 | tag.insert_after('\n\n') 288 | tag.unwrap() 289 | elif tag.name == 'h5': 290 | tag.insert_before('\n\n##### ') 291 | tag.insert_after('\n\n') 292 | tag.unwrap() 293 | elif tag.name == 'h6': 294 | tag.insert_before('\n\n###### ') 295 | tag.insert_after('\n\n') 296 | tag.unwrap() 297 | elif tag.name in ('ul', 'ol'): 298 | tag.insert_before('\n\n') 299 | tag.insert_after('\n\n') 300 | tag.unwrap() 301 | for i, child in enumerate(children): 302 | _markdownify(child, _listType=tag.name, _listIndex=i+1) 303 | return 304 | elif tag.name == 'li': 305 | if not _listType: 306 | #
  • outside of list; ignore 307 | return 308 | if _listType == 'ul': 309 | tag.insert_before('* ') 310 | else: 311 | tag.insert_before('%d. ' % _listIndex) 312 | for child in children: 313 | _markdownify(child) 314 | for c in tag.contents: 315 | if type(c) != bs4.element.NavigableString: 316 | continue 317 | c.replace_with('\n '.join(c.split('\n'))) 318 | tag.insert_after('\n') 319 | tag.unwrap() 320 | return 321 | elif tag.name in ('strong','b'): 322 | tag.insert_before('__') 323 | tag.insert_after('__') 324 | tag.unwrap() 325 | elif tag.name in ('em','i'): 326 | tag.insert_before('_') 327 | tag.insert_after('_') 328 | tag.unwrap() 329 | for child in children: 330 | _markdownify(child) 331 | 332 | def convert(html): 333 | """converts an html string to markdown while preserving unsupported markup.""" 334 | bs = BeautifulSoup(html, 'html.parser') 335 | _markdownify(bs) 336 | ret = unicode(bs).replace(u'\xa0', ' ') 337 | ret = re.sub(r'\n{3,}', r'\n\n', ret) 338 | # ! FIXME: hack 339 | ret = re.sub(r'<<<FLOATING LINK: (.+)>>>', r'<\1>', ret) 340 | # ! FIXME: hack 341 | sp = re.split(r'(<<<BLOCKQUOTE: .*?>>>)', ret, flags=re.DOTALL) 342 | for i,e in enumerate(sp): 343 | if e[:len('<<<BLOCKQUOTE:')] == '<<<BLOCKQUOTE:': 344 | sp[i] = '> ' + e[len('<<<BLOCKQUOTE:') : -len('>>>')] 345 | sp[i] = sp[i].replace('\n', '\n> ') 346 | ret = ''.join(sp) 347 | return ret.strip('\n') 348 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4 -------------------------------------------------------------------------------- /setup.cfg: -------------------------------------------------------------------------------- 1 | [metadata] 2 | description-file = README.rst -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf8 -*- 2 | 3 | from setuptools import setup 4 | 5 | from os import path 6 | import io 7 | this_directory = path.abspath(path.dirname(__file__)) 8 | with io.open(path.join(this_directory, 'README.rst'), encoding='utf-8') as f: 9 | longdesc = f.read() 10 | 11 | setup( 12 | name='html2markdown', 13 | py_modules=['html2markdown'], 14 | version='0.1.7', 15 | description='Conservatively convert html to markdown', 16 | author='David Lönnhager', 17 | author_email='dv.lnh.d@gmail.com', 18 | url='https://github.com/dlon/html2markdown', 19 | install_requires=[ 20 | 'beautifulsoup4' 21 | ], 22 | long_description=longdesc, 23 | long_description_content_type='text/x-rst', 24 | ) 25 | -------------------------------------------------------------------------------- /tests.py: -------------------------------------------------------------------------------- 1 | import unittest 2 | import html2markdown 3 | import markdown 4 | import bs4 5 | 6 | 7 | class TestGenericTags(unittest.TestCase): 8 | 9 | emptyElements = { 10 | 'embed', 11 | 'img', 12 | 'input', 13 | 'wbr', 14 | } 15 | 16 | def test_block_tag_content(self): 17 | """content of block-type tags should not be converted (except

    )""" 18 | testStr = '

    this is stuff. stuff
    ' 19 | mdStr = html2markdown.convert(testStr) 20 | self.assertEqual(mdStr, testStr) 21 | 22 | def test_p_content(self): 23 | """

    's content should be converted""" 24 | testStr = '

    this is stuff. stuff

    ' 25 | expectedStr = 'this is stuff. __stuff__' 26 | mdStr = html2markdown.convert(testStr) 27 | self.assertEqual(mdStr, expectedStr) 28 | 29 | def test_inline_tag_break(self): 30 | """inline-type tags should not cause line breaks""" 31 | emptyElements = self.emptyElements 32 | for tag in html2markdown._inlineTags: 33 | if tag not in emptyElements: 34 | testStr = '

    test <%s>test test

    ' % (tag, tag) 35 | else: 36 | testStr = '

    test <%s /> test

    ' % tag 37 | mdStr = html2markdown.convert(testStr) 38 | bs = bs4.BeautifulSoup(markdown.markdown(mdStr), 'html.parser') 39 | 40 | self.assertEqual(len(bs.find_all('p')), 1) 41 | 42 | def test_inline_tag_content(self): 43 | """content of inline-type tags should be converted""" 44 | emptyElements = self.emptyElements 45 | for tag in html2markdown._inlineTags: 46 | if tag in emptyElements: 47 | continue 48 | 49 | testStr = '<%s style="text-decoration:line-through;">strike through some text here' % (tag, tag) 50 | expectedStr = '<%s style="text-decoration:line-through;">strike __through__ some text here' % (tag, tag) 51 | 52 | mdStr = html2markdown.convert(testStr) 53 | 54 | self.assertEqual(mdStr, expectedStr, 'Tag: {}'.format(tag)) 55 | 56 | bs = bs4.BeautifulSoup(markdown.markdown(mdStr), 'html.parser') 57 | self.assertEqual( 58 | len(bs.find_all('strong')), 1 if tag != 'strong' else 2, 59 | 'Tag: {}. Conversion: {}'.format(tag, mdStr) 60 | ) 61 | 62 | class TestEscaping(unittest.TestCase): 63 | 64 | escapableChars = r'\`*_{}[]()#+-.!' 65 | 66 | @classmethod 67 | def setUpClass(cls): 68 | cls.escapedChars = html2markdown._escapeCharSequence 69 | 70 | def test_block_tag_escaping(self): 71 | """formatting characters should NOT be escaped for block-type tags (except

    )""" 72 | for escChar in self.escapableChars: 73 | testStr = '

    **escape me**
    '.replace('*', escChar) 74 | expectedStr = '
    **escape me**
    '.replace('*', escChar) 75 | mdStr = html2markdown.convert(testStr) 76 | self.assertEqual(mdStr, expectedStr) 77 | 78 | def test_p_escaping(self): 79 | """formatting characters should be escaped for p tags""" 80 | for escChar in self.escapedChars: 81 | testStr = '

    **escape me**

    '.replace('*', escChar) 82 | expectedStr = '\*\*escape me\*\*'.replace('*', escChar) 83 | mdStr = html2markdown.convert(testStr) 84 | self.assertEqual(mdStr, expectedStr) 85 | 86 | def test_p_escaping_2(self): 87 | """ensure all escapable characters are retained for

    """ 88 | for escChar in self.escapableChars: 89 | testStr = '

    **escape me**

    '.replace('*', escChar) 90 | mdStr = html2markdown.convert(testStr) 91 | reconstructedStr = markdown.markdown(mdStr) 92 | self.assertEqual(reconstructedStr, testStr) 93 | 94 | def test_inline_tag_escaping(self): 95 | """formatting characters should be escaped for inline-type tags""" 96 | for escChar in self.escapedChars: 97 | testStr = '**escape me**' 98 | expectedStr = '\*\*escape me\*\*' 99 | mdStr = html2markdown.convert(testStr) 100 | self.assertEqual(mdStr, expectedStr) 101 | 102 | def test_inline_tag_escaping_2(self): 103 | """ensure all escapable characters are retained for inline-type tags""" 104 | for escChar in self.escapableChars: 105 | testStr = '

    **escape me**

    ' 106 | mdStr = html2markdown.convert(testStr) 107 | reconstructedStr = markdown.markdown(mdStr) 108 | self.assertEqual(reconstructedStr, testStr) 109 | 110 | def test_header(self): 111 | result = html2markdown.convert('

    # test

    ') 112 | bs = bs4.BeautifulSoup(markdown.markdown(result), 'html.parser') 113 | self.assertEqual(len(bs.find_all('h1')), 0) 114 | 115 | result = html2markdown.convert('

    test

    ') 116 | bs = bs4.BeautifulSoup(markdown.markdown(result), 'html.parser') 117 | self.assertEqual(len(bs.find_all('h1')), 1) 118 | 119 | def test_links(self): 120 | result = html2markdown.convert('

    [http://google.com](test)

    ') 121 | bs = bs4.BeautifulSoup(markdown.markdown(result), 'html.parser') 122 | self.assertEqual(len(bs.find_all('a')), 0) 123 | 124 | result = html2markdown.convert('

    test

    ') 125 | bs = bs4.BeautifulSoup(markdown.markdown(result), 'html.parser') 126 | self.assertEqual(len(bs.find_all('a')), 1) 127 | 128 | class TestTags(unittest.TestCase): 129 | 130 | genericStr = '

    asdf

    Test

    Here is some code
    ' 131 | problematic_a_string_1 = "before test after" 132 | problematic_a_string_2 = "before test after" 133 | problematic_a_string_3 = "" 134 | problematic_a_string_4 = "test" 135 | problematic_a_string_5 = "test" 136 | problematic_a_string_6 = "test" 137 | 138 | def test_h2(self): 139 | mdStr = html2markdown.convert(self.genericStr) 140 | reconstructedStr = markdown.markdown(mdStr) 141 | 142 | bs = bs4.BeautifulSoup(reconstructedStr, 'html.parser') 143 | childTags = bs.find_all(recursive=False) 144 | 145 | self.assertEqual(childTags[1].name, 'h2') 146 | self.assertEqual(childTags[1].string, 'Test') 147 | 148 | def test_a(self): 149 | mdStr = html2markdown.convert(self.problematic_a_string_1) 150 | self.assertEqual(mdStr, self.problematic_a_string_1, 151 | " tag without an href attribute should be left alone") 152 | 153 | mdStr = html2markdown.convert(self.problematic_a_string_2) 154 | self.assertEqual(mdStr, self.problematic_a_string_2, 155 | " tag without an href attribute should be left alone") 156 | 157 | mdStr = html2markdown.convert(self.problematic_a_string_3) 158 | self.assertEqual(mdStr, self.problematic_a_string_3, 159 | " tag without an href attribute should be left alone") 160 | 161 | mdStr = html2markdown.convert(self.problematic_a_string_4) 162 | self.assertEqual(mdStr, '[test](test "test")') 163 | 164 | mdStr = html2markdown.convert(self.problematic_a_string_5) 165 | self.assertEqual(mdStr, '') 166 | 167 | mdStr = html2markdown.convert(self.problematic_a_string_6) 168 | self.assertEqual(mdStr, '[test](test2)') 169 | 170 | def test_span(self): 171 | """content of inline-type tags should be converted""" 172 | testStr = 'strike through some text here' 173 | expectedStr = 'strike __through__ some text here' 174 | mdStr = html2markdown.convert(testStr) 175 | self.assertEqual(mdStr, expectedStr) 176 | 177 | if __name__ == '__main__': 178 | unittest.main() 179 | --------------------------------------------------------------------------------