├── .gitignore ├── .travis.yml ├── LICENSE ├── README.md ├── go.mod ├── go.sum ├── html2gemini.go ├── html2gemini_test.go └── testdata ├── utf8.html └── utf8_with_bom.xhtml /.gitignore: -------------------------------------------------------------------------------- 1 | # Compiled Object files, Static and Dynamic libs (Shared Objects) 2 | *.o 3 | *.a 4 | *.so 5 | 6 | # Folders 7 | _obj 8 | _test 9 | 10 | # Architecture specific extensions/prefixes 11 | *.[568vq] 12 | [568vq].out 13 | 14 | *.cgo1.go 15 | *.cgo2.c 16 | _cgo_defun.c 17 | _cgo_gotypes.go 18 | _cgo_export.* 19 | 20 | _testmain.go 21 | 22 | *.exe 23 | *.test 24 | *.prof 25 | .hgignore 26 | .hg/ 27 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: go 2 | go: 3 | # n.b. For golang release history, see https://golang.org/doc/devel/release.html 4 | - tip 5 | - "1.13.8" 6 | - "1.12.17" 7 | - "1.11.13" 8 | - "1.10.8" 9 | - "1.9.7" 10 | notifications: 11 | email: 12 | on_success: change 13 | on_failure: always 14 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2015 Jay Taylor 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | 23 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # html2gemini 2 | 3 | ## A Go library to converts HTML into Gemini text/gemini (gemtext) 4 | 5 | This is forked from https://jaytaylor.com/html2text with the following changes: 6 | 7 | * output text/gemini format 8 | * use footnote style references 9 | 10 | ## Introduction 11 | 12 | Turns HTML into text/gemini to be served over gemini, or incorporated into a client. 13 | 14 | html2gemini is a simple golang package for rendering HTML into plaintext. 15 | 16 | 17 | ## Download the package 18 | 19 | ```bash 20 | go get github.com/LukeEmmet/html2gemini 21 | ``` 22 | 23 | ## Example usage 24 | 25 | See https://github.com/LukeEmmet/html2gmi which is a practical command line application that uses this library. Also see https://github.com/LukeEmmet/duckling-proxy which is an HTTP via Gemini proxy server so you can browse the web from any Gemini client that supports scheme-specific proxies. 26 | 27 | To simplify the html passed to this library, you could simplify or sanitise it first, for example using https://github.com/philipjkim/goreadability 28 | 29 | ## Unit-tests 30 | 31 | Running the unit-tests is straightforward and standard: 32 | 33 | ```bash 34 | go test 35 | ``` 36 | 37 | 38 | # License 39 | 40 | Permissive MIT license. 41 | 42 | 43 | ## Contact 44 | 45 | Email: luke [at] marmaladefoo [dot] com 46 | 47 | If you appreciate this library please feel free to drop me a line and tell me, and please send a note of appreciation to Jay Taylor (url below) who wrote the original html2text on which this is based, and who should receive most of the credit. 48 | 49 | https://jaytaylor.com/html2text 50 | 51 | -------------------------------------------------------------------------------- /go.mod: -------------------------------------------------------------------------------- 1 | module github.com/LukeEmmet/html2gemini 2 | 3 | go 1.14 4 | 5 | require ( 6 | github.com/olekukonko/tablewriter v0.0.4 7 | github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf 8 | golang.org/x/net v0.0.0-20200822124328-c89045814202 9 | ) 10 | -------------------------------------------------------------------------------- /go.sum: -------------------------------------------------------------------------------- 1 | github.com/mattn/go-runewidth v0.0.7 h1:Ei8KR0497xHyKJPAv59M1dkC+rOZCMBJ+t3fZ+twI54= 2 | github.com/mattn/go-runewidth v0.0.7/go.mod h1:H031xJmbD/WCDINGzjvQ9THkh0rPKHF+m2gUSrubnMI= 3 | github.com/olekukonko/tablewriter v0.0.4 h1:vHD/YYe1Wolo78koG299f7V/VAS08c6IpCLn+Ejf/w8= 4 | github.com/olekukonko/tablewriter v0.0.4/go.mod h1:zq6QwlOf5SlnkVbMSr5EoBv3636FWnp+qbPhuoO21uA= 5 | github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf h1:pvbZ0lM0XWPBqUKqFU8cmavspvIl9nulOYwdy6IFRRo= 6 | github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf/go.mod h1:RJID2RhlZKId02nZ62WenDCkgHFerpIOmW0iT7GKmXM= 7 | golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w= 8 | golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto= 9 | golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= 10 | golang.org/x/net v0.0.0-20200822124328-c89045814202 h1:VvcQYSHwXgi7W+TpUR6A9g6Up98WAHf3f/ulnJ62IyA= 11 | golang.org/x/net v0.0.0-20200822124328-c89045814202/go.mod h1:/O7V0waA8r7cgGh81Ro3o1hOxt32SMVPicZroKQ2sZA= 12 | golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= 13 | golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= 14 | golang.org/x/sys v0.0.0-20200323222414-85ca7c5b95cd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= 15 | golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= 16 | -------------------------------------------------------------------------------- /html2gemini.go: -------------------------------------------------------------------------------- 1 | package html2gemini 2 | 3 | import ( 4 | "bytes" 5 | "fmt" 6 | "io" 7 | "path/filepath" 8 | "regexp" 9 | "strings" 10 | "unicode" 11 | 12 | "github.com/olekukonko/tablewriter" 13 | "github.com/ssor/bom" 14 | "golang.org/x/net/html" 15 | "golang.org/x/net/html/atom" 16 | ) 17 | 18 | // Options provide toggles and overrides to control specific rendering behaviors. 19 | type Options struct { 20 | PrettyTables bool // Turns on pretty ASCII rendering for table elements. 21 | PrettyTablesOptions *PrettyTablesOptions // Configures pretty ASCII rendering for table elements. 22 | OmitLinks bool // Turns on omitting links 23 | CitationStart int //Start Citations from this number (default 1) 24 | CitationMarkers bool //use footnote style citation markers 25 | LinkEmitFrequency int //emit gathered links after approximately every n paras (otherwise when new heading, or blockquote) 26 | NumberedLinks bool // number the links [1], [2] etc to match citation markers 27 | EmitImagesAsLinks bool //emit referenced images as links e.g. 28 | ImageMarkerPrefix string //prefix when emitting images 29 | EmptyLinkPrefix string //prefix when emitting empty links (e.g. 30 | ListItemToLinkWordThreshold int //max number of words in a list item having a single link that is converted to a plain gemini link 31 | } 32 | 33 | //NewOptions creates Options with default settings 34 | func NewOptions() *Options { 35 | return &Options{ 36 | PrettyTables: false, 37 | PrettyTablesOptions: NewPrettyTablesOptions(), 38 | OmitLinks: false, 39 | CitationStart: 1, 40 | CitationMarkers: true, 41 | NumberedLinks: true, 42 | LinkEmitFrequency: 2, 43 | EmitImagesAsLinks: true, 44 | ImageMarkerPrefix: "‡", 45 | EmptyLinkPrefix: ">>", 46 | ListItemToLinkWordThreshold: 30, 47 | } 48 | } 49 | 50 | // PrettyTablesOptions overrides tablewriter behaviors 51 | type PrettyTablesOptions struct { 52 | AutoFormatHeader bool 53 | AutoWrapText bool 54 | ReflowDuringAutoWrap bool 55 | ColWidth int 56 | ColumnSeparator string 57 | RowSeparator string 58 | CenterSeparator string 59 | HeaderAlignment int 60 | FooterAlignment int 61 | Alignment int 62 | ColumnAlignment []int 63 | NewLine string 64 | HeaderLine bool 65 | RowLine bool 66 | AutoMergeCells bool 67 | Borders tablewriter.Border 68 | } 69 | 70 | // NewPrettyTablesOptions creates PrettyTablesOptions with default settings 71 | func NewPrettyTablesOptions() *PrettyTablesOptions { 72 | return &PrettyTablesOptions{ 73 | AutoFormatHeader: true, 74 | AutoWrapText: true, 75 | ReflowDuringAutoWrap: true, 76 | ColWidth: tablewriter.MAX_ROW_WIDTH, 77 | ColumnSeparator: tablewriter.COLUMN, 78 | RowSeparator: tablewriter.ROW, 79 | CenterSeparator: tablewriter.CENTER, 80 | HeaderAlignment: tablewriter.ALIGN_DEFAULT, 81 | FooterAlignment: tablewriter.ALIGN_DEFAULT, 82 | Alignment: tablewriter.ALIGN_DEFAULT, 83 | ColumnAlignment: []int{}, 84 | NewLine: tablewriter.NEWLINE, 85 | HeaderLine: true, 86 | RowLine: false, 87 | AutoMergeCells: false, 88 | Borders: tablewriter.Border{Left: true, Right: true, Bottom: true, Top: true}, 89 | } 90 | } 91 | 92 | // FlushCitations emits a list of Gemini links gathered up to this point, if the para count exceeds the 93 | // emit frequency 94 | func (ctx *TextifyTraverseContext) CheckFlushCitations() { 95 | 96 | // if ctx.linkAccumulator.emitParaCount > ctx.options.LinkEmitFrequency && ctx.citationCount > 0 { 97 | if ctx.linkAccumulator.emitParaCount > ctx.options.LinkEmitFrequency && len(ctx.linkAccumulator.linkArray) > (ctx.linkAccumulator.flushedToIndex+1) { 98 | ctx.FlushCitations() 99 | } else { 100 | ctx.linkAccumulator.emitParaCount += 1 101 | } 102 | } 103 | 104 | func (ctx *TextifyTraverseContext) FlushCitations() { 105 | ctx.emitGeminiCitations() 106 | } 107 | 108 | func (ctx *TextifyTraverseContext) ResetCitationCounters() { 109 | ctx.linkAccumulator.flushedToIndex = len(ctx.linkAccumulator.linkArray) - 1 110 | ctx.linkAccumulator.emitParaCount = 0 111 | } 112 | 113 | // FromHTMLNode renders text output from a pre-parsed HTML document. 114 | func FromHTMLNode(doc *html.Node, ctx TextifyTraverseContext) (string, error) { 115 | 116 | if err := ctx.traverse(doc); err != nil { 117 | return "", err 118 | } 119 | //flush any remaining citations at the end 120 | ctx.forceFlushGeminiCitations() 121 | 122 | text := strings.TrimSpace(newlineRe.ReplaceAllString( 123 | strings.Replace(ctx.buf.String(), "\n ", "\n", -1), "\n\n"), 124 | ) 125 | 126 | //somewhat hacky tidying up of start and end of blockquotes 127 | startQuote := regexp.MustCompile(`\n *\n+> \n`) 128 | text = startQuote.ReplaceAllString(text, "\n\n") 129 | endQuote := regexp.MustCompile(`\n> \n\n+`) 130 | text = endQuote.ReplaceAllString(text, "\n\n") 131 | text = endQuote.ReplaceAllString(text, "\n\n") 132 | 133 | return text, nil 134 | } 135 | 136 | // FromReader renders text output after parsing HTML for the specified 137 | // io.Reader. 138 | func FromReader(reader io.Reader, ctx TextifyTraverseContext) (string, error) { 139 | newReader, err := bom.NewReaderWithoutBom(reader) 140 | if err != nil { 141 | return "", err 142 | } 143 | doc, err := html.Parse(newReader) 144 | if err != nil { 145 | return "", err 146 | } 147 | 148 | return FromHTMLNode(doc, ctx) 149 | } 150 | 151 | // FromString parses HTML from the input string, then renders the text form. 152 | func FromString(input string, ctx TextifyTraverseContext) (string, error) { 153 | bs := bom.CleanBom([]byte(input)) 154 | text, err := FromReader(bytes.NewReader(bs), ctx) 155 | if err != nil { 156 | return "", err 157 | } 158 | return text, nil 159 | } 160 | 161 | var ( 162 | spacingRe = regexp.MustCompile(`[ \r\n\t]+`) 163 | newlineRe = regexp.MustCompile(`\n\n+`) 164 | ) 165 | 166 | // traverseTableCtx holds text-related context. 167 | type TextifyTraverseContext struct { 168 | buf bytes.Buffer 169 | 170 | prefix string 171 | tableCtx tableTraverseContext 172 | options Options 173 | endsWithSpace bool 174 | justClosedDiv bool 175 | blockquoteLevel int 176 | lineLength int 177 | isPre bool 178 | linkAccumulator linkAccumulatorType 179 | } 180 | 181 | type linkAccumulatorType struct { 182 | emitParaCount int 183 | linkArray []citationLink 184 | flushedToIndex int 185 | tableNestLevel int 186 | } 187 | 188 | func newlinkAccumulator() *linkAccumulatorType { 189 | return &linkAccumulatorType{ 190 | flushedToIndex: -1, 191 | } 192 | } 193 | 194 | type citationLink struct { 195 | index int 196 | url string 197 | display string 198 | } 199 | 200 | // tableTraverseContext holds table ASCII-form related context. 201 | type tableTraverseContext struct { 202 | header []string 203 | body [][]string 204 | footer []string 205 | tmpRow int 206 | isInFooter bool 207 | } 208 | 209 | func (tableCtx *tableTraverseContext) init() { 210 | tableCtx.body = [][]string{} 211 | tableCtx.header = []string{} 212 | tableCtx.footer = []string{} 213 | tableCtx.isInFooter = false 214 | tableCtx.tmpRow = 0 215 | } 216 | 217 | func NewTraverseContext(options Options) *TextifyTraverseContext { 218 | 219 | //no options provided we need to set some default options for non-zero 220 | //types. 221 | 222 | //start links at 1, not 0 if not specified 223 | options.CitationStart = 1 //otherwise uses zero value which is 0 224 | 225 | var ctx = TextifyTraverseContext{ 226 | buf: bytes.Buffer{}, 227 | options: options, 228 | } 229 | 230 | ctx.linkAccumulator = *newlinkAccumulator() 231 | 232 | return &ctx 233 | } 234 | func (ctx *TextifyTraverseContext) handleElement(node *html.Node) error { 235 | ctx.justClosedDiv = false 236 | 237 | prefix := "" 238 | 239 | switch node.DataAtom { 240 | case atom.Br: 241 | return ctx.emit("\n") 242 | 243 | case atom.H1, atom.H2, atom.H3: 244 | 245 | if node.DataAtom == atom.H1 { 246 | ctx.FlushCitations() 247 | prefix = "# " 248 | } 249 | if node.DataAtom == atom.H2 { 250 | ctx.FlushCitations() 251 | prefix = "## " 252 | } 253 | 254 | if node.DataAtom == atom.H3 { 255 | ctx.FlushCitations() 256 | prefix = "### " 257 | } 258 | 259 | ctx.emit("\n\n" + prefix) 260 | if err := ctx.traverseChildren(node); err != nil { 261 | return err 262 | } 263 | return ctx.emit("\n\n") 264 | 265 | case atom.Blockquote: 266 | ctx.FlushCitations() 267 | //if err := ctx.emit("\n"); err != nil { 268 | // return err 269 | //} 270 | ctx.blockquoteLevel++ 271 | ctx.prefix = strings.Repeat(">", ctx.blockquoteLevel) + " " 272 | //if ctx.blockquoteLevel == 1 { 273 | // if err := ctx.emit("\n"); err != nil { 274 | // return err 275 | // } 276 | //} 277 | if err := ctx.traverseChildren(node); err != nil { 278 | return err 279 | } 280 | ctx.blockquoteLevel-- 281 | ctx.prefix = strings.Repeat(">", ctx.blockquoteLevel) 282 | if ctx.blockquoteLevel > 0 { 283 | ctx.prefix += " " 284 | } 285 | //return ctx.emit("\n\n") 286 | return ctx.emit("") 287 | 288 | case atom.Div: 289 | 290 | if ctx.lineLength > 0 { 291 | if err := ctx.emit("\n"); err != nil { 292 | return err 293 | } 294 | } 295 | if err := ctx.traverseChildren(node); err != nil { 296 | return err 297 | } 298 | var err error 299 | if !ctx.justClosedDiv { 300 | err = ctx.emit("\n") 301 | } 302 | ctx.justClosedDiv = true 303 | return err 304 | 305 | case atom.Li: 306 | 307 | //a test context to examine the list element to see if it just has a single link 308 | //in which case we'll output a link line, or no links in which case we output just a bullet 309 | testCtx := TextifyTraverseContext{} 310 | if err := testCtx.traverseChildren(node); err != nil { 311 | return err 312 | } 313 | 314 | //if content contains just one link, output a link instead of a bullet if within a specified number of 315 | //words 316 | maxSingletonLinkLength := ctx.options.ListItemToLinkWordThreshold 317 | if (len(strings.Split(testCtx.buf.String(), " ")) < maxSingletonLinkLength) && (len(testCtx.linkAccumulator.linkArray) == 1) { 318 | return ctx.emit("=> " + testCtx.linkAccumulator.linkArray[0].url + " " + testCtx.buf.String() + "\n") 319 | } 320 | 321 | //if no links, just emit a bullet with the text, ignoring any sub elements 322 | if len(testCtx.linkAccumulator.linkArray) == 0 { 323 | return ctx.emit("* " + testCtx.buf.String() + "\n") 324 | } 325 | 326 | //otherwise is mixed content, so keep traversing 327 | if err := ctx.emit("* "); err != nil { 328 | return err 329 | } 330 | 331 | if err := ctx.traverseChildren(node); err != nil { 332 | return err 333 | } 334 | 335 | return ctx.emit("\n") 336 | 337 | case atom.Img: 338 | //output images with a link to the image 339 | hrefLink := "" 340 | altText := "" 341 | if altText = getAttrVal(node, "alt"); altText != "" { 342 | altText = altText 343 | } else { 344 | if src := getAttrVal(node, "src"); src != "" { 345 | //try to ge the last element of the path 346 | fileName := filepath.Base(src) 347 | fileBase := strings.TrimSuffix(fileName, filepath.Ext(fileName)) 348 | altText = fileBase 349 | } 350 | } 351 | altText = "[" + ctx.options.ImageMarkerPrefix + " " + altText + "]" 352 | altText = strings.ReplaceAll(altText, "_", " ") 353 | altText = strings.ReplaceAll(altText, "-", " ") 354 | altText = strings.ReplaceAll(altText, " ", " ") 355 | 356 | if ctx.options.EmitImagesAsLinks { 357 | if err := ctx.emit(altText); err != nil { 358 | return err 359 | } 360 | 361 | if attrVal := getAttrVal(node, "src"); attrVal != "" { 362 | attrVal = ctx.normalizeHrefLink(attrVal) 363 | if !ctx.options.OmitLinks && attrVal != "" && altText != attrVal { 364 | hrefLink = ctx.addGeminiCitation(attrVal, altText) 365 | } 366 | } 367 | return ctx.emit(hrefLink) 368 | } else { 369 | return ctx.emit(altText) 370 | } 371 | 372 | case atom.A: 373 | linkText := "" 374 | // For simple link element content with single text node only, peek at the link text. 375 | if node.FirstChild != nil && node.FirstChild.NextSibling == nil && node.FirstChild.Type == html.TextNode { 376 | linkText = node.FirstChild.Data 377 | } 378 | 379 | if err := ctx.traverseChildren(node); err != nil { 380 | return err 381 | } 382 | 383 | // If image is the only child, the image will have been shown as a link with its alt text etc 384 | // so choose a simple marker for the link itself 385 | if img := node.FirstChild; img != nil && node.LastChild == img && img.DataAtom == atom.Img { 386 | linkText = ctx.options.EmptyLinkPrefix 387 | ctx.emit(" " + linkText) 388 | } 389 | 390 | hrefLink := "" 391 | if attrVal := getAttrVal(node, "href"); attrVal != "" { 392 | attrVal = ctx.normalizeHrefLink(attrVal) 393 | // Don't print link href if it matches link element content or if the link is empty. 394 | if !ctx.options.OmitLinks && attrVal != "" && linkText != attrVal { 395 | hrefLink = ctx.addGeminiCitation(attrVal, linkText) 396 | } 397 | } 398 | 399 | return ctx.emit(hrefLink) 400 | 401 | case atom.Ul: 402 | 403 | return ctx.paragraphHandler(node) 404 | 405 | case atom.P: 406 | 407 | //a test context to examine the list element to see if it just has a single link 408 | //in which case we'll output a link line, or no links in which case we output just a bullet 409 | testCtx := TextifyTraverseContext{} 410 | if err := testCtx.traverseChildren(node); err != nil { 411 | return err 412 | } 413 | 414 | //if content contains just one link, output a link instead of a para if within a specified number of 415 | //words 416 | maxSingletonLinkLength := ctx.options.ListItemToLinkWordThreshold 417 | if (len(strings.Split(testCtx.buf.String(), " ")) < maxSingletonLinkLength) && (len(testCtx.linkAccumulator.linkArray) == 1) { 418 | return ctx.emit("=> " + testCtx.linkAccumulator.linkArray[0].url + " " + testCtx.buf.String() + "\n") 419 | } 420 | 421 | //if no links, just emit a para with the text, ignoring any sub elements 422 | if len(testCtx.linkAccumulator.linkArray) == 0 { 423 | return ctx.emit(testCtx.buf.String() + "\n") 424 | } 425 | 426 | //else - mixed content 427 | return ctx.paragraphHandler(node) 428 | 429 | case atom.Table, atom.Tfoot, atom.Th, atom.Tr, atom.Td: 430 | 431 | if ctx.options.PrettyTables { 432 | return ctx.handleTableElement(node) 433 | } else if node.DataAtom == atom.Table { 434 | //just treat tables as a type of paragraph 435 | ctx.emit("\n\n⊞ table ⊞\n\n") 436 | return ctx.paragraphHandler(node) 437 | } 438 | 439 | if node.DataAtom == atom.Tr { 440 | //start a new line 441 | ctx.emit("\n") 442 | } 443 | 444 | return ctx.traverseChildren(node) 445 | 446 | case atom.Pre: 447 | ctx.emit("\n\n```\n") 448 | ctx.isPre = true 449 | err := ctx.traverseChildren(node) 450 | ctx.isPre = false 451 | ctx.emit("\n```\n\n") 452 | return err 453 | 454 | case atom.Style, atom.Script, atom.Head: 455 | // Ignore the subtree. 456 | return nil 457 | 458 | default: 459 | return ctx.traverseChildren(node) 460 | } 461 | } 462 | 463 | // paragraphHandler renders node children surrounded by double newlines. 464 | func (ctx *TextifyTraverseContext) paragraphHandler(node *html.Node) error { 465 | ctx.CheckFlushCitations() 466 | 467 | if err := ctx.emit("\n\n"); err != nil { 468 | return err 469 | } 470 | 471 | if err := ctx.traverseChildren(node); err != nil { 472 | return err 473 | } 474 | if err := ctx.emit("\n\n"); err != nil { 475 | return err 476 | } 477 | 478 | return nil 479 | } 480 | 481 | // handleTableElement is only to be invoked when options.PrettyTables is active. 482 | func (ctx *TextifyTraverseContext) handleTableElement(node *html.Node) error { 483 | if !ctx.options.PrettyTables { 484 | panic("handleTableElement invoked when PrettyTables not active") 485 | } 486 | 487 | switch node.DataAtom { 488 | case atom.Table: 489 | 490 | if ctx.linkAccumulator.tableNestLevel == 0 { 491 | if err := ctx.emit("\n\n```\n"); err != nil { 492 | return err 493 | } 494 | } else { 495 | if err := ctx.emit("\n\n"); err != nil { 496 | return err 497 | } 498 | } 499 | 500 | ctx.linkAccumulator.tableNestLevel++ 501 | 502 | // Re-intialize all table context. 503 | ctx.tableCtx.init() 504 | 505 | // Browse children, enriching context with table data. 506 | if err := ctx.traverseChildren(node); err != nil { 507 | return err 508 | } 509 | 510 | buf := &bytes.Buffer{} 511 | table := tablewriter.NewWriter(buf) 512 | if ctx.options.PrettyTablesOptions != nil { 513 | options := ctx.options.PrettyTablesOptions 514 | table.SetAutoFormatHeaders(options.AutoFormatHeader) 515 | table.SetAutoWrapText(options.AutoWrapText) 516 | table.SetReflowDuringAutoWrap(options.ReflowDuringAutoWrap) 517 | table.SetColWidth(options.ColWidth) 518 | table.SetColumnSeparator(options.ColumnSeparator) 519 | table.SetRowSeparator(options.RowSeparator) 520 | table.SetCenterSeparator(options.CenterSeparator) 521 | table.SetHeaderAlignment(options.HeaderAlignment) 522 | table.SetFooterAlignment(options.FooterAlignment) 523 | table.SetAlignment(options.Alignment) 524 | table.SetColumnAlignment(options.ColumnAlignment) 525 | table.SetNewLine(options.NewLine) 526 | table.SetHeaderLine(options.HeaderLine) 527 | table.SetRowLine(options.RowLine) 528 | table.SetAutoMergeCells(options.AutoMergeCells) 529 | table.SetBorders(options.Borders) 530 | } 531 | table.SetHeader(ctx.tableCtx.header) 532 | table.SetFooter(ctx.tableCtx.footer) 533 | table.AppendBulk(ctx.tableCtx.body) 534 | 535 | // Render the table using ASCII. 536 | table.Render() 537 | if err := ctx.emit(buf.String()); err != nil { 538 | return err 539 | } 540 | 541 | ctx.linkAccumulator.tableNestLevel-- 542 | 543 | if ctx.linkAccumulator.tableNestLevel == 0 { 544 | return ctx.emit("```\n\n") 545 | } else { 546 | return ctx.emit("\n\n") 547 | } 548 | 549 | case atom.Tfoot: 550 | ctx.tableCtx.isInFooter = true 551 | if err := ctx.traverseChildren(node); err != nil { 552 | return err 553 | } 554 | ctx.tableCtx.isInFooter = false 555 | 556 | case atom.Tr: 557 | ctx.tableCtx.body = append(ctx.tableCtx.body, []string{}) 558 | if err := ctx.traverseChildren(node); err != nil { 559 | return err 560 | } 561 | ctx.tableCtx.tmpRow++ 562 | 563 | case atom.Th: 564 | res, err := ctx.renderEachChild(node) 565 | if err != nil { 566 | return err 567 | } 568 | 569 | ctx.tableCtx.header = append(ctx.tableCtx.header, res) 570 | 571 | case atom.Td: 572 | res, err := ctx.renderEachChild(node) 573 | if err != nil { 574 | return err 575 | } 576 | 577 | if ctx.tableCtx.isInFooter { 578 | ctx.tableCtx.footer = append(ctx.tableCtx.footer, res) 579 | } else { 580 | ctx.tableCtx.body[ctx.tableCtx.tmpRow] = append(ctx.tableCtx.body[ctx.tableCtx.tmpRow], res) 581 | } 582 | 583 | } 584 | return nil 585 | } 586 | 587 | func (ctx *TextifyTraverseContext) traverse(node *html.Node) error { 588 | switch node.Type { 589 | default: 590 | return ctx.traverseChildren(node) 591 | 592 | case html.TextNode: 593 | var data string 594 | if ctx.isPre { 595 | data = node.Data 596 | } else { 597 | data = strings.TrimSpace(spacingRe.ReplaceAllString(node.Data, " ")) 598 | } 599 | return ctx.emit(data) 600 | 601 | case html.ElementNode: 602 | return ctx.handleElement(node) 603 | } 604 | } 605 | 606 | func (ctx *TextifyTraverseContext) traverseChildren(node *html.Node) error { 607 | for c := node.FirstChild; c != nil; c = c.NextSibling { 608 | if err := ctx.traverse(c); err != nil { 609 | return err 610 | } 611 | } 612 | 613 | return nil 614 | } 615 | 616 | // Tests r for being a character where no space should be inserted in front of. 617 | func punctNoSpaceBefore(r rune) bool { 618 | switch r { 619 | case '.', ',', ';', '!', '?', ')', ']', '>': 620 | return true 621 | default: 622 | return false 623 | } 624 | } 625 | 626 | // Tests r for being a character where no space should be inserted after. 627 | func punctNoSpaceAfter(r rune) bool { 628 | switch r { 629 | case '(', '[', '<': 630 | return true 631 | default: 632 | return false 633 | } 634 | } 635 | func (ctx *TextifyTraverseContext) emit(data string) error { 636 | if data == "" { 637 | return nil 638 | } 639 | 640 | var lines = []string{data} 641 | 642 | for _, line := range lines { 643 | runes := []rune(line) 644 | startsWithSpace := unicode.IsSpace(runes[0]) || punctNoSpaceBefore(runes[0]) 645 | if !startsWithSpace && !ctx.endsWithSpace { 646 | if err := ctx.buf.WriteByte(' '); err != nil { 647 | return err 648 | } 649 | ctx.lineLength++ 650 | } 651 | ctx.endsWithSpace = unicode.IsSpace(runes[len(runes)-1]) || punctNoSpaceAfter(runes[len(runes)-1]) 652 | for _, c := range line { 653 | if _, err := ctx.buf.WriteString(string(c)); err != nil { 654 | return err 655 | } 656 | ctx.lineLength++ 657 | if c == '\n' { 658 | ctx.lineLength = 0 659 | if ctx.prefix != "" { 660 | if _, err := ctx.buf.WriteString(ctx.prefix); err != nil { 661 | return err 662 | } 663 | } 664 | } 665 | } 666 | } 667 | return nil 668 | } 669 | 670 | func (ctx *TextifyTraverseContext) normalizeHrefLink(link string) string { 671 | link = strings.TrimSpace(link) 672 | link = strings.TrimPrefix(link, "mailto:") 673 | return link 674 | } 675 | 676 | func formatGeminiCitation(idx int, showMarker bool) string { 677 | if showMarker { 678 | return fmt.Sprintf("[%d]", idx) 679 | } else { 680 | return "" 681 | } 682 | 683 | } 684 | 685 | func (ctx *TextifyTraverseContext) addGeminiCitation(url string, display string) string { 686 | 687 | if url[0:1] == "#" { 688 | //dont emit bookmarks to the same page (url starts #) 689 | return "" 690 | } else { 691 | citation := citationLink{ 692 | index: len(ctx.linkAccumulator.linkArray) + ctx.options.CitationStart, 693 | display: display, 694 | url: url, 695 | } 696 | 697 | //spaces would mess up the gemini link, so check for them 698 | if strings.Contains(citation.url, " ") { 699 | //escape the spaces 700 | citation.url = strings.ReplaceAll(citation.url, " ", "%20") 701 | 702 | } 703 | ctx.linkAccumulator.linkArray = append(ctx.linkAccumulator.linkArray, citation) 704 | return formatGeminiCitation(citation.index, ctx.options.CitationMarkers) 705 | } 706 | 707 | } 708 | 709 | func (ctx *TextifyTraverseContext) forceFlushGeminiCitations() { 710 | // this method writes to the buffer directly instead of using `emit`, b/c we do not want to split long links 711 | 712 | if ctx.linkAccumulator.tableNestLevel > 0 { 713 | //dont emit citation list inside a table 714 | return 715 | } 716 | 717 | ctx.buf.WriteString("\n") 718 | 719 | //ctx.buf.WriteString("flushedtoindex: ") 720 | //ctx.buf.WriteString(formatGeminiCitation(ctx.linkAccumulator.flushedToIndex)) 721 | ctx.buf.WriteByte('\n') 722 | 723 | for i, link := range ctx.linkAccumulator.linkArray { 724 | // ctx.buf.WriteString(formatGeminiCitation(i)) 725 | 726 | if i > ctx.linkAccumulator.flushedToIndex { 727 | ctx.buf.WriteString("=> ") 728 | ctx.buf.WriteString(link.url) 729 | ctx.buf.WriteByte(' ') 730 | ctx.buf.WriteString(formatGeminiCitation(link.index, ctx.options.NumberedLinks)) 731 | ctx.buf.WriteByte(' ') 732 | ctx.buf.WriteString(link.display) 733 | ctx.buf.WriteByte('\n') 734 | } 735 | } 736 | 737 | ctx.buf.WriteByte('\n') 738 | 739 | ctx.ResetCitationCounters() 740 | 741 | } 742 | func (ctx *TextifyTraverseContext) emitGeminiCitations() { 743 | 744 | if len(ctx.linkAccumulator.linkArray) > ctx.linkAccumulator.flushedToIndex { 745 | //there are unflushed links 746 | ctx.forceFlushGeminiCitations() 747 | } 748 | } 749 | 750 | // renderEachChild visits each direct child of a node and collects the sequence of 751 | // textuual representaitons separated by a single newline. 752 | func (ctx *TextifyTraverseContext) renderEachChild(node *html.Node) (string, error) { 753 | buf := &bytes.Buffer{} 754 | for c := node.FirstChild; c != nil; c = c.NextSibling { 755 | s, err := FromHTMLNode(c, *ctx) 756 | if err != nil { 757 | return "", err 758 | } 759 | if _, err = buf.WriteString(s); err != nil { 760 | return "", err 761 | } 762 | if c.NextSibling != nil { 763 | if err = buf.WriteByte('\n'); err != nil { 764 | return "", err 765 | } 766 | } 767 | } 768 | return buf.String(), nil 769 | } 770 | 771 | func getAttrVal(node *html.Node, attrName string) string { 772 | for _, attr := range node.Attr { 773 | if attr.Key == attrName { 774 | return attr.Val 775 | } 776 | } 777 | 778 | return "" 779 | } 780 | -------------------------------------------------------------------------------- /html2gemini_test.go: -------------------------------------------------------------------------------- 1 | package html2gemini 2 | 3 | import ( 4 | "bytes" 5 | "fmt" 6 | "io/ioutil" 7 | "os" 8 | "path" 9 | "regexp" 10 | "strings" 11 | "testing" 12 | ) 13 | 14 | const destPath = "testdata" 15 | 16 | // EnableExtraLogging turns on additional testing log output. 17 | // Extra test logging can be enabled by setting the environment variable 18 | // HTML2TEXT_EXTRA_LOGGING to "1" or "true". 19 | var EnableExtraLogging bool 20 | 21 | func init() { 22 | if v := os.Getenv("HTML2TEXT_EXTRA_LOGGING"); v == "1" || v == "true" { 23 | EnableExtraLogging = true 24 | } 25 | } 26 | 27 | // TODO Add tests for FromHTMLNode and FromReader. 28 | 29 | func TestParseUTF8(t *testing.T) { 30 | htmlFiles := []struct { 31 | file string 32 | keywordShouldNotExist string 33 | keywordShouldExist string 34 | }{ 35 | { 36 | "utf8.html", 37 | "学习之道:美国公认学习第一书title", 38 | "次世界冠军赛上,我几近疯狂", 39 | }, 40 | { 41 | "utf8_with_bom.xhtml", 42 | "1892年波兰文版序言title", 43 | "种新的波兰文本已成为必要", 44 | }, 45 | } 46 | 47 | for _, htmlFile := range htmlFiles { 48 | bs, err := ioutil.ReadFile(path.Join(destPath, htmlFile.file)) 49 | if err != nil { 50 | t.Fatal(err) 51 | } 52 | ctx := NewTraverseContext(Options{}) 53 | text, err := FromReader(bytes.NewReader(bs), *ctx) 54 | if err != nil { 55 | t.Fatal(err) 56 | } 57 | if !strings.Contains(text, htmlFile.keywordShouldExist) { 58 | t.Fatalf("keyword %s should exists in file %s", htmlFile.keywordShouldExist, htmlFile.file) 59 | } 60 | if strings.Contains(text, htmlFile.keywordShouldNotExist) { 61 | t.Fatalf("keyword %s should not exists in file %s", htmlFile.keywordShouldNotExist, htmlFile.file) 62 | } 63 | } 64 | } 65 | 66 | func TestStrippingWhitespace(t *testing.T) { 67 | testCases := []struct { 68 | input string 69 | output string 70 | }{ 71 | { 72 | "test text", 73 | "test text", 74 | }, 75 | { 76 | " \ttext\ntext\n", 77 | "text text", 78 | }, 79 | { 80 | " \na \n\t \n \n a \t", 81 | "a a", 82 | }, 83 | { 84 | "test text", 85 | "test text", 86 | }, 87 | { 88 | "test    text ", 89 | "test    text", 90 | }, 91 | } 92 | 93 | for _, testCase := range testCases { 94 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 95 | t.Error(err) 96 | } else if len(msg) > 0 { 97 | t.Log(msg) 98 | } 99 | } 100 | } 101 | 102 | func TestParagraphsAndBreaks(t *testing.T) { 103 | testCases := []struct { 104 | input string 105 | output string 106 | }{ 107 | { 108 | "Test text", 109 | "Test text", 110 | }, 111 | { 112 | "Test text
", 113 | "Test text", 114 | }, 115 | { 116 | "Test text
Test", 117 | "Test text\nTest", 118 | }, 119 | { 120 | "

Test text

", 121 | "Test text", 122 | }, 123 | { 124 | "

Test text

Test text

", 125 | "Test text\n\nTest text", 126 | }, 127 | { 128 | "\n

Test text

\n\n\n\t

Test text

\n", 129 | "Test text\n\nTest text", 130 | }, 131 | { 132 | "\n

Test text
Test text

\n", 133 | "Test text\nTest text", 134 | }, 135 | { 136 | "\n

Test text
\tTest text

\n", 137 | "Test text\nTest text", 138 | }, 139 | { 140 | "Test text

Test text", 141 | "Test text\n\nTest text", 142 | }, 143 | { 144 | "
test1\ntest 2\n\ntest  3
", 145 | "```\ntest1\ntest 2\n\ntest 3\n```", 146 | }, 147 | } 148 | 149 | for _, testCase := range testCases { 150 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 151 | t.Error(err) 152 | } else if len(msg) > 0 { 153 | t.Log(msg) 154 | } 155 | } 156 | } 157 | 158 | func TestTables(t *testing.T) { 159 | testCases := []struct { 160 | input string 161 | tabularOutput string 162 | plaintextOutput string 163 | }{ 164 | { 165 | "
", 166 | // Empty table 167 | // +--+--+ 168 | // | | | 169 | // +--+--+ 170 | "```\n+--+--+\n| | |\n+--+--+\n```", 171 | "", 172 | }, 173 | { 174 | "
cell1cell2
", 175 | // +-------+-------+ 176 | // | cell1 | cell2 | 177 | // +-------+-------+ 178 | "```\n+-------+-------+\n| cell1 | cell2 |\n+-------+-------+\n```", 179 | "cell1 cell2", 180 | }, 181 | { 182 | "
row1
row2
", 183 | // +------+ 184 | // | row1 | 185 | // | row2 | 186 | // +------+ 187 | "```\n+------+\n| row1 |\n| row2 |\n+------+\n```", 188 | "row1 row2", 189 | }, 190 | { 191 | ` 192 | 193 | 194 | 195 | 196 |

Row-1-Col-1-Msg123456789012345

Row-1-Col-1-Msg2

Row-1-Col-2
Row-2-Col-1Row-2-Col-2
`, 197 | // +--------------------------------+-------------+ 198 | // | Row-1-Col-1-Msg123456789012345 | Row-1-Col-2 | 199 | // | Row-1-Col-1-Msg2 | | 200 | // | Row-2-Col-1 | Row-2-Col-2 | 201 | // +--------------------------------+-------------+ 202 | "```\n" + `+--------------------------------+-------------+ 203 | | Row-1-Col-1-Msg123456789012345 | Row-1-Col-2 | 204 | | Row-1-Col-1-Msg2 | | 205 | | Row-2-Col-1 | Row-2-Col-2 | 206 | +--------------------------------+-------------+` + "\n```", 207 | `Row-1-Col-1-Msg123456789012345 208 | 209 | Row-1-Col-1-Msg2 210 | 211 | Row-1-Col-2 Row-2-Col-1 Row-2-Col-2` , 212 | }, 213 | { 214 | ` 215 | 216 | 217 |
cell1-1cell1-2
cell2-1cell2-2
`, 218 | // +---------+---------+ 219 | // | cell1-1 | cell1-2 | 220 | // | cell2-1 | cell2-2 | 221 | // +---------+---------+ 222 | "```\n+---------+---------+\n| cell1-1 | cell1-2 |\n| cell2-1 | cell2-2 |\n+---------+---------+\n```", 223 | "cell1-1 cell1-2 cell2-1 cell2-2", 224 | }, 225 | { 226 | ` 227 | 228 | 229 | 230 | 231 | 232 | 233 | 234 | 235 | 236 | 237 |
Header 1Header 2
Footer 1Footer 2
Row 1 Col 1Row 1 Col 2
Row 2 Col 1Row 2 Col 2
`, 238 | "```\n" + `+-------------+-------------+ 239 | | HEADER 1 | HEADER 2 | 240 | +-------------+-------------+ 241 | | Row 1 Col 1 | Row 1 Col 2 | 242 | | Row 2 Col 1 | Row 2 Col 2 | 243 | +-------------+-------------+ 244 | | FOOTER 1 | FOOTER 2 | 245 | +-------------+-------------+` + "\n```", 246 | "Header 1 Header 2 Footer 1 Footer 2 Row 1 Col 1 Row 1 Col 2 Row 2 Col 1 Row 2 Col 2", 247 | }, 248 | // Two tables in same HTML (goal is to test that context is 249 | // reinitialized correctly). 250 | { 251 | `

252 | 253 | 254 | 255 | 256 | 257 | 258 | 259 | 260 | 261 | 262 | 263 |
Table 1 Header 1Table 1 Header 2
Table 1 Footer 1Table 1 Footer 2
Table 1 Row 1 Col 1Table 1 Row 1 Col 2
Table 1 Row 2 Col 1Table 1 Row 2 Col 2
264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 |
Table 2 Header 1Table 2 Header 2
Table 2 Footer 1Table 2 Footer 2
Table 2 Row 1 Col 1Table 2 Row 1 Col 2
Table 2 Row 2 Col 1Table 2 Row 2 Col 2
276 |

`, 277 | "```\n" + `+---------------------+---------------------+ 278 | | TABLE 1 HEADER 1 | TABLE 1 HEADER 2 | 279 | +---------------------+---------------------+ 280 | | Table 1 Row 1 Col 1 | Table 1 Row 1 Col 2 | 281 | | Table 1 Row 2 Col 1 | Table 1 Row 2 Col 2 | 282 | +---------------------+---------------------+ 283 | | TABLE 1 FOOTER 1 | TABLE 1 FOOTER 2 | 284 | +---------------------+---------------------+ 285 | ` + "```\n" + "\n```" + ` 286 | +---------------------+---------------------+ 287 | | TABLE 2 HEADER 1 | TABLE 2 HEADER 2 | 288 | +---------------------+---------------------+ 289 | | Table 2 Row 1 Col 1 | Table 2 Row 1 Col 2 | 290 | | Table 2 Row 2 Col 1 | Table 2 Row 2 Col 2 | 291 | +---------------------+---------------------+ 292 | | TABLE 2 FOOTER 1 | TABLE 2 FOOTER 2 | 293 | +---------------------+---------------------+` + "\n```", 294 | `Table 1 Header 1 Table 1 Header 2 Table 1 Footer 1 Table 1 Footer 2 Table 1 Row 1 Col 1 Table 1 Row 1 Col 2 Table 1 Row 2 Col 1 Table 1 Row 2 Col 2 295 | 296 | Table 2 Header 1 Table 2 Header 2 Table 2 Footer 1 Table 2 Footer 2 Table 2 Row 1 Col 1 Table 2 Row 1 Col 2 Table 2 Row 2 Col 1 Table 2 Row 2 Col 2`, 297 | }, 298 | { 299 | "_
cell
_", 300 | "_\n\n```\n+------+\n| cell |\n+------+\n```\n\n_", 301 | "_\n\ncell\n\n_", 302 | }, 303 | { 304 | ` 305 | 306 | 307 | 308 | 309 | 310 | 311 | 312 | 313 | 314 | 315 | 316 | 317 | 318 | 319 | 320 |
ItemDescriptionPrice
GolangOpen source programming language that makes it easy to build simple, reliable, and efficient software$10.99
HermesProgrammatically create beautiful e-mails using Golang.$1.99
`, 321 | "```\n" + `+--------+--------------------------------+--------+ 322 | | ITEM | DESCRIPTION | PRICE | 323 | +--------+--------------------------------+--------+ 324 | | Golang | Open source programming | $10.99 | 325 | | | language that makes it easy | | 326 | | | to build simple, reliable, and | | 327 | | | efficient software | | 328 | | Hermes | Programmatically create | $1.99 | 329 | | | beautiful e-mails using | | 330 | | | Golang. | | 331 | +--------+--------------------------------+--------+` + "\n```", 332 | "Item Description Price Golang Open source programming language that makes it easy to build simple, reliable, and efficient software $10.99 Hermes Programmatically create beautiful e-mails using Golang. $1.99", 333 | }, 334 | } 335 | 336 | for _, testCase := range testCases { 337 | options := Options{ 338 | PrettyTables: true, 339 | PrettyTablesOptions: NewPrettyTablesOptions(), 340 | } 341 | // Check pretty tabular ASCII version. 342 | if msg, err := wantString(testCase.input, testCase.tabularOutput, options); err != nil { 343 | t.Error(err) 344 | } else if len(msg) > 0 { 345 | t.Log(msg) 346 | } 347 | 348 | // Check plain version. 349 | if msg, err := wantString(testCase.input, testCase.plaintextOutput); err != nil { 350 | t.Error(err) 351 | } else if len(msg) > 0 { 352 | t.Log(msg) 353 | } 354 | } 355 | } 356 | 357 | func TestStrippingLists(t *testing.T) { 358 | testCases := []struct { 359 | input string 360 | output string 361 | }{ 362 | { 363 | "", 364 | "", 365 | }, 366 | { 367 | "_", 368 | "* item\n\n_", 369 | }, 370 | { 371 | "
  • item 1
  • item 2
  • \n_", 372 | "* item 1\n* item 2\n_", 373 | }, 374 | { 375 | "
  • item 1
  • \t\n
  • item 2
  • item 3
  • \n_", 376 | "* item 1\n* item 2\n* item 3\n_", 377 | }, 378 | } 379 | 380 | for _, testCase := range testCases { 381 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 382 | t.Error(err) 383 | } else if len(msg) > 0 { 384 | t.Log(msg) 385 | } 386 | } 387 | } 388 | 389 | 390 | func TestOmitLinks(t *testing.T) { 391 | testCases := []struct { 392 | input string 393 | output string 394 | }{ 395 | { 396 | ``, 397 | ``, 398 | }, 399 | { 400 | ``, 401 | ``, 402 | }, 403 | { 404 | ``, 405 | ``, 406 | }, 407 | { 408 | `Link`, 409 | `Link`, 410 | }, 411 | { 412 | `Link`, 413 | `Link`, 414 | }, 415 | { 416 | `Link`, 417 | `Link`, 418 | }, 419 | { 420 | "\n\tLink\n\t", 421 | `Link`, 422 | }, 423 | { 424 | `Example`, 425 | `Example`, 426 | }, 427 | } 428 | 429 | for _, testCase := range testCases { 430 | if msg, err := wantString(testCase.input, testCase.output, Options{OmitLinks: true}); err != nil { 431 | t.Error(err) 432 | } else if len(msg) > 0 { 433 | t.Log(msg) 434 | } 435 | } 436 | } 437 | 438 | func TestLinkEscaping(t *testing.T) { 439 | testCases := []struct { 440 | input string 441 | output string 442 | }{ 443 | { 444 | `display`, 445 | "display\n\n=> foo display", //minor bug with extra space at present 446 | }, 447 | { 448 | `display`, 449 | "display\n\n=> foo%20spaced display", //minor bug with extra space at present 450 | }, 451 | { 452 | `display`, 453 | "display\n\n=> foo?bar+baz display", //minor bug with extra space at present 454 | }, 455 | } 456 | for _, testCase := range testCases { 457 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 458 | t.Error(err) 459 | } else if len(msg) > 0 { 460 | t.Log(msg) 461 | } 462 | } 463 | } 464 | 465 | func TestCitationStyleLinks(t *testing.T) { 466 | testCases := []struct { 467 | input string 468 | output string 469 | }{ 470 | { 471 | ``, 472 | ``, 473 | }, 474 | { 475 | ``, 476 | ``, 477 | }, 478 | { 479 | ``, 480 | "[1]\n\n=> http://example.com/ [1]", 481 | }, 482 | { 483 | `Link`, 484 | "Link", 485 | }, 486 | { 487 | `Link1Link2`, 488 | "Link1 [1] Link2 [2]\n\n=> http://example1.com/ [1] Link1\n=> http://example2.com/ [2] Link2", 489 | }, 490 | { 491 | `Link1 (Link2)`, 492 | "Link1 [1] (Link2 [2])\n\n=> http://example1.com/ [1] Link1\n=> http://example2.com/ [2] Link2", 493 | }, 494 | { 495 | `Link1? Link2!`, 496 | "Link1 [1]? Link2 [2]!\n\n=> http://example1.com/ [1] Link1\n=> http://example2.com/ [2] Link2", 497 | }, 498 | { 499 | `Link1Link1 again`, 500 | "Link1 [1] Link1 again [2]\n\n=> http://example1.com/ [1] Link1\n=> http://example1.com/ [2] Link1 again", 501 | }, 502 | { 503 | `Link`, 504 | "Link [1]\n\n=> http://example.com/ [1] Link", 505 | }, 506 | { 507 | "\n\tLink\n\t", 508 | "Link [1]\n\n=> http://example.com/ [1] Link", 509 | }, 510 | { 511 | `Example`, 512 | "Example [1]\n\n=> http://example.com/ [1] Example", 513 | }, 514 | } 515 | 516 | for _, testCase := range testCases { 517 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 518 | t.Error(err) 519 | } else if len(msg) > 0 { 520 | t.Log(msg) 521 | } 522 | } 523 | } 524 | 525 | func TestImageAltTags(t *testing.T) { 526 | testCases := []struct { 527 | input string 528 | output string 529 | }{ 530 | { 531 | ``, 532 | ``, 533 | }, 534 | { 535 | ``, 536 | ``, 537 | }, 538 | { 539 | `Example`, 540 | ``, 541 | }, 542 | { 543 | `Example`, 544 | ``, 545 | }, 546 | // Images do matter if they are in a link. 547 | { 548 | `Example`, 549 | `Example [1]\n\n=> http://example.com/ [1] Example`, 550 | }, 551 | { 552 | `Example`, 553 | `Example ( http://example.com/ )`, 554 | }, 555 | { 556 | `Example`, 557 | `Example ( http://example.com/ )`, 558 | }, 559 | { 560 | `Example`, 561 | `Example ( http://example.com/ )`, 562 | }, 563 | } 564 | 565 | for _, testCase := range testCases { 566 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 567 | t.Error(err) 568 | } else if len(msg) > 0 { 569 | t.Log(msg) 570 | } 571 | } 572 | } 573 | 574 | func TestHeadings(t *testing.T) { 575 | testCases := []struct { 576 | input string 577 | output string 578 | }{ 579 | { 580 | "

    Test

    ", 581 | "# Test", 582 | }, 583 | { 584 | "\t

    \nTest

    ", 585 | "# Test", 586 | }, 587 | { 588 | "\t

    \nTest line 1
    Test 2

    ", 589 | "# Test line 1\nTest 2", 590 | }, 591 | { 592 | "

    Test

    Test

    ", 593 | "# Test\n\n# Test", 594 | }, 595 | { 596 | "

    Test

    ", 597 | "## Test", 598 | }, 599 | { 600 | "

    Test

    ", 601 | "# Test [1]", 602 | }, 603 | { 604 | "

    Test

    ", 605 | "### Test", 606 | }, 607 | } 608 | 609 | for _, testCase := range testCases { 610 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 611 | t.Error(err) 612 | } else if len(msg) > 0 { 613 | t.Log(msg) 614 | } 615 | } 616 | 617 | } 618 | 619 | func TestBold(t *testing.T) { 620 | testCases := []struct { 621 | input string 622 | output string 623 | }{ 624 | { 625 | "Test", 626 | "*Test*", 627 | }, 628 | { 629 | "\tTest ", 630 | "*Test*", 631 | }, 632 | { 633 | "\tTest line 1
    Test 2
    ", 634 | "*Test line 1\nTest 2*", 635 | }, 636 | { 637 | "Test Test", 638 | "*Test* *Test*", 639 | }, 640 | } 641 | 642 | for _, testCase := range testCases { 643 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 644 | t.Error(err) 645 | } else if len(msg) > 0 { 646 | t.Log(msg) 647 | } 648 | } 649 | 650 | } 651 | 652 | func TestDiv(t *testing.T) { 653 | testCases := []struct { 654 | input string 655 | output string 656 | }{ 657 | { 658 | "
    Test
    ", 659 | "Test", 660 | }, 661 | { 662 | "\t
    Test
    ", 663 | "Test", 664 | }, 665 | { 666 | "
    Test line 1
    Test 2
    ", 667 | "Test line 1\nTest 2", 668 | }, 669 | { 670 | "Test 1
    Test 2
    Test 3
    Test 4", 671 | "Test 1\nTest 2\nTest 3\nTest 4", 672 | }, 673 | { 674 | "Test 1
     Test 2 
    ", 675 | "Test 1\nTest 2", 676 | }, 677 | } 678 | 679 | for _, testCase := range testCases { 680 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 681 | t.Error(err) 682 | } else if len(msg) > 0 { 683 | t.Log(msg) 684 | } 685 | } 686 | 687 | } 688 | 689 | func TestBlockquotes(t *testing.T) { 690 | testCases := []struct { 691 | input string 692 | output string 693 | }{ 694 | { 695 | "
    level 0
    level 1
    level 2
    level 1
    level 0
    ", 696 | "level 0\n> \n> level 1\n> \n>> level 2\n> \n> level 1\n\nlevel 0", 697 | }, 698 | { 699 | "
    Test
    Test", 700 | "> \n> Test\n\nTest", 701 | }, 702 | { 703 | "\t
    \nTest
    ", 704 | "> \n> Test\n>", 705 | }, 706 | { 707 | "\t
    \nTest line 1
    Test 2
    ", 708 | "> \n> Test line 1\n> Test 2", 709 | }, 710 | { 711 | "
    Test
    Test
    Other Test", 712 | "> \n> Test\n\n> \n> Test\n\nOther Test", 713 | }, 714 | { 715 | "
    Lorem ipsum Commodo id consectetur pariatur ea occaecat minim aliqua ad sit consequat quis ex commodo Duis incididunt eu mollit consectetur fugiat voluptate dolore in pariatur in commodo occaecat Ut occaecat velit esse labore aute quis commodo non sit dolore officia Excepteur cillum amet cupidatat culpa velit labore ullamco dolore mollit elit in aliqua dolor irure do
    ", 716 | "> \n> Lorem ipsum Commodo id consectetur pariatur ea occaecat minim aliqua ad\n> sit consequat quis ex commodo Duis incididunt eu mollit consectetur fugiat\n> voluptate dolore in pariatur in commodo occaecat Ut occaecat velit esse\n> labore aute quis commodo non sit dolore officia Excepteur cillum amet\n> cupidatat culpa velit labore ullamco dolore mollit elit in aliqua dolor\n> irure do", 717 | }, 718 | { 719 | "
    LoremipsumCommodoidconsecteturpariatureaoccaecatminimaliquaadsitconsequatquisexcommodoDuisincididunteumollitconsecteturfugiatvoluptatedoloreinpariaturincommodooccaecatUtoccaecatvelitesselaboreautequiscommodononsitdoloreofficiaExcepteurcillumametcupidatatculpavelitlaboreullamcodoloremollitelitinaliquadoloriruredo
    ", 720 | "> \n> Lorem *ipsum* *Commodo* *id* *consectetur* *pariatur* *ea* *occaecat* *minim*\n> *aliqua* *ad* *sit* *consequat* *quis* *ex* *commodo* *Duis* *incididunt* *eu*\n> *mollit* *consectetur* *fugiat* *voluptate* *dolore* *in* *pariatur* *in* *commodo*\n> *occaecat* *Ut* *occaecat* *velit* *esse* *labore* *aute* *quis* *commodo*\n> *non* *sit* *dolore* *officia* *Excepteur* *cillum* *amet* *cupidatat* *culpa*\n> *velit* *labore* *ullamco* *dolore* *mollit* *elit* *in* *aliqua* *dolor* *irure*\n> *do*", 721 | }, 722 | } 723 | 724 | for _, testCase := range testCases { 725 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 726 | t.Error(err) 727 | } else if len(msg) > 0 { 728 | t.Log(msg) 729 | } 730 | } 731 | 732 | } 733 | 734 | func TestIgnoreStylesScriptsHead(t *testing.T) { 735 | testCases := []struct { 736 | input string 737 | output string 738 | }{ 739 | { 740 | "", 741 | "", 742 | }, 743 | { 744 | "", 745 | "", 746 | }, 747 | { 748 | "", 749 | "", 750 | }, 751 | { 752 | "", 753 | "", 754 | }, 755 | { 756 | "", 757 | "", 758 | }, 759 | { 760 | "", 761 | "", 762 | }, 763 | { 764 | "", 765 | "", 766 | }, 767 | { 768 | "", 769 | "", 770 | }, 771 | { 772 | "", 773 | "", 774 | }, 775 | { 776 | `Title`, 777 | "", 778 | }, 779 | } 780 | 781 | for _, testCase := range testCases { 782 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 783 | t.Error(err) 784 | } else if len(msg) > 0 { 785 | t.Log(msg) 786 | } 787 | } 788 | } 789 | 790 | func TestText(t *testing.T) { 791 | testCases := []struct { 792 | input string 793 | expr string 794 | }{ 795 | { 796 | `
  • 797 | New repository 798 |
  • `, 799 | `\* New repository \( /new \)`, 800 | }, 801 | { 802 | `hi 803 | 804 |
    805 | 806 | hello google 807 |

    808 | test

    List:

    809 | 810 | 815 | `, 816 | `hi 817 | hello google[1] 818 | 819 | test 820 | 821 | List: 822 | 823 | * Foo[2] 824 | * Barsoap[3] 825 | * Baz`, 826 | }, 827 | // Malformed input html. 828 | { 829 | `hi 830 | 831 | hello google 832 | 833 | test

    List:

    834 | 835 | 841 | `, 842 | `hi hello google[1] test 843 | 844 | List: 845 | 846 | * Foo[2] 847 | * Bar[3] 848 | * Baz`, 849 | }, 850 | } 851 | 852 | for _, testCase := range testCases { 853 | if msg, err := wantRegExp(testCase.input, testCase.expr); err != nil { 854 | t.Error(err) 855 | } else if len(msg) > 0 { 856 | t.Log(msg) 857 | } 858 | } 859 | } 860 | 861 | func TestPeriod(t *testing.T) { 862 | testCases := []struct { 863 | input string 864 | expr string 865 | }{ 866 | { 867 | `

    Lorem ipsum test.

    `, 868 | `Lorem ipsum test\.`, 869 | }, 870 | { 871 | `

    Lorem ipsum test.

    `, 872 | `Lorem ipsum test\.`, 873 | }, 874 | } 875 | 876 | for _, testCase := range testCases { 877 | if msg, err := wantRegExp(testCase.input, testCase.expr); err != nil { 878 | t.Error(err) 879 | } else if len(msg) > 0 { 880 | t.Log(msg) 881 | } 882 | } 883 | } 884 | 885 | type StringMatcher interface { 886 | MatchString(string) bool 887 | String() string 888 | } 889 | 890 | type RegexpStringMatcher string 891 | 892 | func (m RegexpStringMatcher) MatchString(str string) bool { 893 | return regexp.MustCompile(string(m)).MatchString(str) 894 | } 895 | func (m RegexpStringMatcher) String() string { 896 | return string(m) 897 | } 898 | 899 | type ExactStringMatcher string 900 | 901 | func (m ExactStringMatcher) MatchString(str string) bool { 902 | return string(m) == str 903 | } 904 | func (m ExactStringMatcher) String() string { 905 | return string(m) 906 | } 907 | 908 | func wantRegExp(input string, outputRE string, options ...Options) (string, error) { 909 | return match(input, RegexpStringMatcher(outputRE), options...) 910 | } 911 | 912 | func wantString(input string, output string, options ...Options) (string, error) { 913 | return match(input, ExactStringMatcher(output), options...) 914 | } 915 | 916 | func match(input string, matcher StringMatcher, options ...Options) (string, error) { 917 | var ctxOptions Options 918 | if len(options) > 0 { 919 | ctxOptions = options[0] 920 | } 921 | ctx := NewTraverseContext(ctxOptions) 922 | text, err := FromString(input, *ctx) 923 | if err != nil { 924 | return "", err 925 | } 926 | if !matcher.MatchString(text) { 927 | return "", fmt.Errorf(`error: input did not match specified expression 928 | Input: 929 | >>>> 930 | %v 931 | <<<< 932 | 933 | Output: 934 | >>>> 935 | %v 936 | <<<< 937 | 938 | Expected: 939 | >>>> 940 | %v 941 | <<<<`, 942 | input, 943 | text, 944 | matcher.String(), 945 | ) 946 | } 947 | 948 | var msg string 949 | 950 | if EnableExtraLogging { 951 | msg = fmt.Sprintf( 952 | ` 953 | input: 954 | 955 | %v 956 | 957 | output: 958 | 959 | %v 960 | `, 961 | input, 962 | text, 963 | ) 964 | } 965 | return msg, nil 966 | } 967 | 968 | func Example() { 969 | inputHTML := ` 970 | 971 | 972 | My Mega Service 973 | 974 | 975 | 976 | 977 | 978 | 981 | 982 |

    Welcome to your new account on my service!

    983 | 984 |

    985 | Here is some more information: 986 | 987 |

    992 |

    993 | 994 | 995 | 996 | 997 | 998 | 999 | 1000 | 1001 | 1002 | 1003 | 1004 | 1005 |
    Header 1Header 2
    Footer 1Footer 2
    Row 1 Col 1Row 1 Col 2
    Row 2 Col 1Row 2 Col 2
    1006 | 1007 |
    1008 | Preformatted content    with    spaces
    1009 |     and indentation
    1010 | 
    1011 | 1012 | ` 1013 | 1014 | ctx := NewTraverseContext(Options{PrettyTables: true, LinkEmitFrequency: 100}) 1015 | text, err := FromString(inputHTML, *ctx) 1016 | if err != nil { 1017 | panic(err) 1018 | } 1019 | fmt.Println(text) 1020 | 1021 | // Output: 1022 | // Mega Service [1] 1023 | // 1024 | // # Welcome to your new account on my service! 1025 | // 1026 | // Here is some more information: 1027 | // 1028 | // * Link 1: Example.com [2] 1029 | // * Link 2: Example2.com [3] 1030 | // * Something else 1031 | // 1032 | // ``` 1033 | // +-------------+-------------+ 1034 | // | HEADER 1 | HEADER 2 | 1035 | // +-------------+-------------+ 1036 | // | Row 1 Col 1 | Row 1 Col 2 | 1037 | // | Row 2 Col 1 | Row 2 Col 2 | 1038 | // +-------------+-------------+ 1039 | // | FOOTER 1 | FOOTER 2 | 1040 | // +-------------+-------------+ 1041 | // ``` 1042 | // 1043 | //``` 1044 | //Preformatted content with spaces 1045 | // and indentation 1046 | // 1047 | //``` 1048 | // 1049 | // => http://jaytaylor.com/ [1] http://jaytaylor.com/ 1050 | // => https://example.com [2] https://example.com 1051 | // => https://example2.com [3] https://example2.com 1052 | } 1053 | -------------------------------------------------------------------------------- /testdata/utf8.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 学习之道:美国公认学习第一书title 7 | 8 | 9 | 10 | 11 | 12 |

    写在前面的话 13 |

    14 |

    在台湾的那次世界冠军赛上,我几近疯狂,直至两年后的今天,我仍沉浸在这次的经历中。这是我生平第一次如此深入地审视我自己,甚至是第一次尝试审视自己。这个过程令人很是兴奋,同时也有点感觉怪异。我重新认识了自我,看到了自己的另外一面,自己从未发觉的另外一面。为了生存,为了取胜,我成了一名角斗士,彻头彻尾,简单纯粹。我并没有意识到这一角色早已在我的心中生根发芽,呼之欲出。也许,他的出现已是不可避免。

    15 |

    而我这全新的一面,与我一直熟识的那个乔希,那个曾经害怕黑暗的孩子,那个象棋手,那个狂热于雨水、反复诵读杰克·克鲁亚克作品的年轻人之间,又有什么样的联系呢?这些都是我正在努力弄清楚的问题。

    16 |

    自台湾赛事之后,我急切非常,一心想要回到训练中去,摆脱自己已经达到巅峰的想法。在过去的两年中,我已经重新开始。这是一个新的起点。前方的路还很长,有待进一步的探索。

    17 |

    这本书的创作耗费了相当多的时间和精力。在成长的过程中,我在我的小房间里从未想过等待我的会是这样的战斗。在创作中,我的思想逐渐成熟;爱恋从分崩离析,到失而复得,世界冠军头衔从失之交臂,到囊中取物。如果说在我人生的第一个二十九年中,我学到了什么,那就是,我们永远无法预测结局,无论是重要的比赛、冒险,还是轰轰烈烈的爱情。我们唯一可以肯定的只有,出乎意料。不管我们做了多么万全的准备,在生活的真实场景中,我们总是会处于陌生的境地。我们也许会无法冷静,失去理智,感觉似乎整个世界都在针对我们。在这个时候,我们所要做的是要付出加倍的努力,要表现得比预想得更好。我认为,关键在于准备好随机应变,准备好在所能想象的高压下发挥出创造力。

    18 |

    读者朋友们,我非常希望你们在读过这本书后,可以得到启发,甚至会得到触动,从而能够根据各自的天赋与特长,去实现自己的梦想。这就是我写作此书的目的。我在字里行间所传达的理念曾经使我受益匪浅,我很希望它们可以为大家提供一个基本的框架和方向。如果我的方法言之有理,那么就请接受它,琢磨它,并加之自己的见解。忘记我的那些数字。真正的掌握需要通过自己发现一些最能够引起共鸣的信息,并将其彻底地融合进来,直至成为一体,这样我们才能随心所欲地驾驭它。

    19 |
    20 | 21 | 22 | -------------------------------------------------------------------------------- /testdata/utf8_with_bom.xhtml: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | 5 | 6 | 1892年波兰文版序言title 7 | 8 | 9 | 10 | 11 |
    12 |

    1892年波兰文版序言[18]

    13 |

    出版共产主义宣言的一种新的波兰文本已成为必要,这一事实,引起了许多感想。

    14 |

    首先值得注意的是,近来宣言在一定程度上已成为欧洲大陆大工业发展的一种尺度。一个国家的大工业越发展,该国工人中想认清自己作为工人阶级在有产阶级面前所处地位的要求就越增加,他们中间的社会主义运动也越扩大,因而对宣言的需求也越增长。这样,根据宣言用某国文字销行的份数,不仅能够相当确切地断定该国工人运动的状况,而且还能够相当确切地断定该国大工业发展的程度。

    15 |

    因此,波兰文的新版本标志着波兰工业的决定性进步。从十年前发表的上一个版本以来确实有了这种进步,对此丝毫不容置疑。俄国的波兰,会议的波兰[19],成了俄罗斯帝国巨大的工业区。俄国大工业是零星分散的,一部分在芬兰湾沿岸,一部分在中央区(莫斯科和弗拉基米尔),第三部分在黑海和亚速海沿岸,还有另一些散布在别处;而波兰工业则紧缩于相对狭小的地区,享受到由这种积聚引起的长处与短处。这种长处是竞争着的俄罗斯工厂主所承认的,他们要求实行保护关税以对付波兰,尽管他们渴望使波兰人俄罗斯化。这种短处,对波兰工厂主与俄罗斯政府来说,表现在社会主义思想在波兰工人中间的迅速传播和对宣言需求的增长。

    16 |

    但是,波兰工业的迅速发展——它超过了俄国工业——本身是波兰人民的坚强生命力的一个新证明,是波兰人民临近的民族复兴的一个新保证。而一个独立强盛的波兰的复兴,不只是一件同波兰人有关、而且是同我们大家有关的事情。只有当每个民族在自己内部完全自主时,欧洲各民族间真诚的国际合作才是可能的。1848年革命在无产阶级旗帜下,使无产阶级的战士最终只作了资产阶级的工作,这次革命通过自己遗嘱的执行者路易·波拿巴和俾斯麦也实现了意大利、德国和匈牙利的独立。然而波兰,它从1792年以来为革命做的比所有这三个国家总共做的还要多,而当它1863年失败于强大十倍的俄军的时候,人们却把它抛弃不顾了。贵族既未能保持住、也未能重新争得波兰的独立;今天波兰的独立对资产阶级至少是无所谓的。然而波兰的独立对于欧洲各民族和谐的合作是必需的。这种独立只有年轻的波兰无产阶级才能争得,而且在它的手中会很好地保持住。因为欧洲所有其余的工人都象波兰工人自己一样也需要波兰的独立。

    17 |

    弗·恩格斯

    18 |

    1892年2月10日于伦敦

    19 |
    20 |
    [18] 恩格斯用德文为《宣言》新的波兰文本写了这篇序言。1892年由波兰社会主义者在伦敦办的《黎明》杂志社出版。序言寄出后,恩格斯写信给门德尔森(1892年2月11日),信中说,他很愿意学会波兰文,并且深入研究波兰工人运动的发展,以便能够为《宣言》的下一版写一篇更详细的序言。——第20页
    21 |
    [19] 指维也纳会议的波兰,即根据1814—1815年维也纳会议的决定,以波兰王国的正式名义割给俄国的那部分波兰土地。——第20页
    22 | 23 | 24 | --------------------------------------------------------------------------------