├── .gitignore
├── .travis.yml
├── LICENSE
├── README.md
├── go.mod
├── go.sum
├── html2gemini.go
├── html2gemini_test.go
└── testdata
├── utf8.html
└── utf8_with_bom.xhtml
/.gitignore:
--------------------------------------------------------------------------------
1 | # Compiled Object files, Static and Dynamic libs (Shared Objects)
2 | *.o
3 | *.a
4 | *.so
5 |
6 | # Folders
7 | _obj
8 | _test
9 |
10 | # Architecture specific extensions/prefixes
11 | *.[568vq]
12 | [568vq].out
13 |
14 | *.cgo1.go
15 | *.cgo2.c
16 | _cgo_defun.c
17 | _cgo_gotypes.go
18 | _cgo_export.*
19 |
20 | _testmain.go
21 |
22 | *.exe
23 | *.test
24 | *.prof
25 | .hgignore
26 | .hg/
27 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: go
2 | go:
3 | # n.b. For golang release history, see https://golang.org/doc/devel/release.html
4 | - tip
5 | - "1.13.8"
6 | - "1.12.17"
7 | - "1.11.13"
8 | - "1.10.8"
9 | - "1.9.7"
10 | notifications:
11 | email:
12 | on_success: change
13 | on_failure: always
14 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | The MIT License (MIT)
2 |
3 | Copyright (c) 2015 Jay Taylor
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
23 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # html2gemini
2 |
3 | ## A Go library to converts HTML into Gemini text/gemini (gemtext)
4 |
5 | This is forked from https://jaytaylor.com/html2text with the following changes:
6 |
7 | * output text/gemini format
8 | * use footnote style references
9 |
10 | ## Introduction
11 |
12 | Turns HTML into text/gemini to be served over gemini, or incorporated into a client.
13 |
14 | html2gemini is a simple golang package for rendering HTML into plaintext.
15 |
16 |
17 | ## Download the package
18 |
19 | ```bash
20 | go get github.com/LukeEmmet/html2gemini
21 | ```
22 |
23 | ## Example usage
24 |
25 | See https://github.com/LukeEmmet/html2gmi which is a practical command line application that uses this library. Also see https://github.com/LukeEmmet/duckling-proxy which is an HTTP via Gemini proxy server so you can browse the web from any Gemini client that supports scheme-specific proxies.
26 |
27 | To simplify the html passed to this library, you could simplify or sanitise it first, for example using https://github.com/philipjkim/goreadability
28 |
29 | ## Unit-tests
30 |
31 | Running the unit-tests is straightforward and standard:
32 |
33 | ```bash
34 | go test
35 | ```
36 |
37 |
38 | # License
39 |
40 | Permissive MIT license.
41 |
42 |
43 | ## Contact
44 |
45 | Email: luke [at] marmaladefoo [dot] com
46 |
47 | If you appreciate this library please feel free to drop me a line and tell me, and please send a note of appreciation to Jay Taylor (url below) who wrote the original html2text on which this is based, and who should receive most of the credit.
48 |
49 | https://jaytaylor.com/html2text
50 |
51 |
--------------------------------------------------------------------------------
/go.mod:
--------------------------------------------------------------------------------
1 | module github.com/LukeEmmet/html2gemini
2 |
3 | go 1.14
4 |
5 | require (
6 | github.com/olekukonko/tablewriter v0.0.4
7 | github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf
8 | golang.org/x/net v0.0.0-20200822124328-c89045814202
9 | )
10 |
--------------------------------------------------------------------------------
/go.sum:
--------------------------------------------------------------------------------
1 | github.com/mattn/go-runewidth v0.0.7 h1:Ei8KR0497xHyKJPAv59M1dkC+rOZCMBJ+t3fZ+twI54=
2 | github.com/mattn/go-runewidth v0.0.7/go.mod h1:H031xJmbD/WCDINGzjvQ9THkh0rPKHF+m2gUSrubnMI=
3 | github.com/olekukonko/tablewriter v0.0.4 h1:vHD/YYe1Wolo78koG299f7V/VAS08c6IpCLn+Ejf/w8=
4 | github.com/olekukonko/tablewriter v0.0.4/go.mod h1:zq6QwlOf5SlnkVbMSr5EoBv3636FWnp+qbPhuoO21uA=
5 | github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf h1:pvbZ0lM0XWPBqUKqFU8cmavspvIl9nulOYwdy6IFRRo=
6 | github.com/ssor/bom v0.0.0-20170718123548-6386211fdfcf/go.mod h1:RJID2RhlZKId02nZ62WenDCkgHFerpIOmW0iT7GKmXM=
7 | golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
8 | golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto=
9 | golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=
10 | golang.org/x/net v0.0.0-20200822124328-c89045814202 h1:VvcQYSHwXgi7W+TpUR6A9g6Up98WAHf3f/ulnJ62IyA=
11 | golang.org/x/net v0.0.0-20200822124328-c89045814202/go.mod h1:/O7V0waA8r7cgGh81Ro3o1hOxt32SMVPicZroKQ2sZA=
12 | golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
13 | golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
14 | golang.org/x/sys v0.0.0-20200323222414-85ca7c5b95cd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
15 | golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
16 |
--------------------------------------------------------------------------------
/html2gemini.go:
--------------------------------------------------------------------------------
1 | package html2gemini
2 |
3 | import (
4 | "bytes"
5 | "fmt"
6 | "io"
7 | "path/filepath"
8 | "regexp"
9 | "strings"
10 | "unicode"
11 |
12 | "github.com/olekukonko/tablewriter"
13 | "github.com/ssor/bom"
14 | "golang.org/x/net/html"
15 | "golang.org/x/net/html/atom"
16 | )
17 |
18 | // Options provide toggles and overrides to control specific rendering behaviors.
19 | type Options struct {
20 | PrettyTables bool // Turns on pretty ASCII rendering for table elements.
21 | PrettyTablesOptions *PrettyTablesOptions // Configures pretty ASCII rendering for table elements.
22 | OmitLinks bool // Turns on omitting links
23 | CitationStart int //Start Citations from this number (default 1)
24 | CitationMarkers bool //use footnote style citation markers
25 | LinkEmitFrequency int //emit gathered links after approximately every n paras (otherwise when new heading, or blockquote)
26 | NumberedLinks bool // number the links [1], [2] etc to match citation markers
27 | EmitImagesAsLinks bool //emit referenced images as links e.g.
28 | ImageMarkerPrefix string //prefix when emitting images
29 | EmptyLinkPrefix string //prefix when emitting empty links (e.g.
30 | ListItemToLinkWordThreshold int //max number of words in a list item having a single link that is converted to a plain gemini link
31 | }
32 |
33 | //NewOptions creates Options with default settings
34 | func NewOptions() *Options {
35 | return &Options{
36 | PrettyTables: false,
37 | PrettyTablesOptions: NewPrettyTablesOptions(),
38 | OmitLinks: false,
39 | CitationStart: 1,
40 | CitationMarkers: true,
41 | NumberedLinks: true,
42 | LinkEmitFrequency: 2,
43 | EmitImagesAsLinks: true,
44 | ImageMarkerPrefix: "‡",
45 | EmptyLinkPrefix: ">>",
46 | ListItemToLinkWordThreshold: 30,
47 | }
48 | }
49 |
50 | // PrettyTablesOptions overrides tablewriter behaviors
51 | type PrettyTablesOptions struct {
52 | AutoFormatHeader bool
53 | AutoWrapText bool
54 | ReflowDuringAutoWrap bool
55 | ColWidth int
56 | ColumnSeparator string
57 | RowSeparator string
58 | CenterSeparator string
59 | HeaderAlignment int
60 | FooterAlignment int
61 | Alignment int
62 | ColumnAlignment []int
63 | NewLine string
64 | HeaderLine bool
65 | RowLine bool
66 | AutoMergeCells bool
67 | Borders tablewriter.Border
68 | }
69 |
70 | // NewPrettyTablesOptions creates PrettyTablesOptions with default settings
71 | func NewPrettyTablesOptions() *PrettyTablesOptions {
72 | return &PrettyTablesOptions{
73 | AutoFormatHeader: true,
74 | AutoWrapText: true,
75 | ReflowDuringAutoWrap: true,
76 | ColWidth: tablewriter.MAX_ROW_WIDTH,
77 | ColumnSeparator: tablewriter.COLUMN,
78 | RowSeparator: tablewriter.ROW,
79 | CenterSeparator: tablewriter.CENTER,
80 | HeaderAlignment: tablewriter.ALIGN_DEFAULT,
81 | FooterAlignment: tablewriter.ALIGN_DEFAULT,
82 | Alignment: tablewriter.ALIGN_DEFAULT,
83 | ColumnAlignment: []int{},
84 | NewLine: tablewriter.NEWLINE,
85 | HeaderLine: true,
86 | RowLine: false,
87 | AutoMergeCells: false,
88 | Borders: tablewriter.Border{Left: true, Right: true, Bottom: true, Top: true},
89 | }
90 | }
91 |
92 | // FlushCitations emits a list of Gemini links gathered up to this point, if the para count exceeds the
93 | // emit frequency
94 | func (ctx *TextifyTraverseContext) CheckFlushCitations() {
95 |
96 | // if ctx.linkAccumulator.emitParaCount > ctx.options.LinkEmitFrequency && ctx.citationCount > 0 {
97 | if ctx.linkAccumulator.emitParaCount > ctx.options.LinkEmitFrequency && len(ctx.linkAccumulator.linkArray) > (ctx.linkAccumulator.flushedToIndex+1) {
98 | ctx.FlushCitations()
99 | } else {
100 | ctx.linkAccumulator.emitParaCount += 1
101 | }
102 | }
103 |
104 | func (ctx *TextifyTraverseContext) FlushCitations() {
105 | ctx.emitGeminiCitations()
106 | }
107 |
108 | func (ctx *TextifyTraverseContext) ResetCitationCounters() {
109 | ctx.linkAccumulator.flushedToIndex = len(ctx.linkAccumulator.linkArray) - 1
110 | ctx.linkAccumulator.emitParaCount = 0
111 | }
112 |
113 | // FromHTMLNode renders text output from a pre-parsed HTML document.
114 | func FromHTMLNode(doc *html.Node, ctx TextifyTraverseContext) (string, error) {
115 |
116 | if err := ctx.traverse(doc); err != nil {
117 | return "", err
118 | }
119 | //flush any remaining citations at the end
120 | ctx.forceFlushGeminiCitations()
121 |
122 | text := strings.TrimSpace(newlineRe.ReplaceAllString(
123 | strings.Replace(ctx.buf.String(), "\n ", "\n", -1), "\n\n"),
124 | )
125 |
126 | //somewhat hacky tidying up of start and end of blockquotes
127 | startQuote := regexp.MustCompile(`\n *\n+> \n`)
128 | text = startQuote.ReplaceAllString(text, "\n\n")
129 | endQuote := regexp.MustCompile(`\n> \n\n+`)
130 | text = endQuote.ReplaceAllString(text, "\n\n")
131 | text = endQuote.ReplaceAllString(text, "\n\n")
132 |
133 | return text, nil
134 | }
135 |
136 | // FromReader renders text output after parsing HTML for the specified
137 | // io.Reader.
138 | func FromReader(reader io.Reader, ctx TextifyTraverseContext) (string, error) {
139 | newReader, err := bom.NewReaderWithoutBom(reader)
140 | if err != nil {
141 | return "", err
142 | }
143 | doc, err := html.Parse(newReader)
144 | if err != nil {
145 | return "", err
146 | }
147 |
148 | return FromHTMLNode(doc, ctx)
149 | }
150 |
151 | // FromString parses HTML from the input string, then renders the text form.
152 | func FromString(input string, ctx TextifyTraverseContext) (string, error) {
153 | bs := bom.CleanBom([]byte(input))
154 | text, err := FromReader(bytes.NewReader(bs), ctx)
155 | if err != nil {
156 | return "", err
157 | }
158 | return text, nil
159 | }
160 |
161 | var (
162 | spacingRe = regexp.MustCompile(`[ \r\n\t]+`)
163 | newlineRe = regexp.MustCompile(`\n\n+`)
164 | )
165 |
166 | // traverseTableCtx holds text-related context.
167 | type TextifyTraverseContext struct {
168 | buf bytes.Buffer
169 |
170 | prefix string
171 | tableCtx tableTraverseContext
172 | options Options
173 | endsWithSpace bool
174 | justClosedDiv bool
175 | blockquoteLevel int
176 | lineLength int
177 | isPre bool
178 | linkAccumulator linkAccumulatorType
179 | }
180 |
181 | type linkAccumulatorType struct {
182 | emitParaCount int
183 | linkArray []citationLink
184 | flushedToIndex int
185 | tableNestLevel int
186 | }
187 |
188 | func newlinkAccumulator() *linkAccumulatorType {
189 | return &linkAccumulatorType{
190 | flushedToIndex: -1,
191 | }
192 | }
193 |
194 | type citationLink struct {
195 | index int
196 | url string
197 | display string
198 | }
199 |
200 | // tableTraverseContext holds table ASCII-form related context.
201 | type tableTraverseContext struct {
202 | header []string
203 | body [][]string
204 | footer []string
205 | tmpRow int
206 | isInFooter bool
207 | }
208 |
209 | func (tableCtx *tableTraverseContext) init() {
210 | tableCtx.body = [][]string{}
211 | tableCtx.header = []string{}
212 | tableCtx.footer = []string{}
213 | tableCtx.isInFooter = false
214 | tableCtx.tmpRow = 0
215 | }
216 |
217 | func NewTraverseContext(options Options) *TextifyTraverseContext {
218 |
219 | //no options provided we need to set some default options for non-zero
220 | //types.
221 |
222 | //start links at 1, not 0 if not specified
223 | options.CitationStart = 1 //otherwise uses zero value which is 0
224 |
225 | var ctx = TextifyTraverseContext{
226 | buf: bytes.Buffer{},
227 | options: options,
228 | }
229 |
230 | ctx.linkAccumulator = *newlinkAccumulator()
231 |
232 | return &ctx
233 | }
234 | func (ctx *TextifyTraverseContext) handleElement(node *html.Node) error {
235 | ctx.justClosedDiv = false
236 |
237 | prefix := ""
238 |
239 | switch node.DataAtom {
240 | case atom.Br:
241 | return ctx.emit("\n")
242 |
243 | case atom.H1, atom.H2, atom.H3:
244 |
245 | if node.DataAtom == atom.H1 {
246 | ctx.FlushCitations()
247 | prefix = "# "
248 | }
249 | if node.DataAtom == atom.H2 {
250 | ctx.FlushCitations()
251 | prefix = "## "
252 | }
253 |
254 | if node.DataAtom == atom.H3 {
255 | ctx.FlushCitations()
256 | prefix = "### "
257 | }
258 |
259 | ctx.emit("\n\n" + prefix)
260 | if err := ctx.traverseChildren(node); err != nil {
261 | return err
262 | }
263 | return ctx.emit("\n\n")
264 |
265 | case atom.Blockquote:
266 | ctx.FlushCitations()
267 | //if err := ctx.emit("\n"); err != nil {
268 | // return err
269 | //}
270 | ctx.blockquoteLevel++
271 | ctx.prefix = strings.Repeat(">", ctx.blockquoteLevel) + " "
272 | //if ctx.blockquoteLevel == 1 {
273 | // if err := ctx.emit("\n"); err != nil {
274 | // return err
275 | // }
276 | //}
277 | if err := ctx.traverseChildren(node); err != nil {
278 | return err
279 | }
280 | ctx.blockquoteLevel--
281 | ctx.prefix = strings.Repeat(">", ctx.blockquoteLevel)
282 | if ctx.blockquoteLevel > 0 {
283 | ctx.prefix += " "
284 | }
285 | //return ctx.emit("\n\n")
286 | return ctx.emit("")
287 |
288 | case atom.Div:
289 |
290 | if ctx.lineLength > 0 {
291 | if err := ctx.emit("\n"); err != nil {
292 | return err
293 | }
294 | }
295 | if err := ctx.traverseChildren(node); err != nil {
296 | return err
297 | }
298 | var err error
299 | if !ctx.justClosedDiv {
300 | err = ctx.emit("\n")
301 | }
302 | ctx.justClosedDiv = true
303 | return err
304 |
305 | case atom.Li:
306 |
307 | //a test context to examine the list element to see if it just has a single link
308 | //in which case we'll output a link line, or no links in which case we output just a bullet
309 | testCtx := TextifyTraverseContext{}
310 | if err := testCtx.traverseChildren(node); err != nil {
311 | return err
312 | }
313 |
314 | //if content contains just one link, output a link instead of a bullet if within a specified number of
315 | //words
316 | maxSingletonLinkLength := ctx.options.ListItemToLinkWordThreshold
317 | if (len(strings.Split(testCtx.buf.String(), " ")) < maxSingletonLinkLength) && (len(testCtx.linkAccumulator.linkArray) == 1) {
318 | return ctx.emit("=> " + testCtx.linkAccumulator.linkArray[0].url + " " + testCtx.buf.String() + "\n")
319 | }
320 |
321 | //if no links, just emit a bullet with the text, ignoring any sub elements
322 | if len(testCtx.linkAccumulator.linkArray) == 0 {
323 | return ctx.emit("* " + testCtx.buf.String() + "\n")
324 | }
325 |
326 | //otherwise is mixed content, so keep traversing
327 | if err := ctx.emit("* "); err != nil {
328 | return err
329 | }
330 |
331 | if err := ctx.traverseChildren(node); err != nil {
332 | return err
333 | }
334 |
335 | return ctx.emit("\n")
336 |
337 | case atom.Img:
338 | //output images with a link to the image
339 | hrefLink := ""
340 | altText := ""
341 | if altText = getAttrVal(node, "alt"); altText != "" {
342 | altText = altText
343 | } else {
344 | if src := getAttrVal(node, "src"); src != "" {
345 | //try to ge the last element of the path
346 | fileName := filepath.Base(src)
347 | fileBase := strings.TrimSuffix(fileName, filepath.Ext(fileName))
348 | altText = fileBase
349 | }
350 | }
351 | altText = "[" + ctx.options.ImageMarkerPrefix + " " + altText + "]"
352 | altText = strings.ReplaceAll(altText, "_", " ")
353 | altText = strings.ReplaceAll(altText, "-", " ")
354 | altText = strings.ReplaceAll(altText, " ", " ")
355 |
356 | if ctx.options.EmitImagesAsLinks {
357 | if err := ctx.emit(altText); err != nil {
358 | return err
359 | }
360 |
361 | if attrVal := getAttrVal(node, "src"); attrVal != "" {
362 | attrVal = ctx.normalizeHrefLink(attrVal)
363 | if !ctx.options.OmitLinks && attrVal != "" && altText != attrVal {
364 | hrefLink = ctx.addGeminiCitation(attrVal, altText)
365 | }
366 | }
367 | return ctx.emit(hrefLink)
368 | } else {
369 | return ctx.emit(altText)
370 | }
371 |
372 | case atom.A:
373 | linkText := ""
374 | // For simple link element content with single text node only, peek at the link text.
375 | if node.FirstChild != nil && node.FirstChild.NextSibling == nil && node.FirstChild.Type == html.TextNode {
376 | linkText = node.FirstChild.Data
377 | }
378 |
379 | if err := ctx.traverseChildren(node); err != nil {
380 | return err
381 | }
382 |
383 | // If image is the only child, the image will have been shown as a link with its alt text etc
384 | // so choose a simple marker for the link itself
385 | if img := node.FirstChild; img != nil && node.LastChild == img && img.DataAtom == atom.Img {
386 | linkText = ctx.options.EmptyLinkPrefix
387 | ctx.emit(" " + linkText)
388 | }
389 |
390 | hrefLink := ""
391 | if attrVal := getAttrVal(node, "href"); attrVal != "" {
392 | attrVal = ctx.normalizeHrefLink(attrVal)
393 | // Don't print link href if it matches link element content or if the link is empty.
394 | if !ctx.options.OmitLinks && attrVal != "" && linkText != attrVal {
395 | hrefLink = ctx.addGeminiCitation(attrVal, linkText)
396 | }
397 | }
398 |
399 | return ctx.emit(hrefLink)
400 |
401 | case atom.Ul:
402 |
403 | return ctx.paragraphHandler(node)
404 |
405 | case atom.P:
406 |
407 | //a test context to examine the list element to see if it just has a single link
408 | //in which case we'll output a link line, or no links in which case we output just a bullet
409 | testCtx := TextifyTraverseContext{}
410 | if err := testCtx.traverseChildren(node); err != nil {
411 | return err
412 | }
413 |
414 | //if content contains just one link, output a link instead of a para if within a specified number of
415 | //words
416 | maxSingletonLinkLength := ctx.options.ListItemToLinkWordThreshold
417 | if (len(strings.Split(testCtx.buf.String(), " ")) < maxSingletonLinkLength) && (len(testCtx.linkAccumulator.linkArray) == 1) {
418 | return ctx.emit("=> " + testCtx.linkAccumulator.linkArray[0].url + " " + testCtx.buf.String() + "\n")
419 | }
420 |
421 | //if no links, just emit a para with the text, ignoring any sub elements
422 | if len(testCtx.linkAccumulator.linkArray) == 0 {
423 | return ctx.emit(testCtx.buf.String() + "\n")
424 | }
425 |
426 | //else - mixed content
427 | return ctx.paragraphHandler(node)
428 |
429 | case atom.Table, atom.Tfoot, atom.Th, atom.Tr, atom.Td:
430 |
431 | if ctx.options.PrettyTables {
432 | return ctx.handleTableElement(node)
433 | } else if node.DataAtom == atom.Table {
434 | //just treat tables as a type of paragraph
435 | ctx.emit("\n\n⊞ table ⊞\n\n")
436 | return ctx.paragraphHandler(node)
437 | }
438 |
439 | if node.DataAtom == atom.Tr {
440 | //start a new line
441 | ctx.emit("\n")
442 | }
443 |
444 | return ctx.traverseChildren(node)
445 |
446 | case atom.Pre:
447 | ctx.emit("\n\n```\n")
448 | ctx.isPre = true
449 | err := ctx.traverseChildren(node)
450 | ctx.isPre = false
451 | ctx.emit("\n```\n\n")
452 | return err
453 |
454 | case atom.Style, atom.Script, atom.Head:
455 | // Ignore the subtree.
456 | return nil
457 |
458 | default:
459 | return ctx.traverseChildren(node)
460 | }
461 | }
462 |
463 | // paragraphHandler renders node children surrounded by double newlines.
464 | func (ctx *TextifyTraverseContext) paragraphHandler(node *html.Node) error {
465 | ctx.CheckFlushCitations()
466 |
467 | if err := ctx.emit("\n\n"); err != nil {
468 | return err
469 | }
470 |
471 | if err := ctx.traverseChildren(node); err != nil {
472 | return err
473 | }
474 | if err := ctx.emit("\n\n"); err != nil {
475 | return err
476 | }
477 |
478 | return nil
479 | }
480 |
481 | // handleTableElement is only to be invoked when options.PrettyTables is active.
482 | func (ctx *TextifyTraverseContext) handleTableElement(node *html.Node) error {
483 | if !ctx.options.PrettyTables {
484 | panic("handleTableElement invoked when PrettyTables not active")
485 | }
486 |
487 | switch node.DataAtom {
488 | case atom.Table:
489 |
490 | if ctx.linkAccumulator.tableNestLevel == 0 {
491 | if err := ctx.emit("\n\n```\n"); err != nil {
492 | return err
493 | }
494 | } else {
495 | if err := ctx.emit("\n\n"); err != nil {
496 | return err
497 | }
498 | }
499 |
500 | ctx.linkAccumulator.tableNestLevel++
501 |
502 | // Re-intialize all table context.
503 | ctx.tableCtx.init()
504 |
505 | // Browse children, enriching context with table data.
506 | if err := ctx.traverseChildren(node); err != nil {
507 | return err
508 | }
509 |
510 | buf := &bytes.Buffer{}
511 | table := tablewriter.NewWriter(buf)
512 | if ctx.options.PrettyTablesOptions != nil {
513 | options := ctx.options.PrettyTablesOptions
514 | table.SetAutoFormatHeaders(options.AutoFormatHeader)
515 | table.SetAutoWrapText(options.AutoWrapText)
516 | table.SetReflowDuringAutoWrap(options.ReflowDuringAutoWrap)
517 | table.SetColWidth(options.ColWidth)
518 | table.SetColumnSeparator(options.ColumnSeparator)
519 | table.SetRowSeparator(options.RowSeparator)
520 | table.SetCenterSeparator(options.CenterSeparator)
521 | table.SetHeaderAlignment(options.HeaderAlignment)
522 | table.SetFooterAlignment(options.FooterAlignment)
523 | table.SetAlignment(options.Alignment)
524 | table.SetColumnAlignment(options.ColumnAlignment)
525 | table.SetNewLine(options.NewLine)
526 | table.SetHeaderLine(options.HeaderLine)
527 | table.SetRowLine(options.RowLine)
528 | table.SetAutoMergeCells(options.AutoMergeCells)
529 | table.SetBorders(options.Borders)
530 | }
531 | table.SetHeader(ctx.tableCtx.header)
532 | table.SetFooter(ctx.tableCtx.footer)
533 | table.AppendBulk(ctx.tableCtx.body)
534 |
535 | // Render the table using ASCII.
536 | table.Render()
537 | if err := ctx.emit(buf.String()); err != nil {
538 | return err
539 | }
540 |
541 | ctx.linkAccumulator.tableNestLevel--
542 |
543 | if ctx.linkAccumulator.tableNestLevel == 0 {
544 | return ctx.emit("```\n\n")
545 | } else {
546 | return ctx.emit("\n\n")
547 | }
548 |
549 | case atom.Tfoot:
550 | ctx.tableCtx.isInFooter = true
551 | if err := ctx.traverseChildren(node); err != nil {
552 | return err
553 | }
554 | ctx.tableCtx.isInFooter = false
555 |
556 | case atom.Tr:
557 | ctx.tableCtx.body = append(ctx.tableCtx.body, []string{})
558 | if err := ctx.traverseChildren(node); err != nil {
559 | return err
560 | }
561 | ctx.tableCtx.tmpRow++
562 |
563 | case atom.Th:
564 | res, err := ctx.renderEachChild(node)
565 | if err != nil {
566 | return err
567 | }
568 |
569 | ctx.tableCtx.header = append(ctx.tableCtx.header, res)
570 |
571 | case atom.Td:
572 | res, err := ctx.renderEachChild(node)
573 | if err != nil {
574 | return err
575 | }
576 |
577 | if ctx.tableCtx.isInFooter {
578 | ctx.tableCtx.footer = append(ctx.tableCtx.footer, res)
579 | } else {
580 | ctx.tableCtx.body[ctx.tableCtx.tmpRow] = append(ctx.tableCtx.body[ctx.tableCtx.tmpRow], res)
581 | }
582 |
583 | }
584 | return nil
585 | }
586 |
587 | func (ctx *TextifyTraverseContext) traverse(node *html.Node) error {
588 | switch node.Type {
589 | default:
590 | return ctx.traverseChildren(node)
591 |
592 | case html.TextNode:
593 | var data string
594 | if ctx.isPre {
595 | data = node.Data
596 | } else {
597 | data = strings.TrimSpace(spacingRe.ReplaceAllString(node.Data, " "))
598 | }
599 | return ctx.emit(data)
600 |
601 | case html.ElementNode:
602 | return ctx.handleElement(node)
603 | }
604 | }
605 |
606 | func (ctx *TextifyTraverseContext) traverseChildren(node *html.Node) error {
607 | for c := node.FirstChild; c != nil; c = c.NextSibling {
608 | if err := ctx.traverse(c); err != nil {
609 | return err
610 | }
611 | }
612 |
613 | return nil
614 | }
615 |
616 | // Tests r for being a character where no space should be inserted in front of.
617 | func punctNoSpaceBefore(r rune) bool {
618 | switch r {
619 | case '.', ',', ';', '!', '?', ')', ']', '>':
620 | return true
621 | default:
622 | return false
623 | }
624 | }
625 |
626 | // Tests r for being a character where no space should be inserted after.
627 | func punctNoSpaceAfter(r rune) bool {
628 | switch r {
629 | case '(', '[', '<':
630 | return true
631 | default:
632 | return false
633 | }
634 | }
635 | func (ctx *TextifyTraverseContext) emit(data string) error {
636 | if data == "" {
637 | return nil
638 | }
639 |
640 | var lines = []string{data}
641 |
642 | for _, line := range lines {
643 | runes := []rune(line)
644 | startsWithSpace := unicode.IsSpace(runes[0]) || punctNoSpaceBefore(runes[0])
645 | if !startsWithSpace && !ctx.endsWithSpace {
646 | if err := ctx.buf.WriteByte(' '); err != nil {
647 | return err
648 | }
649 | ctx.lineLength++
650 | }
651 | ctx.endsWithSpace = unicode.IsSpace(runes[len(runes)-1]) || punctNoSpaceAfter(runes[len(runes)-1])
652 | for _, c := range line {
653 | if _, err := ctx.buf.WriteString(string(c)); err != nil {
654 | return err
655 | }
656 | ctx.lineLength++
657 | if c == '\n' {
658 | ctx.lineLength = 0
659 | if ctx.prefix != "" {
660 | if _, err := ctx.buf.WriteString(ctx.prefix); err != nil {
661 | return err
662 | }
663 | }
664 | }
665 | }
666 | }
667 | return nil
668 | }
669 |
670 | func (ctx *TextifyTraverseContext) normalizeHrefLink(link string) string {
671 | link = strings.TrimSpace(link)
672 | link = strings.TrimPrefix(link, "mailto:")
673 | return link
674 | }
675 |
676 | func formatGeminiCitation(idx int, showMarker bool) string {
677 | if showMarker {
678 | return fmt.Sprintf("[%d]", idx)
679 | } else {
680 | return ""
681 | }
682 |
683 | }
684 |
685 | func (ctx *TextifyTraverseContext) addGeminiCitation(url string, display string) string {
686 |
687 | if url[0:1] == "#" {
688 | //dont emit bookmarks to the same page (url starts #)
689 | return ""
690 | } else {
691 | citation := citationLink{
692 | index: len(ctx.linkAccumulator.linkArray) + ctx.options.CitationStart,
693 | display: display,
694 | url: url,
695 | }
696 |
697 | //spaces would mess up the gemini link, so check for them
698 | if strings.Contains(citation.url, " ") {
699 | //escape the spaces
700 | citation.url = strings.ReplaceAll(citation.url, " ", "%20")
701 |
702 | }
703 | ctx.linkAccumulator.linkArray = append(ctx.linkAccumulator.linkArray, citation)
704 | return formatGeminiCitation(citation.index, ctx.options.CitationMarkers)
705 | }
706 |
707 | }
708 |
709 | func (ctx *TextifyTraverseContext) forceFlushGeminiCitations() {
710 | // this method writes to the buffer directly instead of using `emit`, b/c we do not want to split long links
711 |
712 | if ctx.linkAccumulator.tableNestLevel > 0 {
713 | //dont emit citation list inside a table
714 | return
715 | }
716 |
717 | ctx.buf.WriteString("\n")
718 |
719 | //ctx.buf.WriteString("flushedtoindex: ")
720 | //ctx.buf.WriteString(formatGeminiCitation(ctx.linkAccumulator.flushedToIndex))
721 | ctx.buf.WriteByte('\n')
722 |
723 | for i, link := range ctx.linkAccumulator.linkArray {
724 | // ctx.buf.WriteString(formatGeminiCitation(i))
725 |
726 | if i > ctx.linkAccumulator.flushedToIndex {
727 | ctx.buf.WriteString("=> ")
728 | ctx.buf.WriteString(link.url)
729 | ctx.buf.WriteByte(' ')
730 | ctx.buf.WriteString(formatGeminiCitation(link.index, ctx.options.NumberedLinks))
731 | ctx.buf.WriteByte(' ')
732 | ctx.buf.WriteString(link.display)
733 | ctx.buf.WriteByte('\n')
734 | }
735 | }
736 |
737 | ctx.buf.WriteByte('\n')
738 |
739 | ctx.ResetCitationCounters()
740 |
741 | }
742 | func (ctx *TextifyTraverseContext) emitGeminiCitations() {
743 |
744 | if len(ctx.linkAccumulator.linkArray) > ctx.linkAccumulator.flushedToIndex {
745 | //there are unflushed links
746 | ctx.forceFlushGeminiCitations()
747 | }
748 | }
749 |
750 | // renderEachChild visits each direct child of a node and collects the sequence of
751 | // textuual representaitons separated by a single newline.
752 | func (ctx *TextifyTraverseContext) renderEachChild(node *html.Node) (string, error) {
753 | buf := &bytes.Buffer{}
754 | for c := node.FirstChild; c != nil; c = c.NextSibling {
755 | s, err := FromHTMLNode(c, *ctx)
756 | if err != nil {
757 | return "", err
758 | }
759 | if _, err = buf.WriteString(s); err != nil {
760 | return "", err
761 | }
762 | if c.NextSibling != nil {
763 | if err = buf.WriteByte('\n'); err != nil {
764 | return "", err
765 | }
766 | }
767 | }
768 | return buf.String(), nil
769 | }
770 |
771 | func getAttrVal(node *html.Node, attrName string) string {
772 | for _, attr := range node.Attr {
773 | if attr.Key == attrName {
774 | return attr.Val
775 | }
776 | }
777 |
778 | return ""
779 | }
780 |
--------------------------------------------------------------------------------
/html2gemini_test.go:
--------------------------------------------------------------------------------
1 | package html2gemini
2 |
3 | import (
4 | "bytes"
5 | "fmt"
6 | "io/ioutil"
7 | "os"
8 | "path"
9 | "regexp"
10 | "strings"
11 | "testing"
12 | )
13 |
14 | const destPath = "testdata"
15 |
16 | // EnableExtraLogging turns on additional testing log output.
17 | // Extra test logging can be enabled by setting the environment variable
18 | // HTML2TEXT_EXTRA_LOGGING to "1" or "true".
19 | var EnableExtraLogging bool
20 |
21 | func init() {
22 | if v := os.Getenv("HTML2TEXT_EXTRA_LOGGING"); v == "1" || v == "true" {
23 | EnableExtraLogging = true
24 | }
25 | }
26 |
27 | // TODO Add tests for FromHTMLNode and FromReader.
28 |
29 | func TestParseUTF8(t *testing.T) {
30 | htmlFiles := []struct {
31 | file string
32 | keywordShouldNotExist string
33 | keywordShouldExist string
34 | }{
35 | {
36 | "utf8.html",
37 | "学习之道:美国公认学习第一书title",
38 | "次世界冠军赛上,我几近疯狂",
39 | },
40 | {
41 | "utf8_with_bom.xhtml",
42 | "1892年波兰文版序言title",
43 | "种新的波兰文本已成为必要",
44 | },
45 | }
46 |
47 | for _, htmlFile := range htmlFiles {
48 | bs, err := ioutil.ReadFile(path.Join(destPath, htmlFile.file))
49 | if err != nil {
50 | t.Fatal(err)
51 | }
52 | ctx := NewTraverseContext(Options{})
53 | text, err := FromReader(bytes.NewReader(bs), *ctx)
54 | if err != nil {
55 | t.Fatal(err)
56 | }
57 | if !strings.Contains(text, htmlFile.keywordShouldExist) {
58 | t.Fatalf("keyword %s should exists in file %s", htmlFile.keywordShouldExist, htmlFile.file)
59 | }
60 | if strings.Contains(text, htmlFile.keywordShouldNotExist) {
61 | t.Fatalf("keyword %s should not exists in file %s", htmlFile.keywordShouldNotExist, htmlFile.file)
62 | }
63 | }
64 | }
65 |
66 | func TestStrippingWhitespace(t *testing.T) {
67 | testCases := []struct {
68 | input string
69 | output string
70 | }{
71 | {
72 | "test text",
73 | "test text",
74 | },
75 | {
76 | " \ttext\ntext\n",
77 | "text text",
78 | },
79 | {
80 | " \na \n\t \n \n a \t",
81 | "a a",
82 | },
83 | {
84 | "test text",
85 | "test text",
86 | },
87 | {
88 | "test text ",
89 | "test text",
90 | },
91 | }
92 |
93 | for _, testCase := range testCases {
94 | if msg, err := wantString(testCase.input, testCase.output); err != nil {
95 | t.Error(err)
96 | } else if len(msg) > 0 {
97 | t.Log(msg)
98 | }
99 | }
100 | }
101 |
102 | func TestParagraphsAndBreaks(t *testing.T) {
103 | testCases := []struct {
104 | input string
105 | output string
106 | }{
107 | {
108 | "Test text",
109 | "Test text",
110 | },
111 | {
112 | "Test text
",
113 | "Test text",
114 | },
115 | {
116 | "Test text
Test",
117 | "Test text\nTest",
118 | },
119 | {
120 | "
Test text
", 121 | "Test text", 122 | }, 123 | { 124 | "Test text
Test text
", 125 | "Test text\n\nTest text", 126 | }, 127 | { 128 | "\nTest text
\n\n\n\tTest text
\n", 129 | "Test text\n\nTest text", 130 | }, 131 | { 132 | "\nTest text
Test text
Test text
\tTest text
test1\ntest 2\n\ntest 3", 145 | "```\ntest1\ntest 2\n\ntest 3\n```", 146 | }, 147 | } 148 | 149 | for _, testCase := range testCases { 150 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 151 | t.Error(err) 152 | } else if len(msg) > 0 { 153 | t.Log(msg) 154 | } 155 | } 156 | } 157 | 158 | func TestTables(t *testing.T) { 159 | testCases := []struct { 160 | input string 161 | tabularOutput string 162 | plaintextOutput string 163 | }{ 164 | { 165 | "
cell1 | cell2 |
row1 |
row2 |
Row-1-Col-1-Msg123456789012345 Row-1-Col-1-Msg2 | Row-1-Col-2 |
Row-2-Col-1 | Row-2-Col-2 |
cell1-1 | cell1-2 |
cell2-1 | cell2-2 |
Header 1 | Header 2 |
---|---|
Footer 1 | Footer 2 |
Row 1 Col 1 | Row 1 Col 2 |
Row 2 Col 1 | Row 2 Col 2 |
252 |
Table 1 Header 1 | Table 1 Header 2 |
---|---|
Table 1 Footer 1 | Table 1 Footer 2 |
Table 1 Row 1 Col 1 | Table 1 Row 1 Col 2 |
Table 1 Row 2 Col 1 | Table 1 Row 2 Col 2 |
Table 2 Header 1 | Table 2 Header 2 |
---|---|
Table 2 Footer 1 | Table 2 Footer 2 |
Table 2 Row 1 Col 1 | Table 2 Row 1 Col 2 |
Table 2 Row 2 Col 1 | Table 2 Row 2 Col 2 |
cell |
Item | 307 |Description | 308 |Price | 309 |
---|---|---|
Golang | 312 |Open source programming language that makes it easy to build simple, reliable, and efficient software | 313 |$10.99 | 314 |
Hermes | 317 |Programmatically create beautiful e-mails using Golang. | 318 |$1.99 | 319 |
level 1level 2level 1
TestTest", 700 | "> \n> Test\n\nTest", 701 | }, 702 | { 703 | "\t
\nTest", 704 | "> \n> Test\n>", 705 | }, 706 | { 707 | "\t
\nTest line 1", 708 | "> \n> Test line 1\n> Test 2", 709 | }, 710 | { 711 | "
Test 2
Test
TestOther Test", 712 | "> \n> Test\n\n> \n> Test\n\nOther Test", 713 | }, 714 | { 715 | "
Lorem ipsum Commodo id consectetur pariatur ea occaecat minim aliqua ad sit consequat quis ex commodo Duis incididunt eu mollit consectetur fugiat voluptate dolore in pariatur in commodo occaecat Ut occaecat velit esse labore aute quis commodo non sit dolore officia Excepteur cillum amet cupidatat culpa velit labore ullamco dolore mollit elit in aliqua dolor irure do", 716 | "> \n> Lorem ipsum Commodo id consectetur pariatur ea occaecat minim aliqua ad\n> sit consequat quis ex commodo Duis incididunt eu mollit consectetur fugiat\n> voluptate dolore in pariatur in commodo occaecat Ut occaecat velit esse\n> labore aute quis commodo non sit dolore officia Excepteur cillum amet\n> cupidatat culpa velit labore ullamco dolore mollit elit in aliqua dolor\n> irure do", 717 | }, 718 | { 719 | "
LoremipsumCommodoidconsecteturpariatureaoccaecatminimaliquaadsitconsequatquisexcommodoDuisincididunteumollitconsecteturfugiatvoluptatedoloreinpariaturincommodooccaecatUtoccaecatvelitesselaboreautequiscommodononsitdoloreofficiaExcepteurcillumametcupidatatculpavelitlaboreullamcodoloremollitelitinaliquadoloriruredo", 720 | "> \n> Lorem *ipsum* *Commodo* *id* *consectetur* *pariatur* *ea* *occaecat* *minim*\n> *aliqua* *ad* *sit* *consequat* *quis* *ex* *commodo* *Duis* *incididunt* *eu*\n> *mollit* *consectetur* *fugiat* *voluptate* *dolore* *in* *pariatur* *in* *commodo*\n> *occaecat* *Ut* *occaecat* *velit* *esse* *labore* *aute* *quis* *commodo*\n> *non* *sit* *dolore* *officia* *Excepteur* *cillum* *amet* *cupidatat* *culpa*\n> *velit* *labore* *ullamco* *dolore* *mollit* *elit* *in* *aliqua* *dolor* *irure*\n> *do*", 721 | }, 722 | } 723 | 724 | for _, testCase := range testCases { 725 | if msg, err := wantString(testCase.input, testCase.output); err != nil { 726 | t.Error(err) 727 | } else if len(msg) > 0 { 728 | t.Log(msg) 729 | } 730 | } 731 | 732 | } 733 | 734 | func TestIgnoreStylesScriptsHead(t *testing.T) { 735 | testCases := []struct { 736 | input string 737 | output string 738 | }{ 739 | { 740 | "", 741 | "", 742 | }, 743 | { 744 | "", 745 | "", 746 | }, 747 | { 748 | "", 749 | "", 750 | }, 751 | { 752 | "", 753 | "", 754 | }, 755 | { 756 | "", 757 | "", 758 | }, 759 | { 760 | "", 761 | "", 762 | }, 763 | { 764 | "", 765 | "", 766 | }, 767 | { 768 | "", 769 | "", 770 | }, 771 | { 772 | "", 773 | "", 774 | }, 775 | { 776 | `
List:
809 | 810 | 815 | `, 816 | `hi 817 | hello google[1] 818 | 819 | test 820 | 821 | List: 822 | 823 | * Foo[2] 824 | * Barsoap[3] 825 | * Baz`, 826 | }, 827 | // Malformed input html. 828 | { 829 | `hi 830 | 831 | hello google 832 | 833 | testList:
834 | 835 | 841 | `, 842 | `hi hello google[1] test 843 | 844 | List: 845 | 846 | * Foo[2] 847 | * Bar[3] 848 | * Baz`, 849 | }, 850 | } 851 | 852 | for _, testCase := range testCases { 853 | if msg, err := wantRegExp(testCase.input, testCase.expr); err != nil { 854 | t.Error(err) 855 | } else if len(msg) > 0 { 856 | t.Log(msg) 857 | } 858 | } 859 | } 860 | 861 | func TestPeriod(t *testing.T) { 862 | testCases := []struct { 863 | input string 864 | expr string 865 | }{ 866 | { 867 | `Lorem ipsum test.
`, 868 | `Lorem ipsum test\.`, 869 | }, 870 | { 871 | `Lorem ipsum test.
`, 872 | `Lorem ipsum test\.`, 873 | }, 874 | } 875 | 876 | for _, testCase := range testCases { 877 | if msg, err := wantRegExp(testCase.input, testCase.expr); err != nil { 878 | t.Error(err) 879 | } else if len(msg) > 0 { 880 | t.Log(msg) 881 | } 882 | } 883 | } 884 | 885 | type StringMatcher interface { 886 | MatchString(string) bool 887 | String() string 888 | } 889 | 890 | type RegexpStringMatcher string 891 | 892 | func (m RegexpStringMatcher) MatchString(str string) bool { 893 | return regexp.MustCompile(string(m)).MatchString(str) 894 | } 895 | func (m RegexpStringMatcher) String() string { 896 | return string(m) 897 | } 898 | 899 | type ExactStringMatcher string 900 | 901 | func (m ExactStringMatcher) MatchString(str string) bool { 902 | return string(m) == str 903 | } 904 | func (m ExactStringMatcher) String() string { 905 | return string(m) 906 | } 907 | 908 | func wantRegExp(input string, outputRE string, options ...Options) (string, error) { 909 | return match(input, RegexpStringMatcher(outputRE), options...) 910 | } 911 | 912 | func wantString(input string, output string, options ...Options) (string, error) { 913 | return match(input, ExactStringMatcher(output), options...) 914 | } 915 | 916 | func match(input string, matcher StringMatcher, options ...Options) (string, error) { 917 | var ctxOptions Options 918 | if len(options) > 0 { 919 | ctxOptions = options[0] 920 | } 921 | ctx := NewTraverseContext(ctxOptions) 922 | text, err := FromString(input, *ctx) 923 | if err != nil { 924 | return "", err 925 | } 926 | if !matcher.MatchString(text) { 927 | return "", fmt.Errorf(`error: input did not match specified expression 928 | Input: 929 | >>>> 930 | %v 931 | <<<< 932 | 933 | Output: 934 | >>>> 935 | %v 936 | <<<< 937 | 938 | Expected: 939 | >>>> 940 | %v 941 | <<<<`, 942 | input, 943 | text, 944 | matcher.String(), 945 | ) 946 | } 947 | 948 | var msg string 949 | 950 | if EnableExtraLogging { 951 | msg = fmt.Sprintf( 952 | ` 953 | input: 954 | 955 | %v 956 | 957 | output: 958 | 959 | %v 960 | `, 961 | input, 962 | text, 963 | ) 964 | } 965 | return msg, nil 966 | } 967 | 968 | func Example() { 969 | inputHTML := ` 970 | 971 | 972 |985 | Here is some more information: 986 | 987 |
Header 1 | Header 2 |
---|---|
Footer 1 | Footer 2 |
Row 1 Col 1 | Row 1 Col 2 |
Row 2 Col 1 | Row 2 Col 2 |
1008 | Preformatted content with spaces 1009 | and indentation 1010 |1011 | 1012 | ` 1013 | 1014 | ctx := NewTraverseContext(Options{PrettyTables: true, LinkEmitFrequency: 100}) 1015 | text, err := FromString(inputHTML, *ctx) 1016 | if err != nil { 1017 | panic(err) 1018 | } 1019 | fmt.Println(text) 1020 | 1021 | // Output: 1022 | // Mega Service [1] 1023 | // 1024 | // # Welcome to your new account on my service! 1025 | // 1026 | // Here is some more information: 1027 | // 1028 | // * Link 1: Example.com [2] 1029 | // * Link 2: Example2.com [3] 1030 | // * Something else 1031 | // 1032 | // ``` 1033 | // +-------------+-------------+ 1034 | // | HEADER 1 | HEADER 2 | 1035 | // +-------------+-------------+ 1036 | // | Row 1 Col 1 | Row 1 Col 2 | 1037 | // | Row 2 Col 1 | Row 2 Col 2 | 1038 | // +-------------+-------------+ 1039 | // | FOOTER 1 | FOOTER 2 | 1040 | // +-------------+-------------+ 1041 | // ``` 1042 | // 1043 | //``` 1044 | //Preformatted content with spaces 1045 | // and indentation 1046 | // 1047 | //``` 1048 | // 1049 | // => http://jaytaylor.com/ [1] http://jaytaylor.com/ 1050 | // => https://example.com [2] https://example.com 1051 | // => https://example2.com [3] https://example2.com 1052 | } 1053 | -------------------------------------------------------------------------------- /testdata/utf8.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 |
写在前面的话 13 |
14 |在台湾的那次世界冠军赛上,我几近疯狂,直至两年后的今天,我仍沉浸在这次的经历中。这是我生平第一次如此深入地审视我自己,甚至是第一次尝试审视自己。这个过程令人很是兴奋,同时也有点感觉怪异。我重新认识了自我,看到了自己的另外一面,自己从未发觉的另外一面。为了生存,为了取胜,我成了一名角斗士,彻头彻尾,简单纯粹。我并没有意识到这一角色早已在我的心中生根发芽,呼之欲出。也许,他的出现已是不可避免。
15 |而我这全新的一面,与我一直熟识的那个乔希,那个曾经害怕黑暗的孩子,那个象棋手,那个狂热于雨水、反复诵读杰克·克鲁亚克作品的年轻人之间,又有什么样的联系呢?这些都是我正在努力弄清楚的问题。
16 |自台湾赛事之后,我急切非常,一心想要回到训练中去,摆脱自己已经达到巅峰的想法。在过去的两年中,我已经重新开始。这是一个新的起点。前方的路还很长,有待进一步的探索。
17 |这本书的创作耗费了相当多的时间和精力。在成长的过程中,我在我的小房间里从未想过等待我的会是这样的战斗。在创作中,我的思想逐渐成熟;爱恋从分崩离析,到失而复得,世界冠军头衔从失之交臂,到囊中取物。如果说在我人生的第一个二十九年中,我学到了什么,那就是,我们永远无法预测结局,无论是重要的比赛、冒险,还是轰轰烈烈的爱情。我们唯一可以肯定的只有,出乎意料。不管我们做了多么万全的准备,在生活的真实场景中,我们总是会处于陌生的境地。我们也许会无法冷静,失去理智,感觉似乎整个世界都在针对我们。在这个时候,我们所要做的是要付出加倍的努力,要表现得比预想得更好。我认为,关键在于准备好随机应变,准备好在所能想象的高压下发挥出创造力。
18 |读者朋友们,我非常希望你们在读过这本书后,可以得到启发,甚至会得到触动,从而能够根据各自的天赋与特长,去实现自己的梦想。这就是我写作此书的目的。我在字里行间所传达的理念曾经使我受益匪浅,我很希望它们可以为大家提供一个基本的框架和方向。如果我的方法言之有理,那么就请接受它,琢磨它,并加之自己的见解。忘记我的那些数字。真正的掌握需要通过自己发现一些最能够引起共鸣的信息,并将其彻底地融合进来,直至成为一体,这样我们才能随心所欲地驾驭它。
19 | 20 | 21 | 22 | -------------------------------------------------------------------------------- /testdata/utf8_with_bom.xhtml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 |出版共产主义宣言的一种新的波兰文本已成为必要,这一事实,引起了许多感想。
14 |首先值得注意的是,近来宣言在一定程度上已成为欧洲大陆大工业发展的一种尺度。一个国家的大工业越发展,该国工人中想认清自己作为工人阶级在有产阶级面前所处地位的要求就越增加,他们中间的社会主义运动也越扩大,因而对宣言的需求也越增长。这样,根据宣言用某国文字销行的份数,不仅能够相当确切地断定该国工人运动的状况,而且还能够相当确切地断定该国大工业发展的程度。
15 |因此,波兰文的新版本标志着波兰工业的决定性进步。从十年前发表的上一个版本以来确实有了这种进步,对此丝毫不容置疑。俄国的波兰,会议的波兰[19],成了俄罗斯帝国巨大的工业区。俄国大工业是零星分散的,一部分在芬兰湾沿岸,一部分在中央区(莫斯科和弗拉基米尔),第三部分在黑海和亚速海沿岸,还有另一些散布在别处;而波兰工业则紧缩于相对狭小的地区,享受到由这种积聚引起的长处与短处。这种长处是竞争着的俄罗斯工厂主所承认的,他们要求实行保护关税以对付波兰,尽管他们渴望使波兰人俄罗斯化。这种短处,对波兰工厂主与俄罗斯政府来说,表现在社会主义思想在波兰工人中间的迅速传播和对宣言需求的增长。
16 |但是,波兰工业的迅速发展——它超过了俄国工业——本身是波兰人民的坚强生命力的一个新证明,是波兰人民临近的民族复兴的一个新保证。而一个独立强盛的波兰的复兴,不只是一件同波兰人有关、而且是同我们大家有关的事情。只有当每个民族在自己内部完全自主时,欧洲各民族间真诚的国际合作才是可能的。1848年革命在无产阶级旗帜下,使无产阶级的战士最终只作了资产阶级的工作,这次革命通过自己遗嘱的执行者路易·波拿巴和俾斯麦也实现了意大利、德国和匈牙利的独立。然而波兰,它从1792年以来为革命做的比所有这三个国家总共做的还要多,而当它1863年失败于强大十倍的俄军的时候,人们却把它抛弃不顾了。贵族既未能保持住、也未能重新争得波兰的独立;今天波兰的独立对资产阶级至少是无所谓的。然而波兰的独立对于欧洲各民族和谐的合作是必需的。这种独立只有年轻的波兰无产阶级才能争得,而且在它的手中会很好地保持住。因为欧洲所有其余的工人都象波兰工人自己一样也需要波兰的独立。
17 |弗·恩格斯
18 |1892年2月10日于伦敦
19 | 20 |