├── .rspec
├── lib
├── html2doc
│ ├── version.rb
│ ├── xml.rb
│ ├── notes.rb
│ ├── base.rb
│ ├── lists.rb
│ ├── math.rb
│ ├── mime.rb
│ └── wordstyle.css
└── html2doc.rb
├── .gitattributes
├── spec
├── 19160-6.png
├── 19160-7.gif
├── 19160-8.jpg
├── examples
│ ├── rice_images
│ │ ├── rice_image1.gif
│ │ ├── rice_image1.png
│ │ ├── rice_image2.png
│ │ ├── rice_image3_1.png
│ │ ├── rice_image3_2.png
│ │ └── rice_image3_3.png
│ └── header.html
├── odf.svg
├── spec_helper.rb
├── header.html
├── header_img.html
├── wordstyle-custom-lists.css
├── wordstyle-nopagesize.css
└── wordstyle-custom.css
├── Rakefile
├── .hound.yml
├── bin
├── setup
├── console
├── rspec
└── html2doc
├── .gitignore
├── Gemfile
├── .rubocop.yml
├── .github
└── workflows
│ ├── rake.yml
│ └── release.yml
├── html2doc.gemspec
├── LICENSE
├── CODE_OF_CONDUCT.md
└── README.adoc
/.rspec:
--------------------------------------------------------------------------------
1 | --format documentation
2 | --color
3 | --require spec_helper
4 |
--------------------------------------------------------------------------------
/lib/html2doc/version.rb:
--------------------------------------------------------------------------------
1 | class Html2Doc
2 | VERSION = "1.10.1".freeze
3 | end
4 |
--------------------------------------------------------------------------------
/.gitattributes:
--------------------------------------------------------------------------------
1 | rfc2629*.* linguist-vendored
2 | mathml2omml*.* linguist-vendored
3 |
--------------------------------------------------------------------------------
/spec/19160-6.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/metanorma/html2doc/HEAD/spec/19160-6.png
--------------------------------------------------------------------------------
/spec/19160-7.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/metanorma/html2doc/HEAD/spec/19160-7.gif
--------------------------------------------------------------------------------
/spec/19160-8.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/metanorma/html2doc/HEAD/spec/19160-8.jpg
--------------------------------------------------------------------------------
/spec/examples/rice_images/rice_image1.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/metanorma/html2doc/HEAD/spec/examples/rice_images/rice_image1.gif
--------------------------------------------------------------------------------
/spec/examples/rice_images/rice_image1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/metanorma/html2doc/HEAD/spec/examples/rice_images/rice_image1.png
--------------------------------------------------------------------------------
/spec/examples/rice_images/rice_image2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/metanorma/html2doc/HEAD/spec/examples/rice_images/rice_image2.png
--------------------------------------------------------------------------------
/spec/examples/rice_images/rice_image3_1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/metanorma/html2doc/HEAD/spec/examples/rice_images/rice_image3_1.png
--------------------------------------------------------------------------------
/spec/examples/rice_images/rice_image3_2.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/metanorma/html2doc/HEAD/spec/examples/rice_images/rice_image3_2.png
--------------------------------------------------------------------------------
/spec/examples/rice_images/rice_image3_3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/metanorma/html2doc/HEAD/spec/examples/rice_images/rice_image3_3.png
--------------------------------------------------------------------------------
/Rakefile:
--------------------------------------------------------------------------------
1 | require "bundler/gem_tasks"
2 | require "rspec/core/rake_task"
3 |
4 | RSpec::Core::RakeTask.new(:spec)
5 |
6 | task default: :spec
7 |
--------------------------------------------------------------------------------
/.hound.yml:
--------------------------------------------------------------------------------
1 | # Auto-generated by Cimas: Do not edit it manually!
2 | # See https://github.com/metanorma/cimas
3 | ruby:
4 | enabled: true
5 | config_file: .rubocop.yml
6 |
--------------------------------------------------------------------------------
/bin/setup:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env bash
2 | set -euo pipefail
3 | IFS=$'\n\t'
4 | set -vx
5 |
6 | bundle install
7 |
8 | # Do any other automated setup that you need to do here
9 |
--------------------------------------------------------------------------------
/spec/odf.svg:
--------------------------------------------------------------------------------
1 |
2 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | /.bundle/
2 | /.yardoc
3 | /_yardoc/
4 | /coverage/
5 | /doc/
6 | /pkg/
7 | /spec/reports/
8 | /tmp/
9 |
10 | # rspec failure tracking
11 | .rspec_status
12 |
13 | .rubocop-https--*
14 |
--------------------------------------------------------------------------------
/lib/html2doc.rb:
--------------------------------------------------------------------------------
1 | require_relative "html2doc/version"
2 | require_relative "html2doc/base"
3 | require_relative "html2doc/mime"
4 | require_relative "html2doc/notes"
5 | require_relative "html2doc/math"
6 | require_relative "html2doc/lists"
7 | require_relative "html2doc/xml"
8 |
--------------------------------------------------------------------------------
/Gemfile:
--------------------------------------------------------------------------------
1 | Encoding.default_external = Encoding::UTF_8
2 | Encoding.default_internal = Encoding::UTF_8
3 |
4 | source "https://rubygems.org"
5 | git_source(:github) { |repo| "https://github.com/#{repo}" }
6 |
7 | gemspec
8 |
9 | eval_gemfile("Gemfile.devel") rescue nil
10 |
--------------------------------------------------------------------------------
/.rubocop.yml:
--------------------------------------------------------------------------------
1 | # Auto-generated by Cimas: Do not edit it manually!
2 | # See https://github.com/metanorma/cimas
3 | inherit_from:
4 | - https://raw.githubusercontent.com/riboseinc/oss-guides/master/ci/rubocop.yml
5 |
6 | # local repo-specific modifications
7 | # ...
8 |
9 | AllCops:
10 | TargetRubyVersion: 2.5
11 |
--------------------------------------------------------------------------------
/bin/console:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env ruby
2 |
3 | require "bundler/setup"
4 | require "html2doc"
5 |
6 | # You can add fixtures and/or initialization code here to make experimenting
7 | # with your gem easier. You can also use a different console, if you like.
8 |
9 | # (If you use this, don't forget to add pry to your Gemfile!)
10 | # require "pry"
11 | # Pry.start
12 |
13 | require "irb"
14 | IRB.start(__FILE__)
15 |
--------------------------------------------------------------------------------
/.github/workflows/rake.yml:
--------------------------------------------------------------------------------
1 | # Auto-generated by Cimas: Do not edit it manually!
2 | # See https://github.com/metanorma/cimas
3 | name: rake
4 |
5 | on:
6 | push:
7 | branches: [ master, main ]
8 | tags: [ v* ]
9 | pull_request:
10 |
11 | jobs:
12 | rake:
13 | uses: metanorma/ci/.github/workflows/generic-rake.yml@main
14 | secrets:
15 | pat_token: ${{ secrets.METANORMA_CI_PAT_TOKEN }}
16 |
--------------------------------------------------------------------------------
/bin/rspec:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env ruby
2 |
3 | # This file was generated by Bundler.
4 | #
5 | # The application 'rspec' is installed as part of a gem, and
6 | # this file is here to facilitate running it.
7 | #
8 |
9 | require "pathname"
10 | ENV["BUNDLE_GEMFILE"] ||= File.expand_path(
11 | "../../Gemfile", Pathname.new(__FILE__).realpath
12 | )
13 |
14 | require "rubygems"
15 | require "bundler/setup"
16 |
17 | load Gem.bin_path("rspec-core", "rspec")
18 |
19 |
--------------------------------------------------------------------------------
/spec/spec_helper.rb:
--------------------------------------------------------------------------------
1 | require "simplecov"
2 | SimpleCov.start do
3 | add_filter "/spec/"
4 | end
5 |
6 | require "bundler/setup"
7 | require "rspec/match_fuzzy"
8 | require "html2doc"
9 | require "rspec/matchers"
10 | require "equivalent-xml"
11 |
12 | RSpec.configure do |config|
13 | # Enable flags like --only-failures and --next-failure
14 | config.example_status_persistence_file_path = ".rspec_status"
15 |
16 | # Disable RSpec exposing methods globally on `Module` and `main`
17 | config.disable_monkey_patching!
18 |
19 | config.expect_with :rspec do |c|
20 | c.syntax = :expect
21 | end
22 | end
23 |
--------------------------------------------------------------------------------
/bin/html2doc:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env ruby
2 |
3 | require "html2doc"
4 | require "optparse"
5 |
6 | options = {}
7 | OptionParser.new do |opts|
8 | opts.banner = "Usage: bin/html2doc filename [options]"
9 |
10 | opts.on("--stylesheet FILE.CSS", "Use the provided stylesheet") do |v|
11 | options[:stylesheet] = v
12 | end
13 | opts.on("--header HEADER.HTML", "Use the provided stylesheet") do |v|
14 | options[:header] = v
15 | end
16 | end.parse!
17 |
18 | if ARGV.length < 1
19 | puts "Usage: bin/html2doc filename [options]"
20 | exit
21 | end
22 |
23 | Html2Doc.process(
24 | filename: ARGV[0].gsub(/\.html?$/, ""),
25 | stylesheet: options[:stylesheet],
26 | header: options[:header],
27 | ).process(File.read(ARGV[0], encoding: "utf-8"))
28 |
--------------------------------------------------------------------------------
/.github/workflows/release.yml:
--------------------------------------------------------------------------------
1 | # Auto-generated by Cimas: Do not edit it manually!
2 | # See https://github.com/metanorma/cimas
3 | name: release
4 |
5 | on:
6 | workflow_dispatch:
7 | inputs:
8 | next_version:
9 | description: |
10 | Next release version. Possible values: x.y.z, major, minor, patch (or pre|rc|etc).
11 | Also, you can pass 'skip' to skip 'git tag' and do 'gem push' for the current version
12 | required: true
13 | default: 'skip'
14 | repository_dispatch:
15 | types: [ do-release ]
16 |
17 | jobs:
18 | release:
19 | uses: metanorma/ci/.github/workflows/rubygems-release.yml@main
20 | with:
21 | next_version: ${{ github.event.inputs.next_version }}
22 | secrets:
23 | rubygems-api-key: ${{ secrets.METANORMA_CI_RUBYGEMS_API_KEY }}
24 | pat_token: ${{ secrets.METANORMA_CI_PAT_TOKEN }}
25 |
26 |
--------------------------------------------------------------------------------
/html2doc.gemspec:
--------------------------------------------------------------------------------
1 | lib = File.expand_path("lib", __dir__)
2 | $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
3 | require "html2doc/version"
4 |
5 | Gem::Specification.new do |spec|
6 | spec.name = "html2doc"
7 | spec.version = Html2Doc::VERSION
8 | spec.authors = ["Ribose Inc."]
9 | spec.email = ["open.source@ribose.com"]
10 |
11 | spec.summary = "Convert HTML document to Microsoft Word document"
12 | spec.description = <<~DESCRIPTION
13 | Convert HTML document to Microsoft Word document.
14 |
15 | This gem is in active development.
16 | DESCRIPTION
17 |
18 | spec.homepage = "https://github.com/metanorma/html2doc"
19 | spec.licenses = ["CC-BY-SA-3.0", "BSD-2-Clause"]
20 |
21 | spec.bindir = "bin"
22 | spec.require_paths = ["lib"]
23 | spec.files = `git ls-files -z`.split("\x0").reject do |f|
24 | f.match(%r{^(test|spec|features|bin|.github)/}) \
25 | || f.match(%r{Rakefile|bin/rspec})
26 | end
27 | spec.required_ruby_version = Gem::Requirement.new(">= 2.7.0")
28 |
29 | spec.add_dependency "base64"
30 | spec.add_dependency "htmlentities", "~> 4.3.4"
31 | spec.add_dependency "lutaml-model", "~> 0.7.0"
32 | spec.add_dependency "metanorma-utils", ">= 1.9.0"
33 | spec.add_dependency "mime-types"
34 | spec.add_dependency "nokogiri", "~> 1.18.3"
35 | spec.add_dependency "plane1converter", "~> 0.0.1"
36 | spec.add_dependency "plurimath", "~> 0.9.0"
37 | spec.add_dependency "thread_safe"
38 | spec.add_dependency "uuidtools"
39 | spec.add_dependency "unitsml"
40 | spec.add_dependency "vectory", "~> 0.8"
41 |
42 | spec.add_development_dependency "debug"
43 | spec.add_development_dependency "equivalent-xml", "~> 0.6"
44 | spec.add_development_dependency "guard", "~> 2.14"
45 | spec.add_development_dependency "guard-rspec", "~> 4.7"
46 | spec.add_development_dependency "rake", "~> 12.0"
47 | spec.add_development_dependency "rspec", "~> 3.6"
48 | spec.add_development_dependency "rspec-match_fuzzy", "~> 0.2.0"
49 | spec.add_development_dependency "rubocop", "~> 1"
50 | spec.add_development_dependency "rubocop-performance"
51 | spec.add_development_dependency "simplecov", "~> 0.15"
52 | spec.add_development_dependency "timecop", "~> 0.9"
53 | end
54 |
--------------------------------------------------------------------------------
/lib/html2doc/xml.rb:
--------------------------------------------------------------------------------
1 | class Html2Doc
2 | NOKOHEAD = <<~HERE.freeze
3 |
5 |
6 |
7 |
8 | HERE
9 |
10 | def to_xhtml(xml)
11 | xml.gsub!(/<\?xml[^<>]*>/, "")
12 | unless /' + xml
15 | end
16 | xml = xml.gsub(/")
17 | .gsub(//, "")
18 | Nokogiri::XML.parse(xml)
19 | end
20 |
21 | DOCTYPE = <<~DOCTYPE.freeze
22 |
23 | DOCTYPE
24 |
25 | def from_xhtml(xml)
26 | xml.to_xml.sub(%{ xmlns="http://www.w3.org/1999/xhtml"}, "")
27 | .sub(DOCTYPE, "").gsub(%{ />}, "/>")
28 | .gsub(//, "/, "")
30 | .gsub("\n-->\n", "\n-->\n")
31 | end
32 |
33 | def msword_fix(doc)
34 | # brain damage in MSWord parser
35 | doc.gsub!(%r{ },
36 | " ")
37 | doc.gsub!(%r{ },
38 | ' ')
39 | doc.gsub!(%r{
},
40 | '
')
41 | doc.gsub!(%r{( ")
42 | doc.gsub!(%r{ }, "/>")
46 | doc.gsub!(%r{>}, "/>")
47 | doc.gsub!(%r{>}, "/>")
48 | doc.gsub!(%r{>}, "/>")
49 | doc.gsub!(%r{>}, "/>")
50 | doc.gsub!(%r{>}, "/>")
51 | doc.gsub!(%r{>}, "/>")
52 | doc.gsub!(%r{<(/)?m:(span|em)\b}, "<\\1\\2")
53 | doc.gsub!(%r{&tab;|&tab;},
54 | ' ')
55 | doc.split(%r{(| )}).each_slice(4).map do |a|
56 | a.size > 2 and a[2] = a[2].gsub(/>\s+, "><")
57 | a
58 | end.join
59 | end
60 |
61 | PRINT_VIEW = <<~XML.freeze
62 |
63 |
64 |
65 | Print
66 | 100
67 |
68 |
69 |
70 |
71 | XML
72 |
73 | def namespace(root)
74 | { o: "urn:schemas-microsoft-com:office:office",
75 | w: "urn:schemas-microsoft-com:office:word",
76 | v: "urn:schemas-microsoft-com:vml",
77 | m: "http://schemas.microsoft.com/office/2004/12/omml" }.each { |k, v| root.add_namespace_definition(k.to_s, v) }
78 | end
79 |
80 | def rootnamespace(root)
81 | root.add_namespace(nil, "http://www.w3.org/TR/REC-html40")
82 | end
83 | end
84 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | This software is dual-licensed:
2 |
3 | 1. Distributed under a Creative Commons Attribution-ShareAlike 3.0
4 | Unported License http://creativecommons.org/licenses/by-sa/3.0/
5 |
6 | 2. http://www.opensource.org/licenses/BSD-2-Clause
7 |
8 | All rights reserved.
9 |
10 | Redistribution and use in source and binary forms, with or without
11 | modification, are permitted provided that the following conditions are met:
12 |
13 | * Redistributions of source code must retain the above copyright notice, this
14 | list of conditions and the following disclaimer.
15 |
16 | * Redistributions in binary form must reproduce the above copyright notice,
17 | this list of conditions and the following disclaimer in the documentation
18 | and/or other materials provided with the distribution.
19 |
20 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23 | DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24 | FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 | DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26 | SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27 | CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28 | OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29 | OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 |
31 |
32 |
33 |
34 | LICENSE FOR STYLESHEETS DERIVED FROM https://github.com/TEIC/Stylesheets
35 |
36 | This software is dual-licensed:
37 |
38 | 1. Distributed under a Creative Commons Attribution-ShareAlike 3.0
39 | Unported License http://creativecommons.org/licenses/by-sa/3.0/
40 |
41 | 2. http://www.opensource.org/licenses/BSD-2-Clause
42 |
43 | All rights reserved.
44 |
45 | Redistribution and use in source and binary forms, with or without
46 | modification, are permitted provided that the following conditions are
47 | met:
48 |
49 | * Redistributions of source code must retain the above copyright
50 | notice, this list of conditions and the following disclaimer.
51 |
52 | * Redistributions in binary form must reproduce the above copyright
53 | notice, this list of conditions and the following disclaimer in the
54 | documentation and/or other materials provided with the distribution.
55 |
56 | This software is provided by the copyright holders and contributors
57 | "as is" and any express or implied warranties, including, but not
58 | limited to, the implied warranties of merchantability and fitness for
59 | a particular purpose are disclaimed. In no event shall the copyright
60 | holder or contributors be liable for any direct, indirect, incidental,
61 | special, exemplary, or consequential damages (including, but not
62 | limited to, procurement of substitute goods or services; loss of use,
63 | data, or profits; or business interruption) however caused and on any
64 | theory of liability, whether in contract, strict liability, or tort
65 | (including negligence or otherwise) arising in any way out of the use
66 | of this software, even if advised of the possibility of such damage.
67 |
--------------------------------------------------------------------------------
/CODE_OF_CONDUCT.md:
--------------------------------------------------------------------------------
1 | # Contributor Covenant Code of Conduct
2 |
3 | ## Our Pledge
4 |
5 | In the interest of fostering an open and welcoming environment, we as
6 | contributors and maintainers pledge to making participation in our project and
7 | our community a harassment-free experience for everyone, regardless of age, body
8 | size, disability, ethnicity, gender identity and expression, level of experience,
9 | nationality, personal appearance, race, religion, or sexual identity and
10 | orientation.
11 |
12 | ## Our Standards
13 |
14 | Examples of behavior that contributes to creating a positive environment
15 | include:
16 |
17 | * Using welcoming and inclusive language
18 | * Being respectful of differing viewpoints and experiences
19 | * Gracefully accepting constructive criticism
20 | * Focusing on what is best for the community
21 | * Showing empathy towards other community members
22 |
23 | Examples of unacceptable behavior by participants include:
24 |
25 | * The use of sexualized language or imagery and unwelcome sexual attention or
26 | advances
27 | * Trolling, insulting/derogatory comments, and personal or political attacks
28 | * Public or private harassment
29 | * Publishing others' private information, such as a physical or electronic
30 | address, without explicit permission
31 | * Other conduct which could reasonably be considered inappropriate in a
32 | professional setting
33 |
34 | ## Our Responsibilities
35 |
36 | Project maintainers are responsible for clarifying the standards of acceptable
37 | behavior and are expected to take appropriate and fair corrective action in
38 | response to any instances of unacceptable behavior.
39 |
40 | Project maintainers have the right and responsibility to remove, edit, or
41 | reject comments, commits, code, wiki edits, issues, and other contributions
42 | that are not aligned to this Code of Conduct, or to ban temporarily or
43 | permanently any contributor for other behaviors that they deem inappropriate,
44 | threatening, offensive, or harmful.
45 |
46 | ## Scope
47 |
48 | This Code of Conduct applies both within project spaces and in public spaces
49 | when an individual is representing the project or its community. Examples of
50 | representing a project or community include using an official project e-mail
51 | address, posting via an official social media account, or acting as an appointed
52 | representative at an online or offline event. Representation of a project may be
53 | further defined and clarified by project maintainers.
54 |
55 | ## Enforcement
56 |
57 | Instances of abusive, harassing, or otherwise unacceptable behavior may be
58 | reported by contacting the project team at ronald.tse@ribose.com. All
59 | complaints will be reviewed and investigated and will result in a response that
60 | is deemed necessary and appropriate to the circumstances. The project team is
61 | obligated to maintain confidentiality with regard to the reporter of an incident.
62 | Further details of specific enforcement policies may be posted separately.
63 |
64 | Project maintainers who do not follow or enforce the Code of Conduct in good
65 | faith may face temporary or permanent repercussions as determined by other
66 | members of the project's leadership.
67 |
68 | ## Attribution
69 |
70 | This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71 | available at [http://contributor-covenant.org/version/1/4][version]
72 |
73 | [homepage]: http://contributor-covenant.org
74 | [version]: http://contributor-covenant.org/version/1/4/
75 |
--------------------------------------------------------------------------------
/lib/html2doc/notes.rb:
--------------------------------------------------------------------------------
1 | require "uuidtools"
2 |
3 | class Html2Doc
4 | def footnotes(docxml)
5 | #i = 1
6 | indexes = {}
7 | @footnote_idx = 1
8 | fn = []
9 | docxml.xpath("//a").each do |a|
10 | process_footnote_link(docxml, a, indexes, fn) or next
11 | #i += 1
12 | end
13 | process_footnote_texts(docxml, fn, indexes)
14 | end
15 |
16 | # Currently cannot deal with separate footnote containers in each chapter
17 | # We may eventually need to support that
18 | def process_footnote_texts(docxml, footnotes, indexes)
19 | body = docxml.at("//body")
20 | list = body.add_child("
")
21 | footnotes.each do |f|
22 | #require 'debug'; binding.b
23 | fn = list.first.add_child(footnote_container(docxml, indexes[f["id"]]))
24 | f.parent = fn.first
25 | f["id"] = ""
26 | footnote_div_to_p(f)
27 | end
28 | footnote_cleanup(docxml)
29 | end
30 |
31 | def footnote_div_to_p(elem)
32 | if %w{div aside}.include? elem.name
33 | if elem.at(".//p")
34 | elem.replace(elem.children)
35 | else
36 | elem.name = "p"
37 | elem["class"] = "MsoFootnoteText"
38 | end
39 | end
40 | end
41 |
42 | FN = "".freeze
44 |
45 | def footnote_container(docxml, idx)
46 | ref = docxml&.at("//a[@href='#_ftn#{idx}']")&.children&.to_xml(indent: 0)
47 | &.gsub(/>\n, "><") || FN
48 | <<~DIV
49 |
52 | DIV
53 | end
54 |
55 | def process_footnote_link(docxml, elem, indexes, footnote)
56 | footnote?(elem) or return false
57 | href = elem["href"].gsub(/^#/, "")
58 | #require "debug"; binding.b
59 | note = docxml.at("//*[@name = '#{href}' or @id = '#{href}']")
60 | note.nil? and return false
61 | unless indexes[href]
62 | indexes[href] = @footnote_idx
63 | @footnote_idx += 1
64 | end
65 | set_footnote_link_attrs(elem, indexes[href])
66 | if elem.at("./span[@class = 'MsoFootnoteReference']")
67 | process_footnote_link1(elem)
68 | else elem.children = FN
69 | end
70 | footnote << transform_footnote_text(note)
71 | end
72 |
73 | def process_footnote_link1(elem)
74 | elem.children.each do |c|
75 | if c.name == "span" && c["class"] == "MsoFootnoteReference"
76 | c.replace(FN)
77 | else
78 | c.wrap("")
79 | end
80 | end
81 | end
82 |
83 | def transform_footnote_text(note)
84 | #note["id"] = ""
85 | note.xpath(".//div").each { |div| div.replace(div.children) }
86 | note.xpath(".//aside | .//p").each do |p|
87 | p.name = "p"
88 | p["class"] = "MsoFootnoteText"
89 | end
90 | note.remove
91 | end
92 |
93 | def footnote?(elem)
94 | elem["epub:type"]&.casecmp("footnote")&.zero? ||
95 | elem["class"]&.casecmp("footnote")&.zero?
96 | end
97 |
98 | def set_footnote_link_attrs(elem, idx)
99 | elem["style"] = "mso-footnote-id:ftn#{idx}"
100 | elem["href"] = "#_ftn#{idx}"
101 | elem["name"] = "_ftnref#{idx}"
102 | elem["title"] = ""
103 | end
104 |
105 | # We expect that the content of the footnote text received is one or
106 | # more text containers, p or aside or div (which we have already converted
107 | # to p). We do not expect any or links back to text; if they
108 | # are present in the HTML, they need to have been cleaned out before
109 | # passing to this gem
110 | def footnote_cleanup(docxml)
111 | docxml.xpath('//div[@style="mso-element:footnote"]/a')
112 | .each do |x|
113 | n = x.next_element
114 | n&.children&.first&.add_previous_sibling(x.remove)
115 | end
116 | docxml
117 | end
118 | end
119 |
--------------------------------------------------------------------------------
/lib/html2doc/base.rb:
--------------------------------------------------------------------------------
1 | require "uuidtools"
2 | require "htmlentities"
3 | require "nokogiri"
4 | require "fileutils"
5 |
6 | class Html2Doc
7 | def initialize(hash)
8 | @filename = hash[:filename]
9 | @dir = hash[:dir]
10 | @dir1 = create_dir(@filename, @dir)
11 | @header_file = hash[:header_file]
12 | @asciimathdelims = hash[:asciimathdelims]
13 | @imagedir = hash[:imagedir]
14 | @debug = hash[:debug]
15 | @liststyles = hash[:liststyles]
16 | @stylesheet = read_stylesheet(hash[:stylesheet])
17 | @c = HTMLEntities.new
18 | end
19 |
20 | def process(result)
21 | result = process_html(result)
22 | process_header(@header_file)
23 | generate_filelist(@filename, @dir1)
24 | File.open("#{@filename}.htm", "w:UTF-8") { |f| f.write(result) }
25 | mime_package result, @filename, @dir1
26 | rm_temp_files(@filename, @dir, @dir1) unless @debug
27 | end
28 |
29 | def process_header(headerfile)
30 | headerfile.nil? and return
31 | doc = File.read(headerfile, encoding: "utf-8")
32 | doc = header_image_cleanup(doc, @dir1, @filename,
33 | File.dirname(@filename))
34 | File.open("#{@dir1}/header.html", "w:UTF-8") { |f| f.write(doc) }
35 | end
36 |
37 | def clear_dir(dir)
38 | Dir.foreach(dir) do |f|
39 | fn = File.join(dir, f)
40 | File.delete(fn) if f != "." && f != ".."
41 | end
42 | dir
43 | end
44 |
45 | def create_dir(filename, dir)
46 | dir and return clear_dir(dir)
47 | dir = "#{filename}_files"
48 | FileUtils.mkdir_p(dir)
49 | clear_dir(dir)
50 | end
51 |
52 | def process_html(result)
53 | docxml = to_xhtml(result)
54 | define_head(cleanup(docxml))
55 | msword_fix(from_xhtml(docxml))
56 | end
57 |
58 | def rm_temp_files(filename, dir, dir1)
59 | FileUtils.rm "#{filename}.htm"
60 | FileUtils.rm_f "#{dir1}/header.html"
61 | FileUtils.rm_r dir1 unless dir
62 | end
63 |
64 | def cleanup(docxml)
65 | locate_landscape(docxml)
66 | namespace(docxml.root)
67 | image_cleanup(docxml, @dir1, @imagedir)
68 | mathml_to_ooml(docxml)
69 | lists(docxml, @liststyles)
70 | footnotes(docxml)
71 | bookmarks(docxml)
72 | msonormal(docxml)
73 | docxml
74 | end
75 |
76 | def locate_landscape(_docxml)
77 | @landscape = @stylesheet.scan(/div\.\S+\s+\{\s*page:\s*[^;]+L;\s*\}/m)
78 | .map { |e| e.sub(/^div\.(\S+).*$/m, "\\1") }
79 | end
80 |
81 | def define_head1(docxml, _dir)
82 | docxml.xpath("//*[local-name() = 'head']").each do |h|
83 | h.children.first.add_previous_sibling <<~XML
84 | #{PRINT_VIEW}
85 |
86 | XML
87 | end
88 | end
89 |
90 | def filename_substitute(head, header_filename)
91 | return if header_filename.nil?
92 |
93 | head.xpath(".//*[local-name() = 'style']").each do |s|
94 | s1 = s.to_xml.gsub(/url\("[^"]+"\)/) do |m|
95 | /FILENAME/.match?(m) ? "url(cid:header.html)" : m
96 | end
97 | s.replace(s1)
98 | end
99 | end
100 |
101 | def stylesheet(_filename, _header_filename, _cssname)
102 | stylesheet = "#{@stylesheet}\n#{@newliststyledefs}"
103 | xml = Nokogiri::XML("")
104 | xml.children.first << Nokogiri::XML::CDATA
105 | .new(xml, "\n\n")
106 | xml.root.to_s
107 | end
108 |
109 | def read_stylesheet(cssname)
110 | (cssname.nil? || cssname.empty?) and
111 | cssname = File.join(File.dirname(__FILE__), "wordstyle.css")
112 | File.read(cssname, encoding: "UTF-8")
113 | end
114 |
115 | def define_head(docxml)
116 | title = docxml.at("//*[local-name() = 'head']/*[local-name() = 'title']")
117 | head = docxml.at("//*[local-name() = 'head']")
118 | css = stylesheet(@filename, @header_file, @stylesheet)
119 | add_stylesheet(head, title, css)
120 | filename_substitute(head, @header_file)
121 | define_head1(docxml, @dir1)
122 | rootnamespace(docxml.root)
123 | end
124 |
125 | def add_stylesheet(head, title, css)
126 | if head.children.empty?
127 | head.add_child css
128 | elsif title.nil?
129 | head.children.first.add_previous_sibling css
130 | else title.add_next_sibling css
131 | end
132 | end
133 |
134 | def bookmarks(docxml)
135 | docxml.xpath("//*[@id][not(@name)][not(@style = 'mso-element:footnote')]")
136 | .each do |x|
137 | (x["id"].empty? || x.namespace&.prefix == "v" &&
138 | %w(shapetype shape rect line group).include?(x.name)) and next
139 | if x.children.empty? then x.add_child(" ")
140 | else x.children.first.previous = " "
141 | end
142 | x.delete("id")
143 | end
144 | end
145 |
146 | def msonormal(docxml)
147 | docxml.xpath("//*[local-name() = 'p'][not(self::*[@class])]").each do |p|
148 | p["class"] = "MsoNormal"
149 | end
150 | docxml.xpath("//*[local-name() = 'li'][not(self::*[@class])]").each do |p|
151 | p["class"] = "MsoNormal"
152 | end
153 | end
154 | end
155 |
--------------------------------------------------------------------------------
/lib/html2doc/lists.rb:
--------------------------------------------------------------------------------
1 | require "uuidtools"
2 | require "htmlentities"
3 | require "nokogiri"
4 |
5 | class Html2Doc
6 | def style_list(elem, level, liststyle, listnumber)
7 | liststyle or return
8 | if elem["style"]
9 | elem["style"] += ";"
10 | else
11 | elem["style"] = ""
12 | end
13 | elem["style"] += "mso-list:#{liststyle} level#{level} lfo#{listnumber};"
14 | end
15 |
16 | def list_add1(elem, liststyles, listtype, level)
17 | if %i[ul ol].include? listtype
18 | list_add(elem.xpath(".//ul") - elem.xpath(".//ul//ul | .//ol//ul"),
19 | liststyles, :ul, level + 1)
20 | list_add(elem.xpath(".//ol") - elem.xpath(".//ul//ol | .//ol//ol"),
21 | liststyles, :ol, level + 1)
22 | else
23 | list_add(elem.xpath(".//ul") - elem.xpath(".//ul//ul | .//ol//ul"),
24 | liststyles, listtype, level + 1)
25 | list_add(elem.xpath(".//ol") - elem.xpath(".//ul//ol | .//ol//ol"),
26 | liststyles, listtype, level + 1)
27 | end
28 | end
29 |
30 | def list_add(xpath, liststyles, listtype, level)
31 | xpath.each do |l|
32 | level == 1 && l["seen"] = true and @listnumber += 1
33 | l["id"] ||= UUIDTools::UUID.random_create
34 | liststyle = derive_liststyle(l, liststyles[listtype], level)
35 | (l.xpath(".//li") - l.xpath(".//ol//li | .//ul//li")).each do |li|
36 | style_list(li, level, liststyle, @listnumber)
37 | list_add1(li, liststyles, listtype, level)
38 | end
39 | list_add_tail(l, liststyles, listtype, level)
40 | end
41 | end
42 |
43 | def derive_liststyle(list, liststyle, level)
44 | list["start"] && list["start"] != "1" or return liststyle
45 | @liststyledefsidx += 1
46 | ret = "l#{@liststyledefsidx}"
47 | @newliststyledefs += newliststyle(list["start"], liststyle, ret, level)
48 | ret
49 | end
50 |
51 | def newliststyle(start, liststyle, newstylename, level)
52 | s = @liststyledefs[liststyle]
53 | .gsub(/@list\s+#{liststyle}/, "@list #{newstylename}")
54 | .sub(/@list\s+#{newstylename}\s+\{[^}]*\}/m, <<~LISTSTYLE)
55 | @list #{newstylename}\n{mso-list-id:#{rand(100_000_000..999_999_999)};
56 | mso-list-template-ids:#{rand(100_000_000..999_999_999)};}
57 | LISTSTYLE
58 | .sub(/@list\s+#{newstylename}:level#{level}\s+\{/m,
59 | "\\0mso-level-start-at:#{start};\n")
60 | "#{s}\n"
61 | end
62 |
63 | def list_add_tail(list, liststyles, listtype, level)
64 | list.xpath(".//ul[not(ancestor::li/ancestor::*/@id = '#{list['id']}')] | "\
65 | ".//ol[not(ancestor::li/ancestor::*/@id = '#{list['id']}')]")
66 | .each do |li|
67 | list_add1(li.parent, liststyles, listtype, level - 1)
68 | end
69 | end
70 |
71 | def list2para(list)
72 | list.xpath("./li").empty? and return
73 | list2para_position(list)
74 | list.xpath("./li").each do |l|
75 | l.name = "p"
76 | l["class"] ||= "MsoListParagraphCxSpMiddle"
77 | l.first_element_child&.name == "p" or next
78 | l["style"] ||= ""
79 | l["style"] += l.first_element_child["style"]
80 | &.sub(/mso-list[^;]+;/, "") || ""
81 | l.first_element_child.replace(l.first_element_child.children)
82 | end
83 | list.replace(list.children)
84 | end
85 |
86 | def list2para_position(list)
87 | list.xpath("./li").first["class"] ||= "MsoListParagraphCxSpFirst"
88 | list.xpath("./li").last["class"] ||= "MsoListParagraphCxSpLast"
89 | list.xpath("./li/p").each do |p|
90 | p["class"] ||= "MsoListParagraphCxSpMiddle"
91 | end
92 | end
93 |
94 | TOPLIST = "[not(ancestor::ul) and not(ancestor::ol)]".freeze
95 |
96 | def lists1(docxml, liststyles, style)
97 | case style
98 | when :ul then list_add(docxml.xpath("//ul[not(@class)]#{TOPLIST}"),
99 | liststyles, :ul, 1)
100 | when :ol then list_add(docxml.xpath("//ol[not(@class)]#{TOPLIST}"),
101 | liststyles, :ol, 1)
102 | else
103 | list_add(docxml.xpath("//ol[@class = '#{style}']#{TOPLIST} | "\
104 | "//ul[@class = '#{style}']#{TOPLIST}"),
105 | liststyles, style, 1)
106 | end
107 | end
108 |
109 | def lists_unstyled(docxml, liststyles)
110 | liststyles.has_key?(:ul) and
111 | list_add(docxml.xpath("//ul#{TOPLIST}[not(@seen)]"),
112 | liststyles, :ul, 1)
113 | liststyles.has_key?(:ol) and
114 | list_add(docxml.xpath("//ol#{TOPLIST}[not(@seen)]"),
115 | liststyles, :ul, 1)
116 | docxml.xpath("//ul[@seen] | //ol[@seen]").each do |l|
117 | l.delete("seen")
118 | end
119 | end
120 |
121 | def lists(docxml, liststyles)
122 | liststyles.nil? and return
123 | parse_stylesheet_line_styles
124 | liststyles.each_key { |k| lists1(docxml, liststyles, k) }
125 | lists_unstyled(docxml, liststyles)
126 | liststyles.has_key?(:ul) and docxml.xpath("//ul").each { |u| list2para(u) }
127 | liststyles.has_key?(:ol) and docxml.xpath("//ol").each { |u| list2para(u) }
128 | end
129 |
130 | def parse_stylesheet_line_styles
131 | @listnumber = 0
132 | result = process_stylesheet_lines(@stylesheet.split("\n"))
133 | @liststyledefs = clean_result_content(result)
134 | @newliststyledefs = ""
135 | @liststyledefsidx = @liststyledefs.keys.map do |k|
136 | k.sub(/^.*(\d+)$/, "\\1").to_i
137 | end.max
138 | end
139 |
140 | private
141 |
142 | def extract_list_name(line)
143 | match = line.match(/^\s*@list\s+([^:\s]+)(?::.*)?/)
144 | match ? match[1] : nil
145 | end
146 |
147 | def list_declaration?(line)
148 | !extract_list_name(line).nil?
149 | end
150 |
151 | def save_current_list(result, current_base, current_content)
152 | current_base.nil? || current_content.empty? and return result
153 | if result[current_base]
154 | result[current_base] += current_content
155 | else
156 | result[current_base] = current_content
157 | end
158 | result
159 | end
160 |
161 | def process_stylesheet_lines(lines)
162 | result = {}
163 | current_base = nil
164 | current_content = ""
165 | parsing_active = false
166 |
167 | lines.each do |line|
168 | if list_declaration?(line)
169 | base_name = extract_list_name(line)
170 | if current_base == base_name
171 | current_content += "#{line}\n"
172 | else
173 | # save accumulated list style definition, new list style
174 | save_current_list(result, current_base, current_content)
175 | current_base = base_name
176 | current_content = "#{line}\n"
177 | end
178 | parsing_active = true
179 |
180 | elsif parsing_active && line.include?("}")
181 | # End of current block - add this line and stop parsing
182 | current_content += "#{line}\n"
183 | parsing_active = false
184 |
185 | elsif parsing_active
186 | # Continue adding content while parsing is active
187 | current_content += "#{line}\n"
188 | end
189 | # If parsing_active is false and no @list declaration, skip the line
190 | end
191 | # Save the last list if we were still parsing
192 | save_current_list(result, current_base, current_content)
193 | result
194 | end
195 |
196 | def clean_result_content(result)
197 | result.each { |k, v| result[k] = v.rstrip }
198 | result
199 | end
200 | end
201 |
--------------------------------------------------------------------------------
/lib/html2doc/math.rb:
--------------------------------------------------------------------------------
1 | require "uuidtools"
2 | require "plurimath"
3 | require "htmlentities"
4 | require "nokogiri"
5 | require "plane1converter"
6 | require "metanorma-utils"
7 |
8 | module Nokogiri
9 | module XML
10 | class Node
11 | OOXML_NS = "http://schemas.openxmlformats.org/officeDocument/2006/math".freeze
12 |
13 | def ooxml_xpath(path)
14 | p = Metanorma::Utils::ns(path).gsub("xmlns:", "m:")
15 | xpath(p, "m" => OOXML_NS)
16 | end
17 | end
18 | end
19 | end
20 |
21 | class Html2Doc
22 | def progress_conv(idx, step, total, threshold, msg)
23 | return unless (idx % step).zero? && total > threshold && idx.positive?
24 |
25 | warn "#{msg} #{idx} of #{total}"
26 | end
27 |
28 | def unwrap_accents(doc)
29 | doc.xpath("//*[@accent = 'true']").each do |x|
30 | x.elements.length > 1 or next
31 | x.elements[1].name == "mrow" and
32 | x.elements[1].replace(x.elements[1].children)
33 | end
34 | doc
35 | end
36 |
37 | MATHML_NS = "http://www.w3.org/1998/Math/MathML".freeze
38 |
39 | # random fixes to MathML input that OOXML needs to render properly
40 | def ooxml_cleanup(math, docnamespaces)
41 | # encode_math(
42 | unwrap_accents(
43 | mathml_preserve_space(
44 | mathml_insert_rows(math, docnamespaces), docnamespaces
45 | ),
46 | )
47 | # )
48 | math.add_namespace(nil, MATHML_NS)
49 | math
50 | end
51 |
52 | def encode_math(elem)
53 | elem.traverse do |e|
54 | e.text? or next
55 | e.text.strip.empty? and next
56 | e.replace(@c.encode(e.text, :hexadecimal))
57 | end
58 | elem
59 | end
60 |
61 | def mathml_insert_rows(math, docnamespaces)
62 | math.xpath(%w(msup msub msubsup munder mover munderover)
63 | .map { |m| ".//xmlns:#{m}" }.join(" | "), docnamespaces).each do |x|
64 | next unless x.next_element && x.next_element != "mrow"
65 |
66 | x.next_element.wrap(" ")
67 | end
68 | math
69 | end
70 |
71 | def mathml_preserve_space(math, docnamespaces)
72 | math.xpath(".//xmlns:mtext", docnamespaces).each do |x|
73 | x.children = x.children.to_xml.gsub(/^\s/, " ").gsub(/\s$/, " ")
74 | end
75 | math
76 | end
77 |
78 | HTML_NS = 'xmlns="http://www.w3.org/1999/xhtml"'.freeze
79 |
80 | def wrap_text(elem, wrapper)
81 | elem.traverse do |e|
82 | e.text? or next
83 | e.text.strip.empty? and next
84 | e.wrap(wrapper)
85 | end
86 | end
87 |
88 | def unitalic(math)
89 | math.ooxml_xpath(".//r[rPr[not(m:scr)]/sty[@m:val = 'p']]").each do |x|
90 | wrap_text(x, " ")
91 | end
92 | math.ooxml_xpath(".//r[rPr[not(m:scr)]/sty[@m:val = 'bi']]").each do |x|
93 | wrap_text(x,
94 | " ")
95 | end
96 | math.ooxml_xpath(".//r[rPr[not(m:scr)]/sty[@m:val = 'i']]").each do |x|
97 | wrap_text(x, " ")
98 | end
99 | math.ooxml_xpath(".//r[rPr[not(m:scr)]/sty[@m:val = 'b']]").each do |x|
100 | wrap_text(x,
101 | " ")
102 | end
103 | math.ooxml_xpath(".//r[rPr/scr[@m:val = 'monospace']]").each do |x|
104 | to_plane1(x, :monospace)
105 | end
106 | math.ooxml_xpath(".//r[rPr/scr[@m:val = 'double-struck']]").each do |x|
107 | to_plane1(x, :doublestruck)
108 | end
109 | math.ooxml_xpath(".//r[rPr[not(m:sty) or m:sty/@m:val = 'p']/scr[@m:val = 'script']]").each do |x|
110 | to_plane1(x, :script)
111 | end
112 | math.ooxml_xpath(".//r[rPr[m:sty/@m:val = 'b']/scr[@m:val = 'script']]").each do |x|
113 | to_plane1(x, :scriptbold)
114 | end
115 | math.ooxml_xpath(".//r[rPr[not(m:sty) or m:sty/@m:val = 'p']/scr[@m:val = 'fraktur']]").each do |x|
116 | to_plane1(x, :fraktur)
117 | end
118 | math.ooxml_xpath(".//r[rPr[m:sty/@m:val = 'b']/scr[@m:val = 'fraktur']]").each do |x|
119 | to_plane1(x, :frakturbold)
120 | end
121 | math.ooxml_xpath(".//r[rPr[not(m:sty) or m:sty/@m:val = 'p']/scr[@m:val = 'sans-serif']]").each do |x|
122 | to_plane1(x, :sans)
123 | end
124 | math.ooxml_xpath(".//r[rPr[m:sty/@m:val = 'b']/scr[@m:val = 'sans-serif']]").each do |x|
125 | to_plane1(x, :sansbold)
126 | end
127 | math.ooxml_xpath(".//r[rPr[m:sty/@m:val = 'i']/scr[@m:val = 'sans-serif']]").each do |x|
128 | to_plane1(x, :sansitalic)
129 | end
130 | math.ooxml_xpath(".//r[rPr[m:sty/@m:val = 'bi']/scr[@m:val = 'sans-serif']]").each do |x|
131 | to_plane1(x, :sansbolditalic)
132 | end
133 | math
134 | end
135 |
136 | def to_plane1(xml, font)
137 | xml.traverse do |n|
138 | next unless n.text?
139 |
140 | n.replace(Plane1Converter.conv(@c.decode(n.text), font))
141 | end
142 | xml
143 | end
144 |
145 | def mathml_to_ooml(docxml)
146 | docnamespaces = docxml.collect_namespaces
147 | m = docxml.xpath("//*[local-name() = 'math']")
148 | m.each_with_index do |x, i|
149 | progress_conv(i, 100, m.size, 500, "Math OOXML")
150 | mathml_to_ooml1(x, docnamespaces)
151 | end
152 | end
153 |
154 | # We need span and em not to be namespaced. Word can't deal with explicit
155 | # namespaces.
156 | # We will end up stripping them out again under Nokogiri 1.11, which correctly
157 | # insists on inheriting namespace from parent.
158 | def ooml_clean(xml)
159 | xml.to_xml(indent: 0)
160 | .gsub(/<\?[^>]+>\s*/, "")
161 | .gsub(/ xmlns(:[^=]+)?="[^"]+"/, "")
162 | # .gsub(%r{<(/)?(?!span)(?!em)([a-z])}, "<\\1m:\\2")
163 | end
164 |
165 | def mathml_to_ooml1(xml, docnamespaces)
166 | doc = Nokogiri::XML::Document::new
167 | doc.root = ooxml_cleanup(xml, docnamespaces)
168 | # d = xml.parent["block"] != "false" # display_style
169 | ooxml = Nokogiri::XML(Plurimath::Math
170 | .parse(doc.root.to_xml(indent: 0), :mathml)
171 | .to_omml(split_on_linebreak: true))
172 | ooxml = unitalic(accent_tr(ooxml))
173 | ooxml = ooml_clean(uncenter(xml, ooxml))
174 | xml.swap(ooxml)
175 | end
176 |
177 | def accent_tr(xml)
178 | xml.ooxml_xpath(".//accPr/chr").each do |x|
179 | x["m:val"] &&= accent_tr1(x["m:val"])
180 | x["val"] &&= accent_tr1(x["val"])
181 | end
182 | xml
183 | end
184 |
185 | def accent_tr1(accent)
186 | case accent
187 | when "\u2192" then "\u20D7"
188 | when "^" then "\u0302"
189 | when "~" then "\u0303"
190 | else accent
191 | end
192 | end
193 |
194 | OOXML_NS = "http://schemas.openxmlformats.org/officeDocument/2006/math".freeze
195 |
196 | def math_only_para?(node)
197 | x = node.dup
198 | x.xpath(".//m:math", "m" => MATHML_NS).each(&:remove)
199 | x.xpath(".//m:oMathPara | .//m:oMath", "m" => OOXML_NS).each(&:remove)
200 | x.xpath(".//m:oMathPara | .//m:oMath").each(&:remove)
201 | # namespace can go missing during processing
202 | x.text.strip.empty?
203 | end
204 |
205 | def math_block?(ooxml, mathml)
206 | # ooxml.name == "oMathPara" || mathml["displaystyle"] == "true"
207 | mathml["displaystyle"] == "true" &&
208 | ooxml.xpath("./m:oMath", "m" => OOXML_NS).size <= 1
209 | end
210 |
211 | STYLE_BEARING_NODE =
212 | %w(p div td th li).map { |x| ".//ancestor::#{x}" }.join(" | ").freeze
213 |
214 | # if oomml has no siblings, by default it is centered; override this with
215 | # left/right if parent is so tagged
216 | # also if ooml has mathPara already, or is in para with only oMath content
217 | def uncenter(math, ooxml)
218 | alignnode = math.xpath(STYLE_BEARING_NODE).last
219 | ooxml.document? and ooxml = ooxml.root
220 | ret = uncenter_unneeded(math, ooxml, alignnode) and return ret
221 | dir = ooxml_alignment(alignnode)
222 | ooxml.name == "oMathPara" or ooxml.wrap(" ")
223 | ooxml.elements.first.previous =
224 | " "
225 | ooxml
226 | end
227 |
228 | def ooxml_alignment(alignnode)
229 | dir = "left"
230 | /text-align:\s*right/.match?(alignnode["style"]) and dir = "right"
231 | /text-align:\s*center/.match?(alignnode["style"]) and dir = "center"
232 | dir
233 | end
234 |
235 | def uncenter_unneeded(math, ooxml, alignnode)
236 | (math_block?(ooxml, math) || !alignnode) and return ooxml
237 | math_only_para?(alignnode) and return nil
238 | ooxml.name == "oMathPara" and
239 | ooxml = ooxml.elements.select { |x| %w(oMath r).include?(x.name) }
240 | ooxml.size > 1 ? nil : Nokogiri::XML::NodeSet.new(math.document, ooxml)
241 | end
242 | end
243 |
--------------------------------------------------------------------------------
/lib/html2doc/mime.rb:
--------------------------------------------------------------------------------
1 | require "uuidtools"
2 | require "base64"
3 | require "mime/types"
4 | require "fileutils"
5 | require "vectory"
6 |
7 | class Html2Doc
8 | def mime_preamble(boundary, filename, result)
9 | <<~"PREAMBLE"
10 | MIME-Version: 1.0
11 | Content-Type: multipart/related; boundary="#{boundary}"
12 |
13 | --#{boundary}
14 | Content-ID: <#{File.basename(filename)}>
15 | Content-Disposition: inline; filename="#{File.basename(filename)}"
16 | Content-Type: text/html; charset="utf-8"
17 |
18 | #{result}
19 |
20 | PREAMBLE
21 | end
22 |
23 | def mime_attachment(boundary, _filename, item, dir)
24 | content_type = mime_type(item)
25 | text_mode = %w[text application].any? { |p| content_type.start_with? p }
26 |
27 | path = File.join(dir, item)
28 | content = text_mode ? File.read(path, encoding: "utf-8") : IO.binread(path)
29 |
30 | encoded_file = Base64.strict_encode64(content).gsub(/(.{76})/, "\\1\n")
31 | <<~"FILE"
32 | --#{boundary}
33 | Content-ID: <#{File.basename(item)}>
34 | Content-Disposition: inline; filename="#{File.basename(item)}"
35 | Content-Transfer-Encoding: base64
36 | Content-Type: #{content_type}
37 |
38 | #{encoded_file}
39 |
40 | FILE
41 | end
42 |
43 | def mime_type(item)
44 | types = MIME::Types.type_for(item)
45 | type = types ? types.first.to_s : 'text/plain; charset="utf-8"'
46 | type = %(#{type} charset="utf-8") if /^text/.match(type) && types
47 | type
48 | end
49 |
50 | def mime_boundary
51 | salt = UUIDTools::UUID.random_create.to_s.tr("-", ".")[0..17]
52 | "----=_NextPart_#{salt}"
53 | end
54 |
55 | def mime_package(result, filename, dir)
56 | boundary = mime_boundary
57 | mhtml = mime_preamble(boundary, "#{filename}.htm", result)
58 | mhtml += mime_attachment(boundary, "#{filename}.htm", "filelist.xml", dir)
59 | Dir.foreach(dir) do |item|
60 | next if item == "." || item == ".." || /^\./.match(item) ||
61 | item == "filelist.xml"
62 |
63 | mhtml += mime_attachment(boundary, "#{filename}.htm", item, dir)
64 | end
65 | mhtml += "--#{boundary}--"
66 | File.open("#{filename}.doc", "w:UTF-8") { |f| f.write contentid(mhtml) }
67 | end
68 |
69 | def contentid(mhtml)
70 | mhtml.gsub %r{( ]*?src=")([^"'<]+)(['"])}m do |m|
71 | repl = "#{$1}cid:#{File.basename($2)}#{$3}"
72 | /^data:|^https?:/ =~ $2 ? m : repl
73 | end.gsub %r{(]*?src=")([^"'<]+)(['"])}m do |m|
74 | repl = "#{$1}cid:#{File.basename($2)}#{$3}"
75 | /^data:|^https?:/ =~ $2 ? m : repl
76 | end
77 | end
78 |
79 | IMAGE_PATH = "//*[local-name() = 'img' or local-name() = 'imagedata']".freeze
80 |
81 | def mkuuid
82 | UUIDTools::UUID.random_create.to_s
83 | end
84 |
85 | def warnsvg(src)
86 | warn "#{src}: SVG not supported" if /\.svg$/i.match?(src)
87 | end
88 |
89 | def localname(src, localdir)
90 | %r{^([A-Z]:)?/}.match?(src) ? src : File.join(localdir, src)
91 | end
92 |
93 | IMAGE_IMAGEDATA =
94 | ".//*[local-name() = 'img' or local-name() = 'imagedata']".freeze
95 |
96 | # only processes locally stored images
97 | def image_cleanup(docxml, dir, localdir)
98 | maxheight, maxwidth = page_dimensions(docxml)
99 | docxml.xpath(IMAGE_IMAGEDATA).each do |i|
100 | skip_image_cleanup?(i) and next
101 | local_filename = rename_image(i, dir, localdir)
102 | if tr_ancestor = i.xpath("ancestor::*[local-name() = 'tr']").first
103 | image_count = tr_ancestor.xpath(IMAGE_IMAGEDATA).count
104 | image_resize(i, local_filename, maxheight, maxwidth, image_count)
105 | else # Normal behavior for non-table images
106 | image_resize(i, local_filename, maxheight, maxwidth, 1)
107 | end
108 | end
109 | docxml
110 | end
111 |
112 | def image_resize(img, local_filename, maxheight, maxwidth, image_count)
113 | img["width"], img["height"] =
114 | if landscape?(img)
115 | Vectory.image_resize(img, local_filename, maxwidth,
116 | maxheight / image_count)
117 | else
118 | Vectory.image_resize(img, local_filename, maxheight,
119 | maxwidth / image_count)
120 | end
121 | end
122 |
123 | def landscape?(img)
124 | img.ancestors.each do |a|
125 | a.name == "div" or next
126 | @landscape.include?(a["class"]) and return true
127 | end
128 | false
129 | end
130 |
131 | def rename_image(img, dir, localdir)
132 | local_filename = localname(img["src"], localdir)
133 | new_filename = "#{mkuuid}#{File.extname(img['src'])}"
134 | FileUtils.cp local_filename, File.join(dir, new_filename)
135 | img["src"] = File.join(File.basename(dir), new_filename)
136 | local_filename
137 | end
138 |
139 | def skip_image_cleanup?(img)
140 | src = img["src"]
141 | (img.element? && %w(img imagedata).include?(img.name)) or return true
142 | (src.nil? || src.empty? || /^http/.match?(src) ||
143 | %r{^data:(image|application)/[^;]+;base64}.match?(src)) and return true
144 | false
145 | end
146 |
147 | # we are going to use the 2nd instance of @page in the Word CSS,
148 | # skipping the cover page. Currently doesn't deal with Landscape.
149 | # Scan both @stylesheet and docxml.to_xml (where @standardstylesheet has ended up)
150 | # Allow 0.9 * height to fit caption
151 | def page_dimensions(docxml)
152 | page_size = find_page_size_in_doc(@stylesheet, docxml.to_xml) or
153 | return [680, 400]
154 | m_size = /size:\s*(\S+)\s+(\S+)\s*;/.match(page_size) or return [680, 400]
155 | m_marg = /margin:\s*(\S+)\s+(\S+)\s*(\S+)\s*(\S+)\s*;/.match(page_size) or
156 | return [680, 400]
157 | [0.9 * (units_to_px(m_size[2]) - units_to_px(m_marg[1]) - units_to_px(m_marg[3])),
158 | units_to_px(m_size[1]) - units_to_px(m_marg[2]) - units_to_px(m_marg[4])]
159 | rescue StandardError
160 | [680, 400]
161 | end
162 |
163 | def find_page_size_in_doc(stylesheet, doc)
164 | find_page_size(stylesheet, "WordSection2", false) ||
165 | find_page_size(stylesheet, "WordSection3", false) ||
166 | find_page_size(doc, "WordSection2", true) ||
167 | find_page_size(doc, "WordSection3", true) ||
168 | find_page_size(stylesheet, "", false) || find_page_size(doc, "", true)
169 | end
170 |
171 | # if in_xml, CSS is embedded in XML ") and xml_found = false
179 | /^\s*@page\s+#{klass}/.match?(l) and found = true
180 | found && /^\s*\{?size:/.match?(l) and ret += l
181 | found && /^\s*\{?margin:/.match?(l) and ret += l
182 | if found && /}/.match?(l)
183 | !ret.blank? && (!in_xml || xml_found) and return ret
184 | ret = ""
185 | found = false
186 | end
187 | end
188 | nil
189 | end
190 |
191 | def units_to_px(measure)
192 | m = /^(\S+)(pt|cm)/.match(measure)
193 | ret = case m[2]
194 | when "px" then (m[1].to_f * 0.75)
195 | when "pt" then m[1].to_f
196 | when "cm" then (m[1].to_f * 28.346456693)
197 | when "in" then (m[1].to_f * 72)
198 | end
199 | ret.to_i
200 | end
201 |
202 | # do not parse the header through Nokogiri, since it will contain
203 | # non-XML like
204 | def header_image_cleanup(doc, dir, filename, localdir)
205 | doc.split(%r{( ]*>|]*>)}).each_slice(2).map do |a|
206 | header_image_cleanup1(a, dir, filename, localdir)
207 | end.join
208 | end
209 |
210 | def header_image_cleanup1(a, dir, _filename, localdir)
211 | if a.size == 2 && !(/ src="https?:/.match a[1]) &&
212 | !(%r{ src="data:(image|application)/[^;]+;base64}.match a[1])
213 | m = / src=['"](?[^"']+)['"]/.match a[1]
214 | m2 = /\.(?[a-zA-Z_0-9]+)$/.match m[:src]
215 | new_filename = "#{mkuuid}.#{m2[:suffix]}"
216 | FileUtils.cp localname(m[:src], localdir), File.join(dir, new_filename)
217 | a[1].sub!(%r{ src=['"](?[^"']+)['"]}, " src='cid:#{new_filename}'")
218 | end
219 | a.join
220 | end
221 |
222 | def generate_filelist(filename, dir)
223 | File.open(File.join(dir, "filelist.xml"), "w") do |f|
224 | f.write %{
225 | }
226 | Dir.entries(dir).sort.each do |item|
227 | (item == "." || item == ".." || /^\./.match(item)) and next
228 | f.write %{ \n}
229 | end
230 | f.write(" \n")
231 | end
232 | end
233 | end
234 |
--------------------------------------------------------------------------------
/spec/header.html:
--------------------------------------------------------------------------------
1 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
70 |
71 |
72 |
73 |
74 |
75 |
79 |
80 |
81 |
82 |
83 |
84 |
97 |
98 |
99 |
100 |
101 |
102 |
104 |
105 |
106 |
107 |
108 |
109 |
111 |
112 |
113 |
114 |
115 |
116 |
128 |
129 |
130 |
131 |
132 |
133 |
144 |
145 |
146 |
147 |
148 |
149 |
162 |
163 |
164 |
165 |
166 |
167 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
--------------------------------------------------------------------------------
/spec/examples/header.html:
--------------------------------------------------------------------------------
1 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
70 |
71 |
72 |
73 |
74 |
75 |
79 |
80 |
81 |
82 |
83 |
84 |
97 |
98 |
99 |
100 |
101 |
102 |
104 |
105 |
106 |
107 |
108 |
109 |
111 |
112 |
113 |
114 |
115 |
116 |
128 |
129 |
130 |
131 |
132 |
133 |
144 |
145 |
146 |
147 |
148 |
149 |
162 |
163 |
164 |
165 |
166 |
167 |
179 |
180 |
181 |
182 |
183 |
184 |
185 |
--------------------------------------------------------------------------------
/spec/header_img.html:
--------------------------------------------------------------------------------
1 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
18 |
19 |
20 |
21 |
22 |
23 |
24 |
26 |
27 |
28 |
29 |
30 |
31 |
32 |
33 |
34 |
35 |
37 |
38 |
39 |
40 |
41 |
42 |
43 |
44 |
45 |
46 |
48 |
49 |
50 |
51 |
52 |
53 |
54 |
55 |
56 |
57 |
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
67 |
68 |
70 |
71 |
72 |
73 |
74 |
75 |
76 |
77 |
81 |
82 |
83 |
84 |
85 |
86 |
99 |
100 |
101 |
102 |
103 |
104 |
106 |
107 |
108 |
109 |
110 |
111 |
113 |
114 |
115 |
116 |
117 |
118 |
130 |
131 |
132 |
133 |
134 |
135 |
146 |
147 |
148 |
149 |
150 |
151 |
164 |
165 |
166 |
167 |
168 |
169 |
181 |
182 |
183 |
184 |
185 |
186 |
187 |
--------------------------------------------------------------------------------
/spec/wordstyle-custom-lists.css:
--------------------------------------------------------------------------------
1 | @list l2
2 | {mso-list-id:_;
3 | mso-list-template-ids:_;}
4 | @list l2:level1
5 | {mso-level-style-link:"Heading 1";
6 | mso-level-text:%1;
7 | mso-level-tab-stop:21.6pt;
8 | mso-level-number-position:left;
9 | margin-left:21.6pt;
10 | text-indent:-21.6pt;
11 | mso-bidi-font-family:"Cambria";
12 | mso-ansi-font-weight:bold;
13 | mso-ansi-font-style:normal;}
14 | @list l2:level2
15 | {mso-level-start-at:3;
16 | mso-level-style-link:"Heading 2";
17 | mso-level-text:"%1\.%2";
18 | mso-level-tab-stop:18.0pt;
19 | mso-level-number-position:left;
20 | margin-left:0cm;
21 | text-indent:0cm;
22 | mso-bidi-font-family:"Cambria";
23 | mso-ansi-font-weight:bold;
24 | mso-ansi-font-style:normal;}
25 | @list l2:level3
26 | {mso-level-style-link:"Heading 3";
27 | mso-level-text:"%1\.%2\.%3";
28 | mso-level-tab-stop:36.0pt;
29 | mso-level-number-position:left;
30 | margin-left:0cm;
31 | text-indent:0cm;
32 | mso-bidi-font-family:"Cambria";
33 | mso-ansi-font-weight:bold;
34 | mso-ansi-font-style:normal;}
35 | @list l2:level4
36 | {mso-level-style-link:"Heading 4";
37 | mso-level-text:"%1\.%2\.%3\.%4";
38 | mso-level-tab-stop:54.0pt;
39 | mso-level-number-position:left;
40 | margin-left:0cm;
41 | text-indent:0cm;
42 | mso-bidi-font-family:"Cambria";
43 | mso-ansi-font-weight:bold;
44 | mso-ansi-font-style:normal;}
45 | @list l2:level5
46 | {mso-level-style-link:"Heading 5";
47 | mso-level-text:"%1\.%2\.%3\.%4\.%5";
48 | mso-level-tab-stop:54.0pt;
49 | mso-level-number-position:left;
50 | margin-left:0cm;
51 | text-indent:0cm;
52 | mso-bidi-font-family:"Cambria";
53 | mso-ansi-font-weight:bold;
54 | mso-ansi-font-style:normal;}
55 | @list l2:level6
56 | {mso-level-style-link:"Heading 6";
57 | mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6";
58 | mso-level-tab-stop:72.0pt;
59 | mso-level-number-position:left;
60 | margin-left:0cm;
61 | text-indent:0cm;
62 | mso-bidi-font-family:"Cambria";
63 | mso-ansi-font-weight:bold;
64 | mso-ansi-font-style:normal;}
65 | @list l2:level7
66 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7";
67 | mso-level-tab-stop:72.0pt;
68 | mso-level-number-position:left;
69 | margin-left:0cm;
70 | text-indent:0cm;
71 | mso-bidi-font-family:"Cambria";}
72 | @list l2:level8
73 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7\.%8";
74 | mso-level-tab-stop:90.0pt;
75 | mso-level-number-position:left;
76 | margin-left:0cm;
77 | text-indent:0cm;
78 | mso-bidi-font-family:"Cambria";}
79 | @list l2:level9
80 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7\.%8\.%9";
81 | mso-level-tab-stop:90.0pt;
82 | mso-level-number-position:left;
83 | margin-left:0cm;
84 | text-indent:0cm;
85 | mso-bidi-font-family:"Cambria";}
86 | @list l3
87 | {mso-list-id:_;
88 | mso-list-template-ids:_;}
89 | @list l3:level1
90 | {mso-level-style-link:"Heading 1";
91 | mso-level-text:%1;
92 | mso-level-tab-stop:21.6pt;
93 | mso-level-number-position:left;
94 | margin-left:21.6pt;
95 | text-indent:-21.6pt;
96 | mso-bidi-font-family:"Cambria";
97 | mso-ansi-font-weight:bold;
98 | mso-ansi-font-style:normal;}
99 | @list l3:level2
100 | {mso-level-style-link:"Heading 2";
101 | mso-level-text:"%1\.%2";
102 | mso-level-tab-stop:18.0pt;
103 | mso-level-number-position:left;
104 | margin-left:0cm;
105 | text-indent:0cm;
106 | mso-bidi-font-family:"Cambria";
107 | mso-ansi-font-weight:bold;
108 | mso-ansi-font-style:normal;}
109 | @list l3:level3
110 | {mso-level-style-link:"Heading 3";
111 | mso-level-text:"%1\.%2\.%3";
112 | mso-level-tab-stop:36.0pt;
113 | mso-level-number-position:left;
114 | margin-left:0cm;
115 | text-indent:0cm;
116 | mso-bidi-font-family:"Cambria";
117 | mso-ansi-font-weight:bold;
118 | mso-ansi-font-style:normal;}
119 | @list l3:level4
120 | {mso-level-start-at:5;
121 | mso-level-style-link:"Heading 4";
122 | mso-level-text:"%1\.%2\.%3\.%4";
123 | mso-level-tab-stop:54.0pt;
124 | mso-level-number-position:left;
125 | margin-left:0cm;
126 | text-indent:0cm;
127 | mso-bidi-font-family:"Cambria";
128 | mso-ansi-font-weight:bold;
129 | mso-ansi-font-style:normal;}
130 | @list l3:level5
131 | {mso-level-style-link:"Heading 5";
132 | mso-level-text:"%1\.%2\.%3\.%4\.%5";
133 | mso-level-tab-stop:54.0pt;
134 | mso-level-number-position:left;
135 | margin-left:0cm;
136 | text-indent:0cm;
137 | mso-bidi-font-family:"Cambria";
138 | mso-ansi-font-weight:bold;
139 | mso-ansi-font-style:normal;}
140 | @list l3:level6
141 | {mso-level-style-link:"Heading 6";
142 | mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6";
143 | mso-level-tab-stop:72.0pt;
144 | mso-level-number-position:left;
145 | margin-left:0cm;
146 | text-indent:0cm;
147 | mso-bidi-font-family:"Cambria";
148 | mso-ansi-font-weight:bold;
149 | mso-ansi-font-style:normal;}
150 | @list l3:level7
151 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7";
152 | mso-level-tab-stop:72.0pt;
153 | mso-level-number-position:left;
154 | margin-left:0cm;
155 | text-indent:0cm;
156 | mso-bidi-font-family:"Cambria";}
157 | @list l3:level8
158 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7\.%8";
159 | mso-level-tab-stop:90.0pt;
160 | mso-level-number-position:left;
161 | margin-left:0cm;
162 | text-indent:0cm;
163 | mso-bidi-font-family:"Cambria";}
164 | @list l3:level9
165 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7\.%8\.%9";
166 | mso-level-tab-stop:90.0pt;
167 | mso-level-number-position:left;
168 | margin-left:0cm;
169 | text-indent:0cm;
170 | mso-bidi-font-family:"Cambria";}
171 | @list l4
172 | {mso-list-id:_;
173 | mso-list-template-ids:_;}
174 | @list l4:level1
175 | {mso-level-style-link:"Heading 1";
176 | mso-level-text:%1;
177 | mso-level-tab-stop:21.6pt;
178 | mso-level-number-position:left;
179 | margin-left:21.6pt;
180 | text-indent:-21.6pt;
181 | mso-bidi-font-family:"Cambria";
182 | mso-ansi-font-weight:bold;
183 | mso-ansi-font-style:normal;}
184 | @list l4:level2
185 | {mso-level-style-link:"Heading 2";
186 | mso-level-text:"%1\.%2";
187 | mso-level-tab-stop:18.0pt;
188 | mso-level-number-position:left;
189 | margin-left:0cm;
190 | text-indent:0cm;
191 | mso-bidi-font-family:"Cambria";
192 | mso-ansi-font-weight:bold;
193 | mso-ansi-font-style:normal;}
194 | @list l4:level3
195 | {mso-level-style-link:"Heading 3";
196 | mso-level-text:"%1\.%2\.%3";
197 | mso-level-tab-stop:36.0pt;
198 | mso-level-number-position:left;
199 | margin-left:0cm;
200 | text-indent:0cm;
201 | mso-bidi-font-family:"Cambria";
202 | mso-ansi-font-weight:bold;
203 | mso-ansi-font-style:normal;}
204 | @list l4:level4
205 | {mso-level-style-link:"Heading 4";
206 | mso-level-text:"%1\.%2\.%3\.%4";
207 | mso-level-tab-stop:54.0pt;
208 | mso-level-number-position:left;
209 | margin-left:0cm;
210 | text-indent:0cm;
211 | mso-bidi-font-family:"Cambria";
212 | mso-ansi-font-weight:bold;
213 | mso-ansi-font-style:normal;}
214 | @list l4:level5
215 | {mso-level-start-at:7;
216 | mso-level-style-link:"Heading 5";
217 | mso-level-text:"%1\.%2\.%3\.%4\.%5";
218 | mso-level-tab-stop:54.0pt;
219 | mso-level-number-position:left;
220 | margin-left:0cm;
221 | text-indent:0cm;
222 | mso-bidi-font-family:"Cambria";
223 | mso-ansi-font-weight:bold;
224 | mso-ansi-font-style:normal;}
225 | @list l4:level6
226 | {mso-level-style-link:"Heading 6";
227 | mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6";
228 | mso-level-tab-stop:72.0pt;
229 | mso-level-number-position:left;
230 | margin-left:0cm;
231 | text-indent:0cm;
232 | mso-bidi-font-family:"Cambria";
233 | mso-ansi-font-weight:bold;
234 | mso-ansi-font-style:normal;}
235 | @list l4:level7
236 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7";
237 | mso-level-tab-stop:72.0pt;
238 | mso-level-number-position:left;
239 | margin-left:0cm;
240 | text-indent:0cm;
241 | mso-bidi-font-family:"Cambria";}
242 | @list l4:level8
243 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7\.%8";
244 | mso-level-tab-stop:90.0pt;
245 | mso-level-number-position:left;
246 | margin-left:0cm;
247 | text-indent:0cm;
248 | mso-bidi-font-family:"Cambria";}
249 | @list l4:level9
250 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7\.%8\.%9";
251 | mso-level-tab-stop:90.0pt;
252 | mso-level-number-position:left;
253 | margin-left:0cm;
254 | text-indent:0cm;
255 | mso-bidi-font-family:"Cambria";}
256 | @list l5
257 | {mso-list-id:_;
258 | mso-list-template-ids:_;}
259 | @list l5:level1
260 | {mso-level-start-at:2;
261 | mso-level-style-link:"Heading 1";
262 | mso-level-text:%1;
263 | mso-level-tab-stop:21.6pt;
264 | mso-level-number-position:left;
265 | margin-left:21.6pt;
266 | text-indent:-21.6pt;
267 | mso-bidi-font-family:"Cambria";
268 | mso-ansi-font-weight:bold;
269 | mso-ansi-font-style:normal;}
270 | @list l5:level2
271 | {mso-level-style-link:"Heading 2";
272 | mso-level-text:"%1\.%2";
273 | mso-level-tab-stop:18.0pt;
274 | mso-level-number-position:left;
275 | margin-left:0cm;
276 | text-indent:0cm;
277 | mso-bidi-font-family:"Cambria";
278 | mso-ansi-font-weight:bold;
279 | mso-ansi-font-style:normal;}
280 | @list l5:level3
281 | {mso-level-style-link:"Heading 3";
282 | mso-level-text:"%1\.%2\.%3";
283 | mso-level-tab-stop:36.0pt;
284 | mso-level-number-position:left;
285 | margin-left:0cm;
286 | text-indent:0cm;
287 | mso-bidi-font-family:"Cambria";
288 | mso-ansi-font-weight:bold;
289 | mso-ansi-font-style:normal;}
290 | @list l5:level4
291 | {mso-level-style-link:"Heading 4";
292 | mso-level-text:"%1\.%2\.%3\.%4";
293 | mso-level-tab-stop:54.0pt;
294 | mso-level-number-position:left;
295 | margin-left:0cm;
296 | text-indent:0cm;
297 | mso-bidi-font-family:"Cambria";
298 | mso-ansi-font-weight:bold;
299 | mso-ansi-font-style:normal;}
300 | @list l5:level5
301 | {mso-level-style-link:"Heading 5";
302 | mso-level-text:"%1\.%2\.%3\.%4\.%5";
303 | mso-level-tab-stop:54.0pt;
304 | mso-level-number-position:left;
305 | margin-left:0cm;
306 | text-indent:0cm;
307 | mso-bidi-font-family:"Cambria";
308 | mso-ansi-font-weight:bold;
309 | mso-ansi-font-style:normal;}
310 | @list l5:level6
311 | {mso-level-style-link:"Heading 6";
312 | mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6";
313 | mso-level-tab-stop:72.0pt;
314 | mso-level-number-position:left;
315 | margin-left:0cm;
316 | text-indent:0cm;
317 | mso-bidi-font-family:"Cambria";
318 | mso-ansi-font-weight:bold;
319 | mso-ansi-font-style:normal;}
320 | @list l5:level7
321 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7";
322 | mso-level-tab-stop:72.0pt;
323 | mso-level-number-position:left;
324 | margin-left:0cm;
325 | text-indent:0cm;
326 | mso-bidi-font-family:"Cambria";}
327 | @list l5:level8
328 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7\.%8";
329 | mso-level-tab-stop:90.0pt;
330 | mso-level-number-position:left;
331 | margin-left:0cm;
332 | text-indent:0cm;
333 | mso-bidi-font-family:"Cambria";}
334 | @list l5:level9
335 | {mso-level-text:"%1\.%2\.%3\.%4\.%5\.%6\.%7\.%8\.%9";
336 | mso-level-tab-stop:90.0pt;
337 | mso-level-number-position:left;
338 | margin-left:0cm;
339 | text-indent:0cm;
340 | mso-bidi-font-family:"Cambria";}
341 |
--------------------------------------------------------------------------------
/README.adoc:
--------------------------------------------------------------------------------
1 | = Html2Doc
2 |
3 | https://github.com/metanorma/html2doc/workflows/main/badge.svg
4 |
5 | image:https://img.shields.io/gem/v/html2doc.svg["Gem Version", link="https://rubygems.org/gems/html2doc"]
6 | image:https://github.com/metanorma/html2doc/workflows/rake/badge.svg["Build Status", link="https://github.com/metanorma/html2doc/actions?workflow=rake"]
7 | // image:https://codeclimate.com/github/metanorma/html2doc/badges/gpa.svg["Code Climate", link="https://codeclimate.com/github/metanorma/html2doc"]
8 | image:https://img.shields.io/github/issues-pr-raw/metanorma/html2doc.svg["Pull Requests", link="https://github.com/metanorma/html2doc/pulls"]
9 | image:https://img.shields.io/github/commits-since/metanorma/html2doc/latest.svg["Commits since latest",link="https://github.com/metanorma/html2doc/releases"]
10 |
11 | == Purpose
12 |
13 | Gem to convert an HTML document into a Word document (.doc) format. This is intended for automated generation of Microsoft Word documents, given HTML documents, which are much more readily crafted.
14 |
15 | == Origin
16 |
17 | This gem originated out of https://github.com/metanorma/metanorma-iso, which creates a Word document from a automatically generated HTML document (created in turn by processing Asciidoc).
18 |
19 | This work is driven by the Word document generation procedure documented in http://sebsauvage.net/wiki/doku.php?id=word_document_generation. For more on the approach taken, and on alternative approaches, see https://github.com/metanorma/html2doc/wiki/Why-not-docx%3F
20 |
21 | == Functions
22 |
23 | The gem currently does the following:
24 |
25 | * Convert any AsciiMath and MathML to Word's native mathematical formatting language, OOXML. Word supports copy-pasting MathML into Word and converting it into OOXML; however the conversion is not infallible (we have in the past found problems with `\sum`: Word claims parameters were missing, and inserting dotted squares to indicate as much), and you may need to post-edit the OOXML.
26 | ** The gem does attempt to repair the MathML input, to bring it in line with Word's OOXML's expectations. If you find any issues with AsciiMath or MathML input, please raise an issue.
27 | * Identify any footnotes in the document (defined as hyperlinks with attributes `class = "Footnote"` or `epub:type = "footnote"`), and render them as Microsoft Word footnotes.
28 | ** The corresponding footnote content is any `div` or `aside` element with the same `@id` attribute as the footnote points to; e.g. `3 `, pointing to `