├── $repo-contents.docx ├── .gitattributes ├── .gitignore ├── LICENSE.txt ├── README.md ├── assets ├── word_web_nav.css ├── word_web_nav.js └── word_web_nav_splitter_bar_icon.png ├── createwebpage ├── $package-contents.docx ├── __init__.py ├── construct_html_sections.py ├── copy_words_embedded_files.py ├── create_web_page.py ├── fix_word_html.py ├── input_parameter_file_keys.py ├── jinja_template.html ├── load_html_files.py ├── load_input_parameter_file.py └── yamllint_config_file.yml ├── docs ├── development-docs │ ├── WWN--input-parameter-file--YAML-use.docx │ ├── WWN--testing--regression-tests.docx │ ├── WWN--testing.docx │ ├── index.docx │ ├── web-page--construction--beautiful-soup--jinja.docx │ ├── web-page--html-tutorials.docx │ ├── web-page--structure--css--jquery.docx │ ├── web-page--structure--css-tutorials.docx │ ├── web-page--structure--design.vsd │ ├── word-html--bugs-and-fixes.docx │ ├── word-html--references-and-info.docx │ └── word-html--table-of-contents.docx ├── index.docx ├── installation.docx └── users-guide.docx ├── readme-figure-1.png ├── readme-figure-2.png ├── requirements.txt ├── templates ├── web_page_create--parameters--all.yml └── web_page_create--parameters--minimum.yml ├── tests └── tests-for-create_web_page_py │ ├── demo.docx │ └── included-photo.jpg └── tools ├── batch_create_web_page.py ├── batch_create_web_page.yml ├── create_web_page_for_all_yml_files.py └── generate_word_html.docm /$repo-contents.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/$repo-contents.docx -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Store Word docs as binary 2 | # * https://stackoverflow.com/questions/30728630/what-should-i-do-if-i-put-ms-office-e-g-docx-or-openoffice-e-g-odt-docum 3 | # * https://stackoverflow.com/questions/18331048/how-to-create-a-git-attributes-file 4 | *.docx binary 5 | *.docm binary -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # https://github.com/github/gitignore 2 | 3 | 4 | # Python 5 | # * https://github.com/github/gitignore/blob/master/Python.gitignore 6 | __pycache__/ 7 | 8 | 9 | # Visual Studio Code 10 | # * https://github.com/github/gitignore/blob/master/VisualStudio.gitignore 11 | .vscode/ 12 | workspace.code-workspace 13 | 14 | 15 | # MS Word artifacts 16 | *.wbk 17 | ~$*.docx 18 | ~$*.docm 19 | ~*.tmp 20 | 21 | 22 | # WWN folders 23 | WordWebNav--Word-HTML/ 24 | -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2021-present Jim Yuill 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # WordWebNav: create usable MS-Word web-pages 2 | 3 | [Link to WWN home-page and docs](https://jimyuill.com/software/www/WordWebNav/) 4 | 5 | ## **Overview** 6 | WordWebNav (WWN) is an app that converts a Microsoft-Word document to a usable web-page. 7 |
8 | WWN is free and open-source. 9 | 10 | WWN's web-page features are described in the screen-shots below. The features include: 11 | - A document-text pane, with adjustable width, and support for user-comments at the bottom 12 | - A navigation pane, with hyperlinks to the headings in the document-text pane 13 | - A header-bar for site-navigation, e.g., breadcrumbs 14 | - Fixes for common bugs in Word's HTML, such as: 15 | - Word-HTML's paragraphs span the browser's width, which makes them difficult to read. 16 | - Word-HTML's multi-level lists are misformatted 17 | 18 | ## **Screen-shots** 19 | - WWN web-page components: 20 | 21 | 22 |
23 |
24 | 25 | - The comments section, at the bottom of the document-text pane (Commento is used here): 26 |
27 | 28 | 29 | 30 | ## **Examples** 31 | - [A demo WWN web-page](https://jimyuill.com/software/www/WordWebNav/demo.html) was created from a Word-doc with typical features for recording technical info. 32 | - [The WWN author's web-site](https://jimyuill.com) is created mostly from Word documents and their WWN web-pages. 33 | 34 | ## **Description** 35 | Word is a powerful tool for recording technical info. Word can save a document in HTML format, but, for the web-page to be usable, additional features are needed, as well as fixes for bugs in Word's HTML. 36 | 37 | WWN can be used to create a personal web-site from Word documents. The WWN web-pages' user-interface is simple, and it provides the features needed for navigation and user-comments. And, of course, WWN web-pages can be used on any web-site, not just a personal web-site. 38 | 39 | WWN is relatively easy to use. First, a copy of the Word document is saved in Word HTML-format. Next, the user creates a parameter-file to specify the WWN web-page's files, header-bar contents, etc. WWN is then run to generate the WWN web-page. 40 | 41 | ## **Quality and support** 42 | WWN was created by a software professional, with over 25 years of R&D experience. Background research was performed to investigate Word's limitations and bugs, and to survey alternative solutions. The program code is heavily commented, to aid future development. Extensive testing was performed, including unit and function tests. System-tests were performed using a variety of technical Word-docs, downloaded from the Internet. The created web-pages were tested on the major desk-top browsers and OS's. The end-user documentation is intended to be complete and easy to use. 43 | 44 | WWN's author uses it to construct his personal web-site. He plans to fix the bugs he finds, and those reported by users, through 2023, and hopefully longer. 45 | - [Report bugs and ideas for improvement](https://github.com/jimyuill/word-web-nav/issues) 46 | - [Contact the WWN author](https://jimyuill.com/about.html) -------------------------------------------------------------------------------- /assets/word_web_nav.css: -------------------------------------------------------------------------------- 1 | /* 2 | DESCRIPTION: CSS file for the WordWebNav web-page 3 | 4 | There are two parts to this file: 5 | * CSS IDs (e.g., #header-bar) for the web-page's 4 sections. 6 | * CSS classes for the text in these two web-page sections: header-bar and table-of-contents 7 | 8 | The CSS uses the jQuery and jQuery UI libraries. 9 | * https://jquery.com/ 10 | * https://jqueryui.com/ 11 | 12 | * The system documentation has additional info on its: installation, use, design 13 | and implementation (code). 14 | 15 | * Coding issue to look into: 16 | * See below, "Using CSS pixel-inches to size the screen:" 17 | 18 | MIT License, Copyright (c) 2021-present Jim Yuill 19 | 20 | */ 21 | 22 | /****************************************************************** 23 | This section specifies the CSS IDs (e.g., #header-bar) for the web-page's 4 sections: 24 | * header-bar, table-of-contents, splitter-bar, document 25 | 26 | Each section is defined in the HTML by a div. 27 | * The "header-bar" is the top section, and it's the width of the browser-window. 28 | * The bottom three sections are adjacent to each other: 29 | * table-of-contents, splitter-bar, document 30 | * The bottom three sections are within a "container" div 31 | 32 | * An HTML div is made-up of these 4 parts, in this order, from inside to outside: 33 | * Content-area : padding : border : margin 34 | https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/The_box_model 35 | https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Box_Model 36 | https://developer.mozilla.org/en-US/docs/Web/CSS/Containing_block 37 | 38 | *******************************************************************/ 39 | 40 | /* Specify the header-bar section 41 | * It contains the header-bar text, e.g., the bread-crumb links, and the comments link 42 | */ 43 | #header-bar { 44 | /* calc() 45 | * In calc(), spaces are required around math-operators, e.g., "-" 46 | * calc() is not in older browsers 47 | https://stackoverflow.com/questions/16164736/height-calc100-not-working-correctly-in-css 48 | */ 49 | 50 | /* Specify the header-bar div's position 51 | https://developer.mozilla.org/en-US/docs/Web/CSS/position 52 | https://www.internetingishard.com/html-and-css/advanced-positioning/ 53 | https://www.w3schools.com/css/css_positioning.asp 54 | https://www.freecodecamp.org/news/how-to-use-the-position-property-in-css-to-align-elements-d8f49c403a26/ 55 | */ 56 | position: absolute; 57 | left:0; 58 | top:0; 59 | 60 | /* The div's background-color is the same as the background-color 61 | for MS-Word's ribbon */ 62 | background-color: #f3f2f1; 63 | 64 | /* 65 | Specify the header-bar div's parts and their sizes 66 | 67 | The header-bar div's size is: 68 | * Overall width: 100vw, the browswer's full width 69 | * Overall height: 42px = 70 | 20px (content-area height) + 71 | 20px (top and bottom padding's height) + 72 | 2px (bottom-border height) 73 | */ 74 | 75 | /* Content area: 76 | * Contains the header-bar text 77 | * The content-area's size is specified by "height:" and "width:" 78 | */ 79 | height: 20px; /* Use the height of the header-bar's font (specified below) */ 80 | width: calc(100vw - 40px); /* The browser's full width, minus the width of the left and right padding*/ 81 | 82 | /* Padding: 83 | * Space between the content area and the margin 84 | * Specified by the "padding-*" property/value pairs. 85 | */ 86 | padding-top: 10px; 87 | padding-bottom: 10px; 88 | padding-left: 20px; 89 | padding-right: 20px; 90 | 91 | /* Only a bottom border is used. It's a thin black line. */ 92 | border-width: 0px; 93 | border-bottom-width: 2px; 94 | border-bottom-style: solid; 95 | border-bottom-color: black; 96 | 97 | /* The margin isn't used */ 98 | margin: 0px; 99 | 100 | } 101 | 102 | /* Specify the "container" section. 103 | * It contains three sub-sections: "table-of-contents", "splitter-bar", "document" 104 | */ 105 | #container { 106 | position: absolute; 107 | /* The div's position 108 | * The container is just below the header-bar. 109 | * So, "top:" specifies the header-bar's total height. 110 | */ 111 | top: 42px; 112 | left:0; 113 | 114 | width: 100vw; /* The browser's full width */ 115 | height: calc(100vh - 42px); /* The browser's full height, minus the header-bar's height */ 116 | 117 | /* The container does not have padding, margin, or a border */ 118 | padding: 0px; 119 | margin: 0px; 120 | border-width: 0px; 121 | 122 | background-color: white; 123 | } 124 | 125 | /* 126 | Specify the table-of-contents section 127 | */ 128 | #table-of-contents { 129 | /* The div's position within the container */ 130 | position: absolute; 131 | top: 0px; 132 | left: 0; 133 | 134 | /* The alignment process for the lower 3 sections: table-of-contents, splitter-bar, document 135 | 136 | For the lower-sections' div's, specifying the width as a percentage can result in 137 | misalignment of the sections. For example, "width: calc(25% - 20px);" 138 | 139 | In particular, instead of two sections being adjacent, they can be separated by an 140 | unintended 1px vertical line. 141 | * For example, between the table-of-contents and splitter-bar 142 | * Although small, the line can be noticable, especially when it has a conspicuous color. 143 | 144 | The misalignment and how it can occur is described below. 145 | * It's not possible to fix the misalignment here in the CSS, 146 | unless an additional library is used, such as flexbox. 147 | https://cruft.io/posts/percentage-calculations-in-ie/ 148 | https://stackoverflow.com/questions/49957813/percents-rounding-for-element-width-in-css 149 | * Instead, the misalignment is fixed in the web-page's javascript. 150 | 151 | The browser-window's lay-out is specified for each of its sections, e.g., table-of-contents. 152 | * For a section, its location is specified (e.g., "top: 0;", and "left: 40px;"). 153 | Also, the section's width and height are specified. 154 | * Ultimately those specifications are in units of px's, e.g., 20px. 155 | * A px is the width of the narrowest line that can be displayed with clean edges. 156 | 157 | Instead of specifying px's, a percentage of the screen can be specified. 158 | * For example, "width: 25%;" 159 | * For the percentage calculation, the result can have a fractional part, 160 | * e.g., 250.125 px 161 | * For fractional results, info on how browsers perform rounding is not readily available. 162 | * Further, a browser is not necessarily consistent in how it performs rounding, 163 | for the various CSS property/value pairs, 164 | * e.g., in "width: %25;", the 25% might be rounded differently than in "left: 25%;" 165 | * Also, CSS provides no functions for rounding, e.g., no round or floor functions. 166 | 167 | * This section provides a hypothetical example of how px percentage-calculations can 168 | cause alignment problems. 169 | * In the present file, for the div "table-of-contents", its width is specified as: 170 | "width: 25%;" 171 | * In this example, the 25% is calculated as 100.250px, and the browser could round to 172 | either 100px or 101px. 173 | * The splitter-bar section is just to the right of the table-of-contents section. 174 | * The splitter-bar's left coordinate is specified as: "left: 25%;" 175 | * The 25% is calculated as 100.250 px, and the browswer could round to 176 | either 100px or 101px. 177 | * So, there could be a 1px gap between the table-of-contents and splitter-bar 178 | * For the table-of-contents', a width of 100px could be used, so the table-of-contents 179 | goes from 0px to 99px. 180 | * The splitter-bar could start at 101px 181 | * There would be a 1px gap at 100px, and it will have the default background-color. 182 | * If that color differs from the colors of table-of-contents and splitter-bar, 183 | then the gap can be noticeable, and look sloppy. 184 | 185 | * How this misalignment problem is solved here: 186 | * The CSS is written as if the px percentage-calculations always round to the same value. 187 | * Consequently, there could be 1px gaps between the lower three sections. 188 | * The sections' alignments are then fixed in the web-page's javascript code. 189 | * In particular, the alignment is fixed when the web-page is loaded in the browser. 190 | * Unlike CSS, Javascript has math functions that can be used to get proper alignment. 191 | * Also, in the CSS, some percentage calculations are for 100%, e.g., "height: calc(100% - 10px);" 192 | * When 100% is used, no rounding is performed, so there are no alignment problems 193 | */ 194 | 195 | /* The div's size is: 196 | * Width: 197 | * 25% of the container's full width, minus the padding-left width. 198 | * This will be the width when the page is loaded. 199 | After the page is loaded, the width will be changed when the splitter-bar is moved. 200 | * A negative width is possible, for a browser window with width less than 80px. 201 | * If this occurs, the web-page's javascript will set this width to 0px. 202 | * Height: The container's full height, minus the padding-top width 203 | */ 204 | width: calc(25% - 20px); 205 | height: calc(100% - 24px); 206 | 207 | /* The padding-area is used to create a space on the left and top */ 208 | padding-top: 24px; /* Use 1/4" top margin, based on CSS pixel-inch of 96px. 96/4=24 */ 209 | padding-left: 20px; 210 | padding-bottom: 0px; 211 | padding-right: 0px; 212 | 213 | /* The border and margin aren't used */ 214 | border-width: 0px; 215 | margin: 0px; 216 | 217 | /* Specify the scroll-bars 218 | * auto: the scroll-bar is provided if needed 219 | */ 220 | /* How the vertical scroll-bar is placed 221 | * This is what I could determine, from my experiments: 222 | * The scroll-bar is inserted between the right-border and the right-padding. 223 | * Space for the scroll-bar is obtained by reducing the content-area 224 | */ 225 | overflow-y: auto; 226 | /* If the text is wider than the section's width: 227 | * The x-axis scroll-bar will be inserted, and 228 | the text will not wrap. 229 | * The overflowing text can be seen by scrolling to the right. 230 | https://developer.mozilla.org/en-US/docs/Web/CSS/white-space 231 | https://developer.mozilla.org/en-US/docs/Web/CSS/overflow 232 | */ 233 | white-space: nowrap; 234 | overflow-x: auto; 235 | 236 | /* The div's background-color is the same as the background-color in MS-Word's 237 | table-of-contents */ 238 | background-color: #e6e6e6; 239 | 240 | } 241 | 242 | /* Specify the splitter-bar section 243 | */ 244 | #splitter-bar { 245 | /* The div's position within the container 246 | * This is the splitter-bar's position when the page is loaded, before being moved by the user 247 | */ 248 | position: absolute; 249 | top: 0px; 250 | left: 25%; 251 | 252 | /* The splitter-bar's size */ 253 | height: 100%; 254 | width: 12px; 255 | 256 | padding: 0px; 257 | margin: 0px; 258 | border-width: 0px; 259 | 260 | /* When the pointer is over the splitter-bar, use a move-type pointer */ 261 | cursor: move; 262 | 263 | /* Specify the splitter-bar's appearance: 264 | * It's light black, with 2 white bars in the center. 265 | * The white bars are provided via word_web_nav_splitter_bar_icon.png 266 | */ 267 | background: url("./word_web_nav_splitter_bar_icon.png") center center no-repeat #444444; 268 | } 269 | 270 | /* Specify the document section. 271 | * This section contains the document text. 272 | */ 273 | #document { 274 | /* The div's position within the container */ 275 | position: absolute; 276 | top: 0px; 277 | 278 | /* Make the left edge adjacent to the splitter-bar: 279 | * Total width of the table-of-contents: 25% 280 | * Total width of the splitter-bar: 12px 281 | */ 282 | left: calc(25% + 12px); 283 | 284 | /* Specify the content-area's width: 285 | * 75%: the container's total-width minus the table-of-contents' total width. 286 | * 12px: the splitter-bar's total-width 287 | * 48px + 48px: the width of the document's padding-right and padding-left 288 | * 2px: the document's border-right-width 289 | 290 | A negative width is possible, for a very skinny browser window. 291 | * If this occurs, the web-page's javascript will set this width to 0px. 292 | */ 293 | width: calc(75% - (12px + 48px + 48px + 2px)); 294 | 295 | /* Specify the content-area's height. */ 296 | height: calc(100% - 48px); /* 48px gives a half-inch (in CSS pixel-inches) top margin */ 297 | 298 | padding-top: 48px; 299 | padding-right: 48px; /* 48px gives a half-inch (in CSS pixel-inches) margin */ 300 | padding-bottom: 0px; 301 | padding-left: 48px; 302 | 303 | border-width: 0px; 304 | border-right-width: 2px; /* Thin black vertical line */ 305 | border-right-style: solid; 306 | border-right-color: black; 307 | 308 | margin: 0; 309 | 310 | /* Limit the content-area's width to a reasonable length for reading text. 311 | * MS Word's HTML does not limit the text's line-length. 312 | * So, by default, text lines can be as wide as the whole content-area 313 | * For a wide browser-window, such long text-lines are difficult to read. 314 | * max-width is used here to limit the text's line-length to be no more than 720px. 315 | * For pictures, their displayed width is not limited by max-width. 316 | * However, if a picture is more than 720px, a horizontal scroll-bar is provided 317 | to see the whole image. 318 | 319 | * Use a width comparable to 8.5"-wide paper. 320 | * The calculations here are in CSS pixel-inches, which is 96px. 321 | * The side margins are 1/2" (see above). 322 | 8.5"-.5"-.5" = 7.5" 323 | 7.5 * 96 = 720px 324 | 325 | * Using CSS pixel-inches to size the screen: 326 | * I'm not sure if my use of CSS pixel-inches is the correct way to size the screen 327 | and this section. 328 | * Sources to look-into about this: 329 | * https://hacks.mozilla.org/2013/09/css-length-explained/ 330 | * https://www.freecodecamp.org/news/css-unit-guide 331 | * https://stackoverflow.com/questions/3341485/how-to-make-a-html-page-in-a4-paper-size-pages 332 | * https://developer.mozilla.org/en-US/docs/Web/CSS/@page/size 333 | 334 | */ 335 | max-width: 720px; 336 | 337 | /* Sroll-bars 338 | */ 339 | overflow-y: auto; 340 | overflow-x: auto; 341 | 342 | background-color: white; 343 | } 344 | 345 | 346 | /*********************************************************** 347 | CSS classes for formatting text in the sections: header-bar, table-of-contents 348 | 349 | ************************************************************/ 350 | 351 | /* For the header-bar's text, the font used is similar to typical h3 headings. 352 | 353 | ** font-size: 354 | * The font size is set to be similar to the typical h3. 355 | * Typical h3: font-size: 1.17em 356 | * Typical em: 16px 357 | * 1.17em = 1.17*16px = 18.72px 358 | * The font-size is specified here in px units. 359 | * This will also be the font height. 360 | * px units are used here because px units are used in positioning 361 | the div's, e.g., in specifying the div's height and width. 362 | https://www.w3schools.com/tags/tag_hn.asp 363 | https://developer.mozilla.org/en-US/docs/Web/CSS/font-size 364 | https://stackoverflow.com/questions/5410066/what-are-the-default-font-sizes-in-pixels-for-the-html-heading-tags-h1-h2 365 | 366 | ** font-family: 367 | * A web-safe font-family is used. 368 | https://www.w3schools.com/w3css/w3css_fonts.asp 369 | https://developer.mozilla.org/en-US/docs/Web/CSS/font-family 370 | 371 | ** line-height: 372 | * The line-height is set to be the same as the font-size. 373 | * The header-bar section includes top and bottom padding, and 374 | that padding provides spacing for the header-bar's text-line. 375 | https://developer.mozilla.org/en-US/docs/Web/CSS/line-height 376 | 377 | */ 378 | 379 | /* Declarations common to all text in the header-bar 380 | */ 381 | .headerBarText { 382 | font-size: 20px; 383 | /* font-weight: 400 is the same as normal, and 700 is the same as bold 384 | https://www.w3schools.com/cssref/pr_font_weight.asp 385 | */ 386 | font-weight: 450; 387 | font-family: Arial, Helvetica, sans-serif; 388 | line-height: 1; 389 | color: black; /* The default color, used for the breadcrumb "/" characters */ 390 | } 391 | 392 | /* CSS declarations for tags in the header-bar 393 | */ 394 | a.headerBarHref { 395 | color: blue; /* Link color does not change if it's been clicked on */ 396 | text-decoration: none; /* Underlines are not specified on the links */ 397 | } 398 | 399 | /* These classes are used for the links displayed in the table-of-contents. 400 | * Those links are formatted by MS Word, by using these MS-Word CSS classes: 401 | * a:link and span.MsoHyperlink 402 | * That formatting makes the links underlined, 403 | and the links turn purple after they've been clicked-on. 404 | * The following classes are used to format the links so that they are not 405 | underlined, and so that they are always blue. 406 | * In the HTML for the table-of-contents, the links need to be updated 407 | to reference these classes. How the links are updated is described 408 | in create_web_page.py. 409 | 410 | https://developer.mozilla.org/en-US/docs/Learn/CSS/Styling_text/Styling_links 411 | https://www.w3schools.com/css/css_link.asp 412 | */ 413 | a.tocAnchor, span.tocAnchor { 414 | color: blue; /* Link color does not change if it's been clicked on */ 415 | text-decoration: none; /* Underlines are not specified on the links */ 416 | } 417 | -------------------------------------------------------------------------------- /assets/word_web_nav.js: -------------------------------------------------------------------------------- 1 | /* 2 | DESCRIPTION: Javascript functions for WordWebNav's web-page 3 | 4 | There are three functions. They are run, respectively, when: 5 | * The HTML file is loaded 6 | * The splitter-bar is dragged 7 | * The web-browser is resized 8 | 9 | The code uses the jQuery and jQuery UI libraries. 10 | * https://jquery.com/ 11 | * https://jqueryui.com/ 12 | 13 | The system documentation has additional info on its: installation, use, design 14 | and implementation (code). 15 | 16 | MIT License, Copyright (c) 2021-present Jim Yuill 17 | 18 | */ 19 | 20 | 21 | /* 22 | This function runs when the HTML file is loaded. 23 | 24 | This script fixes possible alignment problems in the web-page's div's. 25 | * The div's sizing and alignment are specified in an accompanying CSS file. 26 | * There are limitations in CSS that can result in alignment problemsj for the div's. 27 | * Those limitations are described in the accompanying CSS file. 28 | 29 | The widths calculated here are the same amounts as those calculated in the CSS file. 30 | * There are accuracy limitations in CSS, and it may have calculated the 31 | widths inaccurately 32 | 33 | https://stackoverflow.com/questions/2926227/how-to-do-jquery-code-after-page-loading 34 | https://stackoverflow.com/questions/8396407/jquery-what-are-differences-between-document-ready-and-window-load 35 | */ 36 | $(window).on('load', function() { 37 | 38 | var // Width of the container div 39 | 40 | /* 41 | * parseInt(): second parameter is the radix (base) for the number returned, i.e., base 10 42 | * https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/parseInt 43 | 44 | * .width(): 45 | * Does not include the margin nor border widths (confirmed). 46 | * It probably does not include the padding width (not confirmed; it is 0 in the container div). 47 | * It includes the content width (confirmed). 48 | * https://api.jquery.com/width/ 49 | */ 50 | totalWidth = parseInt($('#container').width(), 10), 51 | 52 | /* For the table-of-contents div, this is comparable to the CSS calculation in: 53 | * width: calc(25% - 20px); 54 | In addition, the amount is corrected to not be negative 55 | */ 56 | tocWidth = Math.max(0, (Math.floor(totalWidth * .25) - 20)), 57 | 58 | /* For the table-of-contents div, its total-width is: 59 | * (content-area width) + (padding-left width) 60 | */ 61 | tocTotalWidth = tocWidth + 20, 62 | 63 | /* For the splitter-bar div, this is comparable to the CSS calculation in: 64 | * left: 25%; 65 | */ 66 | splitterBarLeft = tocTotalWidth, 67 | 68 | /* For the document div, this is comparable to the CSS calculation in: 69 | * left: calc(25% + 12px); 70 | */ 71 | documentLeft = tocTotalWidth + 12, 72 | 73 | /* For the document div, this is comparable to the CSS calculation in: 74 | * width: calc(75% - (12px + 48px + 48px + 2px)); 75 | In addition, the size is corrected to not be negative 76 | */ 77 | documentWidth = Math.max(0, ((totalWidth - tocTotalWidth) - (12 + 48 + 48 + 2))); 78 | 79 | /* Set the new values, for the CSS IDs and declarations 80 | */ 81 | $('#table-of-contents').css({width : tocWidth}); 82 | $('#splitter-bar').css({left : tocTotalWidth}); 83 | $('#document').css({left : documentLeft}); 84 | $('#document').css({width : documentWidth}); 85 | }); 86 | 87 | 88 | /* This function enables the splitter-bar to be dragged, to resize the 89 | table-of-contents and document sections. 90 | */ 91 | $(function(){ 92 | var // Width of the container div 93 | totalWidth = parseInt($('#container').width(), 10), 94 | 95 | /* For the container div, gets the left and top positions 96 | */ 97 | /* .offset(): 98 | * From the API doc: "Gets the current coordinates of the first element in the set of 99 | matched elements, relative to the document." 100 | * Returns 2 values: .left, .top 101 | * These positions are relative to the whole browser window 102 | * offset.left is 0 103 | * offset.top is 42 (height of the header-bar div) 104 | * https://api.jquery.com/offset/ 105 | */ 106 | offset = $('#container').offset(), 107 | 108 | /* This function is called after the splitter-bar has been moved 109 | */ 110 | splitter = function(event, ui){ 111 | /* After the splitter-bar is moved, ui.position.left contains the location 112 | of the splitter-bar's left edge 113 | */ 114 | var splitterBarLeft = parseInt(ui.position.left, 10), 115 | 116 | // The table-of-contents total-width is the same as splitterBarLeft 117 | tocTotalWidth = splitterBarLeft, 118 | 119 | /* For the table-of-contents div, its total-width is: 120 | * (content-area width) + (padding-left width) 121 | * The padding-left width is 20 122 | tocWidth is the content-area width 123 | * Math.max(0, ...) ensures the tocWidth is not negative 124 | */ 125 | tocWidth = Math.max(0, (tocTotalWidth - 20)), 126 | 127 | /* For the document div, its left position is: 128 | * (table-of-contents total-width) + (splitter-bar width) 129 | * The splitter-bar width is 12 130 | */ 131 | documentLeft = tocTotalWidth + 12, 132 | 133 | /* For the document div, its content-area width is specified 134 | in the CSS file 135 | * Math.max(0, ...) ensures the tocWidth is not negative 136 | */ 137 | documentWidth = Math.max(0, ((totalWidth - tocTotalWidth) - (12 + 48 + 48 + 2))) 138 | ; // END OF: var section 139 | 140 | /* Set the new values, for the CSS IDs and declarations 141 | */ 142 | $('#table-of-contents').css({width : tocWidth}); 143 | $('#document').css({left : documentLeft}); 144 | $('#document').css({width : documentWidth}); 145 | } // END OF: function(event, ui) 146 | ; // END OF: var section 147 | 148 | // https://jqueryui.com/draggable/ 149 | $('#splitter-bar').draggable({ 150 | // Controlling the movement of the splitter-bar along the x-axis 151 | axis : 'x', 152 | 153 | // Specifies the left and right boundaries for the splitter-bar 154 | // * A boundary is specified for the splitter-bar's left-edge 155 | // * A boundary is specified by the pair: left-position and top-position 156 | containment : [ 157 | // Left-side boundary for the splitter-bar 158 | 20, // Left-position. Prevent tocWidth from becoming negative 159 | offset.top, 160 | 161 | // Right-side boundary for the splitter-bar 162 | (totalWidth - (12 + 48 + 48 + 2)), // Left-position. Prevent documentWidth from becoming negative 163 | offset.top 164 | ], 165 | 166 | // Specifies the function called for dragging 167 | drag : splitter 168 | }); // END OF: ('#splitter-bar').draggable 169 | 170 | }); // END OF: $(function(){ 171 | 172 | 173 | /* This function is run when the browser-window is resized. 174 | * It reloads the web-page. 175 | * Reloading resets the sizing and the alignment, for the web-page's sections. 176 | 177 | https://stackoverflow.com/questions/14915653/refresh-page-on-resize-with-javascript-or-jquery 178 | https://stackoverflow.com/questions/5836779/how-can-i-refresh-the-screen-on-browser-resize 179 | https://stackoverflow.com/questions/29546539/refresh-page-when-container-div-is-resized 180 | */ 181 | $(window).bind('resize', function(e) 182 | { 183 | if (window.RT) clearTimeout(window.RT); 184 | window.RT = setTimeout(function() 185 | { 186 | this.location.reload(false); /* false to get page from cache */ 187 | }, 100); 188 | }); -------------------------------------------------------------------------------- /assets/word_web_nav_splitter_bar_icon.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/assets/word_web_nav_splitter_bar_icon.png -------------------------------------------------------------------------------- /createwebpage/$package-contents.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/createwebpage/$package-contents.docx -------------------------------------------------------------------------------- /createwebpage/__init__.py: -------------------------------------------------------------------------------- 1 | # This code is needed to import functions from another file in the same directory 2 | # * Example, from create_web_page.py: 3 | # > from load_input_parameter_file import load_input_parameter_file 4 | # * See: https://stackoverflow.com/a/49375740 5 | import os, sys; sys.path.append(os.path.dirname(os.path.realpath(__file__))) -------------------------------------------------------------------------------- /createwebpage/construct_html_sections.py: -------------------------------------------------------------------------------- 1 | ''' 2 | ################ 3 | This file contains the function: construct_html_sections() 4 | 5 | The function is called by: create_web_page_core() in create_web_page.py 6 | 7 | MIT License, Copyright (c) 2021-present Jim Yuill 8 | ################ 9 | ''' 10 | 11 | import re 12 | import sys 13 | import os 14 | # Specifies the YAML keys for the input parameter-file 15 | from input_parameter_file_keys import * 16 | 17 | # This library needs to have been installed by the user 18 | try: 19 | # BeautifulSoup: pip install beautifulsoup4 20 | from bs4 import BeautifulSoup, NavigableString 21 | except ImportError as e: 22 | print("") 23 | print("ERROR. Could not import a required Python module.") 24 | print(" The installation instructions specify the required modules.") 25 | print(" Import-error description:") 26 | print("") 27 | print(e) 28 | print("") 29 | sys.exit() 30 | 31 | def construct_html_sections(loaded_parms, jinja_template_variables, head, body): 32 | ''' 33 | Description: 34 | * For the WWN web-page, constructs these output HTML-sections: 35 | * The HTML -section's tags and attributes 36 | * The HTML for the web-page header-bar 37 | * The HTML for the document-text's trailer 38 | * The WWN table-of-contents (TOC), if the input HTML has a TOC 39 | 40 | Parameters: 41 | * loaded_parms : a dictionary with the input parameter-file's contents 42 | * jinja_template_variables : the jinja-template object 43 | * head : a BeautifulSoup object that holds the element from the input HTML-file 44 | * body : a BeautifulSoup object that holds the element from the input HTML-file 45 | 46 | Return: 47 | * 1, None : error 48 | 49 | * 0, body_inner_html : OK 50 | * The objects returned 51 | * Objects passed as parameters: 52 | * loaded_parms : not changed 53 | * jinja_template_variables : values added for about 11 keys 54 | * head : not changed 55 | * body : it just has the opening and closing tags. 56 | * The HTML between those tags is removed. 57 | 58 | * body_inner_html 59 | * The HTML from the section in Word's HTML, 60 | but with these parts removed: 61 | * opening and closing tags 62 | * The table-of-contents 63 | ''' 64 | 65 | ''' 66 | ################## 67 | Constants 68 | ################## 69 | ''' 70 | 71 | # Text added to the generated HTML 72 | # 73 | # * For the web-page header-bar, specifies the separator between 74 | # breadcrumbs, e.g., the " / " in: Home / Topic-1 / Topic-1.1 75 | BREAD_CRUMB_SEPARATOR = " / " 76 | # For the document-text trailer, specifies the anchor name. 77 | # * The anchor name can be linked-to from the web-page header-bar. 78 | DOCUMENT_TEXT_TRAILER_ANCHOR_NAME = "word_web_nav_document_text_trailer" 79 | 80 | # Names of WordWebNav files that are opened or referenced 81 | PAGE_STRUCTURE_CSS_FILE_NAME = "word_web_nav.css" 82 | JS_FILE_NAME = "word_web_nav.js" 83 | 84 | ''' 85 | CSS classes that are referenced in some of the HTML that is generated. 86 | * The classes are defined in the CSS-file whose name is specified above, 87 | in the variable: PAGE_STRUCTURE_CSS_FILE_NAME 88 | ''' 89 | CSS_HEADER_BAR_TEXT = "headerBarText" 90 | CSS_HEADER_BAR_HREF = "headerBarHref" 91 | 92 | ''' 93 | ############################# 94 | For the WWN web-page, construct the HTML -section's tags and attributes 95 | * This data includes whole HTML-tags, and attributes used in HTML tags. 96 | * This data is put in the dictionary jinja_template_variables[] 97 | * The variables will be used later in the jinja-template, in its HTML -section. 98 | 99 | The WWN web-page's HTML -section is constructed from two sources: 100 | * The section in the input Word-HTML file 101 | * Data provided in the input parameter-file 102 | ############################# 103 | ''' 104 | 105 | ''' 106 | Get the HTML within the section of the input Word-HTML 107 | 108 | The elements within that section include: 109 | * tags 110 | * An optional tag 111 | * A <style> section with CSS statements: <style><!-- ... --></style> 112 | * An optional <script> section: <script><!-- ... --></script> 113 | ''' 114 | 115 | # * Get the <head> section in the BeautifulSoup object head, and 116 | # convert the <head> section to HTML text format 117 | html_string = head.decode(formatter='html') 118 | 119 | # Remove the tags <head> and </head> 120 | # regexp "\A" matches beginning of the whole string 121 | regex = r"\A(^\s*<head>\s*)" 122 | substitution = "" 123 | html_string, substitution_count = re.subn(regex, substitution, html_string, 0, re.M) 124 | if (substitution_count != 1): 125 | print("") 126 | print("ERROR. The expected HTML <head> tag was not found.") 127 | return 1, None 128 | 129 | # regexp "\Z" matches end of the whole string 130 | regex = r"(\s*<\/head>\s*$)\Z" 131 | html_string, substitution_count = re.subn(regex, substitution, html_string, 0, re.M) 132 | if (substitution_count != 1): 133 | print("") 134 | print("ERROR. The expected HTML </head> tag was not found.") 135 | return 1, None 136 | html_string += "\n" 137 | 138 | # Jinja will be used to put the head-section contents in the output HTML 139 | jinja_template_variables['word_head_section_contents'] = html_string 140 | 141 | 142 | ''' 143 | Get the data for the output HTML <head> section, from the input parameter-file 144 | 145 | The data is specified under the key "html_head_section:"" 146 | * An example: 147 | 148 | html_head_section: 149 | title: Sys-Admin How-To Info 150 | description: Solutions for my various sys-admin tasks 151 | additional_html: <link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32" /> 152 | ''' 153 | 154 | # From the input parameter-file, get the WWN version that is specified 155 | # * Jinja will be used to put the WWN version in the output HTML <head> section 156 | jinja_template_variables['version'] = loaded_parms[YML_KEY_REQUIRED][YML_KEY_VERSION] 157 | 158 | # Construct the HTML tag: <title>. . . 159 | if ( (YML_KEY_HTML_HEAD_SECTION in loaded_parms) and 160 | (YML_KEY_TITLE in loaded_parms[YML_KEY_HTML_HEAD_SECTION]) ): 161 | key_title_value = loaded_parms[YML_KEY_HTML_HEAD_SECTION][YML_KEY_TITLE] 162 | title_tag = "<title>" + key_title_value + "" 163 | else: 164 | title_tag = "" 165 | # Jinja will be used to put title_tag in the output HTML 166 | jinja_template_variables['title_tag'] = title_tag 167 | 168 | # Construct the HTML tag: " 174 | else: 175 | meta_description_tag = "" 176 | # Jinja will be used to put meta_description_tag in the output HTML 177 | jinja_template_variables['meta_tag_with_description'] = meta_description_tag 178 | 179 | # From the input parameter-file, get the additional_html 180 | if ( (YML_KEY_HTML_HEAD_SECTION in loaded_parms) and 181 | (YML_KEY_ADDITIONAL_HTML in loaded_parms[YML_KEY_HTML_HEAD_SECTION]) ): 182 | key_additional_html_value = loaded_parms[YML_KEY_HTML_HEAD_SECTION][YML_KEY_ADDITIONAL_HTML] 183 | else: 184 | key_additional_html_value = "" 185 | # Jinja will be used to put key_additional_html_value in the output HTML 186 | jinja_template_variables['additional_html'] = key_additional_html_value 187 | 188 | # From the input parameter-file, get page_structure_css_file_path 189 | key_scripts_directory_url_value = loaded_parms[YML_KEY_REQUIRED][YML_KEY_SCRIPTS_DIRECTORY_URL] 190 | page_structure_css_file_path = os.path.join(key_scripts_directory_url_value, PAGE_STRUCTURE_CSS_FILE_NAME) 191 | # Jinja will be used to put page_structure_css_file_path in the output HTML 192 | jinja_template_variables['page_structure_css_file_path'] = page_structure_css_file_path 193 | 194 | # From the input parameter-file, get web_page_js_file_path 195 | js_file_path = os.path.join(key_scripts_directory_url_value, JS_FILE_NAME) 196 | # Jinja will be used to put js_file_path in the output HTML 197 | jinja_template_variables['web_page_js_file_path'] = js_file_path 198 | 199 | ''' 200 | ############################# 201 | Process the Word-HTML's tag: 202 | 203 | Get the opening-tag: 204 | * For the tag, the opening-tag is the part 205 | * For the input Word-HTML, its opening-tag will be used in the output HTML. 206 | * Get that opening-tag, in HTML text format, and put it in jinja_template_variables, 207 | for use later in the jinja template. 208 | 209 | Get the HTML within the 's opening and closing tags 210 | * This does not include the opening and closing tags. 211 | * Get the HTML in a BeautifulSoup object, for use later in creating the 212 | output HTML. 213 | ############################# 214 | ''' 215 | 216 | # The variable "body" is a BeautifulSoup object that contains the input HTML's section. 217 | # The following code will: 218 | # * Create a BeautifulSoup object "body_inner_html" 219 | # * For "body", the HTML within the opening and closing tags will be moved to "body_inner_html" 220 | # * "body" will then have just the opening and closing tags 221 | body_inner_html = BeautifulSoup("", 'html.parser') 222 | # From the BS docs, "A tag’s children are available in a list called .contents" 223 | body_contents_list = body.contents 224 | for i in range(0, len(body_contents_list)): 225 | # .append moves the HTML element from body to body_inner_html 226 | # * The BeautifulSoup doc does not make it clear that a move occurs here. 227 | # * Note body_inner_html and body are two different trees 228 | body_inner_html.append(body_contents_list[0]) 229 | 230 | # Convert the opening and closing tags to HTML text format 231 | body_tags = body.decode(formatter='html') 232 | # Extract the opening tag, by removing the closing tag 233 | regex = r"(\s*<\/body\s*>\s*$)\Z" 234 | substitution = "" 235 | body_opening_tag, substitution_count = re.subn(regex, substitution, body_tags, 0, re.M) 236 | if (substitution_count != 1): 237 | print("") 238 | print("ERROR. The expected HTML tag was not found.") 239 | return 1, None 240 | 241 | # Jinja will be used to put body_opening_tag in the output HTML 242 | jinja_template_variables['body_opening_tag'] = body_opening_tag 243 | 244 | ''' 245 | ################### 246 | Construct the HTML for the web-page header-bar 247 | 248 | * The web-page header-bar can be used for navigation breadcrumbs and for other text or URLs. 249 | * The web-page header-bar is different than the HTML section. 250 | 251 | The HTML is put in jinja_template_variables, for use later in the jinja template. 252 | ################### 253 | ''' 254 | ''' 255 | The input parameter-file specifies the contents of the web-page header-bar. 256 | * The contents are put in an HTML table, which is put in the "header-bar" div. 257 | * The table has no borders and one row. 258 | * There is a table-cell for each "section" specifed in the input parameter-file 259 | * The WWN user-guide describes the header-bar's sections 260 | * The sections' table-cells are of equal size. 261 | * The text within a cell is aligned as specified by the "contents_alignment" key, 262 | in the input parameter-file 263 | ''' 264 | header_bar_table = "" 265 | # Test if the parameter-file has the key "header_bar:" 266 | if (YML_KEY_HEADER_BAR in loaded_parms): 267 | 268 | # Generate the opening-tag for the table 269 | # * By default, the columns are made equal-width. 270 | # * Table attribute for the table to span the whole page-width: width="%100" 271 | # * https://stackoverflow.com/questions/539309/html-table-span-entire-width 272 | # * Table style value for table to be centered on page: 273 | # * "margin-left:auto;margin-right:auto;" 274 | # * https://www.w3schools.com/howto/howto_css_table_center.asp 275 | # * Table style value to create a table with no border nor margin: 276 | # * "border-collapse:collapse;" 277 | # * https://stackoverflow.com/questions/16427903/remove-all-padding-and-margin-table-html-and-css 278 | # * Table style values to format over-flow text as hidden, with elipses displayed (...) 279 | # * "table-layout:fixed;" 280 | # * Also need this table attribute: width="100%" 281 | # * https://stackoverflow.com/questions/43561602/add-text-overflow-ellipsis-to-table-cell 282 | # * https://developer.mozilla.org/en-US/docs/Web/CSS/table-layout 283 | header_bar_table += '\n' 285 | # Generate table-row opening tag 286 | header_bar_table += '\n' 287 | 288 | # Construct the opening tag for the table-cells (\n" 364 | # Generate the closing-tags for the table-row and table 365 | header_bar_table += "\n
) 289 | # * The table-cell styles include "text-align". 290 | # * The "text-align" value is specified in the parameter-file, e.g., "left", "centered", etc. 291 | # * Table-cell style values that create table-cells with no margin nor padding: 292 | # * "padding:0;margin:0;" 293 | # * https://stackoverflow.com/questions/16427903/remove-all-padding-and-margin-table-html-and-css 294 | # * Table-cell style values needed to format over-flow text as hidden, with elipses displayed (...) 295 | # * "text-overflow:ellipsis;overflow:hidden;white-space:nowrap;"" 296 | td_opening_tag = '' 298 | 299 | # The header-bar contents are specified in the parameter-file, under the key YML_KEY_HEADER_BAR 300 | # * Under YML_KEY_HEADER_BAR, there are one or more "sections", e.g., YML_KEY_BREADCRUMBS 301 | # * A table-cell is created for each section. 302 | # * The table-cell's contents are specified in the parameter-file. 303 | # 304 | # Loop for each section under the key YML_KEY_HEADER_BAR 305 | for section in loaded_parms[YML_KEY_HEADER_BAR]: 306 | # Generate the opening-tag for the table cell 307 | if (YML_KEY_CONTENTS_ALIGNMENT in section[YML_KEY_SECTION]): 308 | table_cell_alignment = section[YML_KEY_SECTION][YML_KEY_CONTENTS_ALIGNMENT] 309 | else: 310 | table_cell_alignment = "left" 311 | header_bar_table += td_opening_tag.format(table_cell_alignment) 312 | 313 | # * For the section, determine its content's data-type, e.g., "breadcrumbs". 314 | # * And, generate the content's HTML 315 | contents = section[YML_KEY_SECTION][YML_KEY_CONTENTS] 316 | # Data-type "breadcrumbs" 317 | if YML_KEY_BREADCRUMBS in contents: 318 | breadcrumbs_html = "" 319 | # Loop for each "hyperlink" 320 | for breadcrumb in section[YML_KEY_SECTION][YML_KEY_CONTENTS][YML_KEY_BREADCRUMBS]: 321 | # Construct the breadcrumb: the anchor tag () and the breadcrumb-separator 322 | breadcrumbs_html += f"" 324 | breadcrumbs_html += f"{breadcrumb[YML_KEY_HYPERLINK][YML_KEY_TEXT]}" 325 | breadcrumbs_html += BREAD_CRUMB_SEPARATOR 326 | 327 | # Remove the last separator 328 | separator_length = len(BREAD_CRUMB_SEPARATOR) 329 | breadcrumbs_html = breadcrumbs_html[0:-separator_length] 330 | header_bar_table += breadcrumbs_html 331 | 332 | # Data-type "hyperlink" 333 | elif YML_KEY_HYPERLINK in contents: 334 | # Construct the anchor tag () 335 | hyperlink_dict = section[YML_KEY_SECTION][YML_KEY_CONTENTS][YML_KEY_HYPERLINK] 336 | header_bar_table += f"" 338 | header_bar_table += f"{hyperlink_dict[YML_KEY_TEXT]}" 339 | 340 | # Data-type "html" 341 | elif YML_KEY_HTML in contents: 342 | header_bar_table += section[YML_KEY_SECTION][YML_KEY_CONTENTS][YML_KEY_HTML] 343 | 344 | # Data-type "text" 345 | elif YML_KEY_TEXT in contents: 346 | header_bar_table += section[YML_KEY_SECTION][YML_KEY_CONTENTS][YML_KEY_TEXT] 347 | 348 | # Data-type "empty" 349 | elif YML_KEY_EMPTY in contents: 350 | pass 351 | 352 | else: 353 | # * This case is a system error. 354 | # * The parameter-file's schema-definitions should not have allowed this data-type. 355 | # Cerberus should have flagged the data-type as an error. 356 | print("") 357 | print("ERROR. Input parameter-file has an unrecognized key under:") 358 | print(f" {YML_KEY_HEADER_BAR}: {YML_KEY_SECTION}: {YML_KEY_CONTENTS}:") 359 | print("") 360 | return 1, None 361 | 362 | # Generate the closing-tag for the table cell 363 | header_bar_table += "
\n" 366 | 367 | # Jinja will be used to put header_bar_table in the output HTML 368 | jinja_template_variables['header_bar'] = header_bar_table 369 | 370 | ''' 371 | ################### 372 | Construct the HTML for the document-text's trailer 373 | ################### 374 | ''' 375 | # Test if the trailer was specified in the parameter-file 376 | trailer_html = "" 377 | if (YML_KEY_DOCUMENT_TEXT_TRAILER in loaded_parms): 378 | # Create horizontal line 379 | trailer_html += "\n" 380 | trailer_html += "



\n" 381 | # * Create an anchor tag. Use the name attribute, with the value in DOCUMENT_TEXT_TRAILER_ANCHOR_NAME. 382 | # * A hyperlink in the web-page header-bar can use this name to link to the document-text-trailer. 383 | trailer_html += "
\n" 384 | # Get the document-text-trailer's HTML that was specified in the parameter file 385 | trailer_html += f"\n" 386 | trailer_html += loaded_parms[YML_KEY_DOCUMENT_TEXT_TRAILER] 387 | # Jinja will be used to put trailer_html in the output HTML 388 | jinja_template_variables['document_text_trailer'] = trailer_html 389 | 390 | ''' 391 | ####################################################### 392 | Creates the WWN table-of-contents (TOC), if the input HTML has a TOC 393 | 394 | * A TOC will be processed only if it's at the beginning of the web-page. 395 | * Each TOC entry is an HTML paragraph (

) 396 | 397 | The TOC entries are initially in the BeautifulSoup object "body_inner_html". 398 | * The TOC will be displayed in a different web-page frame than the document body. 399 | * So, the TOC entries are removed from "body_inner_html". 400 | 401 | For each TOC entry: 402 | * The TOC entry's HTML is edited to use WordWebNav's TOC style 403 | * The TOC entry's HTML is appended to a string variable that holds the TOC. 404 | * That string variable will be used later in the jinja template, to 405 | put the TOC HTML in the web-page section:

406 | * That section is the web-page frame that displays the TOC 407 | ####################################################### 408 | ''' 409 | # For each TOC entry, the Word-HTML looks like this: 410 | # *

Heading Name Goes Here

411 | 412 | # For the input Word-HTML, in its section, get all of the HTML paragraphs. 413 | # * .find_all('p') returns a list of paragraphs, and each paragraph is a BeautifulSoup object. 414 | all_paragraphs = body_inner_html.find_all('p') 415 | 416 | ''' 417 | Skip any initial paragraphs that only contain white-space 418 | ''' 419 | # * Note: 420 | # * There can be a TOC entry with just white-space, e.g., 421 | #

 

422 | # * Such a TOC entry is not useful. It would almost certainly have been created by mistake. 423 | # * If a TOC starts with such TOC entries, they will also be removed by this code. 424 | regex = r'^\s*$' 425 | empty_paragraphs = [] 426 | for paragraph in all_paragraphs: 427 | # Test if empty paragraph 428 | # * paragraph.string will convert HTML entities to Unicode, e.g.,   is converted to \xa0 429 | # * paragraph.string returns None if the element has children (in which case, the paragraph isn't empty) 430 | paragraph_text = paragraph.string 431 | if (paragraph_text == None): 432 | break 433 | else: 434 | # The regex pattern specifies a string of all whitespace 435 | result = re.search(regex, paragraph_text , re.M) 436 | if (result == None): 437 | # Paragraph text is not all whitespace 438 | break 439 | empty_paragraphs.append(paragraph) 440 | 441 | num_empty_paragraphs = len(empty_paragraphs) 442 | 443 | ''' 444 | Get the TOC-entries' paragraphs 445 | ''' 446 | toc_paragraphs = [] 447 | regex = r'^MsoToc[1-9]$' 448 | # * Loop for each TOC-entry paragraph 449 | # * Start with the paragraph just after the last empty paragraph 450 | # * break for a paragraph that is not a TOC-entry 451 | for i in range(num_empty_paragraphs, len(all_paragraphs)): 452 | paragraph = all_paragraphs[i] 453 | if ( (not 'class' in paragraph.attrs) or (len(paragraph['class']) != 1) ): 454 | break 455 | # Check if the class is "MsoToc" followed by a single digit 456 | paragraph_class = paragraph['class'][0] 457 | result = re.search(regex, paragraph_class, re.M) 458 | if (result == None): 459 | break 460 | toc_paragraphs.append(paragraph) 461 | 462 | num_toc_paragraphs = len(toc_paragraphs) 463 | print("INFO. Table-of-contents entries found, for use in the navigation pane: " + str(num_toc_paragraphs)) 464 | 465 | ''' 466 | * For the TOC-entry paragraphs found, move them from the "body_inner_html" BeautifulSoup object, 467 | to the "soup_toc" BeautifulSoup object 468 | ''' 469 | soup_toc = BeautifulSoup("", 'html.parser') 470 | if (num_toc_paragraphs > 0): 471 | # Delete empty paragraphs from object "body_inner_html" 472 | # * From BS docs: "Tag.decompose() removes a tag from the tree, 473 | # then completely destroys it and its contents" 474 | for i in range(0, num_empty_paragraphs): 475 | empty_paragraphs[i].decompose() 476 | 477 | # Move TOC paragraphs from the object body_inner_html to the object soup_toc 478 | # * .append() moves an HTML tag in this case 479 | for i in range(0, num_toc_paragraphs): 480 | # Also move any newlines after the TOC paragraph 481 | # * These newlines just affect the formatting of the HTML source, 482 | # and not what is displayed on the web-page. 483 | newline_after_toc_paragraph = False 484 | element_after_toc_paragarph = toc_paragraphs[i].next_sibling 485 | if isinstance(element_after_toc_paragarph, NavigableString): 486 | text = element_after_toc_paragarph.string 487 | regex = r'^\n+$' 488 | result = re.search(regex, text, re.M) 489 | if result != None: 490 | newline_after_toc_paragraph = True 491 | soup_toc.append(toc_paragraphs[i]) 492 | if newline_after_toc_paragraph == True: 493 | soup_toc.append(element_after_toc_paragarph) 494 | 495 | ''' 496 | For each TOC paragraph, edit the HTML to use WordWebNav's CSS classes. 497 | * These classes implement WordWebNav's hyperlink style: 498 | * Hyperlinks are not underlined, and clicking on a link does not change its color 499 | ''' 500 | 501 | # Note: This processing does not confirm that the HTML conforms with the expected tag 502 | # structure, e.g., that the anchor () is within the expected span (). 503 | toc_entries_without_hyperlink = 0 504 | toc_paragraphs = soup_toc.find_all('p') 505 | for paragraph in toc_paragraphs: 506 | # Test if the paragraph has a tag with the attribute class=MsoHyperlink 507 | span = paragraph.find('span', class_="MsoHyperlink") 508 | if (span != None): 509 | span['class'].append('tocAnchor') 510 | 511 | # * Usually there is just one anchor tag, but there can be more. 512 | # * When there are multiple anchor-tags in a TOC entry: 513 | # * The anchor tags are not nested 514 | # * The last anchor tag has text, and the others do not 515 | # * The class 'tocAnchor' is added the first anchor-tag 516 | # which is an ancestor to the paragraph's first string. 517 | 518 | paragraph_string = paragraph.find(string=True) 519 | if (paragraph_string == None): 520 | toc_entries_without_hyperlink += 1 521 | continue 522 | 523 | for parent in paragraph_string.parents: 524 | if (parent.name == "a"): 525 | if not ('class' in parent.attrs): 526 | parent['class'] = [] 527 | parent['class'].append('tocAnchor') 528 | break 529 | elif (parent.name == "p"): 530 | toc_entries_without_hyperlink += 1 531 | break 532 | 533 | if toc_entries_without_hyperlink > 0: 534 | print("INFO. Table-of-contents entries without a hyperlink: " + str(toc_entries_without_hyperlink)) 535 | 536 | # Create a string variable with the TOC entries, in HTML text-format. 537 | toc_html = soup_toc.decode(formatter="html") 538 | 539 | # Jinja will be used to put toc_html in the output HTML 540 | jinja_template_variables['table_of_contents'] = toc_html 541 | 542 | return 0, body_inner_html 543 | 544 | # END OF: construct_html_sections() -------------------------------------------------------------------------------- /createwebpage/copy_words_embedded_files.py: -------------------------------------------------------------------------------- 1 | ''' 2 | ################ 3 | This file contains the function: copy_words_embedded_files() 4 | 5 | The function is called by: create_web_page_core() in create_web_page.py 6 | 7 | MIT License, Copyright (c) 2021-present Jim Yuill 8 | ################ 9 | ''' 10 | 11 | import shutil 12 | import os 13 | # Specifies the YAML keys for the input parameter-file 14 | from input_parameter_file_keys import * 15 | 16 | 17 | def copy_words_embedded_files(loaded_parms): 18 | ''' 19 | Description: 20 | * Copy Word's embedded-files-directory to the output directory 21 | * For a Word-HTML-file, Word will create a directory to hold embedded files, such as pictures. 22 | * That directory will be referred to as the "embedded-files directory". 23 | * For the embedded-files directory, its name is the same as the Word-HTML-file, but with a suffix "_files" 24 | * e.g., for index.html, the embedded-files directory is index_files 25 | 26 | Parameter: loaded_parms, contains the input paramter-file, in dictionary format 27 | 28 | Return: 29 | 0 : OK 30 | 1 : Error 31 | ''' 32 | 33 | # Get the relevant values specified in the input parameter-file 34 | input_html_path_value = loaded_parms[YML_KEY_REQUIRED][YML_KEY_INPUT_HTML_PATH] 35 | output_directory_path_value = loaded_parms[YML_KEY_REQUIRED][YML_KEY_OUTPUT_DIRECTORY_PATH] 36 | 37 | # For the input Word-HTML-file, if an embedded-files directory exists, copy it to the output directory. 38 | input_html_file_directory, input_html_file_name = os.path.split(input_html_path_value) 39 | input_html_file_name_without_extension, input_html_file_name_extension = os.path.splitext(input_html_file_name) 40 | embedded_files_directory_name = input_html_file_name_without_extension + "_files" 41 | input_embedded_files_directory_path = os.path.join(input_html_file_directory, embedded_files_directory_name) 42 | output_embedded_files_directory_path = os.path.join(output_directory_path_value, 43 | embedded_files_directory_name) 44 | if not os.path.exists(input_embedded_files_directory_path): 45 | print("INFO. For the input Word-HTML-file, an embedded-files-directory was not found. " + 46 | "It is optional:") 47 | print(" " + input_embedded_files_directory_path) 48 | else: 49 | print("INFO. For the input Word-HTML-file, an embedded-files-directory was found, at:") 50 | print(" " + input_embedded_files_directory_path) 51 | print("INFO. Copying the embedded-files-directory to the output directory, at:") 52 | print(" " + output_embedded_files_directory_path) 53 | 54 | # Test if the embedded-files directory already exists in the output directory 55 | if os.path.exists(output_embedded_files_directory_path): 56 | print("INFO. An embedded-files-directory already exists in the output directory. " + \ 57 | "It will be deleted. ") 58 | # How to delete a directory in Windows: 59 | # * https://stackoverflow.com/questions/6996603/how-to-delete-a-file-or-folder 60 | try: 61 | shutil.rmtree(output_embedded_files_directory_path) 62 | except OSError as e: 63 | print("") 64 | print("ERROR. Could not delete the existing output directory:") 65 | print(" " + output_embedded_files_directory_path) 66 | print(" %s - %s." % (e.strerror, e.filename)) 67 | return 1 68 | 69 | # shutil.copytree() requires that the destination does not exist 70 | # * https://stackoverflow.com/questions/1868714/how-do-i-copy-an-entire-directory-of-files-into-an-existing-directory-using-pyth 71 | try: 72 | shutil.copytree(input_embedded_files_directory_path, output_embedded_files_directory_path) 73 | except OSError as e: 74 | print("") 75 | print("ERROR. Could not copy the embedded-files-directory.") 76 | print(" %s - %s." % (e.strerror, e.filename)) 77 | return 1 78 | 79 | return 0 -------------------------------------------------------------------------------- /createwebpage/create_web_page.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | ''' 3 | ######################################################################## 4 | 5 | DESCRIPTION: Converts a Word HTML-file into a WordWebNav (WWN) web-page 6 | 7 | USAGE: 8 | * Calling from the Windows command-line: 9 | > cd 10 | > python create_web_page.py 11 | 12 | * If is omitted, the user is prompted for it. 13 | 14 | * The parameter-file includes specification of the input Word HTML-file, 15 | and the directory for the output WWN web-page. 16 | * Parameter-file templates are provided in the WWN repo. 17 | 18 | USAGE: 19 | * Calling from a Python module: 20 | * Call: create_web_page(parameter_file_path) 21 | * parameter_file_path: full-path of the input parameter-file 22 | 23 | DOCUMENTATION: 24 | * The call-graph for this Python package is in the repo at: 25 | * createwebpage\$package-contents.docx 26 | * WWN's documentation is on the WWN web-page, and in the repo under /docs. 27 | The documentation includes: 28 | * WWN's introduction, installation and use 29 | * Development documents, describing WWN's R&D, including the code. 30 | 31 | MIT License, Copyright (c) 2021-present Jim Yuill 32 | 33 | ######################################################################## 34 | ''' 35 | 36 | # Python pre-installed libraries 37 | import argparse 38 | import sys 39 | 40 | # Modules in this package 41 | # * To import modules from this package, code was added to __init__.py 42 | from load_input_parameter_file import load_input_parameter_file 43 | from load_html_files import load_html_files 44 | from copy_words_embedded_files import copy_words_embedded_files 45 | from construct_html_sections import construct_html_sections 46 | import fix_word_html 47 | 48 | try: 49 | # Jinja2: pip install jinja2 50 | from jinja2 import Template 51 | except ImportError as e: 52 | print("") 53 | print("ERROR. Could not import a required Python module.") 54 | print(" The installation instructions specify the required modules.") 55 | print(" Import-error description:") 56 | print("") 57 | print(e) 58 | print("") 59 | sys.exit() 60 | 61 | 62 | def create_web_page_core(parameter_file_path: str, 63 | num_warning_messages: int): 64 | ''' 65 | * Description: 66 | * This is WWN's primary function: 67 | * It reads the input parameter-file, and creates the WWN web-page. 68 | * It is called by the wrapper-function create_web_page(). 69 | 70 | * Input: 71 | * parameter_file_path: the input parameter-file's full-path 72 | * num_warning_messages: the number of warning messages, thus far 73 | 74 | * Return: 75 | * 1, num_warning_messages : An error occurred 76 | * 0, num_warning_messages : No errors occurred 77 | ''' 78 | 79 | # Check Python is at least version 3 80 | # * This check works with Python 2.6 and below: 81 | # * https://stackoverflow.com/questions/446052/how-can-i-check-for-python-version-in-a-program-that-uses-new-language-features 82 | if sys.version_info[0] < 3: 83 | print("") 84 | print("ERROR. Python version 3 is required.") 85 | print(" WordWebNav has been tested with Python 3.7.0 and 3.9.6") 86 | print("") 87 | sys.exit() 88 | 89 | # Check Python version is at least 3.7 90 | if sys.version_info[1] < 7: 91 | num_warning_messages += 1 92 | print("") 93 | print("WARNING. Python version is less than 3.7.") 94 | print(" WordWebNav has been tested with Python 3.7.0 and 3.9.6") 95 | print("") 96 | 97 | # This dictionary holds the "variables" that will be copied into the Jinja template (jinja_template) 98 | # * The Jinja template is used to create the output HTML. 99 | # * This dictionary is used in calling the Jinja method render(), to create the output HTML: 100 | # * generated_html = jinja_template.render(jinja_template_variables) 101 | jinja_template_variables = { 102 | 'version': "", 103 | 'title_tag': "", 104 | 'meta_tag_with_description': "", 105 | 'page_structure_css_file_path': "", 106 | 'web_page_js_file_path': "", 107 | 'word_head_section_contents': "", 108 | 'additional_html': "", 109 | 'body': "", 110 | 'header_bar': "", 111 | 'table_of_contents': "", 112 | 'document_text': "", 113 | 'document_text_trailer': "" 114 | } 115 | 116 | #################### 117 | # Call the functions that process the input files 118 | # and that construct the WWN web-page's HTML-sections 119 | # * The functions are in separate files 120 | #################### 121 | 122 | # load_input_parameter_file(): 123 | # * The input parameter-file is in YAML format 124 | # * The file's contents are verified and loaded into a Python 125 | # object, made-up of dictionaries and lists. 126 | # * That object is returned in the variable loaded_parms 127 | return_value, loaded_parms = load_input_parameter_file(parameter_file_path) 128 | if (return_value == 1): 129 | return 1, num_warning_messages 130 | 131 | # load_html_files(): 132 | # * Verifies the HTML-related input-files, and the output directory 133 | return_value, returned_objects_dict = load_html_files(loaded_parms) 134 | if (return_value == 1): 135 | return 1, num_warning_messages 136 | else: 137 | # Data returned in returned_objects_dict: 138 | 139 | # The Jinja template is loaded from a file, and returned in jinja_template 140 | # * jinja_template is a jinja2 object, created by calling jinja2.Template() 141 | jinja_template = returned_objects_dict["jinja_template"] 142 | 143 | # The input Word HTML-file was loaded, and it is returned in BeautifulSoup objects: 144 | # 145 | # head is a BeautifulSoup object 146 | # * It holds the element from the input HTML-file 147 | head = returned_objects_dict["head"] 148 | # body is a BeautifulSoup object 149 | # * It holds the element from the input HTML-file 150 | body = returned_objects_dict["body"] 151 | 152 | # The output HTML-file was created, and it is currently empty: 153 | output_html_file_path = returned_objects_dict["output_html_file_path"] 154 | output_html_file_handle = returned_objects_dict["output_html_file_handle"] 155 | 156 | # copy_words_embedded_files(): 157 | # * Copies Word's embedded files 158 | # * If the input Word HTML-file has embedded-files, they are copied to 159 | # the output directory 160 | return_value = copy_words_embedded_files(loaded_parms) 161 | if (return_value == 1): 162 | return 1, num_warning_messages 163 | 164 | # construct_html_sections(): 165 | # * Constructs the output HTML-sections, and puts them in: 166 | # * BeautifulSoup object: body_inner_html 167 | # * The dictionary jinja_template_variables 168 | return_value, body_inner_html = construct_html_sections(loaded_parms, 169 | jinja_template_variables, head, body) 170 | if (return_value == 1): 171 | return 1, num_warning_messages 172 | 173 | # fix_word_html(): 174 | # * Fixes a set of known bugs in Word's HTML 175 | # * Output: 176 | # * jinja_template_variables : the document-text's HTML is added, with the fixes applied 177 | return_value, num_warning_messages = fix_word_html.fix_word_html(loaded_parms, 178 | jinja_template_variables, body_inner_html, 179 | num_warning_messages) 180 | if (return_value == 1): 181 | return 1, num_warning_messages 182 | 183 | 184 | ################ 185 | # Create the output WWN web-page 186 | ################ 187 | # The web-page is created from the Jinja template and the template-variables. 188 | print("INFO. Generating the output HTML, using the HTML-template.") 189 | generated_html = jinja_template.render(jinja_template_variables) 190 | 191 | # Write the web-page to the output HTML-file 192 | print("INFO. Writing the output HTML, to the output HTML-file:") 193 | print(" " + output_html_file_path) 194 | try: 195 | output_html_file_handle.write(generated_html) 196 | output_html_file_handle.close() 197 | except IOError as e: 198 | print("") 199 | print("ERROR. Could not write-to or close the output HTML-file.") 200 | print(" %s - %s." % (e.strerror, e.filename)) 201 | return 1, num_warning_messages 202 | 203 | print("") 204 | print("INFO. Processing completed. No errors. Warning messages: " + 205 | str(num_warning_messages)) 206 | 207 | return 0, num_warning_messages 208 | 209 | # END of: create_web_page_core() 210 | 211 | 212 | def create_web_page(parameter_file_path): 213 | ''' 214 | * Description: 215 | * A wrapper function for calling create_web_page_core(). 216 | * This wrapper makes it possible for create_web_page_core() 217 | to simply return if it encounters any errors. 218 | ''' 219 | num_warning_messages = 0 220 | 221 | # Check Python version 3 222 | # * https://docs.python.org/2.7/library/sys.html#sys.version_info 223 | if sys.version_info[0] < 3: 224 | # Pylance incorrectly flags this code as unreachable 225 | # * https://github.com/microsoft/pylance-release/issues/470 226 | print("") 227 | print("ERROR. Python version 3 is required.") 228 | print(" WordWebNav has been tested with Python 3.7.0 and 3.9.6") 229 | print("") 230 | sys.exit() 231 | 232 | # Check Python version is at least 3.7 233 | if sys.version_info[1] < 7: 234 | num_warning_messages += 1 235 | print("") 236 | print("WARNING. Python version is less than 3.7.") 237 | print(" WordWebNav has been tested with Python 3.7.0 and 3.9.6") 238 | print("") 239 | 240 | # Call create_web_page_core(), to create the WWN web-page for the input Word HTML-file 241 | return_value, num_warning_messages = create_web_page_core(parameter_file_path, num_warning_messages) 242 | if return_value == 1: 243 | print("") 244 | print("INFO. Error encountered, processing not completed. Error messages: 1. Warning messages: " + str(num_warning_messages)) 245 | print("") 246 | 247 | return return_value, num_warning_messages 248 | # END OF: create_web_page() 249 | 250 | 251 | def create_arg_parser(): 252 | ''' 253 | * Description: 254 | * Argparse is used to process the command-line argument 255 | * Creates and returns the ArgumentParser object 256 | 257 | * References 258 | * Argparse docs: https://docs.python.org/3/library/argparse.html 259 | * The argparse code here is from: https://stackoverflow.com/questions/14360389/getting-file-path-from-command-line-argument-in-python/47324233 260 | ''' 261 | parser = argparse.ArgumentParser(description= 262 | 'Converts a Word HTML-file to a usable web-page.') 263 | # * One positional argument, and it is optional 264 | # * https://stackoverflow.com/questions/4480075/argparse-optional-positional-arguments/31243133 265 | parser.add_argument('parameter_file_path', nargs='?', metavar="", 266 | help='Full-path to the parameter-file.') 267 | return parser 268 | # END OF: create_arg_parser() 269 | 270 | ''' 271 | ############# 272 | Command-line interface 273 | ############# 274 | ''' 275 | if __name__ == "__main__": 276 | # Get the parameter-file path from the command-line 277 | # * argparser also verifies the command-line syntax 278 | arg_parser = create_arg_parser() 279 | parsed_args = arg_parser.parse_args() 280 | parameter_file_path = parsed_args.parameter_file_path 281 | 282 | # * If a parameter-file path was not provided on the command-line, 283 | # then prompt for it 284 | if parameter_file_path == None: 285 | prompted_for_parameter_file = True 286 | parameter_file_path = input("Enter parameter-file path: ").strip() 287 | if parameter_file_path == "": 288 | print("") 289 | print("ERROR. Parameter-file path not provided.") 290 | print("") 291 | sys.exit() 292 | else: 293 | prompted_for_parameter_file = False 294 | 295 | # Create the WWN web-page 296 | create_web_page(parameter_file_path) 297 | 298 | # If the command-window will close at program exit, then prompt for a key-press: 299 | # * If the program was called by clicking on it, the command window will close 300 | # when the program exits, and the program's messages will not be viewable. 301 | # * Prompting for a key-press will ensure the program's messages are viewable. 302 | # * However, it is difficult to determine if the program was called by clicking on it. 303 | # * If the program was called by clicking on it, the user will have been prompted for 304 | # the parameter-file. 305 | # So, if the user was prompted for the parameter-file, then prompt for a key-press: 306 | if prompted_for_parameter_file == True: 307 | print("") 308 | input("Press any key to exit.") 309 | print("") 310 | -------------------------------------------------------------------------------- /createwebpage/fix_word_html.py: -------------------------------------------------------------------------------- 1 | ''' 2 | ################ 3 | * Module with function: fix_word_html() 4 | * Called by: create_web_page_core() in create_web_page.py 5 | 6 | Local functions: 7 | * fix_unordered_list_items() 8 | * test_if_span_with_only_spaces() 9 | 10 | Call graph: 11 | create_web_page_core() 12 | --> fix_word_html() 13 | --> fix_unordered_list_items() 14 | --> test_if_span_with_only_spaces() 15 | --> test_if_span_with_only_spaces() 16 | 17 | MIT License, Copyright (c) 2021-present Jim Yuill 18 | ################ 19 | ''' 20 | 21 | import re 22 | import sys 23 | import os 24 | # Specifies the YAML keys for the input parameter-file 25 | from input_parameter_file_keys import * 26 | 27 | try: 28 | # BeautifulSoup: pip install beautifulsoup4 29 | from bs4 import BeautifulSoup, NavigableString, Tag 30 | except ImportError as e: 31 | print("") 32 | print("ERROR. Could not import a required Python module.") 33 | print(" The installation instructions specify the required modules.") 34 | print(" Import-error description:") 35 | print("") 36 | print(e) 37 | print("") 38 | sys.exit() 39 | 40 | """ 41 | ################## 42 | * Global constants, used in the dictionary html_entity_encodings 43 | ################## 44 | """ 45 | 46 | # Defines the keys used in the dictionary HTML_ENTITY_ENCODING_SPECS 47 | FIX_KEY_DESCRIPTION = "description" 48 | FIX_KEY_ENCODE = "encode" 49 | FIX_KEY_DECODE = "decode" 50 | FIX_KEY_NUMBER_ENCODED = "number_encoded" 51 | 52 | # A dictionary used to store the data used to fix a particular bug 53 | HTML_ENTITY_ENCODING_SPECS = { 54 | FIX_KEY_DESCRIPTION: "", # Constant, used in console messages 55 | FIX_KEY_ENCODE: "", # Constant, specifies the new HTML, within a script tag 56 | FIX_KEY_DECODE: "", # Constant, specifies the new HTML, without the script tag 57 | FIX_KEY_NUMBER_ENCODED: 0 # Variable, specifies the number of instances of the fix 58 | } 59 | 60 | # Defines the top-level keys used in the dictionary html_entity_encodings 61 | FIX_KEY_SOLID_DOT_BULLET = "solid_dot_bullet" 62 | FIX_KEY_SOLID_SQUARE_BULLET = "solid_square_bullet" 63 | FIX_KEY_LETTER_O_BULLET = "letter_o_bullet" 64 | FIX_KEY_SIX_NBSPS = "six_nbsps" 65 | 66 | # Defines HTML tags that are used in the dictionary html_entity_encodings 67 | # * The fixes to the Word-HTML involves replacing HTML strings with particular 68 | # HTML entities. 69 | # * Those replacement HTML-entities are put inside script opening and closing tags, 70 | # which are defined here. 71 | FIX_SCRIPT_OPENING_TAG = "" 73 | 74 | 75 | def test_if_span_with_only_spaces(element): 76 | ''' 77 | For the input HTML-elemement, determine if it is a span-tag with 78 | a single string made-up of only spaces 79 | 80 | Local function, called by: 81 | * fix_word_html() 82 | * fix_unordered_list_items() 83 | 84 | Returns True or False 85 | ''' 86 | # Have to use .decode() to get the " "s 87 | if ( isinstance(element, Tag) and (element.name == "span") and 88 | ('style' in element.attrs) and (len(element.contents) == 1) ): 89 | 90 | span_html_str = element.decode(formatter='html') 91 | regex = r"^]*?>(?:\s*?(?: )\s*?)+$" 92 | search_result = re.search(regex, span_html_str, re.M) 93 | if search_result != None: 94 | return True 95 | 96 | return False 97 | # END OF: test_if_span_with_only_spaces() 98 | 99 | 100 | def fix_unordered_list_items(html_entity_encodings, 101 | word_list_item_list, 102 | font_family: str, 103 | symbol: str, 104 | html_entity_encodings_key: str): 105 | ''' 106 | Fixes commonly-found bugs in Word-HTML, for unordered-lists. 107 | 108 | Local function, called by: fix_word_html() 109 | 110 | Parameters: 111 | * html_entity_encodings : dictionary, each entry specifies the HTML used to fix 112 | a particular bug in the Word-HTML 113 | * word_list_item_list : a Python list, holds candidate Word list-items 114 | * font_family : used in specifying the bullet-symbol of the list-items to be fixed 115 | * symbol : used in specifying the bullet-symbol of the list-items to be fixed 116 | * html_entity_encodings_key : the HTML for fixing the bullet-symbol, if it needs to be fixed 117 | 118 | Return: 119 | * html_entity_encodings : Stats are recorded for fixed list-items 120 | * word_list_item_list : Fixed list-items are removed from word_list_item_list. 121 | ''' 122 | 123 | ''' 124 | * Terminology 125 | * An unordered list is made-up of list-items. 126 | * For the list-items, three types of bullet-symbols are typically used: 127 | * solid-dot, solid-square, and the letter "o" 128 | 129 | * The function's parameters specify: 130 | * The bullet-symbol of the list-item to be fixed, e.g., a round dot 131 | * It's specified by the parameters "font_family" and "symbol" 132 | * The HTML needed to fix the bullet-symbol, if it needs to be fixed. 133 | * The parameter "html_entity_encodings_key" is a key for the dictionary html_entity_encodings. 134 | * In the dictionary, that key's value has the HTML for fixing the bug. 135 | If no fix is needed, the empty string is specified. 136 | 137 | * The function examines each candidate Word list-item in the Python list "word_list_item_list". 138 | * A candidate list-item could be a list-item, but additional examination is needed to confirm that. 139 | * In HTML, a list-item is specified as a paragraph (

...

) 140 | * If a list-item is for an unordered-list, and it has the specified bullet-symbol, 141 | then the list-item's HTML is edited, to fix its bugs. 142 | 143 | * The bug-fixes just described involve replacing HTML strings with HTML entities. 144 | * The replacements are done in the BeautifulSoup HTML. 145 | * During the replacement, BeautifulSoup itself can potentially alter those HTML entities, 146 | in undesirable ways. 147 | * To prevent BeautifulSoup from making those alternations, the replacement HTML entities 148 | are put insides of an HTML . 149 | * Later in fix_word_html(), the BeautifulSoup HTML will be converted to HTML text. 150 | Those added opening and closing script tags will be removed then, from the HTML text. 151 | 152 | * The Word-HTML bugs, and their fixes, are further described in the WWN development-documents. 153 | ''' 154 | 155 | list_elements_to_remove = [] 156 | 157 | symbol_replace_count = 0 158 | bullets_fixed_count = 0 159 | # Loop for each HTML paragraph in word_list_item_list 160 | for word_list_item_list_index in range(0, len(word_list_item_list)): 161 | ''' 162 | * Determine if the paragraph is an unordered-list list-item, 163 | and if it has the required bullet-symbol. 164 | ''' 165 | 166 | # A list-item paragraph will have at least two strings 167 | paragraph = word_list_item_list[word_list_item_list_index] 168 | paragraph_strings = paragraph.find_all(string=True) 169 | if len(paragraph_strings) < 2: 170 | continue 171 | 172 | # Test that the first string is the required bullet-symbol 173 | first_string = paragraph_strings[0] 174 | if (first_string.string != symbol): 175 | continue 176 | 177 | # For the bullet-symbol, test if its parent is an HTML span tag with the attribute 'style' 178 | first_string_parent = first_string.parent 179 | if not ( isinstance(first_string_parent, Tag) and (first_string_parent.name == 'span') and 180 | ('style' in first_string_parent.attrs) ): 181 | continue 182 | 183 | # For the style attribute, test that it contains the required font-family specification 184 | style = first_string_parent.attrs['style'] 185 | regex = r"(?:^|;)" + r"font-family:" + font_family + r"(?:$|;)" 186 | result = re.search(regex, style, re.M) 187 | if result == None: 188 | continue 189 | 190 | # Test if the second string is all spaces (" "), within an enclosing span tag 191 | second_string = paragraph_strings[1] 192 | second_string_parent = second_string.parent 193 | if not test_if_span_with_only_spaces(second_string_parent): 194 | continue 195 | 196 | # Test if the spaces' enclosing span tag is a child of the bullet-symbol's parent 197 | if not (first_string_parent is second_string_parent.parent): 198 | continue 199 | 200 | ''' 201 | Make needed fixes to the HTML 202 | ''' 203 | 204 | # Replace the bullet symbol, if needed, using the HTML defined in html_entity_encodings 205 | if (html_entity_encodings[html_entity_encodings_key][FIX_KEY_ENCODE] != ""): 206 | first_string.replace_with( 207 | BeautifulSoup(html_entity_encodings[html_entity_encodings_key][FIX_KEY_ENCODE], 208 | 'html.parser') ) 209 | symbol_replace_count += 1 210 | 211 | # Replace the " "s, using the HTML defined in html_entity_encodings 212 | ''' 213 | For list-items that use common bullet-symbols, there is an additional bug in the Word HTML. 214 | * Word puts " " entities after the bullet-symbol, to provide spacing between the 215 | bullet-symbol and the list-item's text. 216 | * However, the number of " " entities is often incorrect, and it often results in 217 | a visibly misaligned unordered-list. 218 | * The problem is fixed by replacing the " "'s with six " "'s. 219 | ''' 220 | second_string.replace_with(BeautifulSoup(html_entity_encodings[FIX_KEY_SIX_NBSPS][FIX_KEY_ENCODE], 221 | 'html.parser') ) 222 | 223 | # The HTML paragraph has been fixed. Its index in word_list_item_list is recorded. 224 | list_elements_to_remove.append(word_list_item_list_index) 225 | bullets_fixed_count += 1 226 | 227 | # * For the HTML paragraphs that have been fixed, remove them from the list 228 | # word_list_item_list 229 | # * Removing them from the list does not remove them from the BeautifulSoup HTML 230 | # (i.e., they are not removed from the BeautifulSoup object "soup") 231 | for i in sorted(list_elements_to_remove, reverse=True): 232 | del word_list_item_list[i] 233 | 234 | print("INFO. Editing the Word-HTML. Fixing list-items with " + 235 | html_entity_encodings[html_entity_encodings_key][FIX_KEY_DESCRIPTION] + 236 | " Number of list-items found: " + str(bullets_fixed_count) ) 237 | # Record stats for fixes 238 | html_entity_encodings[html_entity_encodings_key][FIX_KEY_NUMBER_ENCODED] = symbol_replace_count 239 | html_entity_encodings[FIX_KEY_SIX_NBSPS][FIX_KEY_NUMBER_ENCODED] += bullets_fixed_count 240 | # END OF: fix_unordered_list_items() 241 | 242 | 243 | def fix_word_html(loaded_parms, jinja_template_variables, body_inner_html, num_warning_messages): 244 | ''' 245 | Fix bugs in Word's HTML 246 | 247 | Parameters: 248 | * loaded_parms : a dictionary with the input parameter-file's contents 249 | * jinja_template_variables : the jinja-template object 250 | * body_inner_html : The HTML from the section in Word's HTML, 251 | but with these parts removed: 252 | * opening and closing tags 253 | * The table-of-contents 254 | * num_warning_messages : counter 255 | 256 | Calls local functions: 257 | * fix_unordered_list_items() 258 | * test_if_span_with_only_spaces() 259 | 260 | Return: 261 | * 1, None : error 262 | 263 | * 0, num_warning_messages : OK 264 | * The objects returned 265 | * Objects passed as parameters: 266 | * loaded_parms : not changed 267 | * jinja_template_variables : added the document-text's HTML, with the fixes applied 268 | * body_inner_html : contents not specified (no longer used) 269 | 270 | The Word-HTML bugs, and their fixes, are further described in the WWN development-documents. 271 | The documents are: 272 | * In the repo, under /docs/development-docs 273 | * On the WWN web-site 274 | ############# 275 | ''' 276 | 277 | ''' 278 | * Word's HTML has several bugs. This function fixes those bugs, if present. 279 | * The bugs are fixed by editing the HTML. 280 | * The HTML is in the BeautifulSoup object "body_inner_html". 281 | * body_inner_html has the HTML from the section in Word's HTML, 282 | but with the outer tags removed, and the table-of-contents removed 283 | 284 | * The Word-HTML bugs fixed are: 285 | * Formatting problems in bulleted lists (unordered lists) 286 | * Formatting problems in ordered lists 287 | * Text whose color is incorrectly set to be white 288 | 289 | * Among the Word-HTML bugs that are fixed, some of the bugs are fixed by replacing a 290 | particular string within an HTML paragraph. 291 | * The replacement process is the same for each of those bugs, but the data differs. 292 | * The replacement process is coded within the present function and within fix_unordered_list_items(). 293 | * The data used by those replacement-processes is defined here. 294 | ''' 295 | 296 | # * html_entity_encodings: 297 | # * Each entry specifies the HTML used to fix a particular bug in 298 | # the Word-HTML. 299 | # * Each entry's value is itself a dictionary. 300 | # * The value is initialized to be a copy of HTML_ENTITY_ENCODING_SPECS. 301 | html_entity_encodings = { 302 | FIX_KEY_SOLID_DOT_BULLET: HTML_ENTITY_ENCODING_SPECS.copy(), 303 | FIX_KEY_SOLID_SQUARE_BULLET: HTML_ENTITY_ENCODING_SPECS.copy(), 304 | FIX_KEY_LETTER_O_BULLET: HTML_ENTITY_ENCODING_SPECS.copy(), 305 | FIX_KEY_SIX_NBSPS: HTML_ENTITY_ENCODING_SPECS.copy() 306 | } 307 | 308 | # Fix for solid-dot bullet-symbols 309 | html_entity_encodings[FIX_KEY_SOLID_DOT_BULLET][FIX_KEY_DESCRIPTION] = \ 310 | "solid-dot bullet-symbols (used in levels 1,4,7)" 311 | html_entity_encodings[FIX_KEY_SOLID_DOT_BULLET][FIX_KEY_ENCODE] = \ 312 | FIX_SCRIPT_OPENING_TAG + "●" + FIX_SCRIPT_CLOSING_TAG 313 | html_entity_encodings[FIX_KEY_SOLID_DOT_BULLET][FIX_KEY_DECODE] = "●" 314 | 315 | # Fix for solid-square bullet-symbols 316 | html_entity_encodings[FIX_KEY_SOLID_SQUARE_BULLET][FIX_KEY_DESCRIPTION] = \ 317 | "solid-square bullet-symbols (used in levels 3,6,9)" 318 | html_entity_encodings[FIX_KEY_SOLID_SQUARE_BULLET][FIX_KEY_ENCODE] = \ 319 | FIX_SCRIPT_OPENING_TAG + "■" + FIX_SCRIPT_CLOSING_TAG 320 | html_entity_encodings[FIX_KEY_SOLID_SQUARE_BULLET][FIX_KEY_DECODE] = "■" 321 | 322 | # For letter "o" bullet-symbols, specify no fix is needed for the bullet-symbol 323 | html_entity_encodings[FIX_KEY_LETTER_O_BULLET][FIX_KEY_DESCRIPTION] = \ 324 | "letter \"o\" bullet-symbols (used in levels 2,5,8)" 325 | html_entity_encodings[FIX_KEY_LETTER_O_BULLET][FIX_KEY_ENCODE] = "" 326 | html_entity_encodings[FIX_KEY_LETTER_O_BULLET][FIX_KEY_DECODE] = "" 327 | 328 | # Fix for spacing after a bullet-symbol (unordered list), or list-item symbol (ordered-list). 329 | html_entity_encodings[FIX_KEY_SIX_NBSPS][FIX_KEY_DESCRIPTION] = "list-item spacing" 330 | html_entity_encodings[FIX_KEY_SIX_NBSPS][FIX_KEY_ENCODE] = \ 331 | FIX_SCRIPT_OPENING_TAG + "      " + FIX_SCRIPT_CLOSING_TAG 332 | html_entity_encodings[FIX_KEY_SIX_NBSPS][FIX_KEY_DECODE] = "      " 333 | 334 | 335 | ''' 336 | ##################### 337 | Get the HTML paragraphs that are candidate list-items. 338 | ##################### 339 | ''' 340 | 341 | ''' 342 | * List-item paragraphs: 343 | * For ordered and unordered lists, Word usually implements the list-items as HTML paragraphs

. 344 | * An example of the opening tag for a typical list-item paragraph: 345 | *

346 | * For typical list-item paragraphs, their opening tag has these distinctive features: 347 | * A class attribute, with a known set of class names, e.g., MsoNormal, MsoListParagraphCxSpFirst, etc. 348 | * A style attribute, with a "text-indent" value 349 | * Paragraphs with these distinctive features are not necessarily list-items. 350 | * Thus, such paragraphs are "candidate" list-items. 351 | * There are other ways that Word-HTML implements list-items. 352 | * They are described in the WWN development-documents, but their HTML is not edited by the system. 353 | 354 | * body_inner_html.find_all() creates a Python list with the HTML paragraphs that have those distinctive features 355 | * Each element in the Python list is an HTML paragraph, stored as a BeautifulSoup object 356 | * That BeautifulSoup object is a pointer into the original BeautifulSoup object "soup" 357 | * In creating the paragraph's BeautifulSoup object, the paragraph was not removed from "soup" 358 | ''' 359 | 360 | # word_list_item_list contains candidate list-items 361 | # * A candidate list-item could be a Word list-item, but additional examination is needed to confirm. 362 | word_list_item_list = [] 363 | 364 | # "regex" is a reg-ex pattern used to match known class names, e.g., class=MsoNormal 365 | # * These names have been observed in both ordered and unordered lists. 366 | # * The only exception is the name "MsoListParagraph", which has only been observed in ordered lists. 367 | regex = r'(^MsoListParagraph(CxSp(First|Middle|Last))?$)|(^MsoNormal$)|(^MsoBodyText$)' 368 | word_list_item_list = body_inner_html.find_all('p', 369 | class_=re.compile(regex, re.M), 370 | attrs={'style': re.compile('text-indent:')}) 371 | 372 | 373 | ''' 374 | ###################### 375 | Fix the list-items in unordered-lists 376 | ###################### 377 | ''' 378 | ''' 379 | Encoding for symbols: 380 | * Apparently, my editor changes the following symbols from 1-byte encoding to 2-byte encoding: 381 | * Solid-dot bullet symbols, changed to "·" 382 | * Solid-square bullet symbols, changed to ""§" 383 | 384 | * In both cases, the "Â" character was added. 385 | * Perhaps the change was due to the editor using UTF-8 encoding. 386 | * In the call to fix_unordered_list_items(), the 2-byte encoding for the symbol 387 | does not match the intended symbol in the HTML file. 388 | 389 | * So, in the calls to fix_unordered_list_items(), the ASCII value is specified for these symbols. 390 | * For example: symbol=chr(183), 391 | ''' 392 | 393 | # Fix solid-dot bullet symbols, and their spacing 394 | # * Word's solid-dot bullet is not displayed properly by Firefox. 395 | # * It is replaced here by the HTML solid-dot symbol "●" 396 | fix_unordered_list_items(html_entity_encodings, 397 | word_list_item_list, 398 | font_family="Symbol", 399 | symbol=chr(183), 400 | html_entity_encodings_key=FIX_KEY_SOLID_DOT_BULLET) 401 | 402 | # Fix solid-square bullet symbols, and their spacing 403 | # * Word's solid-square bullet is not displayed properly by Firefox. 404 | # * It is replaced here by the HTML solid-square symbol "■" 405 | fix_unordered_list_items(html_entity_encodings, 406 | word_list_item_list, 407 | font_family="Wingdings", 408 | symbol=chr(167), 409 | html_entity_encodings_key=FIX_KEY_SOLID_SQUARE_BULLET) 410 | 411 | # Fix the spacing for the letter-"o" bullet symbols 412 | fix_unordered_list_items(html_entity_encodings, 413 | word_list_item_list, 414 | font_family='"Courier New"', 415 | symbol="o", 416 | html_entity_encodings_key=FIX_KEY_LETTER_O_BULLET) 417 | 418 | ''' 419 | ###################### 420 | Fix the list-items in ordered-lists 421 | ###################### 422 | ''' 423 | 424 | ''' 425 | This code fixes commonly-found bugs in Word-HTML, for ordered-lists. 426 | 427 | * Terminology 428 | * An ordered list is made-up of list-items. 429 | * The list-item symbols are typically: integers, Roman-numerals, and letters. 430 | * Examples of the typical formatting for the list-item symbols is: 431 | 1., 1), and [1] 432 | 433 | * The function examines each candidate list-item in the Python list "word_list_item_list". 434 | * In HTML, a list-item is specified as a paragraph (

...

) 435 | * If a list-item is for an ordered-list, then the list-item's HTML is edited, to fix its bugs. 436 | 437 | * There are two commonly-found bugs in these list-items: 438 | * The list-item symbol is often not properly indented 439 | * The text after the list-item symbol is often not properly indented 440 | * (It does not have the proper number of spaces between the symbol and the start of the text) 441 | 442 | * An example of a typical list-item paragraph, for an ordered-list 443 |

444 |     i.  List-item text

446 | ''' 447 | 448 | unrecognized_indentation_units_list = [] 449 | num_text_indent_unrecognized = 0 450 | num_list_items_fixed = 0 451 | # Loop for each HTML paragraph in word_list_item_list 452 | for paragraph in word_list_item_list: 453 | ''' 454 | * Determine if the paragraph is an ordered-list list-item. 455 | * Also, determine the structure of the relevant HTML tags and strings within 456 | the paragraph. 457 | * In this code-section, references to "list-item" are for an ordered-list. 458 | ''' 459 | 460 | # Strings present in a list-item paragraph: 461 | # * In BeautifulSoup format, strings are stored in an object of 462 | # type NavigableString. 463 | # * In a list-item paragraph, the first string can be either: 464 | # * All spaces (" "), within a span tag, or 465 | # * The list-item symbol, e.g., "1." 466 | # * If the first string is all spaces: 467 | # * The second string is the list-item symbol. 468 | # * The first string is referred to as the "pre-symbol spaces". 469 | # * There's a seperate string after the list-item symbol. 470 | # * It is all spaces (" "), within a span tag. 471 | # * These spaces are referred to as the "post-symbol spaces". 472 | 473 | # A list-item paragraph will have at least two strings 474 | paragraph_strings = paragraph.find_all(string=True) 475 | if len(paragraph_strings) < 2: 476 | continue 477 | 478 | # Determine if the paragraph's first string is all spaces, within a span tag 479 | first_string = paragraph_strings[0] 480 | first_string_parent = first_string.parent 481 | if test_if_span_with_only_spaces(first_string_parent): 482 | first_string_is_all_spaces = True 483 | else: 484 | first_string_is_all_spaces = False 485 | 486 | # * Assume the paragraph is a list-item, and set variables 487 | # to "point" to these strings: the list-item symbol, and the post-symbol spaces. 488 | if first_string_is_all_spaces: 489 | if len(paragraph_strings) < 3: 490 | continue 491 | list_item_symbol_navigable_string = paragraph_strings[1] 492 | ending_span_with_only_spaces = paragraph_strings[2].parent 493 | else: 494 | list_item_symbol_navigable_string = paragraph_strings[0] 495 | ending_span_with_only_spaces = paragraph_strings[1].parent 496 | 497 | # * Test if there is a list-item symbol, in the string where the list-item symbol is 498 | # expected to be. 499 | # * 500 | if not isinstance(list_item_symbol_navigable_string, NavigableString): 501 | continue 502 | text = list_item_symbol_navigable_string.string 503 | # * This reg-ex matches the typical types of list-item symbols, and symbol formatting: 504 | # [1], 1), and 1. 505 | # * \S matches everything except whitespace, e.g., numbers, letters 506 | regex = r'^[\[]?\S+[\)\.\]]$' 507 | result = re.search(regex, text, re.M) 508 | if result == None: 509 | continue 510 | 511 | # * Test if the list-item symbol is followed by the post-symbol spaces 512 | if not test_if_span_with_only_spaces(ending_span_with_only_spaces): 513 | continue 514 | 515 | # * Test the parents of the two strings: the list-item symbol, and the post-symbol spaces. 516 | # * The strings should have the same HTML parent. 517 | if not (list_item_symbol_navigable_string.parent is ending_span_with_only_spaces.parent): 518 | continue 519 | # * If there is a pre-symbol-spaces string, test that it and the list-item-symbol have 520 | # the same HTML parent. 521 | if (first_string_is_all_spaces and 522 | not (first_string_parent.parent is list_item_symbol_navigable_string.parent)): 523 | continue 524 | 525 | # Test if the paragraph's text-indent value is in the expected form. 526 | style = paragraph['style'] 527 | regex = r"(?:^|;)(text-indent:[+-]?[\.0-9]+)([a-zA-Z]+)(?:$|;)" 528 | result = re.search(regex, style, re.M) 529 | if (result == None): 530 | continue 531 | # If the text-indent length-units are in not in inches, a warning message will be displayed, later. 532 | # * Length-units consist of letters (upper and/or lower-case), e.g., cm, mm, px, in, etc. 533 | # * https://developer.mozilla.org/en-US/docs/Web/CSS/text-indent 534 | # * https://developer.mozilla.org/en-US/docs/Web/CSS/length 535 | elif (result.group(2) != 'in'): 536 | num_text_indent_unrecognized += 1 537 | text_indent_specs = result.group(1) + result.group(2) 538 | unrecognized_indentation_units_list.append(text_indent_specs) 539 | 540 | # * Replace the post-symbol spaces with the correct number of spaces ( ) 541 | # * The spaces are enclosed in a script tag, as was done in fixing unordered-lists. 542 | ending_span_with_only_spaces.contents[0].replace_with( 543 | BeautifulSoup(html_entity_encodings[FIX_KEY_SIX_NBSPS][FIX_KEY_ENCODE], 544 | 'html.parser') ) 545 | html_entity_encodings[FIX_KEY_SIX_NBSPS][FIX_KEY_NUMBER_ENCODED] += 1 546 | 547 | # * Pre-symbol-spaces are within a span tag. 548 | # * If there are pre-symbol-spaces, delete the whole span tag. 549 | # * From BS docs: "Tag.decompose() removes a tag from the tree, 550 | # then completely destroys it and its contents" 551 | if first_string_is_all_spaces: 552 | first_string_parent.decompose() 553 | 554 | # Fix the paragraph's text-indent field by setting it to -.25 555 | regex = r"((?:^|;)text-indent:)([+-]?[\.0-9]+[a-zA-Z]+)($|;)" 556 | substitution = "\\1-.25in\\3" 557 | new_style, substitution_count = re.subn(regex, substitution, style, 0, re.M) 558 | if (substitution_count != 1): 559 | print("") 560 | print("ERROR. Unexpected error, while fixing unordered-list list-items.") 561 | print(" Regular-expression substitution failed, in setting text-indent to \"-.25in\"") 562 | print(" List-item HTML:") 563 | print(paragraph.decode(formatter='html')) 564 | return 1, num_warning_messages 565 | paragraph['style'] = new_style 566 | 567 | num_list_items_fixed += 1 568 | 569 | print("INFO. Editing the Word-HTML. Fixing ordered-list list-items." + \ 570 | " %s list-items were fixed." % num_list_items_fixed) 571 | 572 | # * For a list-item, the text-indent value can have units other than inches. 573 | # * If such text-indent units were encountered, display a warning message. 574 | if (num_text_indent_unrecognized > 0): 575 | num_warning_messages += 1 576 | # Remove duplicates from the list: unrecognized_indentation_units_list 577 | unrecognized_indentation_units_list = list(set(unrecognized_indentation_units_list)) 578 | # Create a string with the unrecognized units, separated by commas. 579 | string = "" 580 | for element in unrecognized_indentation_units_list: 581 | string += (element + ", ") 582 | string = string[0:-2] 583 | print("") 584 | print("WARNING. Ordered-list list-item(s) were fixed.") 585 | print(f" There were {num_text_indent_unrecognized} list-items whose " + \ 586 | "text-indent values were not \"in\".") 587 | print(" In fixing their text-indent values, they were changed to use \"in\" units.") 588 | print(" It's possible that these list-items are not properly indented.") 589 | print(" The text-indent units that were not \"in\" are: " + string) 590 | print("") 591 | 592 | 593 | ''' 594 | ###################### 595 | For the Word-HTML, fix text that is incorrectly set to "color:white" 596 | ###################### 597 | ''' 598 | 599 | # * The input parameter-file has a key that's used to specify what types of Word-HTML are to be 600 | # fixed, for "color:white" 601 | # * The key's name is specified in the constant YML_KEY_WHITE_COLORED_TEXT 602 | if ( (YML_KEY_WORD_HTML_EDITS in loaded_parms) and 603 | (YML_KEY_WHITE_COLORED_TEXT in loaded_parms[YML_KEY_WORD_HTML_EDITS]) ): 604 | key_white_text_value = loaded_parms[YML_KEY_WORD_HTML_EDITS][YML_KEY_WHITE_COLORED_TEXT] 605 | else: 606 | # If the key isn't specified in the parameter-file, use the default value 607 | key_white_text_value = YML_DO_NOT_REMOVE 608 | 609 | print("INFO. Processing the span tags with attribute \"style\" and value \"color:white\". ") 610 | print(" Processing-type used (specified via the parameter-file key \"white_colored_text\"): " + key_white_text_value) 611 | 612 | # * Get all of the span tags that have a style attribute, and the style attribute 613 | # includes the value "color:white" 614 | style_color_white = "color:white" 615 | regex = r"(?:^|;)(?:" + style_color_white + ")(?:$|;)" 616 | spans = body_inner_html.find_all('span', attrs={"style": re.compile(regex, re.M)}) 617 | 618 | num_spans_under_paragraph = 0 619 | num_spans_not_under_paragraph = 0 620 | # Loop for each span tag 621 | for span in spans: 622 | paragraph_ancestor_found = False 623 | # Determine if the span tag is within a paragraph 624 | for span_parent in span.parents: 625 | if span_parent.name == 'p': 626 | paragraph_ancestor_found = True 627 | break 628 | 629 | if paragraph_ancestor_found == True: 630 | num_spans_under_paragraph += 1 631 | else: 632 | num_spans_not_under_paragraph += 1 633 | 634 | # Remove "color:white" from the span tag's style attribute, as required 635 | if ((key_white_text_value == YML_REMOVE_IN_PARAGRAPHS) and paragraph_ancestor_found) or \ 636 | (key_white_text_value == YML_REMOVE_ALL): 637 | style = span['style'] 638 | # If the style-attribute's only value is "color:white", delete the style attribute 639 | if (style == style_color_white): 640 | del span['style'] 641 | else: 642 | regex = r"((?:^|;)(?:" + style_color_white + ")(?:$|;))" 643 | substitution = "" 644 | new_style, substitution_count = re.subn(regex, substitution, style, 0, re.M) 645 | if (substitution_count != 1): 646 | print("") 647 | print("ERROR. Editing HTML span tags within a paragraph, having span attribute \"style\" and value \"color:white\".") 648 | print(" For a span-tag, the regular-expression substitution failed in removing \"color:white\".") 649 | print(" span-tag HTML:") 650 | print("") 651 | print(span.decode(formatter='html')) 652 | return 1, num_warning_messages 653 | span['style'] = new_style 654 | 655 | print("INFO. Checking for span tags with attribute \"style\" and value \"color:white\".") 656 | print(" The number of such span-tags: Within an HTML paragraph: %s; Not within an HTML paragraph: %s" % 657 | (num_spans_under_paragraph, num_spans_not_under_paragraph)) 658 | 659 | if ((num_spans_not_under_paragraph + num_spans_under_paragraph) > 0): 660 | num_warning_messages += 1 661 | print("") 662 | print("WARNING. Span tag(s) found, with attribute \"style\" and value \"color:white\".") 663 | print(" INFO messages provide details. Further info is in the WWN docs.") 664 | print("") 665 | 666 | 667 | ''' 668 | ################################ 669 | Generate the document-body's HTML, with the fixes applied 670 | ################################ 671 | ''' 672 | 673 | ''' 674 | Generate the HTML, from BeautifulSoup format, into text format 675 | ''' 676 | # generated_html is a string variable 677 | generated_html = body_inner_html.decode(formatter='html') 678 | 679 | ''' 680 | * If any HTML fixes are within a script tag, the script opening-tag and closing-tag 681 | is removed from the HTML text 682 | ''' 683 | # * When the HTML was edited, if HTML entities were added (e.g.,  ), they were put inside of 684 | # a script tag. This prevented BeautifulSoup from altering the HTML entities. 685 | # * The script opening-tags and closing-tags are removed, using reg-ex substitution. 686 | # * html_entity_encodings was described earlier. 687 | 688 | # Loop for each entry in html_entity_encodings 689 | for key in html_entity_encodings: 690 | # Test if new HTML was added for this HTML-fix 691 | if html_entity_encodings[key][FIX_KEY_NUMBER_ENCODED] != 0: 692 | # The new-HTML is specified by the entry FIX_KEY_ENCODE. 693 | # * This HTML includes the script opening-tag and closing-tag 694 | regex = html_entity_encodings[key][FIX_KEY_ENCODE] 695 | # * The new-HTML, without the script opening-tag and closing-tag, is specified by the 696 | # entry FIX_KEY_DECODE 697 | substitution = html_entity_encodings[key][FIX_KEY_DECODE] 698 | # Use a reg-ex to replace the new-HTML, and remove the script opening-tag and closing-tag. 699 | generated_html, substitution_count = re.subn(regex, substitution, generated_html, 0) 700 | if substitution_count != html_entity_encodings[key][FIX_KEY_NUMBER_ENCODED]: 701 | print("") 702 | print("ERROR. Decoding the " + html_entity_encodings[key][FIX_KEY_DESCRIPTION] + 703 | ". Number decoded (%s) is not equal to the number encoded (%s)." 704 | % (substitution_count, html_entity_encodings[key][FIX_KEY_NUMBER_ENCODED])) 705 | return 1, num_warning_messages 706 | print("INFO. Decoding the " + html_entity_encodings[key][FIX_KEY_DESCRIPTION]) 707 | 708 | # * generated_html has the document-text's HTML, with the fixes applied. 709 | # * Later, Jinja will be used to put generated_html in the output HTML 710 | jinja_template_variables['document_text'] = generated_html 711 | 712 | return 0, num_warning_messages 713 | 714 | # END OF: fix_word_html() -------------------------------------------------------------------------------- /createwebpage/input_parameter_file_keys.py: -------------------------------------------------------------------------------- 1 | # Specifies the YAML keys for the input parameter-file 2 | 3 | YML_KEY_REQUIRED = "required" 4 | YML_KEY_VERSION = "version" 5 | YML_KEY_INPUT_HTML_PATH = "input_html_path" 6 | YML_KEY_OUTPUT_DIRECTORY_PATH = "output_directory_path" 7 | YML_KEY_SCRIPTS_DIRECTORY_URL = "scripts_directory_url" 8 | 9 | YML_KEY_HTML_HEAD_SECTION = "html_head_section" 10 | YML_KEY_TITLE = "title" 11 | YML_KEY_DESCRIPTION = "description" 12 | YML_KEY_ADDITIONAL_HTML = "additional_html" 13 | 14 | YML_KEY_HEADER_BAR = "header_bar" 15 | YML_KEY_SECTION = "section" 16 | YML_KEY_CONTENTS = "contents" 17 | 18 | YML_KEY_BREADCRUMBS = "breadcrumbs" 19 | YML_KEY_TEXT = "text" 20 | YML_KEY_URL = "url" 21 | 22 | YML_KEY_HYPERLINK = "hyperlink" 23 | YML_KEY_HTML = "html" 24 | YML_KEY_EMPTY = "empty" 25 | 26 | YML_KEY_CONTENTS_ALIGNMENT = "contents_alignment" 27 | YML_LEFT = "left" 28 | YML_RIGHT = "right" 29 | YML_CENTER = "center" 30 | YML_JUSTIFY = "justify" 31 | 32 | YML_KEY_DOCUMENT_TEXT_TRAILER = "document_text_trailer" 33 | 34 | YML_KEY_WORD_HTML_EDITS = "word_html_edits" 35 | YML_KEY_WHITE_COLORED_TEXT = "white_colored_text" 36 | # The allowable values for the key "white_colored_text" 37 | YML_DO_NOT_REMOVE = "doNotRemove" 38 | YML_REMOVE_IN_PARAGRAPHS = "removeInParagraphs" 39 | YML_REMOVE_ALL = "removeAll" -------------------------------------------------------------------------------- /createwebpage/jinja_template.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 9 | 10 | 17 | 21 | {{word_head_section_contents}} 22 | 23 | 26 | 27 | 28 | 29 | 33 | {{title_tag}} 34 | 35 | 40 | {{meta_tag_with_description}} 41 | 42 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 57 | 58 | 59 | 62 | 63 | 64 | 68 | {{additional_html}} 69 | 70 | 71 | 72 | 76 | {{bodyOpeningTag}} 77 | 78 | 82 |
83 | 84 | 88 | {{header_bar}} 89 | 90 |
91 | 92 |
93 | 94 |
95 | 96 | 99 | {{table_of_contents}} 100 | 101 |
102 | 103 |
104 | 105 |
106 | 107 | 112 | {{document_text}} 113 | 114 | 119 | {{document_text_trailer}} 120 | 121 |
122 | 123 |
124 | 125 | 126 | 127 | -------------------------------------------------------------------------------- /createwebpage/load_html_files.py: -------------------------------------------------------------------------------- 1 | ''' 2 | ################ 3 | This file contains the function: load_html_files() 4 | 5 | The function is called by: create_web_page_core() in create_web_page.py 6 | 7 | MIT License, Copyright (c) 2021-present Jim Yuill 8 | ################ 9 | ''' 10 | 11 | import re 12 | import sys 13 | import os 14 | # Specifies the YAML keys for the input parameter-file 15 | from input_parameter_file_keys import * 16 | 17 | # These libraries need to have been installed by the user 18 | try: 19 | ''' 20 | HTML-related libraries 21 | ''' 22 | # BeautifulSoup: pip install beautifulsoup4 23 | from bs4 import BeautifulSoup 24 | # Jinja2: pip install jinja2 25 | from jinja2 import Template 26 | except ImportError as e: 27 | print("") 28 | print("ERROR. Could not import a required Python module.") 29 | print(" The installation instructions specify the required modules.") 30 | print(" Import-error description:") 31 | print("") 32 | print(e) 33 | print("") 34 | sys.exit() 35 | 36 | 37 | def load_html_files(loaded_parms): 38 | ''' 39 | Description: 40 | * Open the jinja template-file, and create a jinja-template object 41 | * Open the input Word-HTML-file, and load it into Beautiful Soup objects 42 | * Create and open the output HTML-file 43 | 44 | Return: 45 | * 1, None : Error 46 | * 1, returned_objects_dict : Returns a dictionary with the objects that were created. 47 | Specified at the end of this function. 48 | ################ 49 | ''' 50 | 51 | # Name of WWN's jinja-template file 52 | # * It's in the same directory as the present file. 53 | JINJA_TEMPLATE_FILE_NAME = "jinja_template.html" 54 | 55 | ################### 56 | # Open the jinja template-file, and use Jinja2 to create a jinja-template object 57 | ################### 58 | 59 | # Get the path to the WWN jinja template-file. 60 | scripts_directory = os.path.dirname(os.path.realpath(__file__)) 61 | jinja_template_file_path = os.path.join(scripts_directory, JINJA_TEMPLATE_FILE_NAME) 62 | 63 | print("INFO. Opening the jinja template-file, and loading it:") 64 | print(" " + jinja_template_file_path) 65 | try: 66 | jinja_template_file_handle = open(jinja_template_file_path) 67 | except IOError as e: 68 | print("") 69 | print("ERROR. Could not open the expected jinja template-file.") 70 | print(" %s - %s." % (e.strerror, e.filename)) 71 | return 1, None 72 | 73 | jinja_template_file_data = jinja_template_file_handle.read() 74 | jinja_template_file_handle.close() 75 | 76 | # Verify the jinja tempate-file has the expected signature: 77 | # Regex pattern tested and explained: https://regex101.com/r/EZTpUo/2/ 78 | jinja_template_file_signature = "" 79 | regex = r"" 80 | search_result = re.search(regex, jinja_template_file_data, (re.M | re.S)) 81 | if search_result == None: 82 | print("") 83 | print("ERROR. The jinja template-file does not have the expected signature:") 84 | print(" " + jinja_template_file_signature) 85 | return 1, None 86 | 87 | # Load the jinja template-file, as a Jinja-template object 88 | # * This load technique will break if there is template inheritance, but it's not used here. 89 | # * There are techniques that don't break, but they are more complicated. 90 | # * https://stackoverflow.com/questions/38642557/how-to-load-jinja-template-directly-from-filesystem 91 | jinja_template = Template(jinja_template_file_data) 92 | 93 | 94 | #################### 95 | # Open the input Word-HTML-file, and load it into BeautifulSoup objects. 96 | #################### 97 | 98 | input_html_path_value = loaded_parms[YML_KEY_REQUIRED][YML_KEY_INPUT_HTML_PATH] 99 | print("INFO. Opening the input Word-HTML-file:") 100 | print(" " + input_html_path_value) 101 | try: 102 | input_html_handle = open(input_html_path_value) 103 | except IOError as e: 104 | print("") 105 | print("ERROR. Could not open the input Word-HTML-file.") 106 | print(" %s - %s." % (e.strerror, e.filename)) 107 | return 1, None 108 | 109 | # * BeautifulSoup may fail to load the HTML due to the HTML using an 110 | # unrecognized encoding. 111 | try: 112 | soup = BeautifulSoup(input_html_handle, 'html.parser') 113 | except Exception as error: 114 | print("") 115 | print("ERROR. Could not load the input Word HTML-file.") 116 | print(" Exception in call to BeautifulSoup:") 117 | print("") 118 | print(str(error)) 119 | return 1, None 120 | finally: 121 | input_html_handle.close() 122 | 123 | 124 | ######################### 125 | # * Verify the Word-HTML has the expected HTML elements 126 | # * Also, get the HTML head and body sections, as BeautifulSoup objects 127 | ########################## 128 | 129 | # Veryify that the HTML has excatly one element 130 | heads = soup.find_all('head') 131 | if len(heads) != 1: 132 | print("") 133 | print("ERROR. The input Word-HTML does not have exactly one element.") 134 | return 1, None 135 | head = heads[0] 136 | 137 | # There should be exactly one element 138 | bodys = soup.find_all('body') 139 | if len(bodys) != 1: 140 | print("") 141 | print("ERROR. The input Word-HTML does not have exactly one element.") 142 | return 1, None 143 | body = bodys[0] 144 | 145 | # Check for expected
sections 146 | # * There can be multiple
sections 147 | # * If there are none, it's just reported as an INFO message 148 | divs = soup.find_all('div') 149 | if len(divs) == 0: 150 | print("") 151 | print("INFO. The input Word-HTML does not have
sections. ") 152 | else: 153 | # Check for expected
section 154 | div = soup.find('div', class_="WordSection1") 155 | if (div == None): 156 | print("INFO. The input Word-HTML does not have the div section: " + 157 | "
") 158 | 159 | # Check the HTML section for this MS Word signature: 160 | # * 161 | # The signature appears to be used back to at least Word 2007: 162 | # * https://answers.microsoft.com/en-us/msoffice/forum/msoffice_word-mso_winother-msoversion_other/creating-html-with-word-2007/5d344731-d2f3-4568-b504-45256567f782 163 | signature_found = False 164 | meta_found = soup.head.find('meta', attrs={'name': 'Generator'}) 165 | if (meta_found != None) and ('content' in meta_found.attrs): 166 | meta_content = meta_found['content'] 167 | regex = r"^Microsoft Word [0-9]+ \(filtered\)$" 168 | search_result = re.search(regex, meta_content, re.M) 169 | if search_result != None: 170 | signature_found = True 171 | 172 | if signature_found == False: 173 | print("") 174 | print("ERROR. The input Word-HTML does not have the expected MS-Word signature:") 175 | print(" ") 176 | print("") 177 | return 1, None 178 | 179 | 180 | ################# 181 | # Test that the output directory exists 182 | ################# 183 | 184 | output_directory_path_value = loaded_parms[YML_KEY_REQUIRED][YML_KEY_OUTPUT_DIRECTORY_PATH] 185 | # * os.path.normpath() will remove any trailing "/". 186 | # This is needed later, e.g., for os.path.dirname 187 | output_directory_path = os.path.normpath(output_directory_path_value) 188 | if not os.path.exists(output_directory_path): 189 | print("") 190 | message = "ERROR. The specified output-directory does not exist: " + output_directory_path_value 191 | print(message) 192 | return 1, None 193 | 194 | 195 | ############### 196 | # Create and open the output HTML-file 197 | ############### 198 | 199 | key_input_html_file_path_value = loaded_parms[YML_KEY_REQUIRED][YML_KEY_INPUT_HTML_PATH] 200 | file_name = os.path.basename(key_input_html_file_path_value) 201 | output_html_file_path = os.path.join(output_directory_path_value, file_name) 202 | print("INFO. Creating the output HTML-file:") 203 | print(" " + output_html_file_path) 204 | # Test if the output HTML-file already exists in the output directory 205 | if os.path.exists(output_html_file_path): 206 | print("INFO. For the output HTML-file, there is an existing file with the same name. " + 207 | "It will be overwritten:") 208 | print(" " + output_html_file_path) 209 | 210 | try: 211 | # If there is an existing file, it will be overwritten 212 | # * encoding="utf-8" 213 | # * Use from: 214 | # https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters 215 | # * Fixes exception in processing test file: 216 | # * MS-tutorial--Deep Learning for Signal and Information Processing.htm 217 | # * https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Final-DengYu-NOW-Book-DeepLearn2013-ForLecturesJuly2.docx 218 | # * Error when writing final HTML to file: 219 | # * UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 131432: character maps to 220 | output_html_file_handle = open(output_html_file_path, "w", encoding="utf-8") 221 | except OSError as e: 222 | print("") 223 | print("ERROR. Could not open the output HTML-file.") 224 | print(" %s - %s." % (e.strerror, e.filename)) 225 | return 1, None 226 | 227 | 228 | ################ 229 | # Return the objects that were created 230 | ################ 231 | 232 | returned_objects_dict = { 233 | # jinja_template is a jinja2 object, created by calling jinja2.Template() 234 | "jinja_template" : jinja_template, 235 | # head is a BeautifulSoup object 236 | # * It holds the element from the input HTML-file 237 | "head" : head, 238 | # body is a BeautifulSoup object 239 | # * It holds the element from the input HTML-file 240 | "body" : body, 241 | # The output HTML-file: 242 | "output_html_file_path" : output_html_file_path, 243 | "output_html_file_handle" : output_html_file_handle } 244 | 245 | return 0, returned_objects_dict 246 | 247 | # END of: load_html_files() -------------------------------------------------------------------------------- /createwebpage/load_input_parameter_file.py: -------------------------------------------------------------------------------- 1 | ''' 2 | ################ 3 | This file contains the function: load_input_parameter_file() 4 | 5 | The function is called by: create_web_page_core() in create_web_page.py 6 | 7 | MIT License, Copyright (c) 2021-present Jim Yuill 8 | ################ 9 | ''' 10 | 11 | import sys 12 | import os 13 | # Specifies the YAML keys for the input parameter-file 14 | from input_parameter_file_keys import * 15 | 16 | # These libraries need to have been installed by the user 17 | try: 18 | ''' 19 | YAML-related libraries 20 | ''' 21 | # Cerberus: pip install cerberus 22 | from cerberus import Validator 23 | # pprint++: pip install pprintpp 24 | import pprintpp 25 | # PyYAML: pip install PyYAML 26 | import yaml 27 | # yamllint: pip install yamllint 28 | import yamllint 29 | from yamllint.config import YamlLintConfig 30 | except ImportError as e: 31 | print("") 32 | print("ERROR. Could not import a required Python module.") 33 | print(" The installation instructions specify the required modules.") 34 | print(" Import-error description:") 35 | print("") 36 | print(e) 37 | print("") 38 | sys.exit() 39 | 40 | def load_input_parameter_file(parameter_file_path : str): 41 | ''' 42 | Description: 43 | * Loads the WWN input parameter-file, and verifies it 44 | 45 | Operation: 46 | * Verifies the input parameter-file's syntax 47 | * yamllint is used to verify the parameter-file's YAML syntax. 48 | * PyYAML's YAML-loader is used to load the parameter-file into a Python 49 | object, made-up of dictionaries and lists. 50 | * Cerberus is used to verify the parameter-file's syntax, using a schema. 51 | * Verifies the WordWebNav version that is specified in the parameter-file 52 | * The syntax verification is further described in the WWN development-documents. 53 | The documents are: 54 | * In the repo, under /docs/development-docs 55 | * On the WWN web-site 56 | 57 | Parameter: parameter_file_path, specifies the input parameter-file's path 58 | 59 | Return 60 | * 1, None : Error 61 | * 0, loaded_parms : loaded_parms is a dictionary with the input parameter-file's 62 | contents 63 | ''' 64 | 65 | ''' 66 | ################## 67 | Constants 68 | ################## 69 | ''' 70 | # Name for the yamllint config-file 71 | YAMLLINT_CONFIG_FILE_NAME = "yamllint_config_file.yml" 72 | 73 | ''' 74 | # The following are Cerberus schema-definitons, for the input parameter-file. 75 | * The parameter-file is in YAML formt. 76 | * The parameter-file is validated by calling Cerberus. 77 | * These schema-definitions are used by Cerberus, to validate the parameter-file. 78 | 79 | * These schema-definitons specify the parameter-file's structure, keys, and values. 80 | * The parameter-file is described in the system documentation. 81 | * These schema-definitons specify the parameter-file syntax that is described in the 82 | system documentation. 83 | * Since Cerberus is used to validate the parameter-file, the WWN code that processes the parameter-file 84 | assumes the parameter-file has valid syntax, e.g., it assumes that required keys are present. 85 | ''' 86 | 87 | # The Cerberus schema-definition 88 | PARAMETER_FILE_SCHEMA = { 89 | 90 | # Parameter-file section. Name: "required": 91 | # Example contents for the section: 92 | # 93 | # required: 94 | # version: "1.0" 95 | # input_html_path: D:\Documents\Professional-projects\My-web-site-development\Word-to-HTML\automation-dev\testing\test-Word-files\test-Word-files\tests-for-create_web_page_py\WordWebNav--Word-HTML\all-primary-Word-features.html 96 | # output_directory_path: D:\Documents\Professional-projects\My-web-site-development\Word-to-HTML\automation-dev\testing\test-Word-files\test-Word-files\tests-for-create_web_page_py\WordWebNav--HTML 97 | # scripts_directory_url: D:\Documents\Professional-projects\My-web-site-development\Word-to-HTML\WordWebNav\word_web_nav\assets 98 | # 99 | YML_KEY_REQUIRED: { 100 | "type": "dict", 101 | "required": True, 102 | "schema": { 103 | YML_KEY_VERSION: { 104 | "type": "string", 105 | "required": True, 106 | "minlength": 1 107 | }, 108 | YML_KEY_INPUT_HTML_PATH: { 109 | "type": "string", 110 | "required": True, 111 | "minlength": 1 112 | }, 113 | YML_KEY_OUTPUT_DIRECTORY_PATH: { 114 | "type": "string", 115 | "required": True, 116 | "minlength": 1 117 | }, 118 | YML_KEY_SCRIPTS_DIRECTORY_URL: { 119 | "type": "string", 120 | "required": True, 121 | "minlength": 1 122 | } 123 | } 124 | }, 125 | 126 | # Parameter-file section. Name: "html_head_section": 127 | # Example contents for the section: 128 | # 129 | # html_head_section: 130 | # title: Sys-Admin How-To Info 131 | # description: Solutions for my various sys-admin tasks 132 | # additional_html: 133 | # 134 | YML_KEY_HTML_HEAD_SECTION: { 135 | "type": "dict", 136 | "required": False, 137 | "schema": { 138 | YML_KEY_TITLE: { 139 | "type": "string", 140 | "required": False, 141 | "minlength": 1 142 | }, 143 | YML_KEY_DESCRIPTION: { 144 | "type": "string", 145 | "required": False, 146 | "minlength": 1 147 | }, 148 | YML_KEY_ADDITIONAL_HTML: { 149 | "type": "string", 150 | "required": False, 151 | "minlength": 1 152 | } 153 | } 154 | }, 155 | 156 | # Parameter-file section. Name: "header_bar": 157 | # Example contents for the section: 158 | # 159 | # header_bar: 160 | # # One or more sections 161 | # - section: 162 | # contents: 163 | # # Exactly one of the following keys: 164 | # breadcrumbs: 165 | # # One or more hyperlinks 166 | # - hyperlink: 167 | # text: Home 168 | # url: http://jimyuill.com 169 | # text: 170 | # html: 171 | # empty: 172 | # contents_alignment: 173 | # 174 | YML_KEY_HEADER_BAR: { 175 | "type": "list", 176 | "required": False, 177 | "schema": { 178 | "type": "dict", 179 | "schema": { 180 | YML_KEY_SECTION: { 181 | "type": "dict", 182 | "schema": { 183 | YML_KEY_CONTENTS: { 184 | "type": "dict", 185 | "required": True, 186 | # Only one entry allowed in dict 187 | "maxlength" : 1, 188 | "schema" : { 189 | YML_KEY_BREADCRUMBS: { 190 | # * This entry is a list, and 191 | # each list-member specifies a hyperlink 192 | "type": "list", 193 | "required": False, 194 | "schema": { 195 | "type": "dict", 196 | "schema": { 197 | YML_KEY_HYPERLINK : { 198 | "type": "dict", 199 | "schema": { 200 | YML_KEY_TEXT: { 201 | "type": "string", 202 | "required": True, 203 | "minlength": 1 204 | }, 205 | YML_KEY_URL: { 206 | "type": "string", 207 | "required": True, 208 | "minlength": 1 209 | } 210 | } 211 | } 212 | } 213 | } 214 | }, 215 | 216 | YML_KEY_HYPERLINK : { 217 | "type": "dict", 218 | "required": False, 219 | "schema": { 220 | YML_KEY_TEXT: { 221 | "type": "string", 222 | "required": True, 223 | "minlength": 1 224 | }, 225 | YML_KEY_URL: { 226 | "type": "string", 227 | "required": True, 228 | "minlength": 1 229 | } 230 | } 231 | }, 232 | 233 | YML_KEY_TEXT : { 234 | "type": "string", 235 | "required": False, 236 | "minlength": 1 237 | }, 238 | 239 | YML_KEY_HTML : { 240 | "type": "string", 241 | "required": False, 242 | "minlength": 1 243 | }, 244 | 245 | YML_KEY_EMPTY : { 246 | "type": "string", 247 | "required": False, 248 | "maxlength" : 0, # Ensures no value is specified 249 | "nullable": True # Allows key to have no value 250 | } 251 | } 252 | }, 253 | 254 | YML_KEY_CONTENTS_ALIGNMENT: { 255 | "type": "string", 256 | "required": False, 257 | "allowed": ["left", "right", "center", "justify"] 258 | } 259 | } 260 | } 261 | } 262 | } 263 | }, 264 | 265 | # Parameter-file section. Name: "document_text_trailer": 266 | # Example contents for the section: 267 | # 268 | # document_text_trailer: | 269 | #
270 | # 273 | # 274 | YML_KEY_DOCUMENT_TEXT_TRAILER: { 275 | "type": "string", 276 | "required": False, 277 | "minlength" : 1 278 | }, 279 | 280 | # Parameter-file section. Name: "word_html_edits": 281 | # Example contents for the section: 282 | # 283 | # word_html_edits: 284 | # white_colored_text: removeAll 285 | # 286 | YML_KEY_WORD_HTML_EDITS: { 287 | "type": "dict", 288 | "required": False, 289 | "schema": { 290 | YML_KEY_WHITE_COLORED_TEXT: { 291 | "type": "string", 292 | "required": False, 293 | "allowed": [YML_DO_NOT_REMOVE, YML_REMOVE_IN_PARAGRAPHS, YML_REMOVE_ALL] 294 | } 295 | } 296 | } 297 | } 298 | # END of: PARAMETER_FILE_SCHEMA = { 299 | 300 | 301 | ''' 302 | ################## 303 | Function body 304 | ################## 305 | ''' 306 | 307 | # Open the input parameter-file 308 | print("INFO. Processing the parameter-file:") 309 | print(" " + parameter_file_path) 310 | try: 311 | parameter_file_handle = open(parameter_file_path) 312 | except IOError as e: 313 | print("") 314 | print("ERROR. Could not open the parameter-file.") 315 | print(" %s - %s." % (e.strerror, e.filename)) 316 | print("") 317 | return 1, None 318 | 319 | ''' 320 | ############ 321 | # yamllint is used to verify the parameter-file's YAML syntax. 322 | ############ 323 | ''' 324 | print("INFO. Verifying the parameter-file's YAML syntax. (Calling yamllint.)") 325 | # Call yamllint 326 | # * A yamllint config-file is used, which is distrubted with WordWebNav. 327 | # * The config-file's name is defined above in: YAMLLINT_CONFIG_FILE_NAME 328 | program_directory = os.path.dirname(os.path.realpath(__file__)) 329 | yamllint_config_file_path = os.path.join(program_directory, YAMLLINT_CONFIG_FILE_NAME) 330 | print("INFO. Opening the config-file for yamllint:") 331 | print(" " + yamllint_config_file_path) 332 | try: 333 | # YamlLintConfig(): https://github.com/adrienverge/yamllint/blob/master/yamllint/config.py 334 | yamllint_configuration = YamlLintConfig(file=yamllint_config_file_path) 335 | except IOError as e: 336 | print("") 337 | print("ERROR. Could not open the config-file for yamllint.") 338 | print(" %s - %s." % (e.strerror, e.filename)) 339 | print("") 340 | parameter_file_handle.close() 341 | return 1, None 342 | 343 | # If yamllint finds errors, write them to the console and exit. 344 | # * The linter's error output: https://yamllint.readthedocs.io/en/stable/development.html 345 | 346 | # The linter returns a generator, for the errors found. 347 | # * Convert the generator to a list. 348 | yaml_error_list = list(yamllint.linter.run(parameter_file_handle, yamllint_configuration)) 349 | if len(yaml_error_list) > 0: 350 | print("") 351 | print("ERROR. yamllint found syntax errors in the parameter-file.") 352 | print(" A reported error may be caused by a problem earlier in the file.") 353 | for message in yaml_error_list: 354 | if message.rule != None: 355 | yamllint_rule = str(message.rule) 356 | else: 357 | yamllint_rule = "[not specified]" 358 | if message.line != None: 359 | yaml_error_line = str(message.line) 360 | else: 361 | yaml_error_line = "[not specified]" 362 | if message.desc != None: 363 | yaml_error_description = message.desc 364 | else: 365 | yaml_error_description = "[not specified]" 366 | print("") 367 | print("ERROR. Error on line: " + yaml_error_line) 368 | print(" yammllint rule-type: " + yamllint_rule) 369 | print(" yamllint error-description:") 370 | print(yaml_error_description) 371 | 372 | print("") 373 | print("INFO. yamllint documentation: https://yamllint.readthedocs.io/en/stable/") 374 | print(" yamllint rule-types: https://yamllint.readthedocs.io/en/stable/rules.html") 375 | parameter_file_handle.close() 376 | return 1, None 377 | 378 | ''' 379 | ################ 380 | # PyYAML's YAML-loader is used to load the parameter-file 381 | ################ 382 | ''' 383 | # Read in the parameter-file, for use by the YAML-loader. 384 | # * The file-pointer is first reset to the beginning 385 | parameter_file_handle.seek(0) 386 | parameter_file_text = parameter_file_handle.read() 387 | parameter_file_handle.close() 388 | 389 | print("INFO. Loading the parameter-file, using PyYAML's YAML-loader.") 390 | try: 391 | # Call the YAML-loader 392 | # * yaml.load's exceptions: https://pyyaml.org/wiki/PyYAMLDocumentation 393 | loaded_parms = yaml.load(parameter_file_text, Loader=yaml.FullLoader) 394 | except yaml.YAMLError as e: 395 | print("") 396 | print("ERROR. The YAML-loader was not able to load the parameter-file.") 397 | if hasattr(e, 'problem_mark'): 398 | mark = e.problem_mark 399 | print(" Error in the parameter-file on, or near, line: " + str(mark.line+1)) 400 | print(" Error-message from the YAML-loader:") 401 | print(e) 402 | print("") 403 | print("INFO. PyYAML documentation: https://pyyaml.org/wiki/PyYAMLDocumentation") 404 | return 1, None 405 | 406 | # This can happen if the input file just has a line "---" 407 | if (loaded_parms == None): 408 | print("") 409 | print("ERROR. The YAML-loader did not load anything.") 410 | print(" The parameter-file appears to be in error, e.g., has no keys.") 411 | return 1, None 412 | 413 | ''' 414 | ################### 415 | # Cerberus is used to verify the parameter-file's syntax, using a schema. 416 | # * Schemas are defined (above) for the parameter-file's structure, keys and values. 417 | ################### 418 | ''' 419 | print("INFO. Using Cerberus to verify the parameter-file's syntax.") 420 | # Create an instance of the Cerberus Validator 421 | cerberus_validator = Validator() 422 | # Validate the parameter-file, using the schema 423 | # * By default, Cerberus will flag keys that are not defined in the schema. 424 | # * Cerberus can crash with some invlaid inputs (e.g., loaded_parms == None), so use try/except. 425 | try: 426 | validation_result = cerberus_validator.validate(loaded_parms, PARAMETER_FILE_SCHEMA) 427 | except: 428 | print("") 429 | print("ERROR. An exception was raised in Cerberus.") 430 | print(" The parameter-file is likely to be in error.") 431 | return 1, None 432 | 433 | # Check if errors were found 434 | if validation_result == False: 435 | print("") 436 | print("ERROR. An error was found in the parameter-file.") 437 | print(" The error message is below. It is from Cerberus.") 438 | print(" Cerberus's error messages can be difficult to read.") 439 | print(" * The error-message typically includes:") 440 | print(" * Specification of the relevant key-name(s), e.g., {'web_page_files': [{'output_directory_path': ...") 441 | print(" * Followed by an error descripton, e.g., ['null value not allowed']") 442 | print(" * The docs have more info on the Cerberus error messages.") 443 | print("") 444 | # The Cerberus error-message is formatted using pprint++, a pretty-printer app. 445 | # * pprint++ docs: https://github.com/wolever/pprintpp 446 | pretty_printer = pprintpp.PrettyPrinter() 447 | pretty_printer.pprint(cerberus_validator.errors) 448 | return 1, None 449 | 450 | ''' 451 | ############## 452 | # Verify the WordWebNav version that is specified in the parameter-file 453 | ############## 454 | ''' 455 | key_version_value = loaded_parms[YML_KEY_REQUIRED][YML_KEY_VERSION] 456 | if key_version_value != "1.0": 457 | print("") 458 | print( "ERROR. Error in the parameter-file. In section \"" + YML_KEY_REQUIRED + \ 459 | "\", the key \"" + YML_KEY_VERSION + "\" has an incorrect value: " + key_version_value ) 460 | return 1, None 461 | 462 | 463 | # Return loaded_parms 464 | return 0, loaded_parms 465 | 466 | # END OF: load_input_parameter_file() -------------------------------------------------------------------------------- /createwebpage/yamllint_config_file.yml: -------------------------------------------------------------------------------- 1 | --- 2 | # DESCRIPTION: 3 | # * A yamllint config-file, used by create_web_page.py. 4 | # 5 | # * When create_web_page is called, it is passed a parameter-file that is in YAML format. 6 | # * create_web_page uses yamllint to verify that parameter-file's YAML syntax. 7 | # 8 | # * yamllint itself uses a config-file, and 9 | # it specifies the YAML syntax-checking to be performed. 10 | # * The present file is the yamllint config-file that is used by create_web_page, 11 | # when it calls yamllint. 12 | # 13 | # * yamllint documentation: 14 | # * Configuration: https://yamllint.readthedocs.io/en/stable/configuration.html 15 | # * Rules: https://yamllint.readthedocs.io/en/stable/rules.html 16 | # 17 | # MIT License, Copyright (c) 2021-present Jim Yuill 18 | # 19 | extends: default 20 | 21 | rules: 22 | line-length: disable 23 | new-line-at-end-of-file: disable 24 | trailing-spaces: disable 25 | empty-lines: disable 26 | -------------------------------------------------------------------------------- /docs/development-docs/WWN--input-parameter-file--YAML-use.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/WWN--input-parameter-file--YAML-use.docx -------------------------------------------------------------------------------- /docs/development-docs/WWN--testing--regression-tests.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/WWN--testing--regression-tests.docx -------------------------------------------------------------------------------- /docs/development-docs/WWN--testing.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/WWN--testing.docx -------------------------------------------------------------------------------- /docs/development-docs/index.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/index.docx -------------------------------------------------------------------------------- /docs/development-docs/web-page--construction--beautiful-soup--jinja.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/web-page--construction--beautiful-soup--jinja.docx -------------------------------------------------------------------------------- /docs/development-docs/web-page--html-tutorials.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/web-page--html-tutorials.docx -------------------------------------------------------------------------------- /docs/development-docs/web-page--structure--css--jquery.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/web-page--structure--css--jquery.docx -------------------------------------------------------------------------------- /docs/development-docs/web-page--structure--css-tutorials.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/web-page--structure--css-tutorials.docx -------------------------------------------------------------------------------- /docs/development-docs/web-page--structure--design.vsd: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/web-page--structure--design.vsd -------------------------------------------------------------------------------- /docs/development-docs/word-html--bugs-and-fixes.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/word-html--bugs-and-fixes.docx -------------------------------------------------------------------------------- /docs/development-docs/word-html--references-and-info.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/word-html--references-and-info.docx -------------------------------------------------------------------------------- /docs/development-docs/word-html--table-of-contents.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/development-docs/word-html--table-of-contents.docx -------------------------------------------------------------------------------- /docs/index.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/index.docx -------------------------------------------------------------------------------- /docs/installation.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/installation.docx -------------------------------------------------------------------------------- /docs/users-guide.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/docs/users-guide.docx -------------------------------------------------------------------------------- /readme-figure-1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/readme-figure-1.png -------------------------------------------------------------------------------- /readme-figure-2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/readme-figure-2.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # * List of required Python packages. 2 | # * The system-installation doc provides more info. 3 | beautifulsoup4 4 | cerberus 5 | jinja2 6 | pprintpp 7 | PyYAML 8 | yamllint 9 | -------------------------------------------------------------------------------- /templates/web_page_create--parameters--all.yml: -------------------------------------------------------------------------------- 1 | # DESCRIPTION: * Template for the parameter-file used by create_web_page.py, 2 | # * All supported keys are described, and examples are given for many of them. 3 | # 4 | # * The section "required:" is required. The other sections are optional, and their keys. 5 | # 6 | # * The parameter-file is in YAML format, and must conform to standard YAML syntax rules, e.g., 7 | # * Keys must be indented relative to their position in the hierarchy, e.g., 2 spaces. 8 | # * If a value contains a ":", the value must be in double-quotes. 9 | # 10 | # * YAML references 11 | # * YAML Syntax: https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html 12 | # * YAML Tutorial: https://www.tutorialspoint.com/yaml/index.htm 13 | # * How to specify multi-line values, e.g., when specifying HTML 14 | # * https://stackoverflow.com/questions/3790454/how-do-i-break-a-string-in-yaml-over-multiple-lines 15 | # 16 | # * The WordWebNav(WWN) Users' Guide has more info on this parameter file. 17 | # 18 | # MIT License, Copyright (c) 2021-present Jim Yuill 19 | # 20 | 21 | # Specifies start of YAML data: 22 | --- 23 | # Required parameters 24 | required: 25 | # Parameter-file version 26 | version: "1.0" 27 | # Full path for the input Word HTML-file. (Replace the example path.) 28 | input_html_path: D:\word-web-nav\tests\tests-for-create_web_page_py\WordWebNav--Word-HTML\demo.html 29 | # Full path for the directory for the output WordWebNav HTML-file. (Replace the example path.) 30 | output_directory_path: D:\jimyuill-com\deploy\software\www\WordWebNav 31 | # URL for the directory containing WordWebNav's CSS file and Javascript file (Replace the example URL.) 32 | scripts_directory_url: /assets/WordWebNav 33 | 34 | 35 | # Specifies contents for the HTML head-section (...), in the WWN web-page. (Optional) 36 | html_head_section: 37 | # The web-page's HTML title 38 | title: WordWebNav Demo Page 39 | # The web-page's HTML description 40 | # * YAML syntax-rules require that the value be in quotes because the string contains a ":" 41 | description: "WordWebNav web-page, created from: demo.docx" 42 | # User-written HTML, to be included in the HTML head-section, just before the closing-tag "". 43 | # Example value: a link to a web-page icon (favicon), e.g., the icon shown in the browser's tab 44 | additional_html: 45 | 46 | 47 | # Specifies contents for the output web-page's header-bar (optional) 48 | # * The header-bar is at the top of the web page. 49 | # * (It's different than the HTML head-section.) 50 | # * The header-bar's layout is divided into sections, each of which is specified by a key "- section:". 51 | # * This example has two sections. 52 | # * The sections are adjacent and of equal width. 53 | # * Each section is intended to have a single line of text. 54 | header_bar: 55 | # First section 56 | - section: 57 | # * Contents for the section 58 | contents: 59 | # * Five types of contents are supported: breadcrumbs, hyperlink, html, text, empty 60 | # * The contents-type is specified as a key under "contents:" 61 | # * The following is an example of navigation breadcrumbs, displayed as: 62 | # Home / WordWebNav 63 | breadcrumbs: 64 | - hyperlink: 65 | text: Home 66 | url: / 67 | - hyperlink: 68 | text: WordWebNav 69 | url: /software/www/WordWebNav 70 | # * Alignment of the contents within the section 71 | # * Permissible values are: left, right, center, justify 72 | # * The values' effects are those defined for HTML table-cells (). 73 | # * The default value is: left 74 | contents_alignment: left 75 | # Second section 76 | - section: 77 | contents: 78 | # Hyperlink, e.g., a link to the Comments section at the end of the document text 79 | hyperlink: 80 | text: Comments 81 | url: "#word_web_nav_document_text_trailer" 82 | contents_alignment: right 83 | 84 | # Specifies HTML to be added just after the document-text (Optional) 85 | # * This feature is primarily intended for adding a comments section to the document, e.g., using Commento. 86 | # * The header-bar can have a link to this added HTML, as shown in the example above. 87 | # * The URL for the link should be "#word_web_nav_document_text_trailer" 88 | # * The example value here is HTML for comments supported by Commento. 89 | # * The YAML syntax for multi-line values is used. 90 | document_text_trailer: | 91 |
92 | 95 | 96 | # Specifies how the Word HTML is to be edited (Optional) 97 | word_html_edits: 98 | # Specifies how the style "color:white" is to be removed from the Word HTML (optional). 99 | # * Permissible values are: doNotRemove, removeInParagraphs, removeAll 100 | # * The default value is doNotRemove 101 | white_colored_text: removeAll 102 | -------------------------------------------------------------------------------- /templates/web_page_create--parameters--minimum.yml: -------------------------------------------------------------------------------- 1 | # DESCRIPTION: Template for the parameter-file used by create_web_page.py, 2 | # showing the minimum set of parameters and example values. 3 | # 4 | # * The parameter-file is in YAML format, and must conform to standard YAML syntax rules. 5 | # 6 | # * The system documentation has additional info on its: installation, use, design 7 | # and implementation (code). 8 | # 9 | # * YAML references 10 | # * YAML Syntax: https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html 11 | # * YAML Tutorial: https://www.tutorialspoint.com/yaml/index.htm 12 | # * How to specify multi-line values, e.g., when specifying HTML 13 | # * https://stackoverflow.com/questions/3790454/how-do-i-break-a-string-in-yaml-over-multiple-lines 14 | # 15 | # MIT License, Copyright (c) 2021-present Jim Yuill 16 | # 17 | 18 | # Specifies start of YAML data: 19 | --- 20 | # Required parameters 21 | required: 22 | # Parameter-file version 23 | version: "1.0" 24 | # Full path for the input Word HTML-file. (Replace the example path.) 25 | input_html_path: D:\quick-start\WordWebNav--Word-HTML\demo.html 26 | # Full path for the directory for the output WordWebNav HTML-file. (Replace the example path.) 27 | output_directory_path: D:\quick-start\WordWebNav--HTML 28 | # URL for the directory containing WordWebNav's CSS file and Javascript file. (Replace the example URL.) 29 | scripts_directory_url: D:\word-web-nav\assets 30 | -------------------------------------------------------------------------------- /tests/tests-for-create_web_page_py/demo.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/tests/tests-for-create_web_page_py/demo.docx -------------------------------------------------------------------------------- /tests/tests-for-create_web_page_py/included-photo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/tests/tests-for-create_web_page_py/included-photo.jpg -------------------------------------------------------------------------------- /tools/batch_create_web_page.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 3 | Description: 4 | * This program is used to create WWN web-pages for all of the Word HTML files in a directory 5 | * This program was created for running regresion-tests 6 | * See: docs\development-docs\WWN--testing--regression-tests.docx 7 | 8 | * For each .htm* file at the specified directory: 9 | * Create a WWN input parameter-file (.yml), for the .htm* file 10 | * To create the file, the following are used: 11 | * The input config-file, the path to the input .htm* file, and Jinja2 12 | * Call create_web_page.py, and provide the path to the WWN input parameter-file 13 | 14 | Inputs: 15 | * Command-line input: Path to the directory containing the Word HTML-files. 16 | * Input config-file: batch_create_web_page.yml, in the input directory. 17 | * The config-file is provided by the caller 18 | * It is a Jinja template, used to create the .yml files 19 | * An example config-file is provided in the repo. (The file-paths will need to be changed.) 20 | * The .html and .htm files in the input directory. 21 | 22 | ''' 23 | 24 | 25 | import argparse 26 | import sys 27 | import os 28 | 29 | from os import listdir 30 | from os.path import isfile, join 31 | 32 | from importlib import reload 33 | 34 | from jinja2 import Template 35 | 36 | sys.path.append(r'..\createwebpage') 37 | import create_web_page 38 | 39 | CONFIG_FILE_NAME_ROOT = "batch_create_web_page" 40 | 41 | ''' 42 | Argparse is used to process the command-line argument 43 | * Argparse docs: https://docs.python.org/3/library/argparse.html 44 | * The argparse code here is from: https://stackoverflow.com/questions/14360389/getting-file-path-from-command-line-argument-in-python/47324233 45 | ''' 46 | # Creates and returns the ArgumentParser object 47 | def create_arg_parser(): 48 | parser = argparse.ArgumentParser(description= 49 | 'Calls create_web_page.py, for the Word HTML-files at the specified directory.') 50 | parser.add_argument('input_dir_path', metavar="", 51 | help='Path to the directory containing the Word HTML-files.') 52 | return parser 53 | 54 | ''' 55 | ######### 56 | Main 57 | ######### 58 | ''' 59 | if __name__ == "__main__": 60 | # Get the directory path from the command-line 61 | # * argparser also verifies the command-line syntax 62 | arg_parser = create_arg_parser() 63 | parsed_args = arg_parser.parse_args() 64 | input_dir_path = parsed_args.input_dir_path 65 | 66 | ''' 67 | * A config-file is expected, in the input directory. 68 | * Open the config-file, and use it as a Jinja2 template 69 | ''' 70 | 71 | # Verify the expected config-file exists 72 | config_file_name = CONFIG_FILE_NAME_ROOT + ".yml" 73 | print("INFO. Opening the config-file, and loading it as a Jinja2 template: " + config_file_name) 74 | config_file_path = os.path.join(input_dir_path, config_file_name) 75 | if not (os.path.exists(config_file_path)): 76 | print("") 77 | print("ERROR. The input directory must have the config file: " + config_file_path) 78 | print("") 79 | sys.exit() 80 | 81 | # Verify no HTML files have the root-name specified in CONFIG_FILE_NAME_ROOT 82 | disallowed_file_path1 = os.path.join(input_dir_path, (CONFIG_FILE_NAME_ROOT + ".htm")) 83 | disallowed_file_path2 = os.path.join(input_dir_path, (CONFIG_FILE_NAME_ROOT + ".html")) 84 | if os.path.exists(disallowed_file_path1) or os.path.exists(disallowed_file_path2): 85 | print("") 86 | print("ERROR. The input directory cannot have HTML files with these names:") 87 | print(" " + CONFIG_FILE_NAME_ROOT + ".htm or " + CONFIG_FILE_NAME_ROOT +".html") 88 | print("") 89 | sys.exit() 90 | 91 | # Open the expected config-file 92 | try: 93 | config_file_handle = open(config_file_path) 94 | except IOError as e: 95 | print("") 96 | print("ERROR. Could not open the config-file:") 97 | print("%s - %s." % (e.strerror, e.filename)) 98 | print("") 99 | sys.exit() 100 | config_file_data = config_file_handle.read() 101 | config_file_handle.close() 102 | config_file_template = Template(config_file_data) 103 | 104 | 105 | # For the input directory, create a list of files with extension .htm or .html 106 | all_files = [item for item in listdir(input_dir_path) if isfile(join(input_dir_path, item))] 107 | word_html_files = [] 108 | word_html_file_roots = [] 109 | for file_name in all_files: 110 | file_name_root, file_name_extension = os.path.splitext(file_name) 111 | if (file_name_extension.lower() == ".html") or (file_name_extension.lower() == ".htm"): 112 | word_html_files.append(file_name) 113 | word_html_file_roots.append(file_name_root) 114 | 115 | # * Test if there's two files whose names have the same root-name, and the extensions .htm and .html, 116 | # e.g., "foo.htm" and "foo.html". 117 | # * This isn't allowed because a config-file is created for each HTML file, for calling create_web_page. 118 | # The config-file's name is derived from the HTML-file's root-name: .yml. 119 | # So, two HTML files cannot have the same root-name. 120 | # * set() converts the list to a set, and thus removes duplicates 121 | if len(word_html_file_roots) != len(set(word_html_file_roots)): 122 | print("") 123 | print("ERROR. The input directory has two files with the same root-name and with " + 124 | " extensions \".htm\" and \".html\"") 125 | print("") 126 | sys.exit() 127 | 128 | # Process each input Word HTML-file 129 | file_count = 0 130 | failed_file_count = 0 131 | failed_file_names = "" 132 | total_num_warning_messages = 0 133 | for file_name in word_html_files: 134 | file_count += 1 135 | print("\nINFO. Processing the Word HTML-file: " + file_name) 136 | 137 | ''' 138 | For this Word HTML-file, create the WWN input parameter-file that will be used by create_web_page.py, 139 | ''' 140 | 141 | # Create the file-name for the WWN input parameter-file 142 | file_name_root, file_name_extension = os.path.splitext(file_name) 143 | wwn_input_parameter_file_name = file_name_root + ".yml" 144 | wwn_input_parameter_file_path = join(input_dir_path, wwn_input_parameter_file_name) 145 | 146 | # Use Jinja to generate the data for the WWN input parameter-file 147 | file_path = join(input_dir_path, file_name) 148 | wwn_input_parameter_file_data = config_file_template.render({'inputHtmlPath':file_path}) 149 | 150 | # Write the WWN input parameter-file to disk 151 | print("INFO. Creating the WWN input parameter-file used by create_web_page.py: " + \ 152 | wwn_input_parameter_file_name) 153 | wwn_input_parameter_file_handle = open(wwn_input_parameter_file_path, 'w') 154 | wwn_input_parameter_file_handle.write(wwn_input_parameter_file_data) 155 | wwn_input_parameter_file_handle.close() 156 | 157 | ''' 158 | Call create_web_page.main(), pass it the path to the WWN input parameter-file 159 | ''' 160 | 161 | # Reload needed to refresh the global variables 162 | reload(create_web_page) 163 | return_value, num_warning_messages = create_web_page.create_web_page(wwn_input_parameter_file_path) 164 | if return_value == 1: 165 | failed_file_count += 1 166 | failed_file_names += file_name + ", " 167 | total_num_warning_messages += num_warning_messages 168 | 169 | print("\nBatch processing completed.") 170 | print("Files processed: " + str(file_count)) 171 | print("Number of warning messages: " + str(total_num_warning_messages)) 172 | print("Files processed unsuccessfully: Count:" + str(failed_file_count) + \ 173 | " File-names: " + failed_file_names) 174 | print("") -------------------------------------------------------------------------------- /tools/batch_create_web_page.yml: -------------------------------------------------------------------------------- 1 | # This file is used by batch_create_web_page.py 2 | # This is an example config-file. The file paths will need to be changed. 3 | # 4 | # Specifies start of YAML data: 5 | --- 6 | # Required parameters 7 | required: 8 | # Parameter-file version 9 | version: "1.0" 10 | # Full path for the input Word HTML-file. (Replace the example path.) 11 | # 12 | input_html_path: {{inputHtmlPath}} 13 | # Full path for the directory for the output WordWebNav HTML-file. (Replace the example path.) 14 | output_directory_path: D:\Documents\Professional-projects\Web-site-development\Word-to-HTML\automation-dev--and--WWN-testing\testing\test-Word-files\Other-Internet-Word-files\WordWebNav--HTML 15 | # URL for the directory containing WordWebNav's CSS file and Javascript file. (Replace the example URL.) 16 | scripts_directory_url: D:\Documents\Professional-projects\Web-site-development\word-web-nav\word-web-nav_repo\assets 17 | 18 | # Specifies how the Word HTML is to be edited (Optional) 19 | word_html_edits: 20 | # Specifies how the style "color:white" is to be removed from the Word HTML (optional). 21 | # * Permissible values are: doNotRemove, removeInParagraphs, removeAll 22 | # * The default value is doNotRemove 23 | white_colored_text: removeAll 24 | -------------------------------------------------------------------------------- /tools/create_web_page_for_all_yml_files.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 3 | Description: 4 | * Calls create_web_page.py, for the all of the *.yml files at the specified directory. 5 | * The *.yml files must be WWN input parameter-files 6 | * This program was created for running regresion-tests 7 | * See: docs\development-docs\WWN--testing--regression-tests.docx 8 | 9 | Command-line input: the full path to a directory with WWN input parameter-files 10 | 11 | ''' 12 | 13 | import argparse 14 | import sys 15 | import os 16 | 17 | from os import listdir 18 | from os.path import isfile, join 19 | 20 | sys.path.append(r'..\createwebpage') 21 | import create_web_page 22 | 23 | ''' 24 | Argparse is used to process the command-line argument 25 | * Argparse docs: https://docs.python.org/3/library/argparse.html 26 | * The argparse code here is from: https://stackoverflow.com/questions/14360389/getting-file-path-from-command-line-argument-in-python/47324233 27 | ''' 28 | # Creates and returns the ArgumentParser object 29 | def create_arg_parser(): 30 | parser = argparse.ArgumentParser(description= 31 | 'Calls create_web_page.py, for the all of the *.yml files at the specified directory.') 32 | parser.add_argument('input_dir_path', metavar="", 33 | help='Path to the directory containing the *.yml files.') 34 | return parser 35 | 36 | MAX_FILES_TO_PROCESS = sys.maxsize 37 | #MAX_FILES_TO_PROCESS = 2 38 | 39 | ''' 40 | ######### 41 | Main 42 | ######### 43 | ''' 44 | if __name__ == "__main__": 45 | # Get the directory path from the command-line 46 | # * argparser also verifies the command-line syntax 47 | arg_parser = create_arg_parser() 48 | parsed_args = arg_parser.parse_args() 49 | input_dir_path = parsed_args.input_dir_path 50 | 51 | # For the input directory, create a list of files with extension .yml 52 | all_files = [item for item in listdir(input_dir_path) if isfile(join(input_dir_path, item))] 53 | yml_files = [] 54 | for file_name in all_files: 55 | file_name_root, file_name_extension = os.path.splitext(file_name) 56 | if (file_name_extension.lower() == ".yml"): 57 | yml_files.append(file_name) 58 | 59 | # Process each input .yml file 60 | file_count = 0 61 | failed_file_count = 0 62 | failed_file_names = "" 63 | files_with_warning_messages = 0 64 | total_num_warning_messages = 0 65 | for file_name in yml_files: 66 | file_count += 1 67 | print("\nINFO. Processing the .yml file: " + file_name) 68 | 69 | # Create the file-path for the yml-file 70 | yml_file_path = join(input_dir_path, file_name) 71 | 72 | ''' 73 | Call create_web_page.main(), pass it the path to the yml-file 74 | ''' 75 | return_value, num_warning_messages = create_web_page.create_web_page(yml_file_path) 76 | if return_value == 1: 77 | failed_file_count += 1 78 | failed_file_names += file_name + ", " 79 | if (num_warning_messages > 0): 80 | files_with_warning_messages += 1 81 | total_num_warning_messages += num_warning_messages 82 | 83 | if (file_count == MAX_FILES_TO_PROCESS): 84 | break 85 | 86 | print("\nBatch processing completed.") 87 | print("Files processed: " + str(file_count)) 88 | print("Files processed successfully: " + str(file_count - failed_file_count)) 89 | if (failed_file_count != 0): 90 | failed_files_string = " File-names: " + failed_file_names 91 | else: 92 | failed_files_string = "" 93 | print("Files processed unsuccessfully: Count:" + str(failed_file_count) + \ 94 | failed_files_string) 95 | print("Number of files with warning messages: " + str(files_with_warning_messages)) 96 | print("Number of warning messages: " + str(total_num_warning_messages)) 97 | print("") -------------------------------------------------------------------------------- /tools/generate_word_html.docm: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jimyuill/word-web-nav/ea95972eeea4331b88a03badf2493219fee374d9/tools/generate_word_html.docm --------------------------------------------------------------------------------