├── stack.yaml ├── .github ├── FUNDING.yml └── workflows │ └── test-actions.yaml ├── scripts ├── showcase-svg │ ├── .gitignore │ ├── extract.xq │ ├── replace.xq │ ├── Makefile │ └── template.svg ├── ko-kr-stdict │ ├── .gitignore │ ├── README.rst │ └── main.py ├── haddock-prologue │ ├── omit-rich-elements.lua │ └── build.sh ├── deno │ ├── README.md │ └── test.ts └── Bundle-RequiredDlls.ps1 ├── demo ├── .gitignore ├── README.md ├── elm.json ├── github-corner.html └── src │ └── Markdown │ └── HtmlString.elm ├── test ├── data │ ├── 習慣音.ko-KR.html │ ├── 習慣音.ko-Kore.html │ ├── initial-sound-raw.ko-KR.html │ ├── initial-sound-raw.ko-Kore.html │ ├── 이런날.ko-Kore.html │ ├── 이런날.ko-KR.html │ ├── ellipsis.ko-Kore.html │ ├── ellipsis.ko-KR.html │ ├── 大韓民國憲法第十號前文.ko-KP.html │ ├── 大韓民國憲法第十號前文.ko-KR.html │ ├── 大韓民國憲法第十號前文.ko-Kore.html │ ├── preservation.ko-Kore.html │ └── preservation.ko-KR.html ├── Spec.hs ├── doctest.json ├── hlint.hs ├── hspec.hs ├── doctest.hs └── Text │ └── Seonbi │ ├── HangulSpec.hs │ ├── Html │ ├── PrinterSpec.hs │ ├── WrapperSpec.hs │ ├── PreservationSpec.hs │ ├── TextNormalizerSpec.hs │ ├── LangSpec.hs │ ├── ClipperSpec.hs │ └── ScannerSpec.hs │ ├── Unihan │ └── KHangulSpec.hs │ ├── ContentTypesSpec.hs │ └── FacadeSpec.hs ├── .gitattributes ├── .dockerignore ├── .gitignore ├── data └── ko-kr-stdict.tsv ├── .vscode ├── extensions.json └── settings.json ├── en.utf-8.add ├── src └── Text │ └── Seonbi │ ├── Unihan │ ├── README │ └── KHangul.hs │ ├── Html.hs │ ├── Html │ ├── Preservation.hs │ ├── Wrapper.hs │ ├── Entity.hs │ ├── TextNormalizer.hs │ ├── Printer.hs │ ├── Clipper.hs │ ├── TagStack.hs │ ├── Lang.hs │ ├── Scanner.hs │ └── Tag.hs │ ├── Hangul.hs │ ├── Trie.hs │ └── PairedTransformer.hs ├── .editorconfig ├── stack-ghc-8.10.yaml ├── stack-ghc-8.8.yaml ├── stack-ghc-9.2.yaml ├── stack-ghc-9.4.yaml ├── stack-ghc-9.0.yaml ├── bucket └── seonbi.json ├── fly.toml ├── Dockerfile ├── setup ├── README.md └── action.yaml ├── Setup.hs ├── package.yaml ├── README.md ├── app └── seonbi-api.hs └── CHANGES.md /stack.yaml: -------------------------------------------------------------------------------- 1 | stack-ghc-8.8.yaml -------------------------------------------------------------------------------- /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | github: dahlia 2 | -------------------------------------------------------------------------------- /scripts/showcase-svg/.gitignore: -------------------------------------------------------------------------------- 1 | build/ 2 | -------------------------------------------------------------------------------- /demo/.gitignore: -------------------------------------------------------------------------------- 1 | elm-stuff/ 2 | index.html 3 | -------------------------------------------------------------------------------- /test/data/習慣音.ko-KR.html: -------------------------------------------------------------------------------- 1 |

허락하기 곤란하다.

2 | -------------------------------------------------------------------------------- /test/data/習慣音.ko-Kore.html: -------------------------------------------------------------------------------- 1 |

許諾하기 困難하다.

2 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | data/*.tsv filter=lfs diff=lfs merge=lfs -text 2 | -------------------------------------------------------------------------------- /test/data/initial-sound-raw.ko-KR.html: -------------------------------------------------------------------------------- 1 |

가리

2 |

영리

3 | -------------------------------------------------------------------------------- /test/data/initial-sound-raw.ko-Kore.html: -------------------------------------------------------------------------------- 1 |

可利

2 |

營利

3 | -------------------------------------------------------------------------------- /test/Spec.hs: -------------------------------------------------------------------------------- 1 | {-# OPTIONS_GHC -F -pgmF hspec-discover -optF --module-name=Spec #-} 2 | -------------------------------------------------------------------------------- /scripts/ko-kr-stdict/.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | *.tsv 3 | *.zip 4 | .env/ 5 | .venv/ 6 | env/ 7 | venv/ 8 | -------------------------------------------------------------------------------- /.dockerignore: -------------------------------------------------------------------------------- 1 | *.cabal 2 | *~ 3 | .dockerignore 4 | .git/ 5 | .gitignore 6 | .stack-work/ 7 | Dockerfile 8 | src/Text/Seonbi/kHangul.txt 9 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.cabal 2 | *.prof 3 | *~ 4 | .stack-work/ 5 | dist-newstyle/ 6 | out/ 7 | seonbi.iml 8 | src/Text/Seonbi/kHangul.txt 9 | stack*.yaml.lock 10 | -------------------------------------------------------------------------------- /data/ko-kr-stdict.tsv: -------------------------------------------------------------------------------- 1 | version https://git-lfs.github.com/spec/v1 2 | oid sha256:ae62c488fe95e5c8f3ec36d6c3c7b3a84a89530b14e167c95008289bba67a7f6 3 | size 4970476 4 | -------------------------------------------------------------------------------- /.vscode/extensions.json: -------------------------------------------------------------------------------- 1 | { 2 | "recommendations": [ 3 | "EditorConfig.EditorConfig", 4 | "haskell.haskell", 5 | "streetsidesoftware.code-spell-checker" 6 | ] 7 | } 8 | -------------------------------------------------------------------------------- /test/data/이런날.ko-Kore.html: -------------------------------------------------------------------------------- 1 |

아이들에게 하로의 乾燥한 學課로
2 | 해말간 倦怠가 깃들고、
3 | "矛盾" 두자를 理解치 못하도록
4 | 머리가 單純하였구나。

5 |

尹東柱 <이런날>

6 | -------------------------------------------------------------------------------- /test/data/이런날.ko-KR.html: -------------------------------------------------------------------------------- 1 |

아이들에게 하로의 건조한 학과로
2 | 해말간 권태가 깃들고,
3 | “모순” 두자를 이해치 못하도록
4 | 머리가 단순하였구나.

5 |

윤동주 〈이런날〉

6 | -------------------------------------------------------------------------------- /en.utf-8.add: -------------------------------------------------------------------------------- 1 | deno 2 | guillemets 3 | hanja 4 | inequal 5 | interpunct 6 | interpuncts 7 | phoneticize 8 | phoneticized 9 | punct 10 | seonbi 11 | Sino 12 | stdict 13 | submap 14 | typeclass 15 | typeclasses 16 | -------------------------------------------------------------------------------- /demo/README.md: -------------------------------------------------------------------------------- 1 | Seonbi demo 2 | =========== 3 | 4 | 5 | 6 | This is a web app to demo Seonbi's options and behaviors, and written in 7 | the [Elm] language. 8 | 9 | [Elm]: https://elm-lang.org/ 10 | -------------------------------------------------------------------------------- /scripts/showcase-svg/extract.xq: -------------------------------------------------------------------------------- 1 | xquery version "3.0"; 2 | 3 | declare namespace svg = "http://www.w3.org/2000/svg"; 4 | declare namespace html = "http://www.w3.org/1999/xhtml"; 5 | 6 | /svg:svg/svg:foreignObject/html:div//html:p[@id="input"] 7 | -------------------------------------------------------------------------------- /test/doctest.json: -------------------------------------------------------------------------------- 1 | { 2 | "ignore": [], 3 | "sourceFolders": [ 4 | "src", 5 | "src/Text", 6 | "src/Text/Seonbi", 7 | "src/Text/Seonbi/Html" 8 | ], 9 | "doctestOptions": [ 10 | "-XHaskell2010" 11 | ] 12 | } 13 | -------------------------------------------------------------------------------- /test/data/ellipsis.ko-Kore.html: -------------------------------------------------------------------------------- 1 |
2 | 3 | <동물기계>, 이쪽에선 꽤 유명한 고전임에도 불구하고 국내에는 두달 전에 번역되었다. (...) 4 | 그나저나 요새 책 사서 표지랑 목차만 읽는 것이 취미가 되어버린 것 같아서 반성 중임... 5 | 6 |
7 | -------------------------------------------------------------------------------- /test/data/ellipsis.ko-KR.html: -------------------------------------------------------------------------------- 1 |
2 | 3 | 〈동물기계〉, 이쪽에선 꽤 유명한 고전임에도 불구하고 국내에는 두달 전에 번역되었다. (…) 4 | 그나저나 요새 책 사서 표지랑 목차만 읽는 것이 취미가 되어버린 것 같아서 반성 중임… 5 | 6 |
7 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Unihan/README: -------------------------------------------------------------------------------- 1 | The .txt data files in this directory are imported from the unihan-json project: 2 | 3 | https://github.com/dahlia/unihan-json/releases/tag/12.1.0 4 | 5 | If the version of these data become outdated please let me know or send a patch 6 | to update data files! 7 | -------------------------------------------------------------------------------- /.editorconfig: -------------------------------------------------------------------------------- 1 | root = true 2 | 3 | [*] 4 | end_of_line = lf 5 | trim_trailing_whitespace = true 6 | insert_final_newline = true 7 | charset = utf-8 8 | indent_style = space 9 | indent_size = 4 10 | max_line_length = 80 11 | 12 | [*.yaml] 13 | indent_size = 2 14 | 15 | [{Makefile,**.mk}] 16 | indent_style = tab 17 | -------------------------------------------------------------------------------- /test/hlint.hs: -------------------------------------------------------------------------------- 1 | import Language.Haskell.HLint (hlint) 2 | import System.Exit (exitFailure, exitSuccess) 3 | 4 | arguments :: [String] 5 | arguments = ["app", "src", "test"] 6 | 7 | main :: IO () 8 | main = do 9 | hlints <- hlint arguments 10 | case hlints of 11 | [] -> exitSuccess 12 | _ -> exitFailure 13 | -------------------------------------------------------------------------------- /stack-ghc-8.10.yaml: -------------------------------------------------------------------------------- 1 | resolver: lts-18.28 2 | packages: 3 | - . 4 | extra-deps: 5 | - bytestring-trie-0.2.5.0 6 | - html-charset-0.1.0 7 | flags: 8 | seonbi: 9 | iconv: true 10 | allow-newer: true 11 | ghc-options: 12 | "$everything": -haddock 13 | "$locals": -Werror -fhide-source-paths 14 | require-stack-version: ">=2.7.0" 15 | -------------------------------------------------------------------------------- /stack-ghc-8.8.yaml: -------------------------------------------------------------------------------- 1 | resolver: lts-16.31 2 | packages: 3 | - . 4 | extra-deps: 5 | - bytestring-trie-0.2.5.0 6 | - html-charset-0.1.0 7 | flags: 8 | seonbi: 9 | iconv: true 10 | allow-newer: false 11 | ghc-options: 12 | "$everything": -haddock 13 | "$locals": -Werror -fhide-source-paths 14 | require-stack-version: ">=2.7.0" 15 | -------------------------------------------------------------------------------- /test/hspec.hs: -------------------------------------------------------------------------------- 1 | import Control.Monad 2 | import GHC.IO.Encoding 3 | import System.Info (os) 4 | 5 | import System.IO.CodePage (withCP65001) 6 | import Test.Hspec.Runner 7 | 8 | import qualified Spec 9 | 10 | main :: IO () 11 | main = withCP65001 $ do 12 | when (System.Info.os == "ming32") $ setLocaleEncoding utf8 13 | hspecWith defaultConfig Spec.spec 14 | -------------------------------------------------------------------------------- /stack-ghc-9.2.yaml: -------------------------------------------------------------------------------- 1 | resolver: lts-20.26 2 | packages: 3 | - . 4 | extra-deps: 5 | - html-charset-0.1.0 6 | flags: 7 | seonbi: 8 | iconv: false # iconv seems unmaintained and only supports bytestring < 0.11 9 | allow-newer: false 10 | ghc-options: 11 | "$everything": -haddock 12 | "$locals": -Werror -fhide-source-paths 13 | require-stack-version: ">=2.7.0" 14 | -------------------------------------------------------------------------------- /stack-ghc-9.4.yaml: -------------------------------------------------------------------------------- 1 | resolver: lts-21.21 2 | packages: 3 | - . 4 | extra-deps: 5 | - cmark-0.6.1 6 | - html-charset-0.1.0 7 | - iconv-0.4.1.3 8 | flags: 9 | seonbi: 10 | iconv: false # iconv seems unmaintained and only supports bytestring < 0.11 11 | allow-newer: false 12 | ghc-options: 13 | "$everything": -haddock 14 | "$locals": -Werror -fhide-source-paths 15 | require-stack-version: ">=2.7.0" 16 | -------------------------------------------------------------------------------- /scripts/showcase-svg/replace.xq: -------------------------------------------------------------------------------- 1 | xquery version "3.0"; 2 | 3 | declare namespace html = "http://www.w3.org/1999/xhtml"; 4 | declare variable $ko-kr external; 5 | declare variable $ko-kp external; 6 | declare variable $ko-kore external; 7 | 8 | replace node //html:div/html:p[@id="placeholder-ko-kr"] 9 | with doc($ko-kr), 10 | replace node //html:div/html:p[@id="placeholder-ko-kp"] 11 | with doc($ko-kp), 12 | replace node //html:div/html:p[@id="placeholder-ko-kore"] 13 | with doc($ko-kore) 14 | -------------------------------------------------------------------------------- /stack-ghc-9.0.yaml: -------------------------------------------------------------------------------- 1 | resolver: lts-19.33 2 | packages: 3 | - . 4 | extra-deps: 5 | - aeson-1.5.6.0 6 | - html-charset-0.1.0 7 | flags: 8 | seonbi: 9 | iconv: true 10 | mintty: 11 | Win32-2-13-1: false # https://github.com/RyanGlScott/mintty/issues/4 12 | allow-newer: false 13 | ghc-options: 14 | "$everything": -haddock 15 | "$locals": -Werror -fhide-source-paths 16 | extra-include-dirs: 17 | - /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/ffi 18 | # https://gitlab.haskell.org/ghc/ghc/-/issues/20592#note_403426 19 | require-stack-version: ">=2.7.0" 20 | -------------------------------------------------------------------------------- /test/doctest.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE CPP #-} 2 | #if __GLASGOW_HASKELL__ >= 810 3 | main :: IO () 4 | main = do 5 | putStrLn "Temporarily, doctests are ignored for GHC >= 8.10 due to bugs:\n" 6 | putStrLn " https://github.com/sol/doctest/issues/301" 7 | #elif __GLASGOW_HASKELL__ >= 808 && defined(mingw32_HOST_OS) 8 | main :: IO () 9 | main = do 10 | putStr "Temporarily, doctests are ignored for GHC >= 8.8 on Windows " 11 | putStrLn "due to bugs:\n" 12 | putStrLn " https://github.com/sol/doctest/issues/300" 13 | #else 14 | {-# OPTIONS_GHC -F -pgmF doctest-discover -optF test/doctest.json #-} 15 | #endif 16 | -------------------------------------------------------------------------------- /bucket/seonbi.json: -------------------------------------------------------------------------------- 1 | { 2 | "version": "0.5.0", 3 | "description": "SmartyPants for Korean language", 4 | "homepage": "https://github.com/dahlia/seonbi", 5 | "license": "LGPL-2.1", 6 | "architecture": { 7 | "64bit": { 8 | "url": "https://github.com/dahlia/seonbi/releases/download/0.5.0/seonbi-0.5.0.win64.zip", 9 | "hash": "8103ff6d7a541827d55ac3d28bcb9182e8e5ceced87de24a63662f2a10e4c610" 10 | } 11 | }, 12 | "bin": ["seonbi.exe", "seonbi-api.exe"], 13 | "checkver": "github", 14 | "autoupdate": { 15 | "architecture": { 16 | "64bit": { 17 | "url": "https://github.com/dahlia/seonbi/releases/download/$version/seonbi-$version.win64.zip" 18 | } 19 | } 20 | } 21 | } 22 | -------------------------------------------------------------------------------- /test/data/大韓民國憲法第十號前文.ko-KP.html: -------------------------------------------------------------------------------- 1 |
2 |

3 | 대한민국헌법 4 |

5 |

시행 1988년 2월 25일. 헌법 제10호, 1987년 10월 29일, 전부개정.

6 |

전문

7 |

유구한 력사와 전통에 빛나는 우리 대한국민은 3·1운동으로 건립된 8 | 대한민국림시정부의 법통과 불의에 항거한 4·19민주리념을 계승하고, 9 | 조국의 민주개혁과 평화적 통일의 사명에 입각하여 정의·인도와 동포애로써 10 | 민족의 단결을 공고히 하고, 모든 사회적 폐습과 불의를 타파하며, 11 | 자률과 조화를 바탕으로 자유민주적 기본질서를 더욱 확고히 하여 12 | 정치·경제·사회·문화의 모든 령역에 있어서 각인의 13 | 기회를 균등히 하고, 14 | 능력을 최고도로 발휘하게 하며, 자유와 권리에 따르는 책임과 의무를 완수하게 하여, 15 | 안으로는 국민생활의 균등한 향상을 기하고 밖으로는 항구적인 세계평화와 16 | 인류공영에 이바지함으로써 우리들과 우리들의 자손의 안전과 자유와 행복을 17 | 영원히 확보할 것을 다짐하면서 1948년 7월 12일에 제정되고 8차에 걸쳐 개정된 18 | 헌법을 이제 국회의 의결을 거쳐 국민투표에 의하여 개정한다.

19 |
20 | -------------------------------------------------------------------------------- /test/data/大韓民國憲法第十號前文.ko-KR.html: -------------------------------------------------------------------------------- 1 |
2 |

3 | 대한민국 헌법 4 |

5 |

시행 1988년 2월 25일. 헌법 제10호, 1987년 10월 29일, 전부개정.

6 |

전문

7 |

유구한 역사와 전통에 빛나는 우리 대한국민은 3·1운동으로 건립된 8 | 대한민국 임시 정부의 법통과 불의에 항거한 4·19민주이념을 계승하고, 9 | 조국의 민주개혁과 평화적 통일의 사명에 입각하여 정의·인도와 동포애로써 10 | 민족의 단결을 공고히 하고, 모든 사회적 폐습과 불의를 타파하며, 11 | 자율과 조화를 바탕으로 자유민주적 기본질서를 더욱 확고히 하여 12 | 정치·경제·사회·문화의 모든 영역에 있어서 각인의 13 | 기회를 균등히 하고, 14 | 능력을 최고도로 발휘하게 하며, 자유와 권리에 따르는 책임과 의무를 완수하게 하여, 15 | 안으로는 국민생활의 균등한 향상을 기하고 밖으로는 항구적인 세계평화와 16 | 인류공영에 이바지함으로써 우리들과 우리들의 자손의 안전과 자유와 행복을 17 | 영원히 확보할 것을 다짐하면서 1948년 7월 12일에 제정되고 8차에 걸쳐 개정된 18 | 헌법을 이제 국회의 의결을 거쳐 국민 투표에 의하여 개정한다.

19 |
20 | -------------------------------------------------------------------------------- /test/data/大韓民國憲法第十號前文.ko-Kore.html: -------------------------------------------------------------------------------- 1 |
2 |

3 | 大韓民國憲法 4 |

5 |

施行 1988年 2月 25日. 憲法 第10號, 1987年 10月 29日, 全部改正.

6 |

前文

7 |

悠久한 歷史와 傳統에 빛나는 우리 大韓國民은 3·1運動으로 建立된 8 | 大韓民國臨時政府의 法統과 不義에 抗拒한 4·19民主理念을 계승하고, 9 | 祖國의 民主改革과 平和的 統一의 使命에 입각하여 正義·人道와 同胞愛로써 10 | 民族의 團結을 공고히 하고, 모든 社會的 弊習과 不義를 타파하며, 11 | 自律과 調和를 바탕으로 自由民主的 基本秩序를 더욱 확고히 하여 12 | 政治·經濟·社會·文化의 모든 領域에 있어서 各人의 13 | 機會를 균등히 하고, 14 | 能力을 最高度로 발휘하게 하며, 自由와 權利에 따르는 責任과 義務를 완수하게 하여, 15 | 안으로는 國民生活의 균등한 향상을 기하고 밖으로는 항구적인 世界平和와 16 | 人類共榮에 이바지함으로써 우리들과 우리들의 子孫의 安全과 自由와 幸福을 17 | 영원히 확보할 것을 다짐하면서 1948年 7月 12日에 制定되고 8次에 걸쳐 改正된 18 | 憲法을 이제 國會의 議決을 거쳐 國民投票에 의하여 改正한다.

19 |
20 | -------------------------------------------------------------------------------- /scripts/haddock-prologue/omit-rich-elements.lua: -------------------------------------------------------------------------------- 1 | -- Remove the top-level heading since Haddock in itself prints it. 2 | function Header(elem) 3 | if elem.level > 1 then 4 | return elem 5 | end 6 | return {} 7 | end 8 | 9 | -- Removes linked images since Haddock cannot represent them. 10 | function Link(elem) 11 | children = elem.content 12 | if #children ~= 1 or children[1].tag ~= "Image" then 13 | return nil 14 | end 15 | return {} 16 | end 17 | 18 | -- Escape slashes in hrefs of emphasized links as Pandoc's Haddock target 19 | -- does not escape slashes for us. 20 | function Emph(elem) 21 | return pandoc.walk_inline(elem, { 22 | Link = function (elem) 23 | elem.target = string.gsub(elem.target, "/", "\\/") 24 | return elem 25 | end 26 | }) 27 | end 28 | -------------------------------------------------------------------------------- /.vscode/settings.json: -------------------------------------------------------------------------------- 1 | { 2 | "cSpell.customDictionaries": { 3 | "workspace": { 4 | "name": "Workspace Dictionary", 5 | "description": "A custom dictionary for this poject.", 6 | "path": "${workspaceFolder}/en.utf-8.add", 7 | "addWords": true, 8 | "scope": "workspace" 9 | } 10 | }, 11 | "cSpell.dictionaries": [ 12 | "en_US", 13 | "filetypes", 14 | "html", 15 | "softwareTerms", 16 | "typescript" 17 | ], 18 | "cSpell.ignoreRegExpList": [ 19 | "/\\\\x[0-9A-Fa-f]{2}/", 20 | "/\\\\u[0-9A-Fa-f]{4}/", 21 | "/\\\\U[0-9A-Fa-f]{8}/" 22 | ], 23 | 24 | "deno.enable": false, 25 | "deno.enablePaths": [ 26 | "scripts/deno/" 27 | ], 28 | 29 | "haskell.serverEnvironment": { 30 | "STACK_YAML": "stack-ghc-9.4.yaml" 31 | } 32 | } 33 | -------------------------------------------------------------------------------- /fly.toml: -------------------------------------------------------------------------------- 1 | app = "seonbi" 2 | kill_signal = "SIGINT" 3 | kill_timeout = 5 4 | processes = [] 5 | 6 | [build] 7 | image = "ghcr.io/dahlia/seonbi/bin:latest" 8 | 9 | [experimental] 10 | allowed_public_ports = [3800] 11 | auto_rollback = true 12 | cmd = ["seonbi-api", "--allow-origin=*"] 13 | 14 | [[services]] 15 | http_checks = [] 16 | internal_port = 3800 17 | processes = ["app"] 18 | protocol = "tcp" 19 | script_checks = [] 20 | 21 | [services.concurrency] 22 | hard_limit = 25 23 | soft_limit = 20 24 | type = "connections" 25 | 26 | [[services.ports]] 27 | force_https = true 28 | handlers = ["http"] 29 | port = 80 30 | 31 | [[services.ports]] 32 | handlers = ["tls", "http"] 33 | port = 443 34 | 35 | [[services.tcp_checks]] 36 | grace_period = "1s" 37 | interval = "15s" 38 | restart_limit = 0 39 | timeout = "2s" 40 | -------------------------------------------------------------------------------- /scripts/haddock-prologue/build.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | # Prerequisites: 3 | # - Pandoc 2.0+ 4 | # - yq 5 | # - Haskell Stack 6 | # - GNU sed 7 | set -e 8 | root="$(dirname "$0")/../.." 9 | package="$root/package.yaml" 10 | readme="$root/README.md" 11 | pandoc_script="$(dirname "$0")/omit-rich-elements.lua" 12 | description="$(pandoc --lua-filter "$pandoc_script" -t haddock "$readme")" 13 | backup="$(mktemp)" 14 | cp "$package" "$backup" 15 | cwd="$(pwd)" 16 | exit_code=1 17 | { 18 | yq \ 19 | -y \ 20 | --arg description "$description" \ 21 | '.description = $description' \ 22 | "$backup" > "$package" 23 | cd "$root" 24 | stack haddock --no-haddock-deps 25 | cd "$(stack path --dist-dir)/doc/html/" 26 | hackage_url='https://hackage.haskell.org/package/\1/docs/' 27 | sed -i -E \ 28 | 's|\.\./(([A-Za-z][[:alnum:]]*-)+[0-9]+(\.[0-9]+)*)/|'"$hackage_url|g" \ 29 | ./*/*.html 30 | exit_code=0 31 | } || true 32 | cd "$cwd" 33 | cp "$backup" "$package" 34 | exit "$exit_code" 35 | -------------------------------------------------------------------------------- /demo/elm.json: -------------------------------------------------------------------------------- 1 | { 2 | "type": "application", 3 | "source-directories": [ 4 | "src" 5 | ], 6 | "elm-version": "0.19.1", 7 | "dependencies": { 8 | "direct": { 9 | "elm/browser": "1.0.1", 10 | "elm/core": "1.0.2", 11 | "elm/html": "1.0.0", 12 | "elm/http": "2.0.0", 13 | "elm/json": "1.1.3", 14 | "elm/regex": "1.0.0", 15 | "elm/url": "1.0.0", 16 | "hecrj/html-parser": "2.3.4", 17 | "pablohirafuji/elm-markdown": "2.0.5", 18 | "pablohirafuji/elm-syntax-highlight": "3.4.1", 19 | "rundis/elm-bootstrap": "5.1.0" 20 | }, 21 | "indirect": { 22 | "avh4/elm-color": "1.0.0", 23 | "elm/bytes": "1.0.8", 24 | "elm/file": "1.0.5", 25 | "elm/parser": "1.1.0", 26 | "elm/time": "1.0.0", 27 | "elm/virtual-dom": "1.0.2", 28 | "rtfeldman/elm-hex": "1.0.0" 29 | } 30 | }, 31 | "test-dependencies": { 32 | "direct": {}, 33 | "indirect": {} 34 | } 35 | } 36 | -------------------------------------------------------------------------------- /test/data/preservation.ko-Kore.html: -------------------------------------------------------------------------------- 1 |

Contents in the preserved elements should not be touched.

2 |

To be transformed:

3 | 11 |

Not to be transformed:

12 | 26 | -------------------------------------------------------------------------------- /test/data/preservation.ko-KR.html: -------------------------------------------------------------------------------- 1 |

Contents in the preserved elements should not be touched.

2 |

To be transformed:

3 | 11 |

Not to be transformed:

12 | 26 | -------------------------------------------------------------------------------- /scripts/showcase-svg/Makefile: -------------------------------------------------------------------------------- 1 | SEONBI=seonbi 2 | XQILLA=xqilla 3 | 4 | build/showcase.svg: template.svg build/ko-kr.html build/ko-kp.html build/ko-kore.html 5 | cp template.svg build/showcase.svg 6 | $(XQILLA) \ 7 | -u \ 8 | -i build/showcase.svg \ 9 | -v ko-kr build/ko-kr.html \ 10 | -v ko-kp build/ko-kp.html \ 11 | -v ko-kore build/ko-kore.html \ 12 | replace.xq 13 | sed -i.bak \ 14 | 's|^\s\{0,\} Bool 16 | isPreservedTag tag' = 17 | case tag' of 18 | Code -> True 19 | Kbd -> True 20 | Pre -> True 21 | TextArea -> True 22 | _ -> 23 | case htmlTagKind tag' of 24 | Normal -> False 25 | EscapableRawText -> False 26 | _ -> True 27 | 28 | -- | 'True' if the given tag stack should be preserved from transformation. 29 | isPreservedTagStack :: HtmlTagStack -> Bool 30 | isPreservedTagStack = any isPreservedTag 31 | 32 | -- | 'True' if the given HTML entity should be preserved from transformation. 33 | isPreservedEntity :: HtmlEntity -> Bool 34 | isPreservedEntity HtmlComment {} = 35 | True 36 | isPreservedEntity HtmlStartTag { tagStack, tag } = 37 | isPreservedTag tag || isPreservedTagStack tagStack 38 | isPreservedEntity HtmlEndTag { tagStack, tag } = 39 | isPreservedTag tag || isPreservedTagStack tagStack 40 | isPreservedEntity entity = 41 | isPreservedTagStack $ tagStack entity 42 | -------------------------------------------------------------------------------- /.github/workflows/test-actions.yaml: -------------------------------------------------------------------------------- 1 | name: test-actions 2 | on: 3 | push: [] 4 | schedule: 5 | - cron: 59 14 * * * 6 | pull_request: [] 7 | 8 | jobs: 9 | test-setup-seonbi: 10 | strategy: 11 | matrix: 12 | os: 13 | - ubuntu-20.04 14 | - ubuntu-22.04 15 | - macos-12 # Intel 16 | - macos-13-xlarge # Apple silicon 17 | - windows-2019 18 | - windows-2022 19 | fail-fast: false 20 | runs-on: ${{ matrix.os }} 21 | steps: 22 | - uses: actions/checkout@v4 23 | # Test 1 24 | - id: setup1 25 | uses: ./setup 26 | with: 27 | seonbi-version: 0.3.* 28 | add-to-path: false 29 | - run: | 30 | set -e 31 | [[ "${{ steps.setup1.outputs.seonbi-version }}" = 0.3.* ]] 32 | [[ "${{ steps.setup1.outputs.seonbi-version }}" != "0.3.*" ]] 33 | ! command -v seonbi 34 | ! command -v seonbi-api 35 | shell: bash 36 | - run: >- 37 | ${{ steps.setup1.outputs.seonbi-path }} 38 | test/data/大韓民國憲法第十號前文.ko-Kore.html 39 | # Test 2 40 | - id: setup2 41 | uses: ./setup 42 | with: 43 | seonbi-version: 0.3.0 44 | - run: | 45 | set -e 46 | [[ "${{ steps.setup2.outputs.seonbi-version }}" = "0.3.0" ]] 47 | command -v seonbi 48 | command -v seonbi-api 49 | shell: bash 50 | - run: seonbi test/data/大韓民國憲法第十號前文.ko-Kore.html 51 | - run: >- 52 | ${{ steps.setup1.outputs.seonbi-path }} 53 | test/data/大韓民國憲法第十號前文.ko-Kore.html 54 | -------------------------------------------------------------------------------- /test/Text/Seonbi/HangulSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedStrings #-} 2 | module Text.Seonbi.HangulSpec (spec) where 3 | 4 | import Test.Hspec 5 | 6 | import Text.Seonbi.Hangul 7 | 8 | 9 | spec :: Spec 10 | spec = do 11 | specify "isHangulSyllable" $ do 12 | '가' `shouldSatisfy` isHangulSyllable 13 | '글' `shouldSatisfy` isHangulSyllable 14 | 'A' `shouldNotSatisfy` isHangulSyllable 15 | '?' `shouldNotSatisfy` isHangulSyllable 16 | '字' `shouldNotSatisfy` isHangulSyllable 17 | describe "toJamoTriple" $ do 18 | it "returns only initial cosonant and vowel if there is no batchim" $ 19 | toJamoTriple '가' `shouldBe` Just ('ᄀ', 'ᅡ', Nothing) 20 | it "returns all of triple if there is a batchim" $ do 21 | toJamoTriple '글' `shouldBe` Just ('ᄀ', 'ᅳ', Just 'ᆯ') 22 | toJamoTriple '를' `shouldBe` Just ('ᄅ', 'ᅳ', Just 'ᆯ') 23 | it "returns Nothing for non-Hangul letters" $ do 24 | toJamoTriple 'A' `shouldBe` Nothing 25 | toJamoTriple '?' `shouldBe` Nothing 26 | toJamoTriple '字' `shouldBe` Nothing 27 | specify "fromJamoTriple" $ do 28 | fromJamoTriple ('ᄀ', 'ᅡ', Nothing) `shouldBe` Just '가' 29 | fromJamoTriple ('ᄀ', 'ᅳ', Just 'ᆯ') `shouldBe` Just '글' 30 | fromJamoTriple ('ᄅ', 'ᅳ', Just 'ᆯ') `shouldBe` Just '를' 31 | fromJamoTriple ('ᄓ', 'ᅳ', Nothing) `shouldBe` Nothing 32 | fromJamoTriple ('ᄀ', 'ᅶ', Nothing) `shouldBe` Nothing 33 | fromJamoTriple ('ᄀ', 'ᅳ', Just 'ᅡ') `shouldBe` Nothing 34 | fromJamoTriple ('ᄀ', 'ᅳ', Just 'ᇇ') `shouldBe` Nothing 35 | -------------------------------------------------------------------------------- /test/Text/Seonbi/Html/PrinterSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedLists #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.Html.PrinterSpec (spec) where 4 | 5 | import Data.Text.Lazy 6 | import Test.Hspec 7 | 8 | import Text.Seonbi.Html.Entity 9 | import Text.Seonbi.Html.Printer 10 | import Text.Seonbi.Html.Tag 11 | 12 | sample :: [HtmlEntity] 13 | sample = 14 | [ HtmlComment { tagStack = [], comment = " foo " } 15 | , HtmlStartTag { tagStack = [], tag = P, rawAttributes = " id=\"a\"" } 16 | , HtmlText { tagStack = [P], rawText = "Hello," } 17 | , HtmlStartTag { tagStack = [P], tag = BR, rawAttributes = "" } 18 | , HtmlEndTag { tagStack = [P], tag = BR } 19 | , HtmlText { tagStack = [P], rawText = "\n" } 20 | , HtmlStartTag { tagStack = [P], tag = Em, rawAttributes = "class=\"b\"" } 21 | , HtmlCdata { tagStack = [P, Em], text = "world" } 22 | , HtmlEndTag { tagStack = [P], tag = Em } 23 | , HtmlText { tagStack = [P], rawText = "!" } 24 | , HtmlEndTag { tagStack = [], tag = P } 25 | , HtmlStartTag { tagStack = [], tag = P, rawAttributes = "" } 26 | , HtmlEndTag { tagStack = [], tag = P } 27 | ] 28 | 29 | spec :: Spec 30 | spec = do 31 | specify "printHtml" $ 32 | printHtml sample `shouldBe` Data.Text.Lazy.concat 33 | [ "

Hello,
\n" 34 | , "!

" 35 | ] 36 | specify "printXhtml" $ 37 | printXhtml sample `shouldBe` Data.Text.Lazy.concat 38 | [ "

Hello,
\n" 39 | , "!

" 40 | ] 41 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | # To correctly make a statically-linked binary, we use Alpine Linux. 2 | # The distro entirely uses musl instead of glibc which is unfriendly to be 3 | # statically linked. 4 | FROM docker.io/alpine:3.19 AS build 5 | 6 | LABEL "org.opencontainers.image.title"="Seonbi" 7 | LABEL "org.opencontainers.image.licenses"="LGPL-2.1" 8 | 9 | RUN apk add --no-cache \ 10 | build-base=0.5-r3 \ 11 | bzip2-dev=1.0.8-r6 \ 12 | ghc=9.4.7-r1 \ 13 | libbz2=1.0.8-r6 \ 14 | xz=5.4.5-r0 \ 15 | zlib-dev=1.3.1-r0 \ 16 | zlib-static=1.3.1-r0 17 | 18 | RUN wget -q "https://github.com/commercialhaskell/stack/releases/download/v3.3.1/stack-3.3.1-linux-$(uname -m)-bin" \ 19 | && mv "stack-3.3.1-linux-$(uname -m)-bin" /usr/bin/stack \ 20 | && chmod +x /usr/bin/stack 21 | 22 | RUN stack config set system-ghc --global true 23 | 24 | # Add just the package.yaml file to capture dependencies 25 | COPY package.yaml /src/seonbi/package.yaml 26 | COPY stack-ghc-9.4.yaml /src/seonbi/stack.yaml 27 | 28 | WORKDIR /src/seonbi 29 | 30 | # Docker will cache this command as a layer, freeing us up to 31 | # modify source code without re-installing dependencies 32 | # (unless the .cabal file changes!) 33 | RUN stack setup --system-ghc 34 | RUN stack build \ 35 | --system-ghc \ 36 | --only-snapshot \ 37 | --flag seonbi:iconv \ 38 | --flag seonbi:static 39 | 40 | COPY . /src/seonbi 41 | RUN cp /src/seonbi/stack-ghc-9.4.yaml /src/seonbi/stack.yaml 42 | 43 | RUN stack build \ 44 | --system-ghc \ 45 | --flag seonbi:iconv \ 46 | --flag seonbi:static \ 47 | --copy-bins 48 | 49 | FROM docker.io/alpine:3.19 50 | 51 | COPY --from=build /root/.local/bin/seonbi* /usr/local/bin/ 52 | ENV LANG=en_US.UTF-8 53 | ENV LANGUAGE=en_US.UTF-8 54 | CMD ["seonbi"] 55 | -------------------------------------------------------------------------------- /scripts/deno/README.md: -------------------------------------------------------------------------------- 1 | [Seonbi] client library for Deno 2 | ================================ 3 | 4 | [![Latest version][Tag badge]][Deno module] 5 | 6 | *[Seonbi] is an HTML preprocessor that makes typographic/orthographic 7 | adjustments on Korean text. See the [website][Seonbi] for details.* 8 | 9 | This directory contains a simple client library which manages and communicates 10 | with Seonbi HTTP API server. The `transform()` function and `Seonbi` class 11 | automatically downloads the Seonbi executable binary and runs the server under 12 | the hood. 13 | 14 | Here's an example code for one-shot transformation: 15 | 16 | ~~~~ typescript 17 | import { transform } from "https://deno.land/x/seonbi/mod.ts"; 18 | 19 | const input = "디노를 通해 쓰는 선비"; 20 | const output = transform(input); 21 | console.log(output); // 디노를 통해 쓰는 선비 22 | ~~~~ 23 | 24 | When there are multiple inputs to transform, makes a `Seonbi` instance and 25 | call its `transform()` method multiple times so that the server subprocess 26 | are not spawned needlessly more than once: 27 | 28 | 29 | ~~~~ typescript 30 | import { Seonbi } from "https://deno.land/x/seonbi/mod.ts"; 31 | 32 | const inputs = [ 33 | "序詩", 34 | "看板 없는 거리", 35 | "太初의 아침", 36 | "무서운 時間", 37 | "눈 오는 地圖", 38 | "별 헤는 밤", 39 | "슬픈 族屬", 40 | ]; 41 | const seonbi = new Seonbi(); 42 | const outputs = await Promise.all(inputs.map(input => seonbi.transform(input))); 43 | console.log(outputs); 44 | /* 45 | [ 46 | "서시", 47 | "간판 없는 거리", 48 | "태초의 아침", 49 | "무서운 시간", 50 | "눈 오는 지도", 51 | "별 헤는 밤", 52 | "슬픈 족속", 53 | ] 54 | */ 55 | ~~~~ 56 | 57 | [Seonbi]: https://github.com/dahlia/seonbi 58 | [Tag badge]: https://img.shields.io/github/v/tag/dahlia/seonbi 59 | [Deno module]: https://deno.land/x/seonbi 60 | -------------------------------------------------------------------------------- /scripts/Bundle-RequiredDlls.ps1: -------------------------------------------------------------------------------- 1 | [CmdletBinding()] param () 2 | 3 | Set-Variable ObjdumpPath -Option Constant -Value (stack path --compiler-bin ` 4 | | Split-Path -Parent ` 5 | | Join-Path -ChildPath "mingw" -AdditionalChildPath "bin", "objdump.exe") 6 | 7 | function Get-RequiredDlls { 8 | [OutputType([System.IO.FileInfo[]])] 9 | param ( 10 | [Parameter(Mandatory)] 11 | [System.IO.FileInfo]$ObjectPath, 12 | [Parameter(Mandatory)] 13 | [System.IO.FileInfo]$LibraryPath 14 | ) 15 | $dlls = & $ObjdumpPath -p $ObjectPath ` 16 | | Select-String "^`tDLL Name: (.*?`.[Dd][Ll]{2})$" -CaseSensitive ` 17 | | ForEach-Object { $_.Matches.Groups[1].Value } 18 | $dllsToBundle = Get-ChildItem -Filter *.dll -Recurse $LibraryPath ` 19 | | Where-Object { $dlls -contains $_.Name } 20 | if ($null -eq $dllsToBundle -or $dllsToBundle.Length -lt 1) { 21 | return @() 22 | } elseif ($dllsToBundle.GetType() -eq [System.IO.FileInfo]) { 23 | $dllsToBundle = @($dllsToBundle) 24 | } 25 | $dependencies = @() 26 | foreach ($dll in $dllsToBundle) { 27 | $dependencies += Get-RequiredDlls $dll $LibraryPath 28 | } 29 | if ($dependencies.Length -gt 0) { 30 | $dllsToBundle += $dependencies 31 | } 32 | $dllsToBundle = $dllsToBundle | Select-Object -Unique 33 | if ($dllsToBundle.GetType() -eq [System.IO.FileInfo]) { 34 | return @($dllsToBundle) 35 | } 36 | return $dllsToBundle 37 | } 38 | 39 | $localBinDir = stack path --local-install-root | Join-Path -ChildPath "bin" 40 | $objectPaths = Get-ChildItem -Filter *.exe $localBinDir 41 | $libraryPath = stack path --compiler-bin | Split-Path -Parent 42 | 43 | foreach ($obj in $objectPaths) { 44 | Get-RequiredDlls $obj $libraryPath | ForEach-Object { 45 | Write-Verbose $_ 46 | Copy-Item $_ $localBinDir 47 | } 48 | } 49 | -------------------------------------------------------------------------------- /scripts/ko-kr-stdict/README.rst: -------------------------------------------------------------------------------- 1 | Extract hanja words from *Standard Korean Language Dictionary* (標準國語大辭典) 2 | =============================================================================== 3 | 4 | This Python script extracts Sino-Korean words from *Standard Korean Language 5 | Dictionary* (標準國語大辭典) published by NIKL (國立國語院) of South Korea. 6 | 7 | First of all, this script requires Python 3.6 or higher. Though it might work 8 | on older versions, I've never tested. I'm sure it won't work on Python 2. 9 | It also works well with PyPy3.5 v6.0 or higher, and is even about 2 times 10 | faster than CPython --- so I recommend PyPy more than CPython. 11 | 12 | Note that this script does not depend on any other than the Python standard 13 | library. 14 | 15 | NIKL has distributed *Standard Korean Language Dictionary* under CC BY-SA 16 | `since 11th March, 2019`__. The data can be downloaded from the 17 | `Standard Korean Language Dictionary`__ website --- although this website 18 | does not have English version and you need to make an account to download 19 | the data. Or, in short, you could download using ``curl`` in one-shot:: 20 | 21 | # Works as of January 2025. 22 | curl \ 23 | -LJ \ 24 | -X POST \ 25 | -F link_key=1404371 \ 26 | -F pageUnit=10 \ 27 | -F pageIndex=1 \ 28 | -o stdict.zip \ 29 | https://stdict.korean.go.kr/common/download.do 30 | 31 | The data is contained by a *.zip* archive, and if you extract it there are 32 | several XML data files. This script reads the *.zip* archive (not *.xml* files) 33 | and then prints the result in the TSV format that Seonbi can interpret:: 34 | 35 | ./main.py stdict.zip | sort > kr-stdict.tsv 36 | 37 | __ https://stdict.korean.go.kr/notice/noticeView.do?board_no=1129 38 | __ https://stdict.korean.go.kr/ 39 | -------------------------------------------------------------------------------- /demo/github-corner.html: -------------------------------------------------------------------------------- 1 | 4 | 33 | 34 | 46 | -------------------------------------------------------------------------------- /test/Text/Seonbi/Unihan/KHangulSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedLists #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.Unihan.KHangulSpec (spec) where 4 | 5 | import Data.Either 6 | 7 | import Data.Aeson 8 | import Data.Map.Strict 9 | import Test.Hspec 10 | 11 | import Text.Seonbi.Unihan.KHangul 12 | 13 | spec :: Spec 14 | spec = do 15 | describe "kHangulData'" $ 16 | it "should be loaded" $ 17 | kHangulData' `shouldSatisfy` isRight 18 | describe "kHangulData" $ 19 | it "contains Hanja Hangul readings" $ 20 | Data.Map.Strict.lookup '識' kHangulData `shouldBe` Just 21 | [ ('식', HanjaReadingCitation KS_X_1001 [Education]) 22 | , ('지', HanjaReadingCitation KS_X_1001 [PersonalName]) 23 | ] 24 | describe "HanjaReadingCitation" $ 25 | specify "parseJSON" $ do 26 | decode "\"\"" `shouldBe` Just (HanjaReadingCitation NonStandard []) 27 | decode "\"E\"" `shouldBe` Just 28 | (HanjaReadingCitation NonStandard [Education]) 29 | decode "\"N\"" `shouldBe` Just 30 | (HanjaReadingCitation NonStandard [PersonalName]) 31 | decode "\"EN\"" `shouldBe` Just 32 | (HanjaReadingCitation NonStandard [Education, PersonalName]) 33 | decode "\"0\"" `shouldBe` Just (HanjaReadingCitation KS_X_1001 []) 34 | decode "\"1\"" `shouldBe` Just (HanjaReadingCitation KS_X_1002 []) 35 | decode "\"0E\"" `shouldBe` Just 36 | (HanjaReadingCitation KS_X_1001 [Education]) 37 | decode "\"1N\"" `shouldBe` Just 38 | (HanjaReadingCitation KS_X_1002 [PersonalName]) 39 | decode "\"2\"" `shouldBe` (Nothing :: Maybe HanjaReadingCitation) 40 | decode "\"00\"" `shouldBe` (Nothing :: Maybe HanjaReadingCitation) 41 | decode "\"0Z\"" `shouldBe` (Nothing :: Maybe HanjaReadingCitation) 42 | decode "0" `shouldBe` (Nothing :: Maybe HanjaReadingCitation) 43 | decode "null" `shouldBe` (Nothing :: Maybe HanjaReadingCitation) 44 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Html/Wrapper.hs: -------------------------------------------------------------------------------- 1 | module Text.Seonbi.Html.Wrapper 2 | ( isWrappedBy 3 | , isWrappedBy' 4 | , wrap 5 | ) where 6 | 7 | import Text.Seonbi.Html 8 | import Text.Seonbi.Html.TagStack 9 | 10 | -- | Wraps given entities with an element. 11 | wrap :: HtmlTagStack -> HtmlTag -> HtmlRawAttrs -> [HtmlEntity] -> [HtmlEntity] 12 | wrap baseStack tag' attributes entities = (:) 13 | (HtmlStartTag baseStack tag' attributes) 14 | [ e { tagStack = rebase' (tagStack e) } 15 | | e <- entities 16 | ] ++ [HtmlEndTag baseStack tag'] 17 | where 18 | newBaseStack :: HtmlTagStack 19 | newBaseStack = push tag' baseStack 20 | rebase' :: HtmlTagStack -> HtmlTagStack 21 | rebase' = rebase baseStack newBaseStack 22 | 23 | -- | A shortcut to 'isWrappedBy'' of wildcard attributes match. 24 | isWrappedBy :: [HtmlEntity] -> HtmlTag -> Bool 25 | isWrappedBy entities tag' = 26 | isWrappedBy' entities tag' Nothing 27 | 28 | -- | 'True' if the given @['HtmlEntity']@ is wrapped by a tag and attributes. 29 | -- E.g.: 30 | -- 31 | -- >>> :set -XOverloadedLists 32 | -- >>> :set -XOverloadedStrings 33 | -- >>> :{ 34 | -- let entities = 35 | -- [ HtmlStartTag [] Em " id=foo" 36 | -- , HtmlText [Em] "Hello" 37 | -- , HtmlEndTag [] Em 38 | -- ] :: [HtmlEntity] 39 | -- :} 40 | -- 41 | -- >>> isWrappedBy' entities Em $ Just " id=foo" 42 | -- True 43 | -- >>> isWrappedBy' entities Div $ Just " id=foo" 44 | -- False 45 | -- >>> isWrappedBy' entities Em $ Just " id=wrong" 46 | -- False 47 | -- 48 | -- In order to match to any attributes (wildcard match), give 'Nothing' to 49 | -- the third argument: 50 | -- 51 | -- >>> isWrappedBy' entities Em Nothing 52 | -- True 53 | -- >>> isWrappedBy' entities Span Nothing 54 | -- False 55 | -- 56 | -- Or you can use 'isWrappedBy' function which is a shortcut for that. 57 | isWrappedBy' :: [HtmlEntity] -> HtmlTag -> Maybe HtmlRawAttrs -> Bool 58 | isWrappedBy' entities@(HtmlStartTag s t a : _) tag' attributes = 59 | case Prelude.last entities of 60 | HtmlEndTag s' t' -> 61 | t == tag' && t' == tag' && s == s' && maybe True (== a) attributes 62 | _ -> 63 | False 64 | isWrappedBy' _ _ _ = False 65 | -------------------------------------------------------------------------------- /test/Text/Seonbi/Html/WrapperSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedLists #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.Html.WrapperSpec (spec) where 4 | 5 | import Test.Hspec 6 | 7 | import Text.Seonbi.Html.Entity 8 | import Text.Seonbi.Html.Tag 9 | import Text.Seonbi.Html.Wrapper 10 | 11 | spec :: Spec 12 | spec = 13 | specify "wrap" $ 14 | wrap [Div, Article] BlockQuote " class=\"q\"" 15 | [ HtmlStartTag 16 | { tagStack = [Div, Article] 17 | , tag = P 18 | , rawAttributes = "" 19 | } 20 | , HtmlText { tagStack = [Div, Article, P], rawText = "foo" } 21 | , HtmlStartTag 22 | { tagStack = [Div, Article, P] 23 | , tag = Em 24 | , rawAttributes = "" 25 | } 26 | , HtmlCdata { tagStack = [Div, Article, P, Em], text = "bar" } 27 | , HtmlEndTag { tagStack = [Div, Article, P], tag = Em } 28 | , HtmlComment { tagStack = [Div, Article, P], comment = " baz " } 29 | , HtmlEndTag { tagStack = [Div, Article], tag = P } 30 | ] `shouldBe` 31 | [ HtmlStartTag 32 | { tagStack = [Div, Article] 33 | , tag = BlockQuote 34 | , rawAttributes = " class=\"q\"" 35 | } 36 | , HtmlStartTag 37 | { tagStack = [Div, Article, BlockQuote] 38 | , tag = P 39 | , rawAttributes = "" 40 | } 41 | , HtmlText 42 | { tagStack = [Div, Article, BlockQuote, P] 43 | , rawText = "foo" 44 | } 45 | , HtmlStartTag 46 | { tagStack = [Div, Article, BlockQuote, P] 47 | , tag = Em 48 | , rawAttributes = "" 49 | } 50 | , HtmlCdata 51 | { tagStack = [Div, Article, BlockQuote, P, Em] 52 | , text = "bar" 53 | } 54 | , HtmlEndTag { tagStack = [Div, Article, BlockQuote, P], tag = Em } 55 | , HtmlComment 56 | { tagStack = [Div, Article, BlockQuote, P] 57 | , comment = " baz " 58 | } 59 | , HtmlEndTag { tagStack = [Div, Article, BlockQuote], tag = P } 60 | , HtmlEndTag { tagStack = [Div, Article], tag = BlockQuote } 61 | ] 62 | -------------------------------------------------------------------------------- /test/Text/Seonbi/ContentTypesSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedStrings #-} 2 | module Text.Seonbi.ContentTypesSpec (spec) where 3 | 4 | import Data.Text 5 | import qualified Data.Text.Lazy as LT 6 | import Data.Text.Lazy.Builder 7 | 8 | import Test.Hspec 9 | 10 | import Text.Seonbi.Html 11 | import Text.Seonbi.ContentTypes 12 | import qualified HTMLEntities.Builder 13 | import HTMLEntities.Decoder 14 | 15 | textReverser :: (Monad m, MonadFail m) => HtmlTransformer m 16 | textReverser entities = 17 | return $ reverseText <$> entities 18 | where 19 | reverseText :: HtmlEntity -> HtmlEntity 20 | reverseText e@HtmlText { rawText = t } = 21 | e { rawText = encode $ Data.Text.reverse $ decode t } 22 | reverseText e@HtmlCdata { text = t } = 23 | e { text = Data.Text.reverse t } 24 | reverseText e = 25 | e 26 | decode :: Text -> Text 27 | decode = LT.toStrict . toLazyText . htmlEncodedText 28 | encode :: Text -> Text 29 | encode = LT.toStrict . toLazyText . HTMLEntities.Builder.text 30 | 31 | spec :: Spec 32 | spec = do 33 | specify "asHtmlTransformer" $ do 34 | r <- asHtmlTransformer textReverser "

foo bar
baz

" 35 | r `shouldBe` "

oofrab
zab

" 36 | specify "asXhtmlTransformer" $ do 37 | r <- asXhtmlTransformer textReverser "

foo bar
baz

" 38 | r `shouldBe` "

oofrab
zab

" 39 | specify "asPlainTextTransformer" $ do 40 | r <- asPlainTextTransformer textReverser 41 | "

foo bar
baz

" 42 | r `shouldBe` ">p/rb<>me/me< oof>p<" 43 | specify "asCommonMarkTransformer" $ do 44 | r <- asCommonMarkTransformer textReverser 45 | "# Foo\n\nBar *Baz*\nQux\n\n> Quote tag\n" 46 | r `shouldBe` "# ooF\n\n raB*zaB*\nxuQ\n\n> etouQgat\n" 47 | specify "transformWithContentType" $ do 48 | let input = "*foo* bar
" 49 | h <- transformWithContentType "text/html" textReverser input 50 | h `shouldBe` " *oof*rab
" 51 | x <- transformWithContentType "application/xhtml+xml" textReverser input 52 | x `shouldBe` " *oof*rab
" 53 | p <- transformWithContentType "text/plain" textReverser input 54 | p `shouldBe` ">rb<>me/me< *oof*" 55 | m <- transformWithContentType "text/markdown" textReverser input 56 | m `shouldBe` "*oof* rab
\n" 57 | -------------------------------------------------------------------------------- /setup/README.md: -------------------------------------------------------------------------------- 1 | `dahlia/seonbi/setup`: GitHub action to install [Seonbi] 2 | ======================================================== 3 | 4 | This action installs executables `seonbi` and `seonbi-api` during GitHub Actions 5 | workflow: 6 | 7 | ~~~ yaml 8 | - uses: dahlia/seonbi/setup@main 9 | ~~~ 10 | 11 | It installs the latest version of Seonbi by default. To explicitly specify 12 | the version to install, use the `seonbi-version` option:[^1] 13 | 14 | ~~~ yaml 15 | - uses: dahlia/seonbi/setup@main 16 | with: 17 | seonbi-version: 0.3.* 18 | ~~~ 19 | 20 | The wildcard in the version number chooses the latest released version. 21 | Also, `seonbi-version: 0.*` is equivalent to `seonbi-version: 0.*.*`, 22 | and `seonbi-version: *` is equivalent to `seonbi-version: *.*.*`. 23 | Therefore, `seonbi-version: *` means the latest version. 24 | 25 | to get the exact version number of the installed Seonbi from the later steps, 26 | use the `seonbi-version` output: 27 | 28 | ~~~ yaml 29 | - id: setup-seonbi 30 | uses: dahlia/seonbi/setup@main 31 | with: 32 | seonbi-version: * 33 | - run: | 34 | echo "Installed seonbi version:" \ 35 | "${{ steps.setup-seonbi.outputs.seonbi-version }}" 36 | shell: bash 37 | ~~~ 38 | 39 | To prevent the installed Seonbi from being added to the `PATH`, turn off 40 | the `add-to-path` option (which is turned on by default) and use 41 | the `seonbi-path` and `seonbi-api-path` outputs instead: 42 | 43 | ~~~ yaml 44 | - id: setup-seonbi 45 | uses: dahlia/seonbi/setup@main 46 | with: 47 | add-to-path: false 48 | - run: ${{ steps.setup-seonbi.outputs.seonbi-path }} README.md 49 | shell: bash 50 | ~~~ 51 | 52 | [^1]: Note that the action version and the Seonbi versions are distinct. 53 | However, it's recommended to match major and minor versions for both. 54 | 55 | [Seonbi]: https://github.com/dahlia/seonbi 56 | 57 | 58 | Input parameters 59 | ---------------- 60 | 61 | - `seonbi-version`: Version of executable binaries `seonbi` and `seonbi-api` 62 | to install. Note that asterisks can be used to choose the latest version, 63 | e.g., `1.2.*`, `1.*`, `*`. (Default: `*`.) 64 | - `add-to-path`: Whether to add the installed `seonbi` and `seonbi-api` to 65 | the `PATH`. Turned on by default. (Default: `true`.) 66 | 67 | 68 | Output parameters 69 | ----------------- 70 | 71 | - `seonbi-version`: Exact version number of the installed Seonbi. 72 | - `seonbi-path`: Absolute path of the installed executable `seonbi`. 73 | - `seonbi-api-path`: Absolute path of the installed executable `seonbi-api`. 74 | -------------------------------------------------------------------------------- /test/Text/Seonbi/Html/PreservationSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedLists #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.Html.PreservationSpec (spec) where 4 | 5 | import Test.Hspec 6 | 7 | import Text.Seonbi.Html.Entity 8 | import Text.Seonbi.Html.Preservation 9 | import Text.Seonbi.Html.Tag 10 | 11 | spec :: Spec 12 | spec = do 13 | specify "isPreservedTag" $ do 14 | P `shouldNotSatisfy` isPreservedTag 15 | Em `shouldNotSatisfy` isPreservedTag 16 | Title `shouldNotSatisfy` isPreservedTag 17 | Canvas `shouldSatisfy` isPreservedTag 18 | Code `shouldSatisfy` isPreservedTag 19 | Kbd `shouldSatisfy` isPreservedTag 20 | Pre `shouldSatisfy` isPreservedTag 21 | Script `shouldSatisfy` isPreservedTag 22 | Style `shouldSatisfy` isPreservedTag 23 | Template `shouldSatisfy` isPreservedTag 24 | TextArea `shouldSatisfy` isPreservedTag 25 | specify "isPreservedTagStack" $ do 26 | [] `shouldNotSatisfy` isPreservedTagStack 27 | [P, Em] `shouldNotSatisfy` isPreservedTagStack 28 | [Html, Head, Title] `shouldNotSatisfy` isPreservedTagStack 29 | [Div, Script] `shouldSatisfy` isPreservedTagStack 30 | [Html, Head, Style] `shouldSatisfy` isPreservedTagStack 31 | [P, Kbd] `shouldSatisfy` isPreservedTagStack 32 | [Pre, Code] `shouldSatisfy` isPreservedTagStack 33 | [Template, P] `shouldSatisfy` isPreservedTagStack 34 | specify "shouldBePreserved" $ do 35 | HtmlStartTag [] P "" `shouldNotSatisfy` isPreservedEntity 36 | HtmlEndTag [] P `shouldNotSatisfy` isPreservedEntity 37 | HtmlText [] "" `shouldNotSatisfy` isPreservedEntity 38 | HtmlCdata [] "" `shouldNotSatisfy` isPreservedEntity 39 | HtmlComment [] " ... " `shouldSatisfy` isPreservedEntity 40 | HtmlStartTag [P] Em "" `shouldNotSatisfy` isPreservedEntity 41 | HtmlEndTag [P] Em `shouldNotSatisfy` isPreservedEntity 42 | HtmlText [P] "" `shouldNotSatisfy` isPreservedEntity 43 | HtmlCdata [P] "" `shouldNotSatisfy` isPreservedEntity 44 | HtmlComment [P] " ... " `shouldSatisfy` isPreservedEntity 45 | HtmlStartTag [P] Code "" `shouldSatisfy` isPreservedEntity 46 | HtmlEndTag [P] Code `shouldSatisfy` isPreservedEntity 47 | HtmlStartTag [Pre] Span "" `shouldSatisfy` isPreservedEntity 48 | HtmlEndTag [Pre] Span `shouldSatisfy` isPreservedEntity 49 | HtmlText [Pre] "" `shouldSatisfy` isPreservedEntity 50 | HtmlCdata [Pre] "" `shouldSatisfy` isPreservedEntity 51 | HtmlComment [Pre] " ... " `shouldSatisfy` isPreservedEntity 52 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Hangul.hs: -------------------------------------------------------------------------------- 1 | module Text.Seonbi.Hangul 2 | ( JamoTriple 3 | , fromJamoTriple 4 | , isHangulSyllable 5 | , toJamoTriple 6 | ) where 7 | 8 | -- $setup 9 | -- >>> import qualified Text.Show.Unicode 10 | -- >>> :set -interactive-print=Text.Show.Unicode.uprint 11 | 12 | -- | A triple of an initial consonant, a vowel, and an optional final consonant. 13 | type JamoTriple = (Char, Char, Maybe Char) 14 | 15 | -- | Checks if a character is a hangul letter and a complete syllable. 16 | -- 17 | -- >>> isHangulSyllable '가' 18 | -- True 19 | -- >>> isHangulSyllable 'ㄱ' 20 | -- False 21 | isHangulSyllable :: Char -> Bool 22 | isHangulSyllable c = 23 | c >= '\xac00' && c <= '\xd7a3'; 24 | 25 | syllableBase :: Int 26 | syllableBase = 0xac00 27 | 28 | initialBase :: Int 29 | initialBase = 0x1100 30 | 31 | vowelBase :: Int 32 | vowelBase = 0x1161 33 | 34 | finalBase :: Int 35 | finalBase = 0x11a7 36 | 37 | vowelCount :: Int 38 | vowelCount = 21; 39 | 40 | finalCount :: Int 41 | finalCount = 28; 42 | 43 | -- | Takes a complete hangul syllable apart into consonants and a vowel. 44 | -- Returns 'Nothing' for non-hangul letters. 45 | -- 46 | -- >>> toJamoTriple '가' 47 | -- Just ('ᄀ','ᅡ',Nothing) 48 | -- >>> toJamoTriple '글' 49 | -- Just ('ᄀ','ᅳ',Just 'ᆯ') 50 | -- >>> toJamoTriple 'A' 51 | -- Nothing 52 | toJamoTriple :: Char -> Maybe JamoTriple 53 | toJamoTriple c 54 | | isHangulSyllable c = Just 55 | ( toEnum $ initialBase + ((syllable `div` finalCount) `div` vowelCount) 56 | , toEnum $ vowelBase + ((syllable `div` finalCount) `mod` vowelCount) 57 | , case syllable `mod` finalCount of 58 | 0 -> Nothing 59 | f -> Just $ toEnum (finalBase + f) 60 | ) 61 | | otherwise = Nothing 62 | where 63 | syllable :: Int 64 | syllable = fromEnum c - syllableBase 65 | 66 | -- | Composes hangul jamo triple into a hangul syllable. 67 | -- 68 | -- >>> fromJamoTriple ('ᄀ', 'ᅡ', Nothing) 69 | -- Just '가' 70 | -- >>> fromJamoTriple ('ᄀ', 'ᅳ', Just 'ᆯ') 71 | -- Just '글' 72 | fromJamoTriple :: JamoTriple -> Maybe Char 73 | fromJamoTriple (initial, vowel, final) 74 | | initialIndex < 0 = Nothing 75 | | initialIndex > 18 = Nothing 76 | | vowelIndex < 0 = Nothing 77 | | vowelIndex > 20 = Nothing 78 | | finalIndex < 0 = Nothing 79 | | finalIndex > 27 = Nothing 80 | | otherwise = Just $ toEnum $ syllableBase + 81 | (initialIndex * vowelCount + vowelIndex) * finalCount + finalIndex 82 | where 83 | initialIndex :: Int 84 | initialIndex = fromEnum initial - initialBase 85 | vowelIndex :: Int 86 | vowelIndex = fromEnum vowel - vowelBase 87 | finalIndex :: Int 88 | finalIndex = maybe 0 (\ f -> fromEnum f - finalBase) final 89 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Html/Entity.hs: -------------------------------------------------------------------------------- 1 | module Text.Seonbi.Html.Entity 2 | ( HtmlEntity (..) 3 | , HtmlRawAttrs 4 | ) where 5 | 6 | import Data.Text 7 | 8 | import Text.Seonbi.Html.Tag (HtmlTag) 9 | import Text.Seonbi.Html.TagStack (HtmlTagStack) 10 | 11 | -- | All element attributes in a string. 12 | type HtmlRawAttrs = Text 13 | 14 | -- | An event entity emitted by 'scanHtml'. 15 | data HtmlEntity 16 | -- | Represent a token which [opens an HTML element 17 | -- ](https://www.w3.org/TR/html5/syntax.html#start-tags). 18 | -- 19 | -- Note that 'rawAttributes' is not a parsed and structured data but a raw 20 | -- string as its name implies. 21 | -- 22 | -- The 'tagStack' doesn't include the corresponding opened 'tag'. 23 | = HtmlStartTag 24 | { -- | A stack of 'HtmlTag's that represents a hierarchy of a currently 25 | -- parsing position in an 'HtmlTag' tree. 26 | tagStack :: HtmlTagStack 27 | , tag :: HtmlTag 28 | , rawAttributes :: HtmlRawAttrs 29 | } 30 | -- | Represent a token which [closes an HTML element 31 | -- ](https://www.w3.org/TR/html5/syntax.html#end-tags). 32 | -- The 'tagStack' doesn't include the corresponding closed 'tag'. 33 | | HtmlEndTag 34 | { -- | A stack of 'HtmlTag's that represents a hierarchy of a currently 35 | -- parsing position in an 'HtmlTag' tree. 36 | tagStack :: HtmlTagStack 37 | , tag :: HtmlTag 38 | } 39 | -- | Represent a token of a text node. Note that 'rawText' is not a parsed 40 | -- and structured data but a raw string as its name implies. There can be 41 | -- continuously more than one 'HtmlText' values can be emitted even if they 42 | -- are not separated by element openings or closings. 43 | | HtmlText 44 | { -- | A stack of 'HtmlTag's that represents a hierarchy of a currently 45 | -- parsing position in an 'HtmlTag' tree. 46 | tagStack :: HtmlTagStack 47 | , rawText :: Text 48 | } 49 | -- | Represent a token of a 50 | -- [CDATA section](https://www.w3.org/TR/html5/syntax.html#cdata-sections). 51 | | HtmlCdata 52 | { -- | A stack of 'HtmlTag's that represents a hierarchy of a currently 53 | -- parsing position in an 'HtmlTag' tree. 54 | tagStack :: HtmlTagStack 55 | , text :: Text 56 | } 57 | -- | Represent a token of an 58 | -- [HTML comment](https://www.w3.org/TR/html5/syntax.html#comments). 59 | | HtmlComment 60 | { -- | A stack of 'HtmlTag's that represents a hierarchy of a currently 61 | -- parsing position in an 'HtmlTag' tree. 62 | tagStack :: HtmlTagStack 63 | , comment :: Text 64 | } 65 | deriving (Eq, Ord, Show) 66 | -------------------------------------------------------------------------------- /Setup.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedStrings #-} 2 | import Control.Monad 3 | import Prelude hiding (concat) 4 | import System.IO (Handle, IOMode (..), hClose, hSetEncoding, utf8, withFile) 5 | 6 | import Codec.Archive.Zip 7 | import Data.ByteString.Lazy (ByteString, hPut) 8 | import Data.Text 9 | import Data.Text.IO (hGetLine, hPutStrLn) 10 | import Distribution.Simple 11 | import Network.HTTP.Client 12 | import System.Directory 13 | import System.FilePath 14 | import System.IO.Temp 15 | 16 | unihanUrl :: String 17 | unihanUrl = "http://ftp.unicode.org/Public/11.0.0/ucd/Unihan.zip" 18 | 19 | kHangulPath :: FilePath 20 | kHangulPath = "src" "Text" "Seonbi" "kHangul.txt" 21 | 22 | main :: IO () 23 | main = do 24 | exist <- doesFileExist kHangulPath 25 | unless exist $ do 26 | data' <- downloadUnihan 27 | extractUnihanReadings data' $ \ txtPath -> do 28 | values <- withFile txtPath ReadMode (extractProp "kHangul") 29 | withFile kHangulPath WriteMode $ \ handle -> do 30 | hSetEncoding handle utf8 31 | forM_ values $ \ (char, value) -> 32 | hPutStrLn handle $ concat [char, "\t", value] 33 | defaultMain 34 | 35 | extractProp :: Text -> Handle -> IO [(Text, Text)] 36 | extractProp property handle = do 37 | hSetEncoding handle utf8 38 | line <- hGetLine handle 39 | case line of 40 | "" -> 41 | return [] 42 | line' -> 43 | case breakOn "\t" line' of 44 | (_, "") -> 45 | extractProp property handle 46 | (char, rest) 47 | | "U+" `isPrefixOf` char && "\t" `isPrefixOf` rest -> 48 | case breakOn "\t" $ Data.Text.tail rest of 49 | (_, "") -> 50 | extractProp property handle 51 | (prop, value) | prop == property -> 52 | ((char, value) :) <$> extractProp property handle 53 | _ -> 54 | extractProp property handle 55 | _ -> 56 | extractProp property handle 57 | 58 | 59 | extractUnihanReadings :: ByteString -> (FilePath -> IO a) -> IO a 60 | extractUnihanReadings data' callback = 61 | withSystemTempFile "Unihan.zip" $ \ zipPath handle -> do 62 | hPut handle data' 63 | hClose handle 64 | let entryName = "Unihan_Readings.txt" 65 | withSystemTempFile entryName $ \ txtPath handle' -> do 66 | hClose handle' 67 | sel <- mkEntrySelector entryName 68 | withArchive zipPath (saveEntry sel txtPath) 69 | callback txtPath 70 | 71 | downloadUnihan :: IO ByteString 72 | downloadUnihan = do 73 | mgr <- newManager defaultManagerSettings 74 | req <- parseRequest unihanUrl 75 | res <- httpLbs req mgr 76 | return $ responseBody res 77 | -------------------------------------------------------------------------------- /test/Text/Seonbi/FacadeSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedLists #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.FacadeSpec (spec) where 4 | 5 | import Control.Monad 6 | import Data.Maybe (fromJust) 7 | 8 | import Data.Algorithm.Diff 9 | import Data.Text.Lazy 10 | import Data.Text.Lazy.IO 11 | import System.Directory 12 | import System.FilePath 13 | import Test.Hspec 14 | 15 | import Text.Seonbi.Facade 16 | 17 | dataDirPath :: FilePath 18 | dataDirPath = "test" "data" 19 | 20 | inputExtension :: String 21 | inputExtension = ".ko-Kore.html" 22 | 23 | outputExtensions :: Monad m => [(String, Configuration m a)] 24 | outputExtensions = 25 | [ (".ko-KR.html", ko_KR) 26 | , (".ko-KP.html", ko_KP) 27 | ] 28 | 29 | shouldHaveSameText :: HasCallStack => Text -> Text -> Expectation 30 | actual `shouldHaveSameText` expected = 31 | unless (actual == expected) (expectationFailure msg) 32 | where 33 | expectedLines :: [Text] 34 | expectedLines = Data.Text.Lazy.lines expected 35 | actualLines :: [Text] 36 | actualLines = Data.Text.Lazy.lines actual 37 | diffLines :: [Diff Text] 38 | diffLines = getDiff expectedLines actualLines 39 | diff :: Text 40 | diff = Data.Text.Lazy.unlines 41 | [ case d of 42 | First line -> "- " <> line 43 | Second line -> "+ " <> line 44 | Both line _ -> " " <> line 45 | | d <- diffLines 46 | ] 47 | msg :: String 48 | msg = "Two values are not equal:\n\n--- expected\n+++ actual\n\n" ++ 49 | unpack diff 50 | 51 | spec :: Spec 52 | spec = do 53 | testData <- runIO $ do 54 | files <- listDirectory dataDirPath 55 | let inputFiles = [f | f <- files, inputExtension `isExtensionOf` f] 56 | testFiles <- filterM 57 | (\(_, o, _) -> doesFileExist (dataDirPath o)) 58 | [ (i, dropExtension i -<.> oExt, oCfg) 59 | | i <- inputFiles 60 | , (oExt, oCfg) <- outputExtensions 61 | ] 62 | forM testFiles $ \ (input, output, cfg) -> do 63 | inputData <- Data.Text.Lazy.IO.readFile (dataDirPath input) 64 | outputData <- Data.Text.Lazy.IO.readFile (dataDirPath output) 65 | return (input, output, inputData, outputData, cfg) 66 | describe "transformHtmlLazyText" $ 67 | forM_ testData $ \ (iname, oname, input, output, cfg) -> 68 | specify (iname ++ " -> " ++ oname) $ do 69 | let noOpResult = fromJust $ transformHtmlLazyText noOp input 70 | noOpResult `shouldHaveSameText` input 71 | let cfgResult = fromJust $ transformHtmlLazyText cfg input 72 | cfgResult `shouldHaveSameText` output 73 | where 74 | noOp :: Monad m => Configuration m a 75 | noOp = Configuration 76 | { quote = Nothing 77 | , cite = Nothing 78 | , arrow = Nothing 79 | , ellipsis = False 80 | , emDash = False 81 | , stop = Nothing 82 | , hanja = Nothing 83 | , contentType = "text/html" 84 | , debugLogger = Nothing 85 | } 86 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Html/TextNormalizer.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE LambdaCase #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.Html.TextNormalizer 4 | ( escapeHtmlEntities 5 | , normalizeCdata 6 | , normalizeText 7 | ) where 8 | 9 | import Control.Exception 10 | import Data.List 11 | 12 | import Data.Text hiding (groupBy, map) 13 | 14 | import Text.Seonbi.Html.Entity 15 | 16 | -- | As 'scanHtml' may emit two or more continuous 'HtmlText' fragments even 17 | -- if these can be represented as only one 'HtmlText' fragment, it makes 18 | -- postprocessing hard. 19 | -- 20 | -- The 'normalizeText' function concatenates such continuous 'HtmlText' 21 | -- fragments into one if possible so that postprocessing can be easy: 22 | -- 23 | -- >>> :set -XOverloadedStrings -XOverloadedLists 24 | -- >>> normalizeText [HtmlText [] "Hello, ", HtmlText [] "world!"] 25 | -- [HtmlText {tagStack = fromList [], rawText = "Hello, world!"}] 26 | -- 27 | -- It also transforms all 'HtmlCdata' fragments into an 'HtmlText' together. 28 | -- 29 | -- >>> :{ 30 | -- normalizeText [ HtmlText [] "foo " 31 | -- , HtmlCdata [] "", HtmlText [] " baz!" 32 | -- ] 33 | -- :} 34 | -- [HtmlText {tagStack = fromList [], rawText = "foo <bar> baz!"}] 35 | normalizeText :: [HtmlEntity] -> [HtmlEntity] 36 | normalizeText fragments = 37 | [ case map normalizeCdata frags of 38 | [f] -> 39 | f 40 | frags'@(HtmlText { tagStack = s }:_) -> 41 | HtmlText 42 | { tagStack = s 43 | , rawText = Data.Text.concat $ map rawText frags' 44 | } 45 | frags' -> 46 | throw $ AssertionFailed 47 | ("Unexpected error occured; grouping does not work well: " ++ 48 | show frags') 49 | | frags <- groupBy isSibling fragments 50 | ] 51 | where 52 | isSibling :: HtmlEntity -> HtmlEntity -> Bool 53 | isSibling HtmlText { tagStack = a } HtmlText { tagStack = b } = a == b 54 | isSibling HtmlText { tagStack = a } HtmlCdata { tagStack = b } = a == b 55 | isSibling HtmlCdata { tagStack = a } HtmlText { tagStack = b } = a == b 56 | isSibling HtmlCdata { tagStack = a } HtmlCdata { tagStack = b } = a == b 57 | isSibling _ _ = False 58 | 59 | -- | Transform a given 'HtmlCdata' node into an equivalent 'HtmlText' node. 60 | -- 61 | -- >>> import Text.Seonbi.Html.Tag 62 | -- >>> normalizeCdata HtmlCdata { tagStack = [P], text = "

" } 63 | -- HtmlText {tagStack = fromList [P], rawText = "<p id="foo">"} 64 | normalizeCdata :: HtmlEntity -> HtmlEntity 65 | normalizeCdata HtmlCdata { tagStack = s, text = t } = 66 | HtmlText { tagStack = s, rawText = escapeHtmlEntities t } 67 | normalizeCdata entity = entity 68 | 69 | -- | Escape special (control) characters into corresponding character entities 70 | -- in the given HTML text. 71 | -- 72 | -- >>> escapeHtmlEntities "" 73 | -- "<foo & "bar">" 74 | escapeHtmlEntities :: Text -> Text 75 | escapeHtmlEntities = 76 | Data.Text.concatMap $ \ case 77 | '<' -> "<" 78 | '>' -> ">" 79 | '&' -> "&" 80 | '"' -> """ 81 | c -> Data.Text.singleton c 82 | -------------------------------------------------------------------------------- /test/Text/Seonbi/Html/TextNormalizerSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedLists #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.Html.TextNormalizerSpec (spec) where 4 | 5 | import Control.Monad 6 | 7 | import Test.Hspec 8 | 9 | import Text.Seonbi.Html.Entity 10 | import Text.Seonbi.Html.Tag 11 | import Text.Seonbi.Html.TagStack 12 | import Text.Seonbi.Html.TextNormalizer 13 | 14 | spec :: Spec 15 | spec = do 16 | specify "normalizeText" $ 17 | normalizeText 18 | [ HtmlText { tagStack = [], rawText = "foo " } 19 | , HtmlText { tagStack = [], rawText = "& bar" } 20 | , HtmlCdata { tagStack = [], text = " & baz " } 21 | , HtmlStartTag { tagStack = [], tag = P, rawAttributes = "" } 22 | , HtmlText { tagStack = [P], rawText = "qux " } 23 | , HtmlCdata { tagStack = [P], text = "& \"quux\"" } 24 | , HtmlEndTag { tagStack = [], tag = P } 25 | , HtmlCdata { tagStack = [], text = " " } 26 | ] `shouldBe` 27 | [ HtmlText { tagStack = [], rawText = "foo & bar & baz " } 28 | , HtmlStartTag { tagStack = [], tag = P, rawAttributes = "" } 29 | , HtmlText 30 | { tagStack = [P] 31 | , rawText = "qux & "quux"" 32 | } 33 | , HtmlEndTag { tagStack = [], tag = P } 34 | , HtmlText { tagStack = [], rawText = " <end>" } 35 | ] 36 | 37 | describe "normalizeCdata" $ do 38 | let s1 = [] :: HtmlTagStack 39 | let s2 = [Div, P] :: HtmlTagStack 40 | specify "HtmlStartTag" $ do 41 | let entity1 = HtmlStartTag 42 | { tagStack = s1 43 | , tag = P 44 | , rawAttributes = "" 45 | } 46 | normalizeCdata entity1 `shouldBe` entity1 47 | let entity2 = HtmlStartTag 48 | { tagStack = s2 49 | , tag = P 50 | , rawAttributes = " class=\"entity2\"" 51 | } 52 | normalizeCdata entity2 `shouldBe` entity2 53 | let stacks = [s1, s2] :: [HtmlTagStack] 54 | forM_ stacks $ \ s -> do 55 | specify ("HtmlEndTag: " ++ show s) $ do 56 | let e = HtmlEndTag { tagStack = s, tag = P } 57 | normalizeCdata e `shouldBe` e 58 | specify ("HtmlText: " ++ show s) $ do 59 | let e = HtmlText { tagStack = s, rawText = "foo & bar" } 60 | normalizeCdata e `shouldBe` e 61 | specify ("HtmlComment: " ++ show s) $ do 62 | let e = HtmlComment { tagStack = s, comment = "foo" } 63 | normalizeCdata e `shouldBe` e 64 | specify ("HtmlCdata: " ++ show s) $ do 65 | let e = HtmlCdata { tagStack = s, text = "

foo & bar

" } 66 | normalizeCdata e `shouldBe` 67 | HtmlText 68 | { tagStack = s 69 | , rawText = "<p>foo & bar</p>" 70 | } 71 | 72 | specify "escapeHtmlEntities" $ do 73 | escapeHtmlEntities "

" `shouldBe` 74 | "<p id="foo">" 75 | escapeHtmlEntities "AT&T" `shouldBe` 76 | "AT&T" 77 | -------------------------------------------------------------------------------- /scripts/deno/test.ts: -------------------------------------------------------------------------------- 1 | import { 2 | Configuration, 3 | DEFAULT_CONFIGURATION, 4 | Options, 5 | Seonbi, 6 | transform, 7 | } from "./mod.ts"; 8 | import { assertEquals } from "https://deno.land/std@0.106.0/testing/asserts.ts"; 9 | 10 | const hanjaInParens: Options = { 11 | contentType: "text/html", 12 | quote: "CurvedQuotes", 13 | cite: null, 14 | arrow: null, 15 | ellipsis: false, 16 | emDash: false, 17 | stop: null, 18 | hanja: { 19 | rendering: "HanjaInParentheses", 20 | reading: { 21 | initialSoundLaw: true, 22 | useDictionaries: ["kr-stdict"], 23 | dictionary: {}, 24 | }, 25 | }, 26 | }; 27 | 28 | const customDict: Options = { 29 | ...hanjaInParens, 30 | hanja: { 31 | rendering: "HanjaInParentheses", 32 | reading: { 33 | initialSoundLaw: true, 34 | useDictionaries: [], 35 | dictionary: { "言語": "말", "文字": "글" }, 36 | }, 37 | }, 38 | }; 39 | 40 | let config: Configuration = { 41 | ...DEFAULT_CONFIGURATION, 42 | process: { distType: "nightly" }, 43 | }; 44 | 45 | try { 46 | const binPath = Deno.env.get("SEONBI_API"); 47 | if (binPath != null && "process" in config) config.process = { binPath }; 48 | } catch (e) { 49 | if (!(e instanceof Deno.errors.PermissionDenied)) throw e; 50 | } 51 | 52 | try { 53 | const port = Deno.env.get("SEONBI_API_PORT"); 54 | if (port != null && port.match(/^[0-9]+$/) && "process" in config) { 55 | config.port = parseInt(port); 56 | } 57 | } catch (e) { 58 | if (!(e instanceof Deno.errors.PermissionDenied)) throw e; 59 | } 60 | 61 | try { 62 | const apiUrl = Deno.env.get("SEONBI_API_URL"); 63 | if (apiUrl != null) config = { apiUrl }; 64 | } catch (e) { 65 | if (!(e instanceof Deno.errors.PermissionDenied)) throw e; 66 | } 67 | 68 | Deno.test("transform()", async () => { 69 | const koKr = await transform("

言語와 文字

", config); 70 | assertEquals(koKr, "

언어와 문자

"); 71 | }); 72 | 73 | Deno.test("Seonbi#start()", async () => { 74 | const seonbi = new Seonbi(config); 75 | await seonbi.start(); 76 | try { 77 | for (let i = 0; i < 5; i++) { 78 | try { 79 | const response = await fetch(seonbi.apiUrl); 80 | assertEquals( 81 | { message: "Unsupported method: GET", success: false }, 82 | await response.json(), 83 | ); 84 | break; 85 | } catch (e) { 86 | if ( 87 | !(e instanceof TypeError) || 88 | e.message.indexOf("os error 61") < 0 && 89 | e.message.indexOf("os error 111") < 0 90 | ) { 91 | throw e; 92 | } 93 | 94 | return new Promise((r) => setTimeout(r, 1000)); 95 | } 96 | } 97 | } finally { 98 | await seonbi.stop(); 99 | } 100 | }); 101 | 102 | function withSeonbi(fn: (s: Seonbi) => Promise): () => Promise { 103 | return async () => { 104 | const seonbi = new Seonbi(config); 105 | await seonbi.start(); 106 | try { 107 | await fn(seonbi); 108 | } finally { 109 | await seonbi.stop(); 110 | } 111 | }; 112 | } 113 | 114 | function testWithSeonbi(label: string, fn: (s: Seonbi) => Promise): void { 115 | Deno.test(label, withSeonbi(fn)); 116 | } 117 | 118 | testWithSeonbi("Seonbi#transform()", async (seonbi: Seonbi) => { 119 | assertEquals( 120 | await seonbi.transform("

言語와 文字

"), 121 | "

언어와 문자

", 122 | ); 123 | assertEquals( 124 | await seonbi.transform("

言語와 文字

", hanjaInParens), 125 | "

언어(言語)와 문자(文字)

", 126 | ); 127 | assertEquals( 128 | await seonbi.transform("

言語와 文字

", customDict), 129 | "

말(言語)와 글(文字)

", 130 | ); 131 | }); 132 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Unihan/KHangul.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE CPP #-} 2 | {-# LANGUAGE FlexibleInstances #-} 3 | {-# LANGUAGE LambdaCase #-} 4 | {-# LANGUAGE OverloadedStrings #-} 5 | {-# LANGUAGE TemplateHaskell #-} 6 | {-# LANGUAGE TypeSynonymInstances #-} 7 | module Text.Seonbi.Unihan.KHangul 8 | ( CharacterSet (..) 9 | , HanjaReadings 10 | , HanjaReadingCitation (..) 11 | , KHangulData 12 | , Purpose (..) 13 | , kHangulData 14 | , kHangulData' 15 | ) where 16 | 17 | import Data.Either 18 | 19 | import Data.Aeson 20 | import Data.Attoparsec.Text 21 | import Data.ByteString.Lazy (fromStrict) 22 | import Data.FileEmbed 23 | import Data.Map.Strict 24 | import Data.Set hiding (empty) 25 | import System.FilePath (takeDirectory, ()) 26 | 27 | -- $setup 28 | -- >>> import qualified Text.Show.Unicode 29 | -- >>> :set -interactive-print=Text.Show.Unicode.uprint 30 | 31 | -- | Maps all Hanja characters to their possible readings. 32 | type KHangulData = Map Char HanjaReadings 33 | 34 | -- | All readings of a Hanja character. 35 | type HanjaReadings = Map Char HanjaReadingCitation 36 | 37 | -- | Represents what standard a reading of character belongs to and a purpose 38 | -- of the reading. 39 | data HanjaReadingCitation = 40 | HanjaReadingCitation CharacterSet (Set Purpose) deriving (Eq, Ord, Show) 41 | 42 | -- | Represents character set standards for Korean writing system. 43 | data CharacterSet 44 | -- | KS X 1001 (정보 교환용 부호계). 45 | = KS_X_1001 46 | -- | KS X 1002 (정보 교환용 부호 확장 세트). 47 | | KS_X_1002 48 | -- | Represents that a Hanja character is not included in any Korean 49 | -- character set standards. 50 | | NonStandard 51 | deriving (Eq, Ord, Show) 52 | 53 | -- | Represents purposes of Hanja characters. 54 | data Purpose 55 | -- | Basic Hanja for educational use (漢文敎育用基礎漢字), a subset of 56 | -- Hanja defined in 1972 by a South Korean standard for educational use. 57 | = Education 58 | -- | Hanja for personal names (人名用漢字). 59 | | PersonalName 60 | deriving (Eq, Ord, Show) 61 | 62 | citationParser :: Parser HanjaReadingCitation 63 | citationParser = do 64 | charset' <- option NonStandard charset 65 | purposes <- many' purpose 66 | return $ HanjaReadingCitation charset' $ Data.Set.fromList purposes 67 | where 68 | charset :: Parser CharacterSet 69 | charset = do 70 | d <- digit 71 | case d of 72 | '0' -> return KS_X_1001 73 | '1' -> return KS_X_1002 74 | c -> fail ("Invalid kHangul character set code: " ++ show c) 75 | purpose :: Parser Purpose 76 | purpose = do 77 | l <- letter 78 | case l of 79 | 'E' -> return Education 80 | 'N' -> return PersonalName 81 | c -> fail ("Invalid kHangul purpose code: " ++ show c) 82 | 83 | instance FromJSON HanjaReadingCitation where 84 | parseJSON = 85 | withText "kHangul value (e.g., 0E, 1N, 0EN)" $ \ t -> 86 | case parseOnly (citationParser <* endOfInput) t of 87 | Right cite -> return cite 88 | Left msg -> fail msg 89 | 90 | kHangulData' :: Either String KHangulData 91 | kHangulData' = eitherDecode $ 92 | fromStrict $(embedFile $ takeDirectory __FILE__ "kHangul.json") 93 | 94 | -- | Data that map Hanja characters to their corresponding kHangul entries 95 | -- (i.e., Hanja readings and citations). 96 | -- 97 | -- >>> import Data.Map.Strict as M 98 | -- >>> let Just entries = M.lookup '天' kHangulData 99 | -- >>> entries 100 | -- fromList [('천',HanjaReadingCitation KS_X_1001 (fromList [Education]))] 101 | kHangulData :: KHangulData 102 | kHangulData = fromRight empty kHangulData' 103 | 104 | {- HLINT ignore "Unused LANGUAGE pragma" -} 105 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Html/Printer.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE LambdaCase #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.Html.Printer 4 | ( printHtml 5 | , printText 6 | , printXhtml 7 | ) where 8 | 9 | import Data.Char 10 | import Data.List 11 | 12 | import qualified Data.Text 13 | import Data.Text.Lazy 14 | import Data.Text.Lazy.Builder 15 | import HTMLEntities.Decoder 16 | 17 | import Text.Seonbi.Html.Entity 18 | import Text.Seonbi.Html.Tag 19 | 20 | -- $setup 21 | -- >>> :set -XOverloadedStrings 22 | -- >>> import Text.Seonbi.Html.Scanner 23 | -- >>> :set -interactive-print=Text.Show.Unicode.uprint 24 | 25 | -- | Print the list of 'HtmlEntity' into a lazy 'Text'. 26 | -- 27 | -- >>> let Done "" tokens = scanHtml "

Hello,
\nworld!

" 28 | -- >>> printHtml tokens 29 | -- "

Hello,
\nworld!

" 30 | printHtml :: [HtmlEntity] -> Text 31 | printHtml = printHtml' False 32 | 33 | -- | Similar to 'printHtml' except it renders void (self-closing) tags as 34 | -- like @
@ instead of @
@. 35 | -- 36 | -- >>> let Done "" tokens = scanHtml "

Hello,
\nworld!

" 37 | -- >>> printXhtml tokens 38 | -- "

Hello,
\nworld!

" 39 | -- 40 | -- Note that normal tags are not rendered as self-closed; only void tags 41 | -- according to HTML specification are: 42 | -- 43 | -- >>> let Done "" tokens' = scanHtml "


" 44 | -- >>> printXhtml tokens' 45 | -- "


" 46 | printXhtml :: [HtmlEntity] -> Text 47 | printXhtml = printHtml' True 48 | 49 | printHtml' :: Bool -> [HtmlEntity] -> Text 50 | printHtml' xhtml = 51 | Data.Text.Lazy.concat . Prelude.concatMap render . Data.List.groupBy isVoid 52 | where 53 | isVoid :: HtmlEntity -> HtmlEntity -> Bool 54 | isVoid (HtmlStartTag stck tg _) (HtmlEndTag stck' tg') = 55 | htmlTagKind tg == Void && stck == stck' && tg == tg' 56 | isVoid _ _ = False 57 | render :: [HtmlEntity] -> [Text] 58 | render [a@HtmlStartTag { tag = t, rawAttributes = at }, b@HtmlEndTag {}] = 59 | if isVoid a b 60 | then 61 | [ "<" 62 | , fromStrict (htmlTagName t) 63 | , renderAttrs at 64 | , if xhtml then "/>" else ">" 65 | ] 66 | else e a ++ e b 67 | render entities = Prelude.concatMap e entities 68 | e :: HtmlEntity -> [Text] 69 | e HtmlStartTag { tag = t, rawAttributes = a } = 70 | ["<", fromStrict (htmlTagName t), renderAttrs a, ">"] 71 | e HtmlEndTag { tag = t } = [""] 72 | e HtmlText { rawText = t } = [fromStrict t] 73 | e HtmlCdata { text = t } = [""] 74 | e HtmlComment { comment = c } = [""] 75 | renderAttrs :: Data.Text.Text -> Text 76 | renderAttrs "" = "" 77 | renderAttrs attrs 78 | | isSpace (Data.Text.head attrs) = fromStrict attrs 79 | | otherwise = ' ' `cons` fromStrict attrs 80 | 81 | -- | Print only the text contents (including CDATA sections) without tags 82 | -- into a lazy 'Text'. 83 | -- 84 | -- >>> let Done "" tokens = scanHtml "

Hello,
\nworld!

" 85 | -- >>> printText tokens 86 | -- "Hello,\nworld!" 87 | -- 88 | -- Entities are decoded: 89 | -- 90 | -- >>> let Done "" tokens = scanHtml "

<>"&

" 91 | -- >>> printText tokens 92 | -- "<>\"&" 93 | printText :: [HtmlEntity] -> Text 94 | printText [] = Data.Text.Lazy.empty 95 | printText (x:xs) = 96 | render x <> printText xs 97 | where 98 | render :: HtmlEntity -> Text 99 | render = \ case 100 | HtmlText { rawText = t } -> toLazyText $ htmlEncodedText t 101 | HtmlCdata { text = t } -> fromStrict t 102 | _ -> Data.Text.Lazy.empty 103 | -------------------------------------------------------------------------------- /test/Text/Seonbi/Html/LangSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedLists #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.Html.LangSpec (spec) where 4 | 5 | import Test.Hspec 6 | 7 | import Text.Seonbi.Html 8 | import Text.Seonbi.Html.Lang 9 | 10 | source :: [HtmlEntity] 11 | source = 12 | [ HtmlStartTag 13 | { tagStack = [] 14 | , tag = P 15 | , rawAttributes = "id=\"foo\" lang=\"en\"" 16 | } 17 | , HtmlText { tagStack = [P], rawText = "English" } 18 | , HtmlEndTag { tagStack = [], tag = P } 19 | , HtmlStartTag { tagStack = [], tag = Div, rawAttributes = "" } 20 | , HtmlStartTag 21 | { tagStack = [Div] 22 | , tag = P 23 | , rawAttributes = "class=bar lang=ja" 24 | } 25 | , HtmlStartTag { tagStack = [Div, P], tag = B, rawAttributes = "" } 26 | , HtmlText { tagStack = [Div, P, B], rawText = "日本語" } 27 | , HtmlEndTag { tagStack = [Div, P], tag = B } 28 | , HtmlStartTag 29 | { tagStack = [Div, P] 30 | , tag = Span 31 | , rawAttributes = "lang='yue-Hant'" 32 | } 33 | , HtmlText { tagStack = [Div, P, Span], rawText = "與" } 34 | , HtmlStartTag { tagStack = [Div, P, Span], tag = B, rawAttributes = "" } 35 | , HtmlText { tagStack = [Div, P, Span, B], rawText = "與粵語" } 36 | , HtmlEndTag { tagStack = [Div, P, Span], tag = B } 37 | , HtmlEndTag { tagStack = [Div, P], tag = Span } 38 | , HtmlEndTag { tagStack = [Div], tag = P } 39 | , HtmlEndTag { tagStack = [], tag = Div } 40 | ] 41 | 42 | annotated :: [LangHtmlEntity] 43 | annotated = 44 | [ LangHtmlEntity 45 | (Just "en") 46 | HtmlStartTag 47 | { tagStack = [] 48 | , tag = P 49 | , rawAttributes = "id=\"foo\" lang=\"en\"" 50 | } 51 | , LangHtmlEntity 52 | (Just "en") 53 | HtmlText { tagStack = [P], rawText = "English" } 54 | , LangHtmlEntity (Just "en") HtmlEndTag { tagStack = [], tag = P } 55 | , LangHtmlEntity 56 | Nothing 57 | HtmlStartTag { tagStack = [], tag = Div, rawAttributes = "" } 58 | , LangHtmlEntity 59 | (Just "ja") 60 | HtmlStartTag 61 | { tagStack = [Div] 62 | , tag = P 63 | , rawAttributes = "class=bar lang=ja" 64 | } 65 | , LangHtmlEntity 66 | (Just "ja") 67 | HtmlStartTag { tagStack = [Div, P], tag = B, rawAttributes = "" } 68 | , LangHtmlEntity 69 | (Just "ja") 70 | HtmlText { tagStack = [Div, P, B], rawText = "日本語" } 71 | , LangHtmlEntity (Just "ja") HtmlEndTag { tagStack = [Div, P], tag = B } 72 | , LangHtmlEntity 73 | (Just "yue-hant") 74 | HtmlStartTag 75 | { tagStack = [Div, P] 76 | , tag = Span 77 | , rawAttributes = "lang='yue-Hant'" 78 | } 79 | , LangHtmlEntity 80 | (Just "yue-hant") 81 | HtmlText { tagStack = [Div, P, Span], rawText = "與" } 82 | , LangHtmlEntity 83 | (Just "yue-hant") 84 | HtmlStartTag { tagStack = [Div, P, Span], tag = B, rawAttributes = "" } 85 | , LangHtmlEntity 86 | (Just "yue-hant") 87 | HtmlText { tagStack = [Div, P, Span, B], rawText = "與粵語" } 88 | , LangHtmlEntity 89 | (Just "yue-hant") 90 | HtmlEndTag { tagStack = [Div, P, Span], tag = B } 91 | , LangHtmlEntity 92 | (Just "yue-hant") 93 | HtmlEndTag { tagStack = [Div, P], tag = Span } 94 | , LangHtmlEntity (Just "ja") HtmlEndTag { tagStack = [Div] , tag = P } 95 | , LangHtmlEntity Nothing HtmlEndTag { tagStack = [], tag = Div } 96 | ] 97 | 98 | spec :: Spec 99 | spec = do 100 | specify "extractLang" $ do 101 | extractLang "" `shouldBe` Nothing 102 | extractLang "lang=en" `shouldBe` Just "en" 103 | extractLang "lang=en-US" `shouldBe` Just "en-us" 104 | extractLang "lang='ko-KR'" `shouldBe` Just "ko-kr" 105 | extractLang "lang=\"zh-Hant\"" `shouldBe` Just "zh-hant" 106 | extractLang "lang=\"yue-Hans-HK\"" `shouldBe` Just "yue-hans-hk" 107 | extractLang "id=\"foo\" lang=\"en\"" `shouldBe` Just "en" 108 | extractLang "id=\"foo\" lang=zh-CN class=bar" `shouldBe` Just "zh-cn" 109 | specify "annotateWithLang" $ do 110 | annotateWithLang [] `shouldBe` [] 111 | annotateWithLang source `shouldBe` annotated 112 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Html/Clipper.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE LambdaCase #-} 2 | module Text.Seonbi.Html.Clipper 3 | ( clipPrefixText 4 | , clipSuffixText 5 | , clipText 6 | ) where 7 | 8 | import Control.Monad 9 | import Data.List (dropWhileEnd) 10 | 11 | import Data.Text 12 | 13 | import Text.Seonbi.Html 14 | 15 | -- | Clip the given prefix text and suffix text from the HTML fragments. 16 | -- It simply is composed of 'clipPrefixText' and 'clipSuffixText' functions. 17 | -- It returns 'Nothing' if any of a prefix and a suffix does not match. 18 | clipText :: Text -> Text -> [HtmlEntity] -> Maybe [HtmlEntity] 19 | clipText prefix suffix = 20 | clipSuffixText suffix <=< clipPrefixText prefix 21 | 22 | -- | Clip the given prefix text from the HTML fragments. If its first 23 | -- text element does not have the same prefix, or the first element is not 24 | -- an 'HtmlText' node, or the list of HTML fragments have nothing at all, 25 | -- it returns 'Nothing'. 26 | -- 27 | -- >>> :set -XOverloadedLists 28 | -- >>> :set -XOverloadedStrings 29 | -- >>> clipPrefixText "foo" [HtmlText [] "bar", HtmlStartTag [] P ""] 30 | -- Nothing 31 | -- >>> clipPrefixText "foo" [HtmlStartTag [] P "", HtmlText [] "foo"] 32 | -- Nothing 33 | -- >>> clipPrefixText "foo" [] 34 | -- Nothing 35 | -- 36 | -- If the first element is an 'HtmlText' node, and its 'rawText' contains 37 | -- the common prefix text, it returns a 'Just' value holding a list of 38 | -- HTML fragments with the common prefix removed. 39 | -- 40 | -- >>> clipPrefixText "foo" [HtmlText [] "foobar", HtmlStartTag [] P ""] 41 | -- Just [HtmlText {... "bar"},HtmlStartTag {...}] 42 | -- >>> clipPrefixText "foo" [HtmlText [] "foo", HtmlStartTag [] P ""] 43 | -- Just [HtmlStartTag {..., tag = P, ...}] 44 | -- 45 | -- A given text is treated as a raw text, which means even if some HTML 46 | -- entities refer to the same characters it may fails to match unless 47 | -- they share the exactly same representation, e.g.: 48 | -- 49 | -- >>> clipPrefixText "&" [HtmlText [] "&"] 50 | -- Nothing 51 | -- 52 | -- In the same manner, it doesn't find a prefix from 'HtmlCdata', e.g.: 53 | -- 54 | -- >>> clipPrefixText "foo" [HtmlCdata [] "foo", HtmlStartTag [] P ""] 55 | -- Nothing 56 | -- 57 | -- In order to remove a prefix from both 'HtmlText' and 'HtmlCdata', 58 | -- apply 'normalizeText' first so that all 'HtmlCdata' entities are transformed 59 | -- to equivalent 'HtmlText' entities: 60 | -- 61 | -- >>> import Text.Seonbi.Html.TextNormalizer (normalizeText) 62 | -- >>> let normalized = normalizeText [HtmlCdata [] "foo", HtmlStartTag [] P ""] 63 | -- >>> clipPrefixText "foo" normalized 64 | -- Just [HtmlStartTag {..., tag = P, ...}] 65 | -- 66 | -- Plus, it works even if HTML fragments contain some 'HtmlComment' entities, 67 | -- but these are not touched at all, e.g.: 68 | -- 69 | -- >>> clipPrefixText "bar" [HtmlComment [] "foo", HtmlText [] "barbaz"] 70 | -- Just [HtmlComment {... "foo"},HtmlText {... "baz"}] 71 | clipPrefixText :: Text -> [HtmlEntity] -> Maybe [HtmlEntity] 72 | clipPrefixText prefix [] 73 | | Data.Text.null prefix = Just [] 74 | | otherwise = Nothing 75 | clipPrefixText prefix (x@HtmlComment {} : xs) = 76 | (x :) <$> clipPrefixText prefix xs 77 | clipPrefixText prefix (x@HtmlText { rawText = rawText' } : xs) 78 | | prefix == rawText' = Just xs 79 | | prefix `isPrefixOf` rawText' = Just $ 80 | x { rawText = Data.Text.drop (Data.Text.length prefix) rawText' } : xs 81 | | otherwise = Nothing 82 | clipPrefixText _ _ = Nothing 83 | 84 | -- | Clip the given suffix text from the HTML fragments, in the same manner 85 | -- to 'clipPrefixText'. 86 | clipSuffixText :: Text -> [HtmlEntity] -> Maybe [HtmlEntity] 87 | clipSuffixText suffix [] 88 | | Data.Text.null suffix = Just [] 89 | | otherwise = Nothing 90 | clipSuffixText suffix entities = 91 | case Prelude.last entities' of 92 | e@HtmlText { rawText = rawText' } 93 | | suffix == rawText' -> Just (init' ++ comments) 94 | | suffix `isSuffixOf` rawText' -> 95 | let 96 | sLen = Data.Text.length suffix 97 | rtLen = Data.Text.length rawText' 98 | clipped = Data.Text.take (rtLen - sLen) rawText' 99 | in 100 | Just (init' ++ e { rawText = clipped } : comments) 101 | | otherwise -> Nothing 102 | _ -> Nothing 103 | where 104 | entities' :: [HtmlEntity] 105 | entities' = (`Data.List.dropWhileEnd` entities) $ \ case 106 | HtmlComment {} -> True 107 | _ -> False 108 | init' :: [HtmlEntity] 109 | init' = Prelude.init entities' 110 | comments :: [HtmlEntity] 111 | comments = Prelude.drop (Prelude.length entities') entities 112 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Trie.hs: -------------------------------------------------------------------------------- 1 | {-# OPTIONS_GHC -fno-warn-orphans #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | {-# LANGUAGE RankNTypes #-} 4 | {-# LANGUAGE TypeFamilies #-} 5 | -- | A trie from 'Text' keys to values. 6 | module Text.Seonbi.Trie 7 | ( Trie 8 | , elems 9 | , empty 10 | , fromList 11 | , insert 12 | , keys 13 | , lookup 14 | , member 15 | , mergeBy 16 | , null 17 | , singleton 18 | , size 19 | , toList 20 | , unionL 21 | , unionR 22 | ) where 23 | 24 | import Prelude hiding (lookup, null) 25 | 26 | import Control.Monad (ap) 27 | import qualified GHC.Exts 28 | 29 | import Data.ByteString (ByteString) 30 | import Data.Text hiding (empty, null, singleton) 31 | import Data.Text.Encoding (encodeUtf8, decodeUtf8) 32 | import qualified Data.Trie as BTrie 33 | 34 | -- | A trie from 'Text' keys to 'a' values. 35 | newtype Trie a 36 | = Trie (BTrie.Trie a) 37 | deriving (Eq, Show) 38 | 39 | encodeKey :: Text -> ByteString 40 | encodeKey = encodeUtf8 41 | 42 | decodeKey :: ByteString -> Text 43 | decodeKey = decodeUtf8 44 | 45 | -- | The empty trie. 46 | empty :: Trie a 47 | empty = Trie BTrie.empty 48 | 49 | -- | Checks if the trie is empty. 50 | null :: Trie a -> Bool 51 | null (Trie btrie) = BTrie.null btrie 52 | 53 | -- | Constructs a singleton trie. 54 | singleton :: Text -> a -> Trie a 55 | singleton key value = Trie $ BTrie.singleton (encodeKey key) value 56 | 57 | -- | Gets the number of elements in the trie. 58 | size :: Trie a -> Int 59 | size (Trie btrie) = BTrie.size btrie 60 | 61 | fromList' :: [(Text, a)] -> Trie a 62 | fromList' list = Trie $ BTrie.fromList [(encodeKey k, v) | (k, v) <- list] 63 | 64 | toList' :: Trie a -> [(Text, a)] 65 | toList' (Trie btrie) = [(decodeKey k, v) | (k, v) <- BTrie.toList btrie] 66 | 67 | -- | Converts a list of associated pairs into a trie. For duplicate keys, 68 | -- values earlier in the list shadow later ones. 69 | fromList :: [(Text, a)] -> Trie a 70 | fromList = fromList' 71 | 72 | -- | Converts a trie into a list of associated pairs. Keys will be ordered. 73 | toList :: Trie a -> [(Text, a)] 74 | toList = toList' 75 | 76 | -- | Lists all keys in the trie. Keys will be ordered. 77 | keys :: Trie a -> [Text] 78 | keys (Trie btrie) = Prelude.map decodeKey $ BTrie.keys btrie 79 | 80 | -- | Lists all values in the trie. Values are ordered by their associated keys. 81 | elems :: Trie a -> [a] 82 | elems (Trie btrie) = BTrie.elems btrie 83 | 84 | -- | Gets the value associated with a key if it exists. 85 | lookup :: Text -> Trie a -> Maybe a 86 | lookup key (Trie btrie) = BTrie.lookup (encodeKey key) btrie 87 | 88 | -- | Checks if a key has a value in the trie. 89 | member :: Text -> Trie a -> Bool 90 | member key (Trie btrie) = BTrie.member (encodeKey key) btrie 91 | 92 | -- | Inserts a new key into the trie. 93 | insert 94 | :: Text 95 | -- ^ A new key to insert. If there is already the same key in the trie, 96 | -- the existing value is overwritten by the new value. 97 | -> a 98 | -- ^ A value associated to the key. 99 | -> Trie a 100 | -- ^ An existing trie. 101 | -> Trie a 102 | -- ^ The new trie with the inserted key. 103 | insert key value (Trie btrie) = Trie $ BTrie.insert (encodeKey key) value btrie 104 | 105 | -- | Combines two tries, using a function to resolve collisions. This can only 106 | -- define the space of functions between union and symmetric difference but, 107 | -- with those two, all set operations can be defined (albeit inefficiently). 108 | mergeBy :: (a -> a -> Maybe a) -> Trie a -> Trie a -> Trie a 109 | mergeBy f (Trie a) (Trie b) = Trie $ BTrie.mergeBy f a b 110 | 111 | -- | Combines two tries, resolving conflicts by choosing the value from the 112 | -- left (former) trie. 113 | unionL :: Trie a -> Trie a -> Trie a 114 | unionL (Trie left) (Trie right) = Trie $ BTrie.unionL left right 115 | 116 | -- | Combines two tries, resolving conflicts by choosing the value from the 117 | -- right (latter) trie. 118 | unionR :: Trie a -> Trie a -> Trie a 119 | unionR (Trie left) (Trie right) = Trie $ BTrie.unionR left right 120 | 121 | instance Functor Trie where 122 | fmap f (Trie btrie) = Trie $ fmap f btrie 123 | 124 | instance Foldable Trie where 125 | foldMap f (Trie btrie) = foldMap f btrie 126 | 127 | instance Traversable Trie where 128 | traverse f (Trie btrie) = Trie <$> traverse f btrie 129 | 130 | instance Applicative Trie where 131 | pure = singleton "" 132 | (<*>) = ap 133 | 134 | instance Monad Trie where 135 | Trie btrie >>= f = Trie $ btrie >>= (\ v -> case f v of { Trie b -> b }) 136 | 137 | instance (Semigroup a) => Semigroup (Trie a) where 138 | (Trie a) <> (Trie b) = Trie (a <> b) 139 | 140 | instance (Monoid a) => Monoid (Trie a) where 141 | mempty = Trie mempty 142 | 143 | instance GHC.Exts.IsList (Trie a) where 144 | type Item (Trie a) = (Text, a) 145 | fromList = fromList' 146 | toList = toList' 147 | -------------------------------------------------------------------------------- /setup/action.yaml: -------------------------------------------------------------------------------- 1 | name: Setup Seonbi 2 | description: Set up a specific version of Seonbi and add it to the PATH. 3 | author: Hong Minhee 4 | branding: 5 | icon: package 6 | color: gray-dark 7 | inputs: 8 | seonbi-version: 9 | description: >- 10 | Version of Seonbi to install. Note that asterisks can be used to 11 | choose the latest version, e.g., 1.2.*, 1.*, *. 12 | default: "*" 13 | add-to-path: 14 | description: >- 15 | Whether to add the installed seonbi and seonbi-api to the PATH. Turned 16 | on by default. 17 | default: true 18 | outputs: 19 | seonbi-version: 20 | description: Exact version number of the installed Seonbi. 21 | value: ${{ steps.prepare.outputs.seonbi-version }} 22 | seonbi-path: 23 | description: Absolute path of the installed executable seonbi. 24 | value: ${{ steps.prepare.outputs.seonbi-path }} 25 | seonbi-api-path: 26 | description: Absolute path of the installed executable seonbi-api. 27 | value: ${{ steps.prepare.outputs.seonbi-api-path }} 28 | runs: 29 | using: composite 30 | steps: 31 | - id: prepare 32 | shell: python 33 | run: | 34 | from __future__ import print_function 35 | import fnmatch 36 | import json 37 | import os 38 | import os.path 39 | try: from urllib import request as urllib2 40 | except ImportError: import urllib2 41 | import tempfile 42 | 43 | suffixes = { 44 | ('Linux', 'X64'): 'linux-x86_64.tar.bz2', 45 | ('Linux', 'ARM64'): 'linux-arm64.tar.bz2', 46 | ('macOS', 'X64'): 'macos-x86_64.tar.bz2', 47 | ('macOS', 'ARM64'): 'macos-arm64.tar.bz2', 48 | ('Windows', 'X64'): 'win64.zip', 49 | } 50 | os_ = os.environ['RUNNER_OS'] 51 | arch = os.environ['RUNNER_ARCH'] 52 | try: 53 | suffix = suffixes[os_, arch] 54 | except KeyError: 55 | print( 56 | "::error title=Unsupported OS and architecture::Seonbi doesn't", 57 | 'support {0}/{1}'.format(os_, arch) 58 | ) 59 | raise SystemExit(1) 60 | 61 | # TODO: paging 62 | req = urllib2.Request( 63 | 'https://api.github.com/repos/dahlia/seonbi/releases?per_page=100', 64 | headers={'Authorization': 'Bearer ' + os.environ['GH_TOKEN']} 65 | ) 66 | res = urllib2.urlopen(req) 67 | tags = json.load(res) 68 | tags.sort( 69 | key=lambda tag: tuple(map(int, tag['tag_name'].split('.'))), 70 | reverse=True 71 | ) 72 | res.close() 73 | version_pattern = os.environ['SEONBI_VERSION'].strip() 74 | for tag in tags: 75 | if not fnmatch.fnmatch(tag['tag_name'], version_pattern): 76 | continue 77 | for asset in tag['assets']: 78 | if asset['name'] == 'seonbi-{0}.{1}'.format(tag['tag_name'], suffix): 79 | print('::set-output name=seonbi-version::' + tag['tag_name']) 80 | print( 81 | '::set-output name=download-url::' + asset['browser_download_url'] 82 | ) 83 | break 84 | else: 85 | continue 86 | break 87 | else: 88 | print( 89 | '::error title=Unsupported platform::Seonbi', version_pattern, 90 | 'does not support', os_, '&', arch + '.' 91 | ) 92 | 93 | dir_path = tempfile.mkdtemp('seonbi', dir=os.environ.get('RUNNER_TEMP')) 94 | seonbi_path = os.path.join( 95 | dir_path, 96 | 'seonbi.exe' if os_ == 'Windows' else 'seonbi' 97 | ) 98 | seonbi_api_path = os.path.join( 99 | dir_path, 100 | 'seonbi-api.exe' if os_ == 'Windows' else 'seonbi-api' 101 | ) 102 | print('::set-output name=dir-path::' + dir_path) 103 | print('::set-output name=seonbi-path::' + seonbi_path) 104 | print('::set-output name=seonbi-api-path::' + seonbi_api_path) 105 | env: 106 | GH_TOKEN: ${{ github.token }} 107 | SEONBI_VERSION: ${{ inputs.seonbi-version }} 108 | # Linux & macOS 109 | - if: runner.os != 'Windows' 110 | shell: bash 111 | run: | 112 | set -e 113 | wget "$DOWNLOAD_URL" 114 | tar xvfj "$(basename "$DOWNLOAD_URL")" 115 | chmod +x seonbi seonbi-api 116 | if [[ "$ADD_TO_PATH" = "true" ]]; then 117 | pwd >> "$GITHUB_PATH" 118 | fi 119 | env: 120 | DOWNLOAD_URL: ${{ steps.prepare.outputs.download-url }} 121 | DIR_PATH: ${{ steps.prepare.outputs.dir-path }} 122 | ADD_TO_PATH: ${{ inputs.add-to-path }} 123 | working-directory: ${{ steps.prepare.outputs.dir-path }} 124 | # Windows 125 | - if: runner.os == 'Windows' 126 | shell: pwsh 127 | run: | 128 | Invoke-WebRequest ` 129 | $env:DOWNLOAD_URL ` 130 | -OutFile $env:DOWNLOAD_URL.Split("/")[-1] 131 | 7z x $env:DOWNLOAD_URL.Split("/")[-1] 132 | if (ConvertFrom-Json $env:ADD_TO_PATH) { 133 | Add-Content ` 134 | -Path $env:GITHUB_PATH ` 135 | -Value "$(Get-Location)" 136 | } 137 | env: 138 | DOWNLOAD_URL: ${{ steps.prepare.outputs.download-url }} 139 | DIR_PATH: ${{ steps.prepare.outputs.dir-path }} 140 | ADD_TO_PATH: ${{ inputs.add-to-path }} 141 | working-directory: ${{ steps.prepare.outputs.dir-path }} 142 | -------------------------------------------------------------------------------- /scripts/showcase-svg/template.svg: -------------------------------------------------------------------------------- 1 | 3 | 4 |
5 | 9 | 121 |
122 | 선비Seonbi trasforms: 123 |
124 |
125 |

悠久한 歷史와 傳統에 빛나는 우리 大韓國民은 3·1運動으로 126 | 建立된 大韓民國臨時政府의 法統과 不義에 抗拒한 4·19民主理念을 계승하고, 127 | 祖國의 民主改革과 平和的 統一의 使命에 입각하여 正義·人道와 同胞愛로써 128 | 民族의 團結을 공고히 하고, 모든 社會的 弊習과 不義를 타파하며, 129 | 自律과 調和를 바탕으로 自由民主的 基本秩序를 더욱 확고히 하여 130 | 政治·經濟·社會·文化의 모든 領域에 있어서 各人의 機會를 균등히 하고, 131 | 能力을 最高度로 발휘하게 하며, 自由와 權利에 따르는 責任과 義務를 132 | 완수하게 하여, 안으로는 國民生活의 균등한 향상을 기하고 밖으로는 133 | 항구적인 世界平和와 人類共榮에 이바지함으로써 우리들과 우리들의 子孫의 134 | 安全과 自由와 幸福을 영원히 확보할 것을 다짐하면서 1948年 7月 12日에 135 | 制定되고 8次에 걸쳐 改正된 憲法을 이제 國會의 議決을 거쳐 國民投票에 136 | 의하여 改正한다.

137 |
138 |
South Korean orthography
139 |
140 |

PLACEHOLDER: ko-kp

141 |
142 |
North Korean orthography
143 |
144 |

PLACEHOLDER: ko-kp

145 |
146 |
Mixed script with 147 | <ruby>
148 |
149 |

PLACEHOLDER: ko-Kore

150 |
151 |
152 |
153 | 154 | 155 | -------------------------------------------------------------------------------- /package.yaml: -------------------------------------------------------------------------------- 1 | name: seonbi 2 | version: 0.6.0 3 | synopsis: SmartyPants for Korean language 4 | category: Text 5 | author: Hong Minhee 6 | maintainer: Hong Minhee 7 | copyright: "\xa9 2018\u20132023 Hong Minhee" 8 | license: LGPL-2.1 9 | homepage: https://github.com/dahlia/seonbi 10 | bug-reports: https://github.com/dahlia/seonbi/issues 11 | git: git://github.com/dahlia/seonbi.git 12 | description: 13 | Please see the README.md on GitHub at . 14 | extra-source-files: 15 | - src/Text/Seonbi/Unihan/*.json 16 | - CHANGES.md 17 | - README.md 18 | data-dir: data 19 | data-files: 20 | - '*.tsv' 21 | build-type: Custom 22 | custom-setup: 23 | dependencies: 24 | - base 25 | - bytestring 26 | - Cabal 27 | - directory >= 1 && < 2 28 | - filepath 29 | - http-client >= 0.5 && < 0.8 30 | - temporary >= 1.2 && < 1.4 31 | - text 32 | - zip >= 1.1 && < 3.0 33 | dependencies: 34 | - aeson >= 1.3.1 && < 3 35 | - base >= 4.12 && < 5 36 | - bytestring 37 | - containers 38 | - html-entities >= 1 && < 2 39 | - text 40 | flags: 41 | static: 42 | description: Static link 43 | manual: true 44 | default: false 45 | iconv: 46 | description: Use iconv; however it is ignored on Windows 47 | manual: true 48 | default: false 49 | embed-dictionary: 50 | description: Embed dictionary rather than load from file 51 | manual: true 52 | default: false 53 | when: 54 | - condition: os(darwin) 55 | else: 56 | ghc-options: 57 | - -Wall 58 | - -fprint-explicit-kinds 59 | then: 60 | ghc-options: 61 | - -Wall 62 | - -fprint-explicit-kinds 63 | - -optP-Wno-nonportable-include-path 64 | # The above option works around https://github.com/haskell/cabal/issues/4739 65 | library: 66 | source-dirs: src 67 | dependencies: 68 | - attoparsec >= 0.12 && < 1 69 | - bytestring-trie >= 0.2.5 && < 0.3 70 | - cassava >= 0.5 && < 0.6 71 | - cmark >= 0.6 && < 1 72 | - data-default >= 0.2 && < 1 73 | - filepath >= 1 && < 2 74 | - file-embed >= 0.0.10 && < 0.0.16 75 | - http-media >= 0.8 && < 1 76 | when: 77 | - condition: flag(static) || flag(embed-dictionary) 78 | then: 79 | cpp-options: 80 | - -DEMBED_DICTIONARY 81 | else: 82 | cpp-options: 83 | - -DNO_EMBED_DICTIONARY 84 | executables: 85 | seonbi: 86 | main: seonbi.hs 87 | source-dirs: app 88 | when: 89 | - condition: flag(iconv) && !os(windows) 90 | else: 91 | dependencies: &executable-seonbi-dependencies 92 | cases: ">= 0.1.3.2 && < 0.1.5" 93 | code-page: ">= 0.2 && < 0.3" 94 | html-charset: ">= 0.1 && < 0.2" 95 | optparse-applicative: ">= 0.14 && < 0.18" 96 | seonbi: ">= 0" 97 | then: 98 | dependencies: 99 | <<: *executable-seonbi-dependencies 100 | iconv: ">= 0.4 && < 0.5" 101 | cpp-options: 102 | - -DICONV 103 | - &executable-ghc-options 104 | condition: flag(static) 105 | then: 106 | when: 107 | - condition: os(darwin) || os(windows) 108 | then: 109 | ghc-options: 110 | - -Wall 111 | - -fwarn-incomplete-uni-patterns 112 | - -threaded 113 | - -rtsopts 114 | - -with-rtsopts=-N 115 | # Static link 116 | - -static 117 | - -optc-Os 118 | else: 119 | ghc-options: 120 | - -Wall 121 | - -fwarn-incomplete-uni-patterns 122 | - -threaded 123 | - -rtsopts 124 | - -with-rtsopts=-N 125 | # Static link 126 | - -static 127 | - -optl-static 128 | - -optl-pthread 129 | - -optc-Os 130 | - -fPIC 131 | ld-options: 132 | - -static 133 | else: 134 | ghc-options: 135 | - -Wall 136 | - -fwarn-incomplete-uni-patterns 137 | - -threaded 138 | - -rtsopts 139 | - -with-rtsopts=-N 140 | seonbi-api: 141 | main: seonbi-api.hs 142 | source-dirs: app 143 | dependencies: 144 | - http-types >= 0.12 && < 0.13 145 | - optparse-applicative >= 0.14 && < 0.18 146 | - seonbi 147 | - wai >= 3.2 && < 3.4 148 | - warp >= 3.2 && < 3.4 149 | when: 150 | - *executable-ghc-options 151 | tests: 152 | doctest: 153 | main: doctest.hs 154 | source-dirs: test 155 | other-modules: [] 156 | ghc-options: 157 | - -threaded 158 | dependencies: 159 | - doctest 160 | - doctest-discover 161 | - QuickCheck 162 | - seonbi 163 | - unicode-show 164 | spec: 165 | main: hspec.hs 166 | source-dirs: test 167 | ghc-options: 168 | - -threaded 169 | - -rtsopts 170 | - -with-rtsopts=-N 171 | dependencies: 172 | - code-page >= 0.2 && < 0.3 173 | - Diff >= 0.3.4 && < 0.5 174 | - directory >= 1 && < 2 175 | - filepath >= 1 && < 2 176 | - hspec >= 2.4.8 && < 3 177 | - hspec-discover >= 2.4.8 && < 3 178 | - interpolatedstring-perl6 >= 1.0.1 && < 2 179 | - random >= 1.1 && < 1.3 180 | - seonbi 181 | - text 182 | hlint: 183 | main: hlint.hs 184 | source-dirs: test 185 | other-modules: [] 186 | ghc-options: 187 | - -threaded 188 | dependencies: 189 | - hlint >= 2.1.7 && < 3.6 190 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Html/TagStack.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE TypeFamilies #-} 2 | module Text.Seonbi.Html.TagStack 3 | ( HtmlTagStack 4 | , Text.Seonbi.Html.TagStack.any 5 | , descendsFrom 6 | , Text.Seonbi.Html.TagStack.elem 7 | , depth 8 | , empty 9 | , fromList 10 | , last 11 | , pop 12 | , push 13 | , rebase 14 | , toList 15 | ) where 16 | 17 | import Prelude hiding (last) 18 | 19 | import Data.List hiding (last) 20 | import GHC.Exts (IsList (..)) 21 | 22 | import Text.Seonbi.Html.Tag 23 | 24 | -- | Represents a hierarchy of a currently parsing position in an 'HtmlTag' 25 | -- tree. 26 | -- 27 | -- For example, if an 'scanHtml' has read "@\\\foo\ bar@" 28 | -- it is represented as @'HtmlTagStack' ['B', 'A']@. 29 | -- 30 | -- Note that the tags are stored in reverse order, from the deepest to 31 | -- the shallowest, to make inserting a more deeper tag efficient. 32 | newtype HtmlTagStack = HtmlTagStack [HtmlTag] deriving (Eq, Ord) 33 | 34 | instance IsList HtmlTagStack where 35 | type Item HtmlTagStack = HtmlTag 36 | fromList = HtmlTagStack . reverse 37 | toList (HtmlTagStack tags) = reverse tags 38 | 39 | instance Show HtmlTagStack where 40 | show tags = "fromList " ++ show (toList tags) 41 | 42 | -- | An empty stack. 43 | empty :: HtmlTagStack 44 | empty = HtmlTagStack [] 45 | 46 | -- | Count the depth of a stack. 47 | -- 48 | -- >>> :set -XOverloadedLists 49 | -- >>> depth empty 50 | -- 0 51 | -- >>> depth [Div, Article, P, Em] 52 | -- 4 53 | depth :: HtmlTagStack -> Int 54 | depth (HtmlTagStack stack) = Data.List.length stack 55 | 56 | -- | Get the deepest tag from a 'HtmlTagStack'. 57 | -- 58 | -- >>> :set -XOverloadedLists 59 | -- >>> let stack = [Div, Article, P, Em] :: HtmlTagStack 60 | -- >>> last stack 61 | -- Just Em 62 | -- >>> last [] 63 | -- Nothing 64 | last :: HtmlTagStack -> Maybe HtmlTag 65 | last (HtmlTagStack []) = Nothing 66 | last (HtmlTagStack (tag:_)) = Just tag 67 | 68 | -- | Build a new stack from a stack by replacing its bottom with a new base. 69 | -- 70 | -- >>> :set -XOverloadedLists 71 | -- >>> rebase [Article, BlockQuote] [Div] [Article, BlockQuote, P, Em] 72 | -- fromList [Div,P,Em] 73 | -- 74 | -- If there are no such bottom elements, it replaces nothing. 75 | -- 76 | -- >>> rebase [Div, Article, BlockQuote] [Div] [Article, BlockQuote, P, Em] 77 | -- fromList [Article,BlockQuote,P,Em] 78 | rebase :: HtmlTagStack -> HtmlTagStack -> HtmlTagStack -> HtmlTagStack 79 | rebase (HtmlTagStack base) (HtmlTagStack newBase) stack@(HtmlTagStack l) 80 | | base `isSuffixOf` l = HtmlTagStack $ 81 | take (depth stack - length base) l ++ newBase 82 | | otherwise = stack 83 | 84 | -- | Push one deeper @tag@ to a 'HtmlTagStack'. 85 | -- 86 | -- >>> push A empty 87 | -- fromList [A] 88 | -- >>> push B (push A empty) 89 | -- fromList [A,B] 90 | push :: HtmlTag -> HtmlTagStack -> HtmlTagStack 91 | push tag (HtmlTagStack tags) = 92 | HtmlTagStack (tag : tags) 93 | 94 | -- | Pop the deepest @tag@ from a 'HtmlTagStack'. 95 | -- 96 | -- >>> :set -XOverloadedLists 97 | -- >>> pop Em [A, B, Em] 98 | -- fromList [A,B] 99 | -- 100 | -- It may pop a @tag@ in the middle if a @tag@ looking for is not the deepest: 101 | -- 102 | -- >>> pop B [A, B, Em] 103 | -- fromList [A,Em] 104 | -- 105 | -- It does not affect to the input if there is no such @tag@ in the input: 106 | -- 107 | -- >>> pop P [A, B, Em] 108 | -- fromList [A,B,Em] 109 | -- >>> pop A empty 110 | -- fromList [] 111 | pop :: HtmlTag -> HtmlTagStack -> HtmlTagStack 112 | pop tag (HtmlTagStack tags'@(t : ags)) = 113 | if t == tag 114 | then HtmlTagStack ags 115 | else 116 | let 117 | (head', rest) = span (/= tag) tags' 118 | tail' = case uncons rest of 119 | Just (_, tail'') -> tail'' 120 | Nothing -> [] 121 | in 122 | HtmlTagStack (head' ++ tail') 123 | pop _ (HtmlTagStack []) = empty 124 | 125 | -- | Check if a node ('HtmlEntity') that a 'HtmlTagStack' (the first argument) 126 | -- refers is contained by a node that another 'HtmlTagStack' (the second 127 | -- argument), or they are sibling at least. 128 | -- 129 | -- >>> :set -XOverloadedLists 130 | -- >>> descendsFrom [Div, P, A, Em] [Div, P, A] 131 | -- True 132 | -- >>> descendsFrom [Div, P, A] [Div, P, A] 133 | -- True 134 | -- >>> descendsFrom [Div, P, Em] [Div, P, A] 135 | -- False 136 | -- >>> descendsFrom [Div, P] [Div, P, A] 137 | -- False 138 | descendsFrom :: HtmlTagStack -> HtmlTagStack -> Bool 139 | HtmlTagStack a `descendsFrom` HtmlTagStack b = 140 | b `isSuffixOf` a 141 | 142 | -- | Determine whether any element of the tag stack satisfies the predicate. 143 | -- 144 | -- >>> :set -XOverloadedLists 145 | -- >>> Text.Seonbi.Html.TagStack.any ((== Void) . htmlTagKind) [Div, P, Script] 146 | -- False 147 | -- >>> Text.Seonbi.Html.TagStack.any ((== Void) . htmlTagKind) [BR, P, Script] 148 | -- True 149 | any :: (HtmlTag -> Bool) -> HtmlTagStack -> Bool 150 | any fn (HtmlTagStack stack) = 151 | Prelude.any fn stack 152 | 153 | -- | Determine whether the element occurs in the tag stack. 154 | -- 155 | -- >>> :set -XOverloadedLists 156 | -- >>> A `Text.Seonbi.Html.TagStack.elem` [A, B, Code] 157 | -- True 158 | -- >>> Em `Text.Seonbi.Html.TagStack.elem` [A, B, Code] 159 | -- False 160 | elem :: HtmlTag -> HtmlTagStack -> Bool 161 | elem tag (HtmlTagStack stack) = tag `Prelude.elem` stack 162 | -------------------------------------------------------------------------------- /src/Text/Seonbi/PairedTransformer.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE ScopedTypeVariables #-} 2 | module Text.Seonbi.PairedTransformer 3 | ( PairedTransformer (..) 4 | , transformPairs 5 | ) where 6 | 7 | import Data.Text hiding (break, reverse) 8 | 9 | import Text.Seonbi.Html 10 | 11 | -- | Settings for 'transformPairs'. 12 | data PairedTransformer match = PairedTransformer 13 | { ignoresTagStack :: HtmlTagStack -> Bool 14 | , matchStart :: [match] -> Text -> Maybe (match, Text, Text, Text) 15 | , matchEnd :: Text -> Maybe (match, Text, Text, Text) 16 | , areMatchesPaired :: match -> match -> Bool 17 | , transformPair :: match -> match -> [HtmlEntity] -> [HtmlEntity] 18 | } 19 | 20 | -- | Some transformations should be done only if a start and an end are paired 21 | -- like parentheses. These even usually can be nested. Even if there is 22 | -- a start and an end they should not be paired unless they are sibling in 23 | -- an HTML tree. 24 | -- 25 | -- These kinds of scanning are easily turned highly stateful and imperative, 26 | -- hence hard to debug. This base class provides the common logic between 27 | -- these kinds of paired transformations so that an implementation class fill 28 | -- several abstract methods triggered by the state machine. 29 | transformPairs :: forall m . PairedTransformer m -> [HtmlEntity] -> [HtmlEntity] 30 | transformPairs (PairedTransformer ignores start end arePaired transform) = 31 | iter [] . normalizeText 32 | where 33 | iter :: [Unclosed m] -> [HtmlEntity] -> [HtmlEntity] 34 | iter [] [] = [] 35 | iter stack [] = unstack stack 36 | iter stack (x@HtmlText { tagStack = ts, rawText = txt } : xs) = 37 | case (startMatch, endMatch) of 38 | (Just captured, Nothing) -> 39 | roll stack captured ts xs 40 | (Nothing, Just captured@(m, _, _, _)) 41 | | Prelude.any ((`arePaired` m) . match) stack -> 42 | unroll stack captured ts xs 43 | (Just captured@(_, pre, _, _), Just captured'@(m', pre', _, _)) -> 44 | if Data.Text.length pre >= Data.Text.length pre' && 45 | Prelude.any ((`arePaired` m') . match) stack 46 | then unroll stack captured' ts xs 47 | else roll stack captured ts xs 48 | (Nothing, _) -> 49 | case stack of 50 | [] -> x : iter stack xs 51 | s : ss -> iter (s { buffer = x : buffer s } : ss) xs 52 | where 53 | startMatch :: Maybe (m, Text, Text, Text) 54 | startMatch = start (reverse $ fmap match stack) txt 55 | endMatch :: Maybe (m, Text, Text, Text) 56 | endMatch = end txt 57 | iter (s@Unclosed {} : ss) (x : xs) = 58 | iter (s { buffer = x : buffer s } : ss) xs 59 | iter [] (x : xs) = x : iter [] xs 60 | roll :: [Unclosed m] 61 | -> (m, Text, Text, Text) 62 | -> HtmlTagStack 63 | -> [HtmlEntity] 64 | -> [HtmlEntity] 65 | roll [] (startMatch, pre, t, post) tagStack_ entities = 66 | prependText tagStack_ pre $ iter 67 | [Unclosed startMatch [HtmlText tagStack_ t]] 68 | (normalizeText (prependText tagStack_ post entities)) 69 | roll (s : ss) (startMatch, pre, t, post) tagStack_ entities = iter 70 | ( Unclosed startMatch [HtmlText tagStack_ t] 71 | : s { buffer = prependText tagStack_ pre $ buffer s } 72 | : ss 73 | ) 74 | (normalizeText (prependText tagStack_ post entities)) 75 | unroll :: [Unclosed m] 76 | -> (m, Text, Text, Text) 77 | -> HtmlTagStack 78 | -> [HtmlEntity] 79 | -> [HtmlEntity] 80 | unroll stack (endMatch, pre, t, post) tagStack_ es = 81 | case remainStack of 82 | [] -> unrolled ++ iter [] remainEntities 83 | s : ss -> iter 84 | (s { buffer = reverse unrolled ++ buffer s } : ss) 85 | remainEntities 86 | where 87 | prependText' :: Text -> [HtmlEntity] -> [HtmlEntity] 88 | prependText' = prependText tagStack_ 89 | unrolled :: [HtmlEntity] 90 | remainStack :: [Unclosed m] 91 | (unrolled, remainStack) = case findPair endMatch stack of 92 | (_, []) -> 93 | ([HtmlText tagStack_ (pre `append` t)], []) 94 | (stack', s@Unclosed { match = startMatch } : ss) -> 95 | let 96 | buf = prependText' pre (unstack' stack' ++ buffer s) 97 | buf' = prependText' t buf 98 | buf'' = reverse buf' 99 | transformed = if Prelude.any (ignores . tagStack) buf'' 100 | then buf'' 101 | else transform startMatch endMatch buf'' 102 | in 103 | (transformed, ss) 104 | remainEntities :: [HtmlEntity] 105 | remainEntities = prependText' post es 106 | findPair :: m -> [Unclosed m] -> ([Unclosed m], [Unclosed m]) 107 | findPair m = break (arePaired m . match) 108 | unstack :: [Unclosed m] -> [HtmlEntity] 109 | unstack = reverse . unstack' 110 | unstack' :: [Unclosed m] -> [HtmlEntity] 111 | unstack' [] = [] 112 | unstack' (Unclosed { buffer = b } : ss) = b ++ unstack' ss 113 | prependText :: HtmlTagStack -> Text -> [HtmlEntity] -> [HtmlEntity] 114 | prependText tagStack_ txt 115 | | Data.Text.null txt = id 116 | | otherwise = (HtmlText tagStack_ txt :) 117 | 118 | data Unclosed match = Unclosed 119 | { match :: match 120 | , buffer :: [HtmlEntity] -- in reverse order 121 | } 122 | -------------------------------------------------------------------------------- /demo/src/Markdown/HtmlString.elm: -------------------------------------------------------------------------------- 1 | module Markdown.HtmlString exposing (render) 2 | 3 | import List 4 | import Markdown.Block exposing (..) 5 | import Markdown.Inline exposing (..) 6 | import Maybe exposing (andThen, withDefault) 7 | import String 8 | 9 | 10 | escape : String -> String 11 | escape = 12 | String.replace "&" "&" 13 | >> String.replace "<" "<" 14 | >> String.replace ">" ">" 15 | >> String.replace "\"" """ 16 | 17 | 18 | render : List (Block b i) -> String 19 | render blocks = 20 | List.map renderBlock blocks |> String.concat 21 | 22 | 23 | renderBlock : Block b i -> String 24 | renderBlock block = 25 | case block of 26 | BlankLine text -> 27 | escape text 28 | 29 | ThematicBreak -> 30 | "\n
\n" 31 | 32 | Heading _ level inlines -> 33 | "\n" 36 | ++ renderInlines inlines 37 | ++ "\n" 40 | 41 | CodeBlock _ code -> 42 | "
" ++ escape code ++ "
\n" 43 | 44 | Paragraph _ text -> 45 | "

" ++ renderInlines text ++ "

\n" 46 | 47 | BlockQuote blocks -> 48 | "
\n" ++ render blocks ++ "
\n" 49 | 50 | List list items -> 51 | let 52 | ( open, close ) = 53 | case list.type_ of 54 | Unordered -> 55 | ( "
    ", "
" ) 56 | 57 | Ordered start -> 58 | ( "
    " 59 | , "
" 60 | ) 61 | 62 | renderItem = 63 | \blocks -> 64 | "
  • " ++ render blocks ++ "
  • \n" 65 | in 66 | open 67 | ++ "\n" 68 | ++ String.concat (List.map renderItem items) 69 | ++ close 70 | ++ "\n" 71 | 72 | PlainInlines inlines -> 73 | renderInlines inlines 74 | 75 | Markdown.Block.Custom _ blocks -> 76 | render blocks 77 | 78 | 79 | renderInlines : List (Inline i) -> String 80 | renderInlines inlines = 81 | List.map renderInline inlines 82 | |> String.concat 83 | 84 | 85 | renderInline : Inline i -> String 86 | renderInline inline = 87 | case inline of 88 | Text text -> 89 | escape text 90 | 91 | HardLineBreak -> 92 | "
    \n" 93 | 94 | CodeInline text -> 95 | "" ++ escape text ++ "" 96 | 97 | Link href title label -> 98 | "
    andThen (\t -> Just <| " title=\"" ++ t ++ "\"") 103 | |> withDefault "" 104 | ) 105 | ++ ">" 106 | ++ (List.map renderInline label |> String.concat) 107 | ++ "" 108 | 109 | Image src title alt -> 110 | " andThen (\t -> Just <| " title=\"" ++ t ++ "\"") 115 | |> withDefault "" 116 | ) 117 | ++ " alt=\"" 118 | ++ (List.map simplifyInline alt |> String.concat) 119 | ++ "\">" 120 | 121 | HtmlInline tag attrs inlines -> 122 | renderHtmlInline tag attrs inlines 123 | 124 | Emphasis 1 inlines -> 125 | "" ++ renderInlines inlines ++ "" 126 | 127 | Emphasis _ inlines -> 128 | "" ++ renderInlines inlines ++ "" 129 | 130 | Markdown.Inline.Custom _ inlines -> 131 | renderInlines inlines 132 | 133 | 134 | simplifyInlines : List (Inline i) -> String 135 | simplifyInlines inlines = 136 | List.map simplifyInline inlines |> String.concat 137 | 138 | 139 | simplifyInline : Inline i -> String 140 | simplifyInline inline = 141 | case inline of 142 | Text text -> 143 | escape text 144 | 145 | HardLineBreak -> 146 | "\n" 147 | 148 | CodeInline text -> 149 | escape text 150 | 151 | Link _ _ label -> 152 | simplifyInlines label 153 | 154 | Image _ _ alt -> 155 | simplifyInlines alt 156 | 157 | HtmlInline _ _ inlines -> 158 | simplifyInlines inlines 159 | 160 | Emphasis _ inlines -> 161 | simplifyInlines inlines 162 | 163 | Markdown.Inline.Custom _ inlines -> 164 | simplifyInlines inlines 165 | 166 | 167 | renderHtmlInline : 168 | String 169 | -> List ( String, Maybe String ) 170 | -> List (Inline i) 171 | -> String 172 | renderHtmlInline tag attrs inlines = 173 | let 174 | attrsString = 175 | String.concat <| List.map renderAttr attrs 176 | 177 | renderAttr = 178 | \( attr, value ) -> 179 | case value of 180 | Just v -> 181 | " " ++ attr ++ "=\"" ++ escape v ++ "\"" 182 | 183 | Nothing -> 184 | " " ++ attr 185 | in 186 | "<" 187 | ++ tag 188 | ++ attrsString 189 | ++ ">" 190 | ++ renderInlines inlines 191 | ++ "" 194 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Html/Lang.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedStrings #-} 2 | module Text.Seonbi.Html.Lang 3 | ( LangHtmlEntity (..) 4 | , LanguageTag 5 | , annotateWithLang 6 | , extractLang 7 | , isKorean 8 | , isNeverKorean 9 | ) where 10 | 11 | import Control.Applicative 12 | import Data.Char (isSpace) 13 | import Data.Maybe 14 | 15 | import Data.Attoparsec.Text 16 | import Data.Text 17 | 18 | import Text.Seonbi.Html.Entity 19 | import Text.Seonbi.Html.Tag (HtmlTag) 20 | 21 | -- | Represents a language tag. Although it is defined as an alias for 'Text', 22 | -- it can be structured in the future. Do not use its contents directly. 23 | type LanguageTag = Text 24 | 25 | -- | Extracts the language tag from the given raw HTML attributes if it has 26 | -- @lang@ attribute. 27 | -- 28 | -- >>> extractLang "" 29 | -- Nothing 30 | -- >>> extractLang "lang=en" 31 | -- Just "en" 32 | -- >>> extractLang "lang=\"ko-KR\"" 33 | -- Just "ko-kr" 34 | -- >>> extractLang " lang='ko-Hang'" 35 | -- Just "ko-hang" 36 | extractLang 37 | :: HtmlRawAttrs 38 | -- ^ A raw HTML attributes to extract the language tag from. 39 | -> Maybe LanguageTag 40 | -- ^ A language tag extracted from the given raw HTML attributes. 41 | -- If the given raw HTML attributes does not have @lang@ attribute or 42 | -- its value is invalid, 'Nothing' is returned. 43 | extractLang attrs = 44 | case parseOnly parser' attrs of 45 | Right (Just lang') -> 46 | let lt = toLower . strip . normalizeEntities $ lang' 47 | in if Data.Text.null lt then Nothing else Just lt 48 | _ -> Nothing 49 | where 50 | parser' :: Parser (Maybe Text) 51 | parser' = do 52 | skipSpace 53 | attrs' <- langAttr `sepBy` space 54 | skipSpace 55 | return $ listToMaybe $ catMaybes attrs' 56 | langAttr :: Parser (Maybe Text) 57 | langAttr = do 58 | (isLang, cont) <- attrIsLang 59 | value <- if cont then attrValue else return "" 60 | return (if isLang then Just value else Nothing) 61 | attrIsLang :: Parser (Bool, Bool) 62 | attrIsLang = choice 63 | [ asciiCI "lang=" >> return (True, True) 64 | , do { _ <- takeWhile1 (/= '=') 65 | ; eq <- optional (char '=') 66 | ; return (False, isJust eq) 67 | } 68 | ] 69 | attrValue :: Parser Text 70 | attrValue = choice 71 | [ do { skip (== '"'); v <- takeTill (== '"'); skip (== '"'); return v } 72 | , do { skip (== '\'') 73 | ; v <- takeTill (== '\'') 74 | ; skip (== '\''); return v 75 | } 76 | , takeWhile1 (not . isSpace) 77 | ] 78 | normalizeEntities :: Text -> Text 79 | normalizeEntities 80 | = Data.Text.replace "‐" "-" 81 | . Data.Text.replace "‐" "-" 82 | . Data.Text.replace "‐" "-" 83 | . Data.Text.replace "‐" "-" 84 | . Data.Text.replace "‐" "-" 85 | 86 | -- | Annotates 'HtmlEntity' with the 'lang' tag extracted from it or its 87 | -- ancestors. 88 | data LangHtmlEntity = LangHtmlEntity 89 | { -- | The @lang@ tag extracted from the HTML 'entity' or its ancestors. 90 | lang :: Maybe LanguageTag 91 | -- | The annotated HTML 'entity'. 92 | , entity :: HtmlEntity 93 | } deriving (Show, Eq) 94 | 95 | -- | Annotates the given HTML entities with the language tag extracted from 96 | -- their @lang@ attributes. If a parent entity has @lang@ attribute, its 97 | -- all descendants are annotated with the same language tag. 98 | annotateWithLang :: [HtmlEntity] -> [LangHtmlEntity] 99 | annotateWithLang = 100 | annotate [] 101 | where 102 | annotate :: [(HtmlTag, Maybe LanguageTag)] 103 | -> [HtmlEntity] 104 | -> [LangHtmlEntity] 105 | annotate _ [] = [] 106 | annotate stack (x@HtmlStartTag { tag = tag', rawAttributes = attrs } : xs) = 107 | LangHtmlEntity thisLang x : annotate nextStack xs 108 | where 109 | parentLang :: Maybe LanguageTag 110 | parentLang = case stack of 111 | (_, l):_ -> l 112 | _ -> Nothing 113 | thisLang :: Maybe LanguageTag 114 | thisLang = extractLang attrs <|> parentLang 115 | nextStack :: [(HtmlTag, Maybe LanguageTag)] 116 | nextStack = (tag', thisLang) : stack 117 | annotate stack (x@HtmlEndTag { tag = tag' } : xs) = 118 | LangHtmlEntity thisLang x : annotate nextStack xs 119 | where 120 | (nextStack, thisLang) = case stack of 121 | [] -> ([], Nothing) 122 | s@((t, lang'):ys) -> 123 | (if t == tag' then ys else s, lang') 124 | annotate stack (x : xs) = 125 | LangHtmlEntity parentLang x : annotate stack xs 126 | where 127 | parentLang :: Maybe LanguageTag 128 | parentLang = case stack of 129 | (_, l):_ -> l 130 | _ -> Nothing 131 | 132 | -- | Determines whether the given language tag refers to any kind of Korean. 133 | -- 134 | -- >>> isKorean "ko" 135 | -- True 136 | -- >>> isKorean "ko-KR" 137 | -- True 138 | -- >>> isKorean "kor-Hang" 139 | -- True 140 | -- >>> isKorean "en" 141 | -- False 142 | -- >>> isKorean "en-KR" 143 | -- False 144 | isKorean :: LanguageTag -> Bool 145 | isKorean lang' = 146 | l == "ko" || l == "kor" || 147 | "ko-" `isPrefixOf` l || 148 | "kor-" `isPrefixOf` l 149 | where 150 | l :: Text 151 | l = toLower lang' 152 | 153 | -- | Determines whether the given language tag undoubtedly does not refer 154 | -- to any kind of Korean. 155 | -- 156 | -- >>> isNeverKorean $ Just "ko" 157 | -- False 158 | -- >>> isNeverKorean $ Just "ko-KR" 159 | -- False 160 | -- >>> isNeverKorean Nothing 161 | -- False 162 | -- >>> isNeverKorean $ Just "en" 163 | -- True 164 | isNeverKorean :: Maybe LanguageTag -> Bool 165 | isNeverKorean Nothing = False 166 | isNeverKorean (Just lang') = not (isKorean lang') 167 | -------------------------------------------------------------------------------- /src/Text/Seonbi/Html/Scanner.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedStrings #-} 2 | {-# LANGUAGE TupleSections #-} 3 | module Text.Seonbi.Html.Scanner 4 | ( Result (..) 5 | , scanHtml 6 | ) where 7 | 8 | import Data.Char 9 | import Prelude hiding (takeWhile) 10 | 11 | import Data.Attoparsec.Text.Lazy 12 | import Data.Map.Strict 13 | import qualified Data.Text 14 | import qualified Data.Text.Lazy 15 | 16 | import Text.Seonbi.Html.Entity 17 | import Text.Seonbi.Html.Tag 18 | import Text.Seonbi.Html.TagStack 19 | 20 | htmlFragments :: Parser [HtmlEntity] 21 | htmlFragments = do 22 | result <- option [] $ fragments Text.Seonbi.Html.TagStack.empty 23 | txt <- htmlText Text.Seonbi.Html.TagStack.empty 24 | endOfInput 25 | return $ case txt of 26 | HtmlText { rawText = "" } -> result 27 | _ -> result ++ [txt] 28 | 29 | fragments :: HtmlTagStack -> Parser [HtmlEntity] 30 | fragments tagStack' = do 31 | txt <- htmlText tagStack' 32 | (entities, nextStack) <- htmlEntity tagStack' 33 | nextChunk <- option [] $ fragments nextStack 34 | let chunks = entities ++ nextChunk 35 | return $ case txt of 36 | HtmlText { rawText = "" } -> chunks 37 | txt' -> txt' : chunks 38 | 39 | htmlText :: HtmlTagStack -> Parser HtmlEntity 40 | htmlText tagStack' = do 41 | texts <- many' textFragment 42 | return $ mkText $ Data.Text.concat texts 43 | where 44 | mkText :: Data.Text.Text -> HtmlEntity 45 | mkText txt = HtmlText { tagStack = tagStack', rawText = txt } 46 | 47 | textFragment :: Parser Data.Text.Text 48 | textFragment = choice 49 | [ takeWhile1 (/= '<') 50 | , do 51 | a <- char '<' 52 | b <- satisfy $ \ c -> 53 | not (c == '!' || c == '/' || isAsciiUpper c || isAsciiLower c) 54 | return $ Data.Text.pack [a, b] 55 | ] 56 | 57 | htmlEntity :: HtmlTagStack -> Parser ([HtmlEntity], HtmlTagStack) 58 | htmlEntity tagStack' = choice 59 | [ htmlComment tagStack' 60 | , cdata tagStack' 61 | , startTag tagStack' 62 | , endTag tagStack' 63 | -- fallback: 64 | , (, tagStack') . (: []) . HtmlText tagStack' . Data.Text.singleton 65 | <$> anyChar 66 | ] 67 | 68 | -- https://www.w3.org/TR/html5/syntax.html#comments 69 | htmlComment :: HtmlTagStack -> Parser ([HtmlEntity], HtmlTagStack) 70 | htmlComment tagStack' = do 71 | _ <- string "" 84 | return 85 | ( [ HtmlComment 86 | { tagStack = tagStack' 87 | , comment = Data.Text.concat contents 88 | } 89 | ] 90 | , tagStack' 91 | ) 92 | 93 | -- https://www.w3.org/TR/html5/syntax.html#cdata-sections 94 | cdata :: HtmlTagStack -> Parser ([HtmlEntity], HtmlTagStack) 95 | cdata tagStack' = do 96 | _ <- string "' 106 | return $ Data.Text.snoc a b 107 | ] 108 | _ <- string "]]>" 109 | return 110 | ( [HtmlCdata { tagStack = tagStack', text = Data.Text.concat contents }] 111 | , tagStack' 112 | ) 113 | 114 | -- https://www.w3.org/TR/html5/syntax.html#start-tags 115 | startTag :: HtmlTagStack -> Parser ([HtmlEntity], HtmlTagStack) 116 | startTag tagStack' = do 117 | _ <- char '<' 118 | tag' <- htmlTag 119 | attributes <- many' $ choice 120 | [ do 121 | s <- char '"' 122 | c <- takeWhile (/= '"') 123 | e <- char '"' 124 | return (Data.Text.cons s $ Data.Text.snoc c e) 125 | , do 126 | s <- char '\'' 127 | c <- takeWhile (/= '\'') 128 | e <- char '\'' 129 | return (Data.Text.cons s $ Data.Text.snoc c e) 130 | , takeWhile1 $ \ c -> c /= '"' && c /= '\'' && c /= '/' && c /= '>' 131 | ] 132 | selfClosing <- option ' ' $ char '/' 133 | _ <- char '>' 134 | let (trailingEntities, nextTagStack) = 135 | if selfClosing == '/' || htmlTagKind tag' == Void 136 | then ([HtmlEndTag { tagStack = tagStack', tag = tag' }], tagStack') 137 | else ([], push tag' tagStack') 138 | return 139 | ( HtmlStartTag 140 | { tagStack = tagStack' 141 | , tag = tag' 142 | , rawAttributes = Data.Text.concat attributes 143 | } : trailingEntities 144 | , nextTagStack 145 | ) 146 | 147 | -- https://www.w3.org/TR/html5/syntax.html#end-tags 148 | endTag :: HtmlTagStack -> Parser ([HtmlEntity], HtmlTagStack) 149 | endTag tagStack' = do 150 | _ <- string "' 153 | return $ case htmlTagKind tag' of 154 | Void -> ([], tagStack') 155 | _ -> 156 | let 157 | nextTagStack = pop tag' tagStack' 158 | in 159 | ( [HtmlEndTag { tagStack = nextTagStack, tag = tag' }] 160 | , nextTagStack 161 | ) 162 | 163 | htmlTag :: Parser HtmlTag 164 | htmlTag = do 165 | name <- tagName 166 | case Data.Map.Strict.lookup (Data.Text.toLower name) htmlTagNames of 167 | Just t -> return t 168 | _ -> fail ("failed to parse; invalid tag: " ++ Data.Text.unpack name) 169 | 170 | tagName :: Parser Data.Text.Text 171 | tagName = do 172 | first <- satisfy $ \ c -> isAsciiUpper c || isAsciiLower c 173 | rest <- takeWhile $ \ c -> isAsciiUpper c || isAsciiLower c || isDigit c 174 | return $ Data.Text.cons first rest 175 | 176 | scanHtml :: Data.Text.Lazy.Text -> Result [HtmlEntity] 177 | scanHtml = parse htmlFragments 178 | -------------------------------------------------------------------------------- /test/Text/Seonbi/Html/ClipperSpec.hs: -------------------------------------------------------------------------------- 1 | {-# LANGUAGE OverloadedLists #-} 2 | {-# LANGUAGE OverloadedStrings #-} 3 | module Text.Seonbi.Html.ClipperSpec (spec) where 4 | 5 | import Control.Monad 6 | 7 | import Data.Text 8 | import Test.Hspec 9 | 10 | import Text.Seonbi.Html.Clipper 11 | import Text.Seonbi.Html.Entity 12 | import Text.Seonbi.Html.Tag 13 | 14 | spec :: Spec 15 | spec = do 16 | describe "clipPrefixText" $ do 17 | it "returns Nothing if entities are empty and a prefix is not empty" $ 18 | clipPrefixText "foo" [] `shouldBe` Nothing 19 | it "returns Nothing if the first entity is not an HtmlText" $ 20 | forM_ (["", "foo"] :: [Text]) $ \ prefix -> do 21 | clipPrefixText prefix 22 | [ HtmlStartTag [] P "" 23 | , HtmlText [P] "foo" 24 | , HtmlEndTag [] P 25 | ] `shouldBe` Nothing 26 | clipPrefixText prefix 27 | [ HtmlComment [] "foo" 28 | , HtmlStartTag [] P "" 29 | , HtmlText [P] "foo" 30 | , HtmlEndTag [] P 31 | ] `shouldBe` Nothing 32 | clipPrefixText prefix [HtmlEndTag [] P] `shouldBe` Nothing 33 | clipPrefixText prefix [HtmlCdata [] "foo"] `shouldBe` Nothing 34 | it "returns Just [] if entities are empty and a prefix is empty too" $ 35 | clipPrefixText "" [] `shouldBe` Just [] 36 | it "returns entities with the prefix text dropped" $ do 37 | clipPrefixText "foo" 38 | [ HtmlText [] "foobar" 39 | , HtmlStartTag [] P "" 40 | , HtmlText [P] "foo" 41 | , HtmlEndTag [] P 42 | ] 43 | `shouldBe` Just 44 | [ HtmlText [] "bar" 45 | , HtmlStartTag [] P "" 46 | , HtmlText [P] "foo" 47 | , HtmlEndTag [] P 48 | ] 49 | clipPrefixText "foo" 50 | [ HtmlText [] "foo" 51 | , HtmlStartTag [] P "" 52 | , HtmlText [P] "foo" 53 | , HtmlEndTag [] P 54 | ] 55 | `shouldBe` Just 56 | [ HtmlStartTag [] P "" 57 | , HtmlText [P] "foo" 58 | , HtmlEndTag [] P 59 | ] 60 | it "ignores HtmlComment entities but preseves them" $ do 61 | clipPrefixText "foo" 62 | [ HtmlComment [] "comment" 63 | , HtmlText [] "foobar" 64 | , HtmlStartTag [] P "" 65 | , HtmlText [P] "foo" 66 | , HtmlEndTag [] P 67 | ] 68 | `shouldBe` Just 69 | [ HtmlComment [] "comment" 70 | , HtmlText [] "bar" 71 | , HtmlStartTag [] P "" 72 | , HtmlText [P] "foo" 73 | , HtmlEndTag [] P 74 | ] 75 | clipPrefixText "foo" 76 | [ HtmlComment [] "comment" 77 | , HtmlText [] "foo" 78 | , HtmlStartTag [] P "" 79 | , HtmlText [P] "foo" 80 | , HtmlEndTag [] P 81 | ] 82 | `shouldBe` Just 83 | [ HtmlComment [] "comment" 84 | , HtmlStartTag [] P "" 85 | , HtmlText [P] "foo" 86 | , HtmlEndTag [] P 87 | ] 88 | 89 | describe "clipSuffixText" $ do 90 | it "returns Nothing if entities are empty and a suffix is not empty" $ 91 | clipSuffixText "foo" [] `shouldBe` Nothing 92 | it "returns Nothing if the last entity is not an HtmlText" $ 93 | forM_ (["", "foo"] :: [Text]) $ \ suffix -> do 94 | clipSuffixText suffix 95 | [ HtmlStartTag [] P "" 96 | , HtmlText [P] "foo" 97 | , HtmlEndTag [] P 98 | ] `shouldBe` Nothing 99 | clipSuffixText suffix 100 | [ HtmlStartTag [] P "" 101 | , HtmlText [P] "foo" 102 | , HtmlEndTag [] P 103 | , HtmlComment [] "foo" 104 | ] `shouldBe` Nothing 105 | clipSuffixText suffix [HtmlEndTag [] P] `shouldBe` Nothing 106 | clipSuffixText suffix [HtmlCdata [] "foo"] `shouldBe` Nothing 107 | it "returns Just [] if entities are empty and a suffix is empty too" $ 108 | clipSuffixText "" [] `shouldBe` Just [] 109 | it "returns entities with the suffix text dropped" $ do 110 | clipSuffixText "bar" 111 | [ HtmlStartTag [] P "" 112 | , HtmlText [P] "foo" 113 | , HtmlEndTag [] P 114 | , HtmlText [] "foobar" 115 | ] 116 | `shouldBe` Just 117 | [ HtmlStartTag [] P "" 118 | , HtmlText [P] "foo" 119 | , HtmlEndTag [] P 120 | , HtmlText [] "foo" 121 | ] 122 | clipSuffixText "foo" 123 | [ HtmlStartTag [] P "" 124 | , HtmlText [P] "foo" 125 | , HtmlEndTag [] P 126 | , HtmlText [] "foo" 127 | ] 128 | `shouldBe` Just 129 | [ HtmlStartTag [] P "" 130 | , HtmlText [P] "foo" 131 | , HtmlEndTag [] P 132 | ] 133 | it "ignores HtmlComment entities but preseves them" $ do 134 | clipSuffixText "bar" 135 | [ HtmlStartTag [] P "" 136 | , HtmlText [P] "foo" 137 | , HtmlEndTag [] P 138 | , HtmlText [] "foobar" 139 | , HtmlComment [] "comment" 140 | ] 141 | `shouldBe` Just 142 | [ HtmlStartTag [] P "" 143 | , HtmlText [P] "foo" 144 | , HtmlEndTag [] P 145 | , HtmlText [] "foo" 146 | , HtmlComment [] "comment" 147 | ] 148 | clipSuffixText "foo" 149 | [ HtmlStartTag [] P "" 150 | , HtmlText [P] "foo" 151 | , HtmlEndTag [] P 152 | , HtmlText [] "foo" 153 | , HtmlComment [] "comment" 154 | ] 155 | `shouldBe` Just 156 | [ HtmlStartTag [] P "" 157 | , HtmlText [P] "foo" 158 | , HtmlEndTag [] P 159 | , HtmlComment [] "comment" 160 | ] 161 | 162 | specify "clipText" $ do 163 | clipText "foo" "baz" 164 | [ HtmlText [] "foo" 165 | , HtmlStartTag [] P "" 166 | , HtmlText [P] "bar" 167 | , HtmlEndTag [] P 168 | , HtmlText [] "baz" 169 | ] `shouldBe` Just 170 | [ HtmlStartTag [] P "" 171 | , HtmlText [P] "bar" 172 | , HtmlEndTag [] P 173 | ] 174 | clipText "foo" "quux" 175 | [ HtmlText [] "foobar" 176 | , HtmlStartTag [] P "" 177 | , HtmlText [P] "baz" 178 | , HtmlEndTag [] P 179 | , HtmlText [] "quxquux" 180 | ] `shouldBe` Just 181 | [ HtmlText [] "bar" 182 | , HtmlStartTag [] P "" 183 | , HtmlText [P] "baz" 184 | , HtmlEndTag [] P 185 | , HtmlText [] "qux" 186 | ] 187 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Seonbi: SmartyPants for Korean language 2 | ======================================= 3 | 4 | [![][releases-badge]][releases] [![][hackage-badge]][hackage] [![][dockerhub-badge]][dockerhub] [![][ci-status-badge]][ci] 5 | 6 | [![](https://dahlia.github.io/seonbi/showcase.svg)][demo web app] 7 | 8 | (TL;DR: See the [demo web app].) 9 | 10 | Seonbi (선비) is an HTML preprocessor that makes typographic adjustments 11 | to an HTML so that the result uses accurate punctuations according to 12 | the modern Korean orthography. 13 | (It's similar to what [SmartyPants] does for text written in English.) 14 | 15 | It also transforms `ko-Kore` text (國漢文混用; [Korean mixed script]) into 16 | `ko-Hang` text (한글전용; Hangul-only script). 17 | 18 | Seonbi provides a Haskell library, a CLI, and an HTTP API; any of them can 19 | perform the following transformations: 20 | 21 | - All hanja words (e.g., `漢字`) into corresponding hangul-only words 22 | (e.g., `한자`) 23 | - Straight quotes and apostrophes (`"` & `'`) into curly quotes HTML 24 | entities (`“`, `”`, `‘`, & `’`) 25 | - Three consecutive periods (`...` or `。。。`) into an ellipsis entity (`…`) 26 | - Classical (Chinese-style) stops (`。`, `、`, `?`, & `!`) into modern 27 | (English-style) stops (`.`, `,`, `?`, & `!`) 28 | - Pairs of less-than and greater-than inequality symbols (`<` & `>`) into 29 | pairs of proper angle quotes (`〈` & `〉`) 30 | - Pairs of two consecutive inequality symbols (`<<` & `>>`) into 31 | pairs of proper double angle quotes (`《` & `》`) 32 | - A hyphen (`-`) or hangul vowel *eu* (`ㅡ`) surrounded by spaces, or 33 | two/three consecutive hyphens (`--` or `---`) into a proper em dash (`—`) 34 | - A less-than inequality symbol followed by a hyphen or an equality 35 | symbol (`<-`, `<=`) into arrows to the left (`←`, `⇐`) 36 | - A hyphen or an equality symbol followed by a greater-than inequality 37 | symbol (`->`, `=>`) into arrows to the right (`→`, `⇒`) 38 | - A hyphen or an equality symbol wrapped by inequality symbols (`<->`, `<=>`) 39 | into bi-directional arrows (`↔`, `⇔`) 40 | 41 | Each transformations can be partially turned on and off, and some 42 | transformations have many options. 43 | 44 | All transformations work with both plain texts and rich text tree. 45 | In a similar way to SmartyPants, it does not modify characters within 46 | several sensitive HTML elements like `
    `/``/`