├── .gitignore ├── README.md ├── project.clj ├── src └── html5_walker │ ├── core.clj │ └── walker.clj ├── test └── html5_walker │ ├── core_test.clj │ └── walker_test.clj └── tests.edn /.gitignore: -------------------------------------------------------------------------------- 1 | /.nrepl-port 2 | /target 3 | /pom.xml 4 | /pom.xml.asc 5 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # html5-walker 2 | 3 | A thin Clojure wrapper around 4 | [jfiveparse](https://github.com/digitalfondue/jfiveparse), this lets you find 5 | and replace in HTML5 strings. 6 | 7 | ## Install 8 | 9 | - add `[html5-walker "2023.11.21"]` to `:dependencies` in your project.clj 10 | 11 | or 12 | 13 | - add `html5-walker/html5-walker {:mvn/version "2023.11.21"}` to `:deps` in your deps.edn 14 | 15 | ## Usage 16 | 17 | html5-walker exposes these functions: 18 | 19 | ### html5-walker.walker/find-nodes 20 | 21 | Signature: `(find-nodes html-string path)` 22 | 23 | It returns a sequence of 24 | [Nodes](https://static.javadoc.io/ch.digitalfondue.jfiveparse/jfiveparse/0.6.0/ch/digitalfondue/jfiveparse/Node.html) 25 | matching the path. 26 | 27 | A path is a vector of symbols (or strings) of CSS selectors. Like this: 28 | 29 | - `'[a]` matches all anchor tags. 30 | - `'[form input]` matches all input tags nested inside a form. 31 | - `'[form > input]` matches all input tags that are direct children of a form. 32 | - `'[div.foo]` matches all div tags with "foo" in its class name. 33 | - `'[.button]` matches all elements with the "button" class. 34 | - `'[div#content]` matches the div with "content" as its id. 35 | - `'[:first-child]` matches any element that is the first child. 36 | - `'[:last-child]` matches any element that is the last child. 37 | - `'["meta[property]"]` matches all meta tags with the `property` attribute. 38 | - `'["meta[property=og:title]"]` matches all meta tags with the `property` 39 | attribute set to "og:title". 40 | 41 | The following additional attribute selectors are also supported, and work like 42 | they do in CSS: `*=`, `$=`, `~=` and `^=`. 43 | 44 | So running: 45 | 46 | ```clj 47 | (require '[html5-walker.walker :as walker]) 48 | 49 | (walker/find-nodes "" '[ul li]) 50 | ``` 51 | 52 | would return a sequence with two `li` nodes in it. [See the javadoc for more 53 | information about these 54 | nodes.](https://static.javadoc.io/ch.digitalfondue.jfiveparse/jfiveparse/0.6.0/ch/digitalfondue/jfiveparse/Node.html) 55 | 56 | ### html5-walker.walker/replace-in-fragment 57 | 58 | Signature: `(replace-in-fragment html-string path->fn)` 59 | 60 | This returns a new html-string with any changes performed by the functions in 61 | the `path->fn` map applied. 62 | 63 | So running: 64 | 65 | ```clj 66 | (require '[html5-walker.walker :as walker]) 67 | 68 | (walker/replace-in-fragment 69 | "" 70 | {'[ul li] (fn [node] (.setInnerHTML node (str (.getInnerHTML node) "!!!")))}) 71 | ``` 72 | 73 | would return: 74 | 75 | ``` 76 | "" 77 | ``` 78 | 79 | ### html5-walker.walker/replace-in-document 80 | 81 | Just like `replace-in-fragment`, except it works on an entire html document. 82 | This means that `html`, `head` and `body` tags are expected to be there. They 83 | will be added if missing. 84 | 85 | Note that `replace-in-fragment` will actually remove these tags when found. 86 | 87 | ## More usage 88 | 89 | Take a look at the tests if you'd like more examples. 90 | 91 | ## About `html5-walker.core` 92 | 93 | The `html5-walker.core` namespace contains the same three functions as above. 94 | These functions work exactly the same, with one important difference: `'[div a]` 95 | is treated as `'[div > a]`, e.g. a direct child selector. This was the library's 96 | original behavior, and the core namespace is kept around for backwards 97 | compatibility. 98 | 99 | ## Changes 100 | 101 | #### 2023.11.21 102 | 103 | - Preserve DOCTYPE [(cjohansen)](https://github.com/cjohansen) 104 | 105 | #### 2023.10.22 106 | 107 | - Rename namespace to `html5-walker.walker` to preserve backwards compatibility 108 | for `html5-walker.core` 109 | - Support lots more CSS selector semantics [(cjohansen)](https://github.com/cjohansen) 110 | 111 | #### 2022.03.07 112 | 113 | - Upgrade jfiveparse to version 0.9.0 114 | - Switch versioning separator to dot, for more Maven friendly version numbers 115 | 116 | #### 2020-01-08 117 | 118 | - Support selecting only by class name, like so: `:.myclass` 119 | 120 | ## License 121 | 122 | Copyright © Magnar Sveen, since 2019 123 | 124 | Distributed under the Eclipse Public License, the same as Clojure. 125 | -------------------------------------------------------------------------------- /project.clj: -------------------------------------------------------------------------------- 1 | (defproject html5-walker "2023.11.21" 2 | :description "Search and replace html5." 3 | :url "https://github.com/magnars/html5-walker" 4 | :license {:name "Eclipse Public License" 5 | :url "http://www.eclipse.org/legal/epl-v10.html"} 6 | :dependencies [[ch.digitalfondue.jfiveparse/jfiveparse "0.9.0"]] 7 | :profiles {:dev {:dependencies [[org.clojure/clojure "1.10.1"] 8 | [lambdaisland/kaocha "0.0-529"] 9 | [kaocha-noyoda "2019-06-03"]] 10 | :aliases {"kaocha" ["run" "-m" "kaocha.runner"]}}}) 11 | -------------------------------------------------------------------------------- /src/html5_walker/core.clj: -------------------------------------------------------------------------------- 1 | (ns html5-walker.core 2 | (:require [html5-walker.walker :as walker])) 3 | 4 | (defn enforce-child-selectors 5 | "Preserves the old behavior where [:div :a] enforced a child relationship." 6 | [path] 7 | (->> (partition-all 2 1 path) 8 | (remove (comp #{">"} #(some-> % name) first)) 9 | (mapcat 10 | (fn [[element descendant]] 11 | (if (nil? descendant) 12 | [element] 13 | [element '>]))))) 14 | 15 | (defn ^:export replace-in-document [html path->f] 16 | (->> (for [[path f] path->f] 17 | [(enforce-child-selectors path) f]) 18 | (into {}) 19 | (walker/replace-in-document html))) 20 | 21 | (defn ^:export replace-in-fragment [html path->f] 22 | (->> (for [[path f] path->f] 23 | [(enforce-child-selectors path) f]) 24 | (into {}) 25 | (walker/replace-in-fragment html))) 26 | 27 | (defn ^:export find-nodes [html path] 28 | (walker/find-nodes html (enforce-child-selectors path))) 29 | -------------------------------------------------------------------------------- /src/html5_walker/walker.clj: -------------------------------------------------------------------------------- 1 | (ns html5-walker.walker 2 | (:require [clojure.string :as str]) 3 | (:import (ch.digitalfondue.jfiveparse Element Parser Selector))) 4 | 5 | (def prefix->kind 6 | {nil :element 7 | "#" :id 8 | "." :class 9 | "[" :attr}) 10 | 11 | (defn parse-selector 12 | "Breaks a CSS selector element into tag matcher, class matchers, id matcher, and 13 | attribute matchers." 14 | [selector] 15 | (->> (str/replace selector #":(first|last)-child" "") 16 | (re-seq #"([#\.\[])?([a-z0-9\-\:]+)(?:(.?=)([a-z0-9\-\:]+)])?") 17 | (map #(into [(prefix->kind (second %))] (remove nil? (drop 2 %)))) 18 | (concat (->> (re-seq #":((?:first|last)-child)" selector) 19 | (map (comp vector keyword second)))))) 20 | 21 | (comment 22 | 23 | (parse-selector "div#content.text[property=og:image].mobile[style][data-test~=bla]:first-child") 24 | (parse-selector "div:first-child") 25 | (parse-selector ":first-child") 26 | (parse-selector "[property]") 27 | 28 | ) 29 | 30 | (defn- match-path-fragment [selector element] 31 | (reduce 32 | (fn [s [kind m comparator v]] 33 | (case kind 34 | :element (.element s m) 35 | :id (.id s m) 36 | :class (.hasClass s m) 37 | :attr (case comparator 38 | "=" (.attrValEq s m v) 39 | "*=" (.attrValContains s m v) 40 | "$=" (.attrValEndWith s m v) 41 | "~=" (.attrValInList s m v) 42 | "^=" (.attrValStartWith s m v) 43 | nil (.attr s m)) 44 | :first-child (.isFirstChild s) 45 | :last-child (.isLastChild s))) 46 | selector 47 | (parse-selector (name element)))) 48 | 49 | (defn make-descendants-explicit 50 | "Walks a path and returns pairs of [descendant element] where descendant is 51 | either `:descendant` or `:child`, describing the desired relationship to the 52 | previous path element. `descendant` will be `nil` for the first element. `:>` 53 | creates a `:child` relationship between two elements, while elements that 54 | don't have an explicit relationship (e.g. `[:div :a]`) will have a 55 | `:descendant` interposed between them." 56 | [path] 57 | (->> (partition-all 2 1 path) 58 | (remove (comp #{">"} #(some-> % name) first)) 59 | (mapcat 60 | (fn [[element descendant]] 61 | (cond 62 | (nil? descendant) 63 | [element] 64 | 65 | (= ">" (name descendant)) 66 | [element :child] 67 | 68 | :else 69 | [element :descendant]))) 70 | (into [nil]) 71 | (partition 2))) 72 | 73 | (defn create-matcher [path] 74 | (let [path (make-descendants-explicit path)] 75 | (.toMatcher 76 | (reduce (fn [selector [descendant element-kw]] 77 | (-> (case descendant 78 | :descendant (.withDescendant selector) 79 | :child (.withChild selector)) 80 | (match-path-fragment element-kw))) 81 | (-> (Selector/select) 82 | (match-path-fragment (second (first path)))) 83 | (next path))))) 84 | 85 | (defn ^:export replace-in-document [html path->f] 86 | (let [doc (.parse (Parser.) html)] 87 | (doseq [[path f] path->f] 88 | (doseq [node (.getAllNodesMatching doc (create-matcher path))] 89 | (f node))) 90 | (str (re-find #"^]+>" html) 91 | (.getOuterHTML (.getDocumentElement doc))))) 92 | 93 | (defn ^:export replace-in-fragment [html path->f] 94 | (let [el (first (.parseFragment (Parser.) (Element. "div") (str "
" html "
")))] 95 | (doseq [[path f] path->f] 96 | (doseq [node (.getAllNodesMatching el (create-matcher path))] 97 | (f node))) 98 | (.getInnerHTML el))) 99 | 100 | (defn ^:export find-nodes [html path] 101 | (.getAllNodesMatching (.parse (Parser.) html) (create-matcher path))) 102 | -------------------------------------------------------------------------------- /test/html5_walker/core_test.clj: -------------------------------------------------------------------------------- 1 | (ns html5-walker.core-test 2 | (:require [clojure.test :refer [deftest is testing]] 3 | [html5-walker.core :as sut])) 4 | 5 | (deftest find-nodes 6 | (testing "element selector" 7 | (is (= (map #(.getAttribute % "href") 8 | (sut/find-nodes 9 | "Hello! 10 | Hi! 11 | Howdy! 12 | " 13 | [:a])) 14 | ["foo" "bar"]))) 15 | 16 | (testing "element.class selector" 17 | (is (= (map #(.getAttribute % "href") 18 | (sut/find-nodes 19 | "Hello! 20 | Hi! 21 | Howdy! 22 | " 23 | [:a.bar])) 24 | ["barn"]))) 25 | 26 | (testing "multiple class selector" 27 | (is (= (map #(.getAttribute % "href") 28 | (sut/find-nodes 29 | "Hello! 30 | Hi! 31 | Howdy! 32 | " 33 | [:a.foo.baz])) 34 | ["fool"]))) 35 | 36 | (testing "implicit child selector" 37 | (is (= (map #(.getAttribute % "href") 38 | (sut/find-nodes 39 | "Hello! 40 |
Hi!
41 |
Howdy!
42 | " 43 | [:div.bar :a])) 44 | ["barn"]))) 45 | 46 | (testing "child selector does not match any descendant" 47 | (is (= (map #(.getAttribute % "href") 48 | (sut/find-nodes 49 | "Hello! 50 |
Hi!
51 |
Howdy!
52 | " 53 | [:div.bar :> :a])) 54 | []))) 55 | 56 | (testing "explicit child selector" 57 | (is (= (map #(.getAttribute % "href") 58 | (sut/find-nodes 59 | "Hello! 60 |
Hi!
61 |
Howdy!
62 | " 63 | '[div.bar > div > a])) 64 | ["barn"]))) 65 | 66 | (testing ".class only selector" 67 | (is (= (map #(.getAttribute % "id") 68 | (sut/find-nodes 69 | "Hello! 70 | Hi! 71 |
Howdy! 72 | Howdy! 73 | " 74 | [:.foo])) 75 | ["fool" "food" "foot"]))) 76 | 77 | (testing "element#id selector" 78 | (is (= (map #(.getAttribute % "id") 79 | (sut/find-nodes 80 | "Hello! 81 | Hi! 82 |
Howdy! 83 | Howdy! 84 | " 85 | [:span#fool])) 86 | ["fool"]))) 87 | 88 | (testing "#id only selector" 89 | (is (= (map #(.getAttribute % "id") 90 | (sut/find-nodes 91 | "Hello! 92 | Hi! 93 |
Howdy! 94 | Howdy! 95 | " 96 | [:#fool])) 97 | ["fool"]))) 98 | 99 | (testing "element:first-child selector" 100 | (is (= (map #(.getAttribute % "id") 101 | (sut/find-nodes 102 | "Hi!" 103 | [:span:first-child])) 104 | ["fool"]))) 105 | 106 | (testing ":first-child only selector" 107 | (is (= (map #(.getTagName %) 108 | (sut/find-nodes 109 | "" 110 | [":first-child"])) 111 | ["HTML" "HEAD" "SPAN"]))) 112 | 113 | (testing ":first-child combined with :last-child" 114 | (is (= (map #(.getTagName %) 115 | (sut/find-nodes 116 | "" 117 | [":first-child:last-child"])) 118 | ["HTML" "SPAN"]))) 119 | 120 | (testing "attribute selector" 121 | (is (= (map #(.getAttribute % "content") 122 | (sut/find-nodes 123 | " 124 | A sample blog post | Rubberduck 125 | 126 | 127 | 128 | 129 | " 130 | ["[property]"])) 131 | ["A short open graph description" 132 | "A sample blog post"]))) 133 | 134 | (testing "attribute= selector" 135 | (is (= (map #(.getAttribute % "content") 136 | (sut/find-nodes 137 | " 138 | A sample blog post | Rubberduck 139 | 140 | 141 | 142 | 143 | " 144 | ["[property=og:title]"])) 145 | ["A sample blog post"]))) 146 | 147 | (testing "attribute*= selector" 148 | (is (= (map #(.getAttribute % "content") 149 | (sut/find-nodes 150 | " 151 | A sample blog post | Rubberduck 152 | 153 | 154 | 155 | 156 | " 157 | ["[property*=title]"])) 158 | ["A sample blog post"]))) 159 | 160 | (testing "attribute$= selector" 161 | (is (= (map #(.getAttribute % "content") 162 | (sut/find-nodes 163 | " 164 | A sample blog post | Rubberduck 165 | 166 | 167 | 168 | 169 | " 170 | ["[property$=title]"])) 171 | ["A sample blog post"]))) 172 | 173 | (testing "attribute^= selector" 174 | (is (= (map #(.getAttribute % "content") 175 | (sut/find-nodes 176 | " 177 | A sample blog post | Rubberduck 178 | 179 | 180 | 181 | 182 | " 183 | ["[property^=og:]"])) 184 | ["A short open graph description" 185 | "A sample blog post"]))) 186 | 187 | (testing "attribute~= selector" 188 | (is (= (map #(.getTextContent %) 189 | (sut/find-nodes 190 | " 191 |
One
192 |
Two
193 | " 194 | ["[class~=butt]"])) 195 | ["Two"])))) 196 | 197 | (deftest replace-in-document 198 | (is (= (sut/replace-in-document 199 | "Hello! 200 | Hi! 201 | Howdy first-name-goes-here? 202 | " 203 | 204 | {[:a] (fn [node] (.setAttribute node "href" "http://example.com")) 205 | [:span.first-name-holder] (fn [node] (.setInnerHTML node "Arthur B Ablabab"))}) 206 | "Hello! 207 | Hi! 208 | Howdy Arthur B Ablabab? 209 | ")) 210 | 211 | (testing "Preserves DOCTYPE when present" 212 | (is (= (sut/replace-in-document 213 | "Hello! 214 | Hi! 215 | Howdy first-name-goes-here? 216 | " 217 | 218 | {[:a] (fn [node] (.setAttribute node "href" "http://example.com")) 219 | [:span.first-name-holder] (fn [node] (.setInnerHTML node "Arthur B Ablabab"))}) 220 | "Hello! 221 | Hi! 222 | Howdy Arthur B Ablabab? 223 | "))) 224 | 225 | (testing "Only preserves DOCTYPE when at the start" 226 | (is (= (sut/replace-in-document 227 | "Hello!" 228 | {}) 229 | "Hello!")))) 230 | 231 | (deftest replace-in-fragment 232 | (is (= (sut/replace-in-fragment 233 | "
Hello! 234 | Hi! 235 | Howdy first-name-goes-here? 236 |
" 237 | 238 | {[:a] (fn [node] (.setAttribute node "href" "http://example.com")) 239 | [:span.first-name-holder] (fn [node] (.setInnerHTML node "Arthur B Ablabab"))}) 240 | "
Hello! 241 | Hi! 242 | Howdy Arthur B Ablabab? 243 |
"))) 244 | -------------------------------------------------------------------------------- /test/html5_walker/walker_test.clj: -------------------------------------------------------------------------------- 1 | (ns html5-walker.walker-test 2 | (:require [clojure.test :refer [deftest is testing]] 3 | [html5-walker.walker :as sut])) 4 | 5 | (deftest selector-test 6 | (testing "implicit descendant selector" 7 | (is (= (map #(.getAttribute % "href") 8 | (sut/find-nodes 9 | "Hello! 10 | 11 | 12 | " 13 | '[div.bar a])) 14 | ["barn"]))) 15 | 16 | (testing "child selector does not match any descendant" 17 | (is (= (map #(.getAttribute % "href") 18 | (sut/find-nodes 19 | "Hello! 20 | 21 | 22 | " 23 | '[div.bar > a])) 24 | []))) 25 | 26 | (testing "explicit child selector" 27 | (is (= (map #(.getAttribute % "href") 28 | (sut/find-nodes 29 | "Hello! 30 | 31 | 32 | " 33 | '[div.bar > div > a])) 34 | ["barn"])))) 35 | -------------------------------------------------------------------------------- /tests.edn: -------------------------------------------------------------------------------- 1 | #kaocha/v1 2 | {:plugins [:noyoda.plugin/swap-actual-and-expected] 3 | :tests [{:id :unit 4 | :source-paths ["src" "test-data"] 5 | :focus-meta [:focus]}]} 6 | --------------------------------------------------------------------------------