├── README.md └── css2rss.py /README.md: -------------------------------------------------------------------------------- 1 | # CSS2RSS 2 | scrapper post-process script for RSSGuard ( https://github.com/martinrotter/rssguard ) 3 | 4 | ## Arguments - each is a CSS selector ( https://www.w3schools.com/cssref/css_selectors.asp ): 5 | 1) item 6 | 2) item title (optional - else would use link's text as title) 7 | 3) item description (optional - else would use all the text from item as description) 8 | 4) item link (optional - else would use 1st found link in the item (or the item itself if it's a link)) 9 | 5) item title 2nd part (optional (or if static main title \ multilink option is enabled), else just title, e.g. title is "Batman" and 2nd part is "chapter 94") 10 | 6) item date (optional, else it'd all be "just now") - aim this selector either at text nodes (e.g. `span`) or elements (`a`, `img`) with `title` or `alt` containing the Date (e.g. "New!" flashing image badges you get the Date when hovering over) 11 | 12 | ## Options for arguments: 13 | * for `1) item` - `@` at start - enables searching for multiple links inside the found item, e.g. one `div` item and multiple `a` links inside it and you want it as separate feed items 14 | * for everything after `1) item` - `~` as the whole argument - to let the script decide what to do (default action) - e.g. use 1st found link inside the item, use whole text inside the item as the description etc (not actually an option, but rather a format for the argument line), e.g. `python css2rss.py div.itemclass ~ span.description` (here link's inner text (2nd argument) will be used as the title by default action but description is being looked for (3rd argument)) 15 | * for `2) title` , `5) item title 2nd part` , `3) item description` , `4) item link ` - `!` at start - makes it a static specified value (after the !), e.g. `"!my title"`, if you make 1st part of the title fixed then 2nd part title addon would get auto-enabled and it would use text inside the found link as the 2nd part (unless you specify what to use manually as the 5th argument) 16 | * for everything (even for the `1) item` - `$` at start - executes (via `eval()`) a python code expression instead of using CSS selectors, the return value (can be an array if you're using `@` for `1) item`) of that expression will be used for that item e.g. `$found_link.select('img')['alt']` would return `alt` text from an `img` element inside found link, see https://www.crummy.com/software/BeautifulSoup/bs4/doc/ for things you can do with the soup - e.g. go one level up (to the parent element) or to the next element - or select elements CSS selectors can't select, or anything you can do with Python - see examples section below, useful script's variables are `found_link` and `item` (found link and the whole found item) - you can CSS `.select` inside them e.g. - `$item.select('img.class')` to find an image element of class "class" inside your found root item, or use `soup` var - `$soup.select('img.class')` to select an element globally 17 | * for `6) date` - `?` at start - tells the parser that you're expecting an Americal format of date - "Month/Day/Year" 18 | 19 | ## Notes: 20 | - `1) item` is searched in the whole document and the rest is searched inside the `item` document node (but you can make the `item` point right at the `a` hyperlink - it will be used by default) 21 | 22 | - use space ` ` as the separator for arguments if they contain no spaces themselves, else (if they do) also enclose such arguments into quotation marks `"`, e.g. `python css2rss.py div.class "div.subclass > h1.title" span.description` (btw, you can also enclose arguments without any spaces into brackets if you'd like) 23 | **Warning**: starting from RSSGuard v4.5.2 which supports single quotation marks as well `'` you have to either use single quotation marks instead `'` to enclose arguments to pass them as is or escape backslashes and double-quotes with backslashes, e.g. `python css2rss.py "\\:argument starting with\\:"` or `python css2rss.py '\:argument starting with\:'` 24 | - if no item is found - a feed item would be generated with the html dump of the whole page so you could see what could be wrong (e.g. - cloudflare block page) 25 | - content you need to log-in first to see is available 26 | - scrapper uses cookies of RSSGuard, so if you login into a website using built-in browser of RSSGuard - scrapper would be able to access that content as well to scrape it into a feed 27 | 28 | ## Limitations: 29 | - No javaScripts would run on scrapped pages, so sites which populate their content with javaScripts wouldn't be able to get scrapped, instead their starting version (what you'd see in `right click -> view page source`) would get scrapped. 30 | - You could try to get the needed content from other pages of the site, e.g. - main page, releases page or even the search page - one of these pages could be static and not constructed using javaScripts 31 | 32 | # Installation 33 | 34 | 1) Have Python 3+ or newer ( https://www.python.org/downloads/ ) installed (and added to PATH during install) 35 | 36 | 1.2. Have Python Soup ( https://www.crummy.com/software/BeautifulSoup/ ) installed (Win+R -> cmd -> enter -> `pip install beautifulsoup4`) 37 | 1.3. (optional) If you'd like to parse Dates for articles - Have Maya ( https://github.com/timofurrer/maya/ ) installed (Righ click the Start menu -> run powershell as administrator -> cmd -> `pip install maya`) 38 | 39 | 3) Put css2rss.py into your `data4` folder (so you can call the script with just `python css2rss.py`, else you'd need to specify full path to the `.py` file) 40 | 41 | ![data4](https://user-images.githubusercontent.com/1309656/162590050-0c6d4d9d-4c57-4123-9959-06a83f0af61b.jpg) 42 | 43 | # Examples 44 | 45 | ## * 46 | - a simple link makeover into an rss feed (right-clicked a link -> inspect element -> use its CSS selector): 47 | 48 | url: `https://www.foxnews.com/media` 49 | script: `python css2rss.py ".title > a"` (link `a` right inside an element with `title` class 50 | ![](https://user-images.githubusercontent.com/1309656/162590533-dcc261f4-3a24-4c59-9e24-60d312a4e3ec.jpg) 51 | ![](https://user-images.githubusercontent.com/1309656/162590684-c452b64f-7916-43e1-b440-3889b2d6a82c.jpg) 52 | ![](https://user-images.githubusercontent.com/1309656/162590622-66bf2f9e-e2cb-4434-a377-3ebdcc573f20.jpg) 53 | 54 | ## * 55 | - the reason for implementing static titles 56 | 57 | url: `https://kumascans.com/manga/sokushi-cheat-ga-saikyou-sugite-isekai-no-yatsura-ga-marude-aite-ni-naranai-n-desu-ga/` 58 | script: `python css2rss.py ".eph-num > a" "!Sokushi Cheat" ".chapterdate" ~ ".chapternum"` 59 | 60 | ![](https://user-images.githubusercontent.com/1309656/162590790-1995cd7e-ea6f-41b5-a24c-cb669de851d2.jpg) 61 | ![](https://user-images.githubusercontent.com/1309656/162590821-d3388846-fb47-41e4-866a-5aaa3754d022.jpg) 62 | 63 | ## * 64 | - the reason for implementing searching multiple links inside one item 65 | 66 | url: `https://www.asurascans.com/` 67 | script: `python css2rss.py "@.uta" "h4" img "li > a" "li > a"` 68 | 69 | ![](https://user-images.githubusercontent.com/1309656/162590919-4374ba05-9c1f-4f39-b27c-f723d4afda1f.jpg) 70 | ![](https://user-images.githubusercontent.com/1309656/162590934-4c28c614-7548-4048-b147-b7a5b036a842.jpg) 71 | 72 | ## * 73 | - the reason for implementing eval expressions for titles (since CSS selectors can't select text nodes outside any tags) 74 | 75 | url: `https://reaperscans.com/` 76 | script: `python css2rss.py "@div.space-y-4:first-of-type div.relative.bg-white" "p.font-medium" "img" "a.border" "$found_link.contents[0].text"` (it was just `$contents[0]` previously as seen on the screenshot but later more freedom was given so now you have to write the full code) 77 | 78 | ![image](https://user-images.githubusercontent.com/1309656/194601286-7c7b399a-7561-4274-9444-89508dd51681.png) 79 | ![image](https://user-images.githubusercontent.com/1309656/194601403-578c9550-785e-44bd-98d7-88c50f785a5d.png) 80 | 81 | ## * 82 | - using img titles (alt) as RSS titles with the help of $ eval 83 | 84 | url: `https://imginn.com/gothicarchitectures/` 85 | script: `python css2rss.py ".img" "$found_link.find('img')['alt'][:60]" "$str(found_link.find('img'))+'
'+found_link.find('img')['alt']"` 86 | 87 | ![410191698-b9165ecf-461d-4fd1-91f7-c9cfe1124c10](https://github.com/user-attachments/assets/22a901ac-e971-4e31-8d51-6eaf9b8a3bfb) 88 | 89 | 90 | ## * 91 | - cutting out part of the text to use (as a date) 92 | 93 | url: `https://support.microsoft.com/en-us/topic/windows-10-update-history-8127c2c6-6edf-4fdf-8b9f-0f7be1ef3562` 94 | script: `python css2rss.py "#supLeftNav > div > ul:nth-child(2) > li.supLeftNavArticle:nth-child(2) > .supLeftNavLink" ~ "$found_link.contents[0].text[:found_link.contents[0].text.find('—')]" ~ ~ "$found_link.contents[0].text[:found_link.contents[0].text.find('—')]"` 95 | 96 | ![410204374-17a8d1a3-66a2-4d7d-a121-108b85bb8d94](https://github.com/user-attachments/assets/95a709c0-723f-4afb-a744-5cabea0b6de7) 97 | 98 | 99 | ## * 100 | - using static links 101 | 102 | url: `https://www.mobygames.com/changelog/` 103 | script: `python css2rss.py @main h2 ul !https://www.mobygames.com/changelog/` 104 | 105 | ![447137625-50015afb-0daf-4ebd-9d6e-505990c9f846](https://github.com/user-attachments/assets/1f6cbee9-d15a-4e06-9432-250db6e58521) 106 | 107 | 108 | ## * 109 | url: `https://reader.kireicake.com/` 110 | script: `python css2rss.py @.group a[href*='/series/'] .meta_r ".element > .title a" ".element > .title a"` 111 | 112 | ![](https://user-images.githubusercontent.com/1309656/162591038-3664255c-8e8b-4065-b0a9-a0d2eb4977c7.jpg) 113 | ![](https://user-images.githubusercontent.com/1309656/162591089-6951e712-384f-4109-8c57-1caa05ac49f6.jpg) 114 | 115 | 116 | ## * 117 | - example for parsing Dates for articles, here it uses OR in the css selector and it looks for either `a` element (the "New!" badge) with date inside its tooltip (`title` or `alt`) **OR** for a `span` element without any child nodes (both these elements are of class `.post-on` 118 | 119 | url: `https://drakescans.com/` 120 | script: `python css2rss.py "@.page-item-detail" ".post-title a" "img" "span.chapter > a" ~ ".post-on > a,.post-on:not(:has(*))"` 121 | 122 | ![](https://github.com/Owyn/CSS2RSS/assets/1309656/692796e0-8caa-4b1b-ac05-2be60388aa28) 123 | ![](https://github.com/Owyn/CSS2RSS/assets/1309656/55220446-4c22-498a-9bb7-1c27294996bb) 124 | 125 | 126 | ## * 127 | - the workaround to scrap sites which give out their contents via javaScripts (the workaround is to find a static page - right-click -> view page source - and see if your text is originally there - that means it's static and not given out later via JS) 128 | 129 | url: `https://manhuaus.com/?s=Wo+Wei+Xie+Di&post_type=wp-manga&post_type=wp-manga` 130 | script: `python css2rss.py ".latest-chap a" "!I'm an Evil God"` 131 | -------------------------------------------------------------------------------- /css2rss.py: -------------------------------------------------------------------------------- 1 | # /// script 2 | # requires-python = ">=3.13" 3 | # dependencies = [ 4 | # "beautifulsoup4", 5 | # "maya", 6 | # ] 7 | # /// 8 | # CSS2RSS 9 | # input html file must be provided in stdin 10 | # arguments: item, item title, item description, item link, item title 2nd part, item date 11 | 12 | import json 13 | import sys 14 | import datetime 15 | from bs4 import BeautifulSoup 16 | 17 | def css_to_rss(item, depth): 18 | find_links_near = False 19 | found_link = None 20 | if (not(bDefault_link) and sys.argv[4][0] == '!'): 21 | item_link = sys.argv[4][1:] 22 | #found_link = item #for the default description - maybe bad idea? 23 | if bMulti_enabled and not(bDefault_main_title) and (depth+1 < len(item.select(sys.argv[2]))): # lets count found main titles then if the link is static? 24 | find_links_near = True 25 | elif aEval[4]: 26 | found_link = eval(sys.argv[4]) 27 | if isinstance(found_link, list) == True: 28 | if (link_l := len(found_link)) > depth: 29 | if bMulti_enabled and depth+1 < link_l: 30 | find_links_near = True 31 | found_link = found_link[depth] 32 | item_link = found_link['href'] 33 | else: 34 | item_link = found_link['href'] 35 | elif not(bDefault_link) and (link_l := len(found_link := item.select(sys.argv[4]))) > depth: 36 | found_link = found_link[depth] 37 | item_link = found_link['href'] 38 | if bMulti_enabled and depth+1 < link_l: 39 | find_links_near = True 40 | else: 41 | if item.name == "a": #item itself is a link 42 | found_link = item 43 | item_link = item['href'] 44 | else: # use 1st link found 45 | if bDefault_link: 46 | found_link = item.find("a") 47 | if found_link: 48 | item_link = found_link['href'] 49 | if not(found_link): # we found something else without a link or we specified a link to find so we don't want 1st found link anymore 50 | global found_items_bad_n 51 | found_items_bad_n += 1 52 | return 53 | 54 | if bFixed_main_title: 55 | main_title = sys.argv[2] 56 | elif aEval[2]: 57 | main_title = eval(sys.argv[2]) # add .text at the end of your eval selector yourself if it's a html element 58 | if isinstance(main_title, list) == True: 59 | main_title = main_title[depth if len(main_title) > depth else 0] 60 | elif not(bDefault_main_title) and (mt_l := len(main_title := item.select(sys.argv[2]))) != 0: 61 | main_title = main_title[depth if mt_l > depth else 0].text # not sure if we should look for more main titles? 62 | elif found_link: 63 | main_title = found_link.text # use the link's text 64 | #main_title = item.text # use all the text inside - bad idea 65 | else: 66 | main_title = "" 67 | #raise(ValueError("Title & Link were not found - can't do anything now, please adjust your Title selector: " + sys.argv[2])) 68 | global found_items_bad_t 69 | found_items_bad_t += 1 70 | return 71 | 72 | if bFixed_addon_title: 73 | addon_title = sys.argv[5] 74 | elif not(bDefault_addon_title): 75 | if aEval[5]: 76 | addon_title = eval(sys.argv[5]) # add .text at the end of your eval selector yourself if it's a html element 77 | if isinstance(addon_title, list) == True: 78 | addon_title = addon_title[depth] # we need all the addon titles 79 | elif len(addon_title := item.select(sys.argv[5])) > depth: 80 | addon_title = addon_title[depth].text 81 | else: 82 | addon_title = found_link.text 83 | elif bFixed_main_title or (bMulti_enabled and found_link): # enable addon title by default for these options 84 | addon_title = found_link.text 85 | else: 86 | addon_title = "" 87 | #raise(ValueError(addon_title)) # lets see what we've found? 88 | 89 | item_title = main_title + (" - " if addon_title != "" else "") + addon_title 90 | 91 | if bComment_fixed: 92 | item_description = sys.argv[3] 93 | elif aEval[3]: 94 | item_description = eval(sys.argv[3]) 95 | if isinstance(item_description, list) == True: 96 | item_description = item_description[depth if len(item_description) > depth else 0] 97 | elif not(bDefault_comment) and (desc_l := len(tDescr := item.select(sys.argv[3]))) != 0: 98 | item_description = str(tDescr[depth if desc_l > depth else 0]) # keep html, also use 1st found if none further 99 | #item_description = item_description.replace('<', '≤').replace('&', '&') # don't keep html 100 | else: 101 | item_description = str(item) # use everything inside found item 102 | 103 | item_date = "" 104 | DateCurEl = "" 105 | if bFind_date: 106 | if ((tDate := eval(sys.argv[6])) if aEval[6] else (date_l := len(tDate := item.select(sys.argv[6])))) != 0: 107 | if aEval[6]: 108 | if isinstance(tDate, list) == False: 109 | DateCurEl = tDate 110 | else: 111 | date_l = len(tDate) 112 | 113 | if DateCurEl == "": 114 | if date_l > depth: 115 | DateCurEl = tDate[depth] 116 | else: 117 | DateCurEl = "no Date element found for THIS item (there are less date elements found than total items)" 118 | 119 | if type(DateCurEl) == str: 120 | item_date = DateCurEl 121 | else: 122 | item_date = (DateCurEl['datetime'] if DateCurEl.has_attr('datetime') else DateCurEl['alt'] if DateCurEl.has_attr('alt') else DateCurEl['title'] if DateCurEl.has_attr('title') else "") or DateCurEl.text 123 | try: 124 | item_date = maya.parse(item_date, get_localzone().key, bNotAmerican_Date).datetime().isoformat() 125 | except: # BaseException: 126 | try: 127 | item_date = maya.when(item_date, get_localzone().key).datetime().isoformat() 128 | except: # ValueError: 129 | #ok what now? do we error everything or say that the feed is fully invalid when just the date is invalid? 130 | item_description += "\n
CSS2RSS: Date '"+item_date+"' from element '"+str(DateCurEl).replace('<', '≤').replace('&', '&')+"' could not be parsed for this entry, please adjust your CSS selector: " + sys.argv[6].replace('<', '≤').replace('&', '&') 131 | global found_items_w_bad_dates 132 | found_items_w_bad_dates += 1 133 | item_date = "" 134 | else: 135 | global found_items_wo_dates 136 | found_items_wo_dates += 1 137 | 138 | items.append("{{\"title\": {title}, \"content_html\": {html}, \"url\": {url}, \"date_published\": {date}}}".format( 139 | title=json.dumps(item_title), 140 | html=json.dumps(item_description), 141 | url=json.dumps(item_link), 142 | date=json.dumps(item_date))) 143 | 144 | if find_links_near: 145 | global found_items_n 146 | found_items_n += 1 147 | css_to_rss(item, depth+1) 148 | 149 | #from urllib.request import Request, urlopen 150 | #url="https://en.wikivoyage.org/wiki/Main_Page" 151 | #req = Request(url, headers={'User-Agent': 'Mozilla/5.0'}) 152 | #web_byte = urlopen(req).read() 153 | #input_data = web_byte.decode('utf-8') 154 | 155 | sys.stdin.reconfigure(encoding='utf-8') 156 | input_data = sys.stdin.read() 157 | 158 | soup = BeautifulSoup(input_data, 'html.parser') 159 | items = list() 160 | 161 | # options: ! - fixed result text, @ - look for multiple links inside one item, ~ - do default action, $ - eval code 162 | if sys.argv[1][0] == '@': 163 | sys.argv[1] = sys.argv[1][1:] 164 | bMulti_enabled = True 165 | else: 166 | bMulti_enabled = False 167 | 168 | aEval = [False] * 7 169 | i = 1 170 | while i < len(sys.argv): 171 | if sys.argv[i][0] == '$': 172 | sys.argv[i] = sys.argv[i][1:] # cut $ 173 | aEval[i] = True 174 | i += 1 175 | 176 | bDefault_main_title = False 177 | bFixed_main_title = False 178 | if len(sys.argv) > 2: 179 | if sys.argv[2] == '' or sys.argv[2][0] == '~': 180 | bDefault_main_title = True 181 | elif sys.argv[2][0] == '!': 182 | sys.argv[2] = sys.argv[2][1:] 183 | bFixed_main_title = True 184 | else: 185 | bDefault_main_title = True 186 | 187 | bDefault_addon_title = False 188 | bFixed_addon_title = False 189 | bEval_addon_title = False 190 | if len(sys.argv) > 5: 191 | if sys.argv[5] == '' or sys.argv[5][0] == '~': 192 | bDefault_addon_title = True 193 | elif sys.argv[5][0] == '!': 194 | sys.argv[5] = sys.argv[5][1:] 195 | bFixed_addon_title = True 196 | elif len(sys.argv) > 5 and sys.argv[5][0] == '$': 197 | sys.argv[5] = sys.argv[5][1:] 198 | bEval_addon_title = True 199 | else: 200 | bDefault_addon_title = True 201 | 202 | bDefault_comment = False 203 | bComment_fixed = False 204 | if len(sys.argv) > 3: 205 | if sys.argv[3] == '' or sys.argv[3][0] == '~': 206 | bDefault_comment = True 207 | elif sys.argv[3][0] == '!': 208 | sys.argv[3] = sys.argv[3][1:] 209 | bComment_fixed = True 210 | else: 211 | bDefault_comment = True 212 | 213 | if len(sys.argv) > 4: 214 | if sys.argv[4] == '' or sys.argv[4][0] == '~': 215 | bDefault_link = True 216 | else: 217 | bDefault_link = False 218 | else: 219 | bDefault_link = True 220 | 221 | bFind_date = False 222 | bNotAmerican_Date = True 223 | if len(sys.argv) > 6: 224 | if sys.argv[6] != '' and sys.argv[6][0] != '~': 225 | bFind_date = True 226 | try: 227 | import maya 228 | from tzlocal import get_localzone 229 | except BaseException as e: 230 | raise(SystemExit("Couldn't import Maya module to parse time, have you installed it? error: ", e)) 231 | if sys.argv[6][0] == '?': 232 | sys.argv[6] = sys.argv[6][1:] 233 | bNotAmerican_Date = False 234 | 235 | # end options 236 | 237 | if aEval[1]: 238 | item_selector = eval(sys.argv[1]) 239 | else: 240 | item_selector = sys.argv[1] 241 | found_items = soup.select(item_selector) 242 | found_items_n = len(found_items) 243 | found_items_bad_n = 0 244 | found_items_bad_t = 0 245 | found_items_wo_dates = 0 246 | found_items_w_bad_dates = 0 247 | 248 | jsonfeed_version = "https://jsonfeed.org/version/1.1" 249 | description_addon = "" 250 | if found_items_n != 0: 251 | for item in found_items: 252 | css_to_rss(item, 0) 253 | if found_items_bad_n != 0: 254 | description_addon += ", Found items with NO Link: " + str(found_items_bad_n) 255 | if found_items_bad_t != 0: 256 | description_addon += ", Found items with NO Title: " + str(found_items_bad_t) 257 | if found_items_wo_dates != 0: 258 | description_addon += ", Found no Date item for: " + str(found_items_wo_dates) 259 | if found_items_w_bad_dates != 0: 260 | description_addon += ", Failed to parse Dates for: " + str(found_items_w_bad_dates) 261 | json_feed = "{{\"version\": {version}, \"title\": {title}, \"description\": {description}, \"items\": [{items}]}}" 262 | json_feed = json_feed.format(version = json.dumps(jsonfeed_version), title = json.dumps(soup.title.text), description = json.dumps("Script found "+str(found_items_n)+" items"+description_addon), items = ", ".join(items)) 263 | else: 264 | raise(SystemExit("CSS selector found no items - is the content generated with JavaScript? - Or did the website change its structure?")) 265 | items.append("{{\"title\": {title}, \"content_html\": {html}, \"url\": {url}}}".format( 266 | title=json.dumps("ERROR page @ " + str(datetime.datetime.now()) + (" - " + soup.title.text) if soup.title else ""), 267 | html=json.dumps(soup.prettify()), 268 | url=json.dumps(""))) 269 | json_feed = "{{\"version\": {version}, \"title\": {title}, \"description\": {description}, \"items\": [{items}]}}" 270 | json_feed = json_feed.format(version = json.dumps(jsonfeed_version), title = (json.dumps("ERROR: " + soup.title.text) if soup.title else json.dumps("ERROR")), description = json.dumps("Error: - CSS selector found no items"), items = ", ".join(items)) 271 | 272 | print(json_feed) 273 | --------------------------------------------------------------------------------