├── .github └── workflows │ └── jekyll-gh-pages.yml ├── LICENSE ├── README.md ├── REQUIREMENTS └── bin ├── rsscluster.py ├── rsscount.py ├── rssdir.py ├── rssfind.py ├── rssinternetdraft.py └── rssmerge.py /.github/workflows/jekyll-gh-pages.yml: -------------------------------------------------------------------------------- 1 | # Sample workflow for building and deploying a Jekyll site to GitHub Pages 2 | name: Deploy Jekyll with GitHub Pages dependencies preinstalled 3 | 4 | on: 5 | # Runs on pushes targeting the default branch 6 | push: 7 | branches: ["master"] 8 | 9 | # Allows you to run this workflow manually from the Actions tab 10 | workflow_dispatch: 11 | 12 | # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages 13 | permissions: 14 | contents: read 15 | pages: write 16 | id-token: write 17 | 18 | # Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued. 19 | # However, do NOT cancel in-progress runs as we want to allow these production deployments to complete. 20 | concurrency: 21 | group: "pages" 22 | cancel-in-progress: false 23 | 24 | jobs: 25 | # Build job 26 | build: 27 | runs-on: ubuntu-latest 28 | steps: 29 | - name: Checkout 30 | uses: actions/checkout@v4 31 | - name: Setup Pages 32 | uses: actions/configure-pages@v4 33 | - name: Build with Jekyll 34 | uses: actions/jekyll-build-pages@v1 35 | with: 36 | source: ./ 37 | destination: ./_site 38 | - name: Upload artifact 39 | uses: actions/upload-pages-artifact@v3 40 | 41 | # Deployment job 42 | deploy: 43 | environment: 44 | name: github-pages 45 | url: ${{ steps.deployment.outputs.page_url }} 46 | runs-on: ubuntu-latest 47 | needs: build 48 | steps: 49 | - name: Deploy to GitHub Pages 50 | id: deployment 51 | uses: actions/deploy-pages@v4 52 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2007-2024 Alexandre Dulaunoy 2 | 3 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 4 | 5 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 6 | 7 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 8 | 9 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # RSS tools 2 | 3 | Following an old idea from 2007, published in my ancient blog post titled [RSS Everything?](http://www.foo.be/cgi-bin/wiki.pl/2007-02-11_RSS_Everything), this set of tools is designed to work with RSS (Really Simple Syndication) in a manner consistent with the [Unix philosophy](http://en.wikipedia.org/wiki/Unix_philosophy). 4 | 5 | The code committed in this repository was originally old Python code from 2007. It might break your PC, harm your cat, or cause the Flying Spaghetti Monster to lose a meatball. 6 | 7 | As 2024 marks the resurgence of RSS and Atom[^1], I decided to update my rudimentary RSS tools to make them contemporary. 8 | 9 | [Forks and pull requests](https://github.com/adulau/rss-tools) are more than welcome. Be warned: this code was initially created for experimenting with RSS workflows. 10 | 11 | ## Requirements 12 | 13 | * Python 3 14 | * Feedparser 15 | 16 | ## Tools 17 | 18 | ### rssfind 19 | 20 | [rssfind.py](https://github.com/adulau/rss-tools/blob/master/bin/rssfind.py) is a simple script designed to discover RSS or Atom feeds from a given URL. 21 | 22 | It employs two techniques: 23 | 24 | - The first involves searching for direct link references to the feed within the HTML page. 25 | - The second uses a brute-force approach, trying a series of known paths for feeds to determine if they are valid RSS or Atom feeds. 26 | 27 | The script returns an array in JSON format containing all the potential feeds it discovers. 28 | 29 | ~~~shell 30 | Usage: Find RSS or Atom feeds from an URL 31 | usage: rssfind.py [options] 32 | 33 | Options: 34 | -h, --help show this help message and exit 35 | -l LINK, --link=LINK http link where to find one or more feed source(s) 36 | -d, --disable-strict Include empty feeds in the list, default strict is 37 | enabled 38 | -b, --brute-force Search RSS/Atom feeds by brute-forcing url path 39 | (useful if the page is missing a link entry) 40 | ~~~ 41 | 42 | ### rsscluster 43 | 44 | [rsscluster.py](https://github.com/adulau/rss-tools/blob/master/bin/rsscluster.py) is a simple script that clusters items from an RSS feed based on a specified time interval, expressed in days. 45 | The `maxitem` parameter defines the maximum number of items to keep after clustering. This script can be particularly useful for platforms like Mastodon, where a user might be very active in a single day and you want to cluster their activity into a single RSS item for a defined time slot. 46 | 47 | ~~~shell 48 | rsscluster.py --interval 2 --maxitem 20 "http://paperbay.org/@a.rss" > adulau.xml 49 | ~~~ 50 | 51 | ### rssmerge 52 | 53 | [rssmerge.py](https://github.com/adulau/rss-tools/blob/master/bin/rssmerge.py) is a simple script designed to aggregate RSS feeds and merge them in reverse chronological order. It outputs the merged content in text, HTML, or Markdown format. This tool is useful for tracking recent events from various feeds and publishing them on your website. 54 | 55 | ~~~shell 56 | python3 rssmerge.py --maxitem 30 --output markdown "http://api.flickr.com/services/feeds/photos_public.gne?id=31797858@N00&lang=en-us&format=atom" "http://www.foo.be/cgi-bin/wiki.pl?action=journal&tile=AdulauMessyDesk" "http://paperbay.org/@a.rss" "http://infosec.exchange/@adulau.rss" 57 | ~~~ 58 | 59 | ~~~shell 60 | Usage: rssmerge.py [options] url 61 | 62 | Options: 63 | -h, --help show this help message and exit 64 | -m MAXITEM, --maxitem=MAXITEM 65 | maximum item to list in the feed, default 200 66 | -s SUMMARYSIZE, --summarysize=SUMMARYSIZE 67 | maximum size of the summary if a title is not present 68 | -o OUTPUT, --output=OUTPUT 69 | output format (text, phtml, markdown), default text 70 | ~~~ 71 | 72 | ~~~shell 73 | python3 rssmerge.py --maxitem 5 --output markdown "http://api.flickr.com/services/feeds/photos_public.gne?id=31797858@N00&lang=en-us&format=atom" "http://www.foo.be/cgi-bin/wiki.pl?action=journal&tile=AdulauMessyDesk" "http://paperbay.org/@a.rss" "http://infosec.exchange/@adulau.rss 74 | ~~~ 75 | 76 | #### Sample output from rssmerge 77 | 78 | ~~~markdown 79 | 80 | - [harvesting society #street #streetphotography #paris #societ](https://paperbay.org/@a/111908018263388808) 81 | - [harvesting society](https://www.flickr.com/photos/adulau/53520731553/) 82 | - [late in the night#bynight #leica #streetphotography #street ](https://paperbay.org/@a/111907960149305774) 83 | - [late in the night](https://www.flickr.com/photos/adulau/53520867709/) 84 | - [geography of illusion#photography #art #photo #bleu #blue #a](https://paperbay.org/@a/111907911876620745) 85 | ~~~ 86 | 87 | ### rssdir 88 | 89 | [rssdir.py](https://github.com/adulau/rss-tools/blob/master/bin/rssdir.py) is a simple and straightforward script designed to convert any directory on the filesystem into an RSS feed. 90 | 91 | ~~~shell 92 | rssdir.py --prefix https://www.foo.be/cours/ . >rss.xml 93 | ~~~ 94 | 95 | ~~~shell 96 | Usage: rssdir.py [options] directory 97 | 98 | Options: 99 | -h, --help show this help message and exit 100 | -p PREFIX, --prefix=PREFIX 101 | http prefix to be used for each entry, default none 102 | -t TITLE, --title=TITLE 103 | set a title to the rss feed, default using prefix 104 | -l LINK, --link=LINK http link set, default is prefix and none if prefix 105 | not set 106 | -m MAXITEM, --maxitem=MAXITEM 107 | maximum item to list in the feed, default 32 108 | ~~~ 109 | 110 | ### rsscount 111 | 112 | [rsscount.py](https://github.com/adulau/rss-tools/blob/master/bin/rsscount.py) is a straightforward script designed to count the number of items in an RSS feed per day. It is utilized to construct the [wiki creativity index](http://www.foo.be/cgi-bin/wiki.pl/WikiCreativityIndex). The script accepts an unlimited number of URL arguments. It can be used to feed statistical tools. 113 | 114 | ~~~shell 115 | python3 rsscount.py https://paperbay.org/@a.rss | sort 116 | 20240121 3 117 | 20240124 1 118 | 20240128 4 119 | 20240130 1 120 | 20240131 1 121 | 20240201 1 122 | 20240203 2 123 | 20240204 3 124 | 20240210 4 125 | ~~~ 126 | 127 | ## License 128 | 129 | rss-tools are open source/free software licensed under the permissive 2-clause BSD license. 130 | 131 | Copyright 2007-2024 Alexandre Dulaunoy 132 | 133 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 134 | 135 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 136 | 137 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 138 | 139 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 140 | 141 | [^1]: As web platforms continue to deteriorate in quality, and with the diminishing visibility across various pseudo-social networks coupled with the decline of RSS culture, the emergence of new open-source, federated networks using ActivityPub (an advanced RSS format) seems particularly timely. I believe that reviving open-source tools developed in 2007 for handling RSS is increasingly relevant. Many of these new federated platforms are revitalizing RSS, which is a trend that deserves encouragement and support. 142 | 143 | -------------------------------------------------------------------------------- /REQUIREMENTS: -------------------------------------------------------------------------------- 1 | arrow 2 | bs4 3 | feedparser 4 | feedgen 5 | orjson 6 | requests 7 | -------------------------------------------------------------------------------- /bin/rsscluster.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # a at foo dot be - Alexandre Dulaunoy - http://www.foo.be/cgi-bin/wiki.pl/RssAny 5 | # 6 | # rsscluster.py is a simple script to cluster items from an rss feed based on a 7 | # time interval (expressed in number of days). The maxitem is the 8 | # number of item maximum after the clustering. 9 | # 10 | # an example use is for Mastodon where you can have a lot of toots during 11 | # one day and you want to cluster them in one single item in RSS or in (X)HTML. 12 | # 13 | # example of use : 14 | # python3 rsscluster.py --interval 5 --maxitem 20 "https://paperbay.org/@a.rss" >adulau.xml 15 | 16 | import feedparser 17 | import sys, os 18 | import time 19 | import datetime 20 | import xml.etree.ElementTree as ET 21 | import hashlib 22 | from optparse import OptionParser 23 | 24 | # print sys.stdout.encoding 25 | version = "0.2" 26 | 27 | feedparser.USER_AGENT = "rsscluster.py " + version + " +https://github.com/adulau/rss-tools" 28 | 29 | 30 | def date_as_rfc(value): 31 | return time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.localtime(value)) 32 | 33 | 34 | def build_rss(myitem, maxitem): 35 | 36 | RSSroot = ET.Element("rss", {"version": "2.0"}) 37 | RSSchannel = ET.SubElement(RSSroot, "channel") 38 | 39 | ET.SubElement(RSSchannel, "title").text = ( 40 | "RSS cluster of " + str(url) + " per " + str(options.interval) + " days" 41 | ) 42 | ET.SubElement(RSSchannel, "link").text = str(url) 43 | ET.SubElement(RSSchannel, "description").text = ( 44 | "RSS cluster of " + str(url) + " per " + str(options.interval) + " days" 45 | ) 46 | ET.SubElement(RSSchannel, "generator").text = "by rsscluster.py " + version 47 | ET.SubElement(RSSchannel, "pubDate").text = date_as_rfc(time.time()) 48 | 49 | for bloodyitem in myitem[0:maxitem]: 50 | 51 | RSSitem = ET.SubElement(RSSchannel, "item") 52 | ET.SubElement(RSSitem, "title").text = ( 53 | "clustered data of " 54 | + date_as_rfc(float(bloodyitem[0])) 55 | + " for " 56 | + str(url) 57 | ) 58 | ET.SubElement(RSSitem, "pubDate").text = date_as_rfc(float(bloodyitem[0])) 59 | ET.SubElement(RSSitem, "description").text = bloodyitem[1] 60 | h = hashlib.md5() 61 | h.update(bloodyitem[1].encode("utf-8")) 62 | ET.SubElement(RSSitem, "guid").text = h.hexdigest() 63 | 64 | RSSfeed = ET.ElementTree(RSSroot) 65 | feed = ET.tostring(RSSroot) 66 | return feed 67 | 68 | 69 | def complete_feed(myfeed): 70 | 71 | myheader = '' 72 | return myheader + str(myfeed) 73 | 74 | 75 | def DaysInSec(val): 76 | 77 | return int(val) * 24 * 60 * 60 78 | 79 | 80 | usage = "usage: %prog [options] url" 81 | parser = OptionParser(usage) 82 | 83 | parser.add_option( 84 | "-m", 85 | "--maxitem", 86 | dest="maxitem", 87 | help="maximum item to list in the feed, default 200", 88 | ) 89 | parser.add_option( 90 | "-i", 91 | "--interval", 92 | dest="interval", 93 | help="time interval expressed in days, default 1 day", 94 | ) 95 | 96 | # 2007-11-10 11:25:51 97 | pattern = "%Y-%m-%d %H:%M:%S" 98 | 99 | (options, args) = parser.parse_args() 100 | 101 | if options.interval is None: 102 | options.interval = 1 103 | options.output = 1 104 | 105 | if options.maxitem == None: 106 | options.maxitem = 200 107 | 108 | 109 | if len(args) != 1: 110 | parser.print_help() 111 | parser.error("incorrect number of arguments") 112 | 113 | allitem = {} 114 | url = args[0] 115 | 116 | d = feedparser.parse(url) 117 | 118 | if options.interval is None: 119 | options.interval = 0 120 | 121 | interval = DaysInSec(options.interval) 122 | 123 | previousepoch = [] 124 | clusteredepoch = [] 125 | tcluster = [] 126 | 127 | for el in d.entries: 128 | if 'modified_parsed' in el: 129 | eldatetime = datetime.datetime.fromtimestamp(time.mktime(el.modified_parsed)) 130 | else: 131 | eldatetime = datetime.datetime.fromtimestamp(time.mktime(el.published_parsed)) 132 | 133 | elepoch = int(time.mktime(time.strptime(str(eldatetime), pattern))) 134 | 135 | if len(previousepoch): 136 | 137 | # print el.link, int(previousepoch[0])-int(elepoch), interval 138 | 139 | if len(clusteredepoch): 140 | value = clusteredepoch.pop() 141 | else: 142 | value = "" 143 | if 'title' in el: 144 | clusteredepoch.append(value + ' ' + el.title + "") 145 | else: 146 | clusteredepoch.append(value + ' ' + el.summary + "") 147 | 148 | if not ((int(previousepoch[0]) - int(elepoch)) < interval): 149 | 150 | value = clusteredepoch.pop() 151 | 152 | starttimetuple = datetime.datetime.fromtimestamp(previousepoch[0]) 153 | endttimetuple = datetime.datetime.fromtimestamp(previousepoch.pop()) 154 | clusteredepoch.append( 155 | value 156 | + " from: " 157 | + str(starttimetuple.ctime()) 158 | + " to: " 159 | + str(endttimetuple.ctime()) 160 | ) 161 | if previousepoch: 162 | startdatelist = str(previousepoch[0]), str( 163 | clusteredepoch[len(clusteredepoch) - 1] 164 | ) 165 | tcluster.append(startdatelist) 166 | del previousepoch[0 : len(previousepoch)] 167 | del clusteredepoch[0 : len(clusteredepoch)] 168 | else: 169 | if 'title' in el: 170 | clusteredepoch.append(' ' + el.title + "") 171 | else: 172 | clusteredepoch.append(' ' + el.summary + "") 173 | 174 | previousepoch.append(elepoch) 175 | 176 | # if last cluster list was not complete, we add the time period information. 177 | if len(previousepoch): 178 | value = clusteredepoch.pop() 179 | starttimetuple = datetime.datetime.fromtimestamp(previousepoch[0]) 180 | endttimetuple = datetime.datetime.fromtimestamp(previousepoch.pop()) 181 | clusteredepoch.append( 182 | value 183 | + " from: " 184 | + str(starttimetuple.ctime()) 185 | + " to: " 186 | + str(endttimetuple.ctime()) 187 | ) 188 | del previousepoch[0 : len(previousepoch)] 189 | 190 | 191 | tcluster.sort() 192 | tcluster.reverse() 193 | print(complete_feed(build_rss(tcluster, int(options.maxitem)))) 194 | -------------------------------------------------------------------------------- /bin/rsscount.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # a at foo dot be - Alexandre Dulaunoy - http://www.foo.be/cgi-bin/wiki.pl/RssAny 5 | # 6 | # rsscount.py is a simple script to count how many items in a RSS feed per day 7 | # 8 | # The output is epoch + the number of changes separated with a tab. 9 | # 10 | # This is used to build statistic like the wiki creativity index. 11 | # 12 | 13 | import feedparser 14 | import sys, os 15 | import time 16 | import datetime 17 | from optparse import OptionParser 18 | 19 | feedparser.USER_AGENT = "rsscount.py +https://github.com/adulau/rss-tools" 20 | 21 | usage = "usage: %prog url(s)" 22 | parser = OptionParser(usage) 23 | 24 | (options, args) = parser.parse_args() 25 | 26 | if args is None: 27 | print(usage) 28 | 29 | counteditem = {} 30 | 31 | for url in args: 32 | 33 | d = feedparser.parse(url) 34 | for el in d.entries: 35 | 36 | if "modified_parsed" in el: 37 | eldatetime = datetime.datetime.fromtimestamp( 38 | time.mktime(el.modified_parsed) 39 | ) 40 | else: 41 | eldatetime = datetime.datetime.fromtimestamp( 42 | time.mktime(el.published_parsed) 43 | ) 44 | eventdate = eldatetime.isoformat(" ").split(" ", 1) 45 | edate = eventdate[0].replace("-", "") 46 | 47 | if edate in counteditem: 48 | counteditem[edate] = counteditem[edate] + 1 49 | else: 50 | counteditem[edate] = 1 51 | 52 | 53 | for k in list(counteditem.keys()): 54 | 55 | print(f"{k}\t{counteditem[k]}") 56 | -------------------------------------------------------------------------------- /bin/rssdir.py: -------------------------------------------------------------------------------- 1 | # rssdir.py 2 | # a at foo dot be - Alexandre Dulaunoy - http://www.foo.be/cgi-bin/wiki.pl/RssAny 3 | # 4 | # rssdir is a simply-and-dirty script to rssify any directory on the filesystem. 5 | # 6 | # an example of use on the current directory : 7 | # 8 | # python3 /usr/local/bin/rssdir.py --prefix http://www.foo.be/cours/ . >rss.xml 9 | # 10 | 11 | import os, fnmatch 12 | import time 13 | import sys 14 | import xml.etree.ElementTree as ET 15 | from optparse import OptionParser 16 | 17 | version = "0.2" 18 | 19 | # recursive list file function from the ASPN cookbook 20 | def all_files(root, patterns="*", single_level=False, yield_folders=False): 21 | patterns = patterns.split(";") 22 | for path, subdirs, files in os.walk(root): 23 | if yield_folders: 24 | files.extend(subdirs) 25 | files.sort() 26 | for name in files: 27 | for pattern in patterns: 28 | if fnmatch.fnmatch(name, pattern): 29 | yield os.path.join(path, name) 30 | break 31 | if single_level: 32 | break 33 | 34 | 35 | def date_files(filelist): 36 | date_filename_list = [] 37 | 38 | for filename in filelist: 39 | stats = os.stat(filename) 40 | last_update = stats[8] 41 | date_filename_tuple = last_update, filename 42 | date_filename_list.append(date_filename_tuple) 43 | return date_filename_list 44 | 45 | 46 | def date_as_rfc(value): 47 | return time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.localtime(value)) 48 | 49 | 50 | def build_rss(myitem, maxitem): 51 | 52 | RSSroot = ET.Element("rss", {"version": "2.0"}) 53 | RSSchannel = ET.SubElement(RSSroot, "channel") 54 | 55 | ET.SubElement(RSSchannel, "title").text = "RSS feed of " + str(title) 56 | ET.SubElement(RSSchannel, "link").text = link 57 | ET.SubElement(RSSchannel, "description").text = ( 58 | "A directory RSSified by rssdir.py " + version 59 | ) 60 | ET.SubElement(RSSchannel, "generator").text = ( 61 | "A directory RSSified by rssdir.py " + version 62 | ) 63 | ET.SubElement(RSSchannel, "pubDate").text = date_as_rfc(time.time()) 64 | 65 | for bloodyitem in myitem[0:maxitem]: 66 | 67 | RSSitem = ET.SubElement(RSSchannel, "item") 68 | ET.SubElement(RSSitem, "title").text = bloodyitem[1] 69 | ET.SubElement(RSSitem, "pubDate").text = date_as_rfc(bloodyitem[0]) 70 | ET.SubElement(RSSitem, "description").text = prefixurl + bloodyitem[1] 71 | ET.SubElement(RSSitem, "guid").text = prefixurl + bloodyitem[1] 72 | 73 | RSSfeed = ET.ElementTree(RSSroot) 74 | feed = ET.tostring(RSSroot) 75 | return feed 76 | 77 | 78 | def complete_feed(myfeed): 79 | 80 | myheader = '' 81 | return myheader + str(myfeed) 82 | 83 | 84 | usage = "usage: %prog [options] directory" 85 | parser = OptionParser(usage) 86 | 87 | parser.add_option( 88 | "-p", 89 | "--prefix", 90 | dest="prefix", 91 | default="", 92 | help="http prefix to be used for each entry, default none", 93 | ) 94 | parser.add_option( 95 | "-t", 96 | "--title", 97 | dest="title", 98 | help="set a title to the rss feed, default using prefix", 99 | type="string", 100 | ) 101 | parser.add_option( 102 | "-l", 103 | "--link", 104 | dest="link", 105 | help="http link set, default is prefix and none if prefix not set", 106 | ) 107 | parser.add_option( 108 | "-m", 109 | "--maxitem", 110 | dest="maxitem", 111 | help="maximum item to list in the feed, default 32", 112 | default=32, 113 | type="int", 114 | ) 115 | 116 | (options, args) = parser.parse_args() 117 | 118 | if options.prefix is None: 119 | prefixurl = "" 120 | else: 121 | prefixurl = options.prefix 122 | 123 | if options.link is None: 124 | link = options.prefix 125 | else: 126 | link = options.link 127 | 128 | if options.title is None: 129 | title = options.prefix 130 | else: 131 | title = options.title 132 | 133 | if options.maxitem is None: 134 | maxitem = 32 135 | else: 136 | maxitem = options.maxitem 137 | 138 | if not args: 139 | print("Missing directory") 140 | parser.print_help() 141 | sys.exit(0) 142 | 143 | file_to_list = [] 144 | for x in all_files(args[0]): 145 | file_to_list.append(x) 146 | 147 | mylist = date_files(file_to_list) 148 | 149 | mylist.sort() 150 | mylist.reverse() 151 | 152 | print(complete_feed(build_rss(mylist, maxitem))) 153 | -------------------------------------------------------------------------------- /bin/rssfind.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | # [rssfind.py](https://github.com/adulau/rss-tools/blob/master/bin/rssfind.py) is a simple script designed to discover RSS or Atom feeds from a given URL. 3 | # 4 | # It employs two techniques: 5 | # 6 | # - The first involves searching for direct link references to the feed within the HTML page. 7 | # - The second uses a brute-force approach, trying a series of known paths for feeds to determine if they are valid RSS or Atom feeds. 8 | # 9 | # The script returns an array in JSON format containing all the potential feeds it discovers. 10 | 11 | import sys 12 | import urllib.parse 13 | from optparse import OptionParser 14 | import random 15 | 16 | import feedparser 17 | import orjson as json 18 | import requests 19 | from bs4 import BeautifulSoup as bs4 20 | 21 | brute_force_urls = [ 22 | "index.xml", 23 | "feed/index.php", 24 | "feed.xml", 25 | "feed.atom", 26 | "feed.rss", 27 | "feed.json", 28 | "feed.php", 29 | "feed.asp", 30 | "posts.rss", 31 | "blog.xml", 32 | "atom.xml", 33 | "podcasts.xml", 34 | "main.atom", 35 | "main.xml", 36 | ] 37 | random.shuffle(brute_force_urls) 38 | 39 | 40 | def findfeeds(url=None, disable_strict=False): 41 | if url is None: 42 | return None 43 | 44 | raw = requests.get(url, headers=headers).text 45 | results = [] 46 | discovered_feeds = [] 47 | html = bs4(raw, features="lxml") 48 | feed_urls = html.findAll("link", rel="alternate") 49 | if feed_urls: 50 | for f in feed_urls: 51 | tag = f.get("type", None) 52 | if tag: 53 | if "feed" in tag or "rss" in tag or "xml" in tag: 54 | href = f.get("href", None) 55 | if href: 56 | discovered_feeds.append(href) 57 | 58 | parsed_url = urllib.parse.urlparse(url) 59 | base = f"{parsed_url.scheme}://{parsed_url.hostname}" 60 | ahreftags = html.findAll("a") 61 | 62 | for a in ahreftags: 63 | href = a.get("href", None) 64 | if href: 65 | if "feed" in href or "rss" in href or "xml" in href: 66 | discovered_feeds.append(f"{base}{href}") 67 | 68 | for url in list(set(discovered_feeds)): 69 | f = feedparser.parse(url) 70 | if f.entries: 71 | if url not in results: 72 | results.append(url) 73 | 74 | if disable_strict: 75 | return list(set(discovered_feeds)) 76 | else: 77 | return results 78 | 79 | 80 | def brutefindfeeds(url=None, disable_strict=False): 81 | if url is None: 82 | return None 83 | found_urls = [] 84 | found_valid_feeds = [] 85 | parsed_url = urllib.parse.urlparse(url) 86 | for path in brute_force_urls: 87 | url = f"{parsed_url.scheme}://{parsed_url.hostname}/{path}" 88 | r = requests.get(url, headers=headers) 89 | if r.status_code == 200: 90 | found_urls.append(url) 91 | for url in list(set(found_urls)): 92 | f = feedparser.parse(url) 93 | if f.entries: 94 | if url not in found_valid_feeds: 95 | found_valid_feeds.append(url) 96 | if disable_strict: 97 | return list(set(found_urls)) 98 | else: 99 | return found_valid_feeds 100 | 101 | 102 | version = "0.2" 103 | 104 | user_agent = f"rssfind.py {version} +https://github.com/adulau/rss-tools" 105 | 106 | feedparser.USER_AGENT = user_agent 107 | 108 | headers = {"User-Agent": user_agent} 109 | 110 | usage = "Find RSS or Atom feeds from an URL\nusage: %prog [options]" 111 | 112 | parser = OptionParser(usage) 113 | 114 | parser.add_option( 115 | "-l", 116 | "--link", 117 | dest="link", 118 | help="http link where to find one or more feed source(s)", 119 | ) 120 | 121 | parser.add_option( 122 | "-d", 123 | "--disable-strict", 124 | action="store_false", 125 | default=False, 126 | help="Include empty feeds in the list, default strict is enabled", 127 | ) 128 | 129 | parser.add_option( 130 | "-b", 131 | "--brute-force", 132 | action="store_true", 133 | default=False, 134 | help="Search RSS/Atom feeds by brute-forcing url path (useful if the page is missing a link entry)", 135 | ) 136 | 137 | (options, args) = parser.parse_args() 138 | 139 | if not options.link: 140 | print("Link/url missing - -l option") 141 | parser.print_help() 142 | sys.exit(0) 143 | 144 | if not options.brute_force: 145 | print( 146 | json.dumps( 147 | findfeeds(url=options.link, disable_strict=options.disable_strict) 148 | ).decode("utf-8") 149 | ) 150 | else: 151 | print( 152 | json.dumps( 153 | brutefindfeeds(url=options.link, disable_strict=options.disable_strict) 154 | ).decode("utf-8") 155 | ) 156 | -------------------------------------------------------------------------------- /bin/rssinternetdraft.py: -------------------------------------------------------------------------------- 1 | # 2 | # quick-and-dirty(tm) script to gather IETF Internet-Draft announce 3 | # from a mbox and to generate a nice RSS feed of the recent announce. 4 | # 5 | # for more information : http://www.foo.be/ietf/id/ 6 | 7 | import mailbox 8 | import time 9 | import re 10 | import xml.etree.ElementTree as ET 11 | 12 | date_rfc2822 = "%a, %d %b %Y %H:%M:%S" 13 | 14 | tmsg = [] 15 | 16 | def date_as_rfc(value): 17 | return time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.localtime(value)) 18 | 19 | def build_rss(myitem,maxitem): 20 | 21 | RSSroot = ET.Element( 'rss', {'version':'2.0'} ) 22 | RSSchannel = ET.SubElement( RSSroot, 'channel' ) 23 | 24 | ET.SubElement( RSSchannel, 'title' ).text = 'Latest Internet-Draft (IDs) Published - IETF - custom RSS feed' 25 | ET.SubElement( RSSchannel, 'link' ).text = 'http://www.foo.be/ietf/id/' 26 | ET.SubElement( RSSchannel, 'description' ).text = 'Latest Internet-Draft (IDs) Published - IETF - custom RSS feed' 27 | ET.SubElement( RSSchannel, 'generator' ).text = 'rssany extended for parsing IETF IDs - http://www.foo.be/cgi-bin/wiki.pl/RssAny' 28 | # ET.SubElement( RSSchannel, 'pubDate' ).text = date_as_rfc(time.time()) 29 | ET.SubElement( RSSchannel, 'pubDate' ).text = date_as_rfc(time.time()-10000) 30 | 31 | for bloodyitem in myitem[0:maxitem]: 32 | RSSitem = ET.SubElement ( RSSchannel, 'item' ) 33 | ET.SubElement( RSSitem, 'title' ).text = bloodyitem[1] 34 | ET.SubElement( RSSitem, 'pubDate' ).text = date_as_rfc(bloodyitem[0]) 35 | ET.SubElement( RSSitem, 'description').text = '
'+bloodyitem[2]+'' 36 | ET.SubElement( RSSitem, 'guid').text = "http://tools.ietf.org/html/"+bloodyitem[3] 37 | ET.SubElement( RSSitem, 'link').text = "http://tools.ietf.org/html/"+bloodyitem[3] 38 | RSSfeed = ET.ElementTree(RSSroot) 39 | feed = ET.tostring(RSSroot) 40 | return feed 41 | 42 | for message in mailbox.mbox('/var/spool/mail/ietf'): 43 | subject = message['subject'] 44 | date = message['date'] 45 | date_epoch = int(time.mktime(time.strptime(date[0:-12], date_rfc2822))) 46 | message_id = message['Message-Id'] 47 | body = message.get_payload()[0].get_payload() 48 | id = subject.split(":")[1].split(".")[0] 49 | tmsg.append([date_epoch,subject,body,id]) 50 | 51 | tmsg.sort() 52 | tmsg.reverse() 53 | print build_rss(tmsg,100) 54 | -------------------------------------------------------------------------------- /bin/rssmerge.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # 4 | # a at foo dot be - Alexandre Dulaunoy - https://git.foo.be/adulau/rss-tools 5 | # 6 | # rssmerge.py is a simple script designed to aggregate RSS feeds and merge them in reverse chronological order. 7 | # It outputs the merged content in text, HTML, or Markdown format. This tool is useful for tracking recent events 8 | # from various feeds and publishing them on your website. 9 | # 10 | # Sample usage: 11 | # 12 | # python3 rssmerge.py "https://git.foo.be/adulau.rss" "http://api.flickr.com/services/feeds/photos_public.gne?id=31797858@N00&lang=en-us&format=atom" 13 | # "https://github.com/adulau.atom" -o markdown --maxitem 20 14 | 15 | import feedparser 16 | import sys, os 17 | import time 18 | import datetime 19 | import hashlib 20 | from optparse import OptionParser 21 | import html 22 | from bs4 import BeautifulSoup 23 | from urllib.parse import urlparse 24 | 25 | feedparser.USER_AGENT = "rssmerge.py +https://github.com/adulau/rss-tools" 26 | 27 | 28 | def RenderMerge(itemlist, output="text"): 29 | i = 0 30 | if output == "text": 31 | for item in itemlist: 32 | i = i + 1 33 | # Keep consistent datetime representation if not use allitem[item[1]]['updated'] 34 | link = allitem[item[1]]["link"] 35 | title = html.escape(allitem[item[1]]["title"]) 36 | timestamp = datetime.datetime.fromtimestamp( 37 | allitem[item[1]]["epoch"] 38 | ).ctime() 39 | print(f'{i}:{title}:{timestamp}:{link}') 40 | 41 | if i == int(options.maxitem): 42 | break 43 | 44 | if output == "phtml": 45 | print("