├── .github
    └── workflows
    │   └── jekyll-gh-pages.yml
├── LICENSE
├── README.md
├── REQUIREMENTS
└── bin
    ├── rsscluster.py
    ├── rsscount.py
    ├── rssdir.py
    ├── rssfind.py
    ├── rssinternetdraft.py
    └── rssmerge.py


/.github/workflows/jekyll-gh-pages.yml:
--------------------------------------------------------------------------------
 1 | # Sample workflow for building and deploying a Jekyll site to GitHub Pages
 2 | name: Deploy Jekyll with GitHub Pages dependencies preinstalled
 3 | 
 4 | on:
 5 |   # Runs on pushes targeting the default branch
 6 |   push:
 7 |     branches: ["master"]
 8 | 
 9 |   # Allows you to run this workflow manually from the Actions tab
10 |   workflow_dispatch:
11 | 
12 | # Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
13 | permissions:
14 |   contents: read
15 |   pages: write
16 |   id-token: write
17 | 
18 | # Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
19 | # However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
20 | concurrency:
21 |   group: "pages"
22 |   cancel-in-progress: false
23 | 
24 | jobs:
25 |   # Build job
26 |   build:
27 |     runs-on: ubuntu-latest
28 |     steps:
29 |       - name: Checkout
30 |         uses: actions/checkout@v4
31 |       - name: Setup Pages
32 |         uses: actions/configure-pages@v4
33 |       - name: Build with Jekyll
34 |         uses: actions/jekyll-build-pages@v1
35 |         with:
36 |           source: ./
37 |           destination: ./_site
38 |       - name: Upload artifact
39 |         uses: actions/upload-pages-artifact@v3
40 | 
41 |   # Deployment job
42 |   deploy:
43 |     environment:
44 |       name: github-pages
45 |       url: ${{ steps.deployment.outputs.page_url }}
46 |     runs-on: ubuntu-latest
47 |     needs: build
48 |     steps:
49 |       - name: Deploy to GitHub Pages
50 |         id: deployment
51 |         uses: actions/deploy-pages@v4
52 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | Copyright 2007-2024 Alexandre Dulaunoy 
 2 | 
 3 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
 4 | 
 5 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
 6 | 
 7 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
 8 | 
 9 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
10 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # RSS tools
  2 | 
  3 | Following an old idea from 2007, published in my ancient blog post titled [RSS Everything?](http://www.foo.be/cgi-bin/wiki.pl/2007-02-11_RSS_Everything), this set of tools is designed to work with RSS (Really Simple Syndication) in a manner consistent with the [Unix philosophy](http://en.wikipedia.org/wiki/Unix_philosophy).
  4 | 
  5 | The code committed in this repository was originally old Python code from 2007. It might break your PC, harm your cat, or cause the Flying Spaghetti Monster to lose a meatball.
  6 | 
  7 | As 2024 marks the resurgence of RSS and Atom[^1], I decided to update my rudimentary RSS tools to make them contemporary.
  8 | 
  9 | [Forks and pull requests](https://github.com/adulau/rss-tools) are more than welcome. Be warned: this code was initially created for experimenting with RSS workflows.
 10 | 
 11 | ## Requirements
 12 | 
 13 | * Python 3 
 14 | * Feedparser
 15 | 
 16 | ## Tools
 17 | 
 18 | ### rssfind
 19 | 
 20 | [rssfind.py](https://github.com/adulau/rss-tools/blob/master/bin/rssfind.py) is a simple script designed to discover RSS or Atom feeds from a given URL.
 21 | 
 22 | It employs two techniques:
 23 | 
 24 | - The first involves searching for direct link references to the feed within the HTML page.
 25 | - The second uses a brute-force approach, trying a series of known paths for feeds to determine if they are valid RSS or Atom feeds.
 26 | 
 27 | The script returns an array in JSON format containing all the potential feeds it discovers.
 28 | 
 29 | ~~~shell
 30 | Usage: Find RSS or Atom feeds from an URL
 31 | usage: rssfind.py [options]
 32 | 
 33 | Options:
 34 |   -h, --help            show this help message and exit
 35 |   -l LINK, --link=LINK  http link where to find one or more feed source(s)
 36 |   -d, --disable-strict  Include empty feeds in the list, default strict is
 37 |                         enabled
 38 |   -b, --brute-force     Search RSS/Atom feeds by brute-forcing url path
 39 |                         (useful if the page is missing a link entry)
 40 | ~~~
 41 | 
 42 | ### rsscluster
 43 | 
 44 | [rsscluster.py](https://github.com/adulau/rss-tools/blob/master/bin/rsscluster.py) is a simple script that clusters items from an RSS feed based on a specified time interval, expressed in days.
 45 | The `maxitem` parameter defines the maximum number of items to keep after clustering. This script can be particularly useful for platforms like Mastodon, where a user might be very active in a single day and you want to cluster their activity into a single RSS item for a defined time slot.
 46 | 
 47 | ~~~shell
 48 | rsscluster.py --interval 2 --maxitem 20 "http://paperbay.org/@a.rss" > adulau.xml
 49 | ~~~
 50 | 
 51 | ### rssmerge
 52 | 
 53 | [rssmerge.py](https://github.com/adulau/rss-tools/blob/master/bin/rssmerge.py) is a simple script designed to aggregate RSS feeds and merge them in reverse chronological order. It outputs the merged content in text, HTML, or Markdown format. This tool is useful for tracking recent events from various feeds and publishing them on your website.
 54 | 
 55 | ~~~shell
 56 | python3 rssmerge.py --maxitem 30 --output markdown "http://api.flickr.com/services/feeds/photos_public.gne?id=31797858@N00&lang=en-us&format=atom"  "http://www.foo.be/cgi-bin/wiki.pl?action=journal&tile=AdulauMessyDesk" "http://paperbay.org/@a.rss" "http://infosec.exchange/@adulau.rss"
 57 | ~~~
 58 | 
 59 | ~~~shell
 60 | Usage: rssmerge.py [options] url
 61 | 
 62 | Options:
 63 |   -h, --help            show this help message and exit
 64 |   -m MAXITEM, --maxitem=MAXITEM
 65 |                         maximum item to list in the feed, default 200
 66 |   -s SUMMARYSIZE, --summarysize=SUMMARYSIZE
 67 |                         maximum size of the summary if a title is not present
 68 |   -o OUTPUT, --output=OUTPUT
 69 |                         output format (text, phtml, markdown), default text
 70 | ~~~
 71 | 
 72 | ~~~shell
 73 | python3 rssmerge.py --maxitem 5 --output markdown "http://api.flickr.com/services/feeds/photos_public.gne?id=31797858@N00&lang=en-us&format=atom"  "http://www.foo.be/cgi-bin/wiki.pl?action=journal&tile=AdulauMessyDesk" "http://paperbay.org/@a.rss" "http://infosec.exchange/@adulau.rss
 74 | ~~~
 75 | 
 76 | #### Sample output from rssmerge
 77 | 
 78 | ~~~markdown
 79 | 
 80 | - [harvesting society #street #streetphotography #paris #societ](https://paperbay.org/@a/111908018263388808)
 81 | - [harvesting society](https://www.flickr.com/photos/adulau/53520731553/)
 82 | - [late in the night#bynight #leica #streetphotography #street ](https://paperbay.org/@a/111907960149305774)
 83 | - [late in the night](https://www.flickr.com/photos/adulau/53520867709/)
 84 | - [geography of illusion#photography #art #photo #bleu #blue #a](https://paperbay.org/@a/111907911876620745)
 85 | ~~~
 86 | 
 87 | ### rssdir
 88 | 
 89 | [rssdir.py](https://github.com/adulau/rss-tools/blob/master/bin/rssdir.py) is a simple and straightforward script designed to convert any directory on the filesystem into an RSS feed.
 90 | 
 91 | ~~~shell
 92 | rssdir.py --prefix https://www.foo.be/cours/ . >rss.xml
 93 | ~~~
 94 | 
 95 | ~~~shell
 96 | Usage: rssdir.py [options] directory
 97 | 
 98 | Options:
 99 |   -h, --help            show this help message and exit
100 |   -p PREFIX, --prefix=PREFIX
101 |                         http prefix to be used for each entry, default none
102 |   -t TITLE, --title=TITLE
103 |                         set a title to the rss feed, default using prefix
104 |   -l LINK, --link=LINK  http link set, default is prefix and none if prefix
105 |                         not set
106 |   -m MAXITEM, --maxitem=MAXITEM
107 |                         maximum item to list in the feed, default 32
108 | ~~~
109 | 
110 | ### rsscount
111 | 
112 | [rsscount.py](https://github.com/adulau/rss-tools/blob/master/bin/rsscount.py) is a straightforward script designed to count the number of items in an RSS feed per day. It is utilized to construct the [wiki creativity index](http://www.foo.be/cgi-bin/wiki.pl/WikiCreativityIndex). The script accepts an unlimited number of URL arguments. It can be used to feed statistical tools.
113 | 
114 | ~~~shell
115 | python3 rsscount.py https://paperbay.org/@a.rss | sort
116 | 20240121	3
117 | 20240124	1
118 | 20240128	4
119 | 20240130	1
120 | 20240131	1
121 | 20240201	1
122 | 20240203	2
123 | 20240204	3
124 | 20240210	4
125 | ~~~
126 | 
127 | ## License
128 | 
129 | rss-tools are open source/free software licensed under the permissive 2-clause BSD license.
130 | 
131 | Copyright 2007-2024 Alexandre Dulaunoy
132 | 
133 | Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
134 | 
135 | 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
136 | 
137 | 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
138 | 
139 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
140 | 
141 | [^1]: As web platforms continue to deteriorate in quality, and with the diminishing visibility across various pseudo-social networks coupled with the decline of RSS culture, the emergence of new open-source, federated networks using ActivityPub (an advanced RSS format) seems particularly timely. I believe that reviving open-source tools developed in 2007 for handling RSS is increasingly relevant. Many of these new federated platforms are revitalizing RSS, which is a trend that deserves encouragement and support.
142 | 
143 | 


--------------------------------------------------------------------------------
/REQUIREMENTS:
--------------------------------------------------------------------------------
1 | arrow
2 | bs4
3 | feedparser
4 | feedgen
5 | orjson
6 | requests
7 | 


--------------------------------------------------------------------------------
/bin/rsscluster.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | #
  4 | # a at foo dot be - Alexandre Dulaunoy - http://www.foo.be/cgi-bin/wiki.pl/RssAny
  5 | #
  6 | # rsscluster.py is a simple script to cluster items from an rss feed based on a
  7 | #               time interval (expressed in number of days). The maxitem is the
  8 | #               number of item maximum after the clustering.
  9 | #
 10 | # an example use is for Mastodon where you can have a lot of toots during
 11 | # one day and you want to cluster them in one single item in RSS or in (X)HTML.
 12 | #
 13 | # example of use :
 14 | #  python3 rsscluster.py --interval 5 --maxitem 20 "https://paperbay.org/@a.rss" >adulau.xml
 15 | 
 16 | import feedparser
 17 | import sys, os
 18 | import time
 19 | import datetime
 20 | import xml.etree.ElementTree as ET
 21 | import hashlib
 22 | from optparse import OptionParser
 23 | 
 24 | # print sys.stdout.encoding
 25 | version = "0.2"
 26 | 
 27 | feedparser.USER_AGENT = "rsscluster.py " + version + " +https://github.com/adulau/rss-tools"
 28 | 
 29 | 
 30 | def date_as_rfc(value):
 31 |     return time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.localtime(value))
 32 | 
 33 | 
 34 | def build_rss(myitem, maxitem):
 35 | 
 36 |     RSSroot = ET.Element("rss", {"version": "2.0"})
 37 |     RSSchannel = ET.SubElement(RSSroot, "channel")
 38 | 
 39 |     ET.SubElement(RSSchannel, "title").text = (
 40 |         "RSS cluster of " + str(url) + " per " + str(options.interval) + " days"
 41 |     )
 42 |     ET.SubElement(RSSchannel, "link").text = str(url)
 43 |     ET.SubElement(RSSchannel, "description").text = (
 44 |         "RSS cluster of " + str(url) + " per " + str(options.interval) + " days"
 45 |     )
 46 |     ET.SubElement(RSSchannel, "generator").text = "by rsscluster.py " + version
 47 |     ET.SubElement(RSSchannel, "pubDate").text = date_as_rfc(time.time())
 48 | 
 49 |     for bloodyitem in myitem[0:maxitem]:
 50 | 
 51 |         RSSitem = ET.SubElement(RSSchannel, "item")
 52 |         ET.SubElement(RSSitem, "title").text = (
 53 |             "clustered data of "
 54 |             + date_as_rfc(float(bloodyitem[0]))
 55 |             + " for "
 56 |             + str(url)
 57 |         )
 58 |         ET.SubElement(RSSitem, "pubDate").text = date_as_rfc(float(bloodyitem[0]))
 59 |         ET.SubElement(RSSitem, "description").text = bloodyitem[1]
 60 |         h = hashlib.md5()
 61 |         h.update(bloodyitem[1].encode("utf-8"))
 62 |         ET.SubElement(RSSitem, "guid").text = h.hexdigest()
 63 | 
 64 |     RSSfeed = ET.ElementTree(RSSroot)
 65 |     feed = ET.tostring(RSSroot)
 66 |     return feed
 67 | 
 68 | 
 69 | def complete_feed(myfeed):
 70 | 
 71 |     myheader = '<?xml version="1.0"?>'
 72 |     return myheader + str(myfeed)
 73 | 
 74 | 
 75 | def DaysInSec(val):
 76 | 
 77 |     return int(val) * 24 * 60 * 60
 78 | 
 79 | 
 80 | usage = "usage: %prog [options] url"
 81 | parser = OptionParser(usage)
 82 | 
 83 | parser.add_option(
 84 |     "-m",
 85 |     "--maxitem",
 86 |     dest="maxitem",
 87 |     help="maximum item to list in the feed, default 200",
 88 | )
 89 | parser.add_option(
 90 |     "-i",
 91 |     "--interval",
 92 |     dest="interval",
 93 |     help="time interval expressed in days, default 1 day",
 94 | )
 95 | 
 96 | # 2007-11-10 11:25:51
 97 | pattern = "%Y-%m-%d %H:%M:%S"
 98 | 
 99 | (options, args) = parser.parse_args()
100 | 
101 | if options.interval is None:
102 |     options.interval = 1
103 |     options.output = 1
104 | 
105 | if options.maxitem == None:
106 |     options.maxitem = 200
107 | 
108 | 
109 | if len(args) != 1:
110 |     parser.print_help()
111 |     parser.error("incorrect number of arguments")
112 | 
113 | allitem = {}
114 | url = args[0]
115 | 
116 | d = feedparser.parse(url)
117 | 
118 | if options.interval is None:
119 |     options.interval = 0
120 | 
121 | interval = DaysInSec(options.interval)
122 | 
123 | previousepoch = []
124 | clusteredepoch = []
125 | tcluster = []
126 | 
127 | for el in d.entries:
128 |     if 'modified_parsed' in el:
129 |         eldatetime = datetime.datetime.fromtimestamp(time.mktime(el.modified_parsed))
130 |     else:
131 |         eldatetime = datetime.datetime.fromtimestamp(time.mktime(el.published_parsed))
132 | 
133 |     elepoch = int(time.mktime(time.strptime(str(eldatetime), pattern)))
134 | 
135 |     if len(previousepoch):
136 | 
137 |         # print el.link, int(previousepoch[0])-int(elepoch), interval
138 | 
139 |         if len(clusteredepoch):
140 |             value = clusteredepoch.pop()
141 |         else:
142 |             value = ""
143 |         if 'title' in el:
144 |             clusteredepoch.append(value + ' <a href="' + el.link + '">' + el.title + "</a>")
145 |         else:
146 |             clusteredepoch.append(value + ' <a href="' + el.link + '">' + el.summary + "</a>")
147 | 
148 |         if not ((int(previousepoch[0]) - int(elepoch)) < interval):
149 | 
150 |             value = clusteredepoch.pop()
151 | 
152 |             starttimetuple = datetime.datetime.fromtimestamp(previousepoch[0])
153 |             endttimetuple = datetime.datetime.fromtimestamp(previousepoch.pop())
154 |             clusteredepoch.append(
155 |                 value
156 |                 + " from: "
157 |                 + str(starttimetuple.ctime())
158 |                 + " to: "
159 |                 + str(endttimetuple.ctime())
160 |             )
161 |             if previousepoch:
162 |                 startdatelist = str(previousepoch[0]), str(
163 |                     clusteredepoch[len(clusteredepoch) - 1]
164 |                 )
165 |                 tcluster.append(startdatelist)
166 |                 del previousepoch[0 : len(previousepoch)]
167 |                 del clusteredepoch[0 : len(clusteredepoch)]
168 |     else:
169 |         if 'title' in el:
170 |             clusteredepoch.append(' <a href="' + el.link + '">' + el.title + "</a>")
171 |         else:
172 |             clusteredepoch.append(' <a href="' + el.link + '">' + el.summary + "</a>")
173 | 
174 |     previousepoch.append(elepoch)
175 | 
176 | # if last cluster list was not complete, we add the time period information.
177 | if len(previousepoch):
178 |     value = clusteredepoch.pop()
179 |     starttimetuple = datetime.datetime.fromtimestamp(previousepoch[0])
180 |     endttimetuple = datetime.datetime.fromtimestamp(previousepoch.pop())
181 |     clusteredepoch.append(
182 |         value
183 |         + " from: "
184 |         + str(starttimetuple.ctime())
185 |         + " to: "
186 |         + str(endttimetuple.ctime())
187 |     )
188 |     del previousepoch[0 : len(previousepoch)]
189 | 
190 | 
191 | tcluster.sort()
192 | tcluster.reverse()
193 | print(complete_feed(build_rss(tcluster, int(options.maxitem))))
194 | 


--------------------------------------------------------------------------------
/bin/rsscount.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # -*- coding: utf-8 -*-
 3 | #
 4 | # a at foo dot be - Alexandre Dulaunoy - http://www.foo.be/cgi-bin/wiki.pl/RssAny
 5 | #
 6 | # rsscount.py is a simple script to count how many items in a RSS feed per day
 7 | #
 8 | # The output is epoch + the number of changes separated with a tab.
 9 | #
10 | # This is used to build statistic like the wiki creativity index.
11 | #
12 | 
13 | import feedparser
14 | import sys, os
15 | import time
16 | import datetime
17 | from optparse import OptionParser
18 | 
19 | feedparser.USER_AGENT = "rsscount.py +https://github.com/adulau/rss-tools"
20 | 
21 | usage = "usage: %prog url(s)"
22 | parser = OptionParser(usage)
23 | 
24 | (options, args) = parser.parse_args()
25 | 
26 | if args is None:
27 |     print(usage)
28 | 
29 | counteditem = {}
30 | 
31 | for url in args:
32 | 
33 |     d = feedparser.parse(url)
34 |     for el in d.entries:
35 | 
36 |         if "modified_parsed" in el:
37 |             eldatetime = datetime.datetime.fromtimestamp(
38 |                 time.mktime(el.modified_parsed)
39 |             )
40 |         else:
41 |             eldatetime = datetime.datetime.fromtimestamp(
42 |                 time.mktime(el.published_parsed)
43 |             )
44 |         eventdate = eldatetime.isoformat(" ").split(" ", 1)
45 |         edate = eventdate[0].replace("-", "")
46 | 
47 |         if edate in counteditem:
48 |             counteditem[edate] = counteditem[edate] + 1
49 |         else:
50 |             counteditem[edate] = 1
51 | 
52 | 
53 | for k in list(counteditem.keys()):
54 | 
55 |     print(f"{k}\t{counteditem[k]}")
56 | 


--------------------------------------------------------------------------------
/bin/rssdir.py:
--------------------------------------------------------------------------------
  1 | # rssdir.py
  2 | # a at foo dot be - Alexandre Dulaunoy - http://www.foo.be/cgi-bin/wiki.pl/RssAny
  3 | #
  4 | # rssdir is a simply-and-dirty script to rssify any directory on the filesystem.
  5 | #
  6 | # an example of use on the current directory :
  7 | #
  8 | # python3 /usr/local/bin/rssdir.py --prefix http://www.foo.be/cours/ . >rss.xml
  9 | #
 10 | 
 11 | import os, fnmatch
 12 | import time
 13 | import sys
 14 | import xml.etree.ElementTree as ET
 15 | from optparse import OptionParser
 16 | 
 17 | version = "0.2"
 18 | 
 19 | # recursive list file function from the ASPN cookbook
 20 | def all_files(root, patterns="*", single_level=False, yield_folders=False):
 21 |     patterns = patterns.split(";")
 22 |     for path, subdirs, files in os.walk(root):
 23 |         if yield_folders:
 24 |             files.extend(subdirs)
 25 |         files.sort()
 26 |         for name in files:
 27 |             for pattern in patterns:
 28 |                 if fnmatch.fnmatch(name, pattern):
 29 |                     yield os.path.join(path, name)
 30 |                     break
 31 |         if single_level:
 32 |             break
 33 | 
 34 | 
 35 | def date_files(filelist):
 36 |     date_filename_list = []
 37 | 
 38 |     for filename in filelist:
 39 |         stats = os.stat(filename)
 40 |         last_update = stats[8]
 41 |         date_filename_tuple = last_update, filename
 42 |         date_filename_list.append(date_filename_tuple)
 43 |     return date_filename_list
 44 | 
 45 | 
 46 | def date_as_rfc(value):
 47 |     return time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.localtime(value))
 48 | 
 49 | 
 50 | def build_rss(myitem, maxitem):
 51 | 
 52 |     RSSroot = ET.Element("rss", {"version": "2.0"})
 53 |     RSSchannel = ET.SubElement(RSSroot, "channel")
 54 | 
 55 |     ET.SubElement(RSSchannel, "title").text = "RSS feed of " + str(title)
 56 |     ET.SubElement(RSSchannel, "link").text = link
 57 |     ET.SubElement(RSSchannel, "description").text = (
 58 |         "A directory RSSified by rssdir.py " + version
 59 |     )
 60 |     ET.SubElement(RSSchannel, "generator").text = (
 61 |         "A directory RSSified by rssdir.py " + version
 62 |     )
 63 |     ET.SubElement(RSSchannel, "pubDate").text = date_as_rfc(time.time())
 64 | 
 65 |     for bloodyitem in myitem[0:maxitem]:
 66 | 
 67 |         RSSitem = ET.SubElement(RSSchannel, "item")
 68 |         ET.SubElement(RSSitem, "title").text = bloodyitem[1]
 69 |         ET.SubElement(RSSitem, "pubDate").text = date_as_rfc(bloodyitem[0])
 70 |         ET.SubElement(RSSitem, "description").text = prefixurl + bloodyitem[1]
 71 |         ET.SubElement(RSSitem, "guid").text = prefixurl + bloodyitem[1]
 72 | 
 73 |     RSSfeed = ET.ElementTree(RSSroot)
 74 |     feed = ET.tostring(RSSroot)
 75 |     return feed
 76 | 
 77 | 
 78 | def complete_feed(myfeed):
 79 | 
 80 |     myheader = '<?xml version="1.0"?>'
 81 |     return myheader + str(myfeed)
 82 | 
 83 | 
 84 | usage = "usage: %prog [options] directory"
 85 | parser = OptionParser(usage)
 86 | 
 87 | parser.add_option(
 88 |     "-p",
 89 |     "--prefix",
 90 |     dest="prefix",
 91 |     default="",
 92 |     help="http prefix to be used for each entry, default none",
 93 | )
 94 | parser.add_option(
 95 |     "-t",
 96 |     "--title",
 97 |     dest="title",
 98 |     help="set a title to the rss feed, default using prefix",
 99 |     type="string",
100 | )
101 | parser.add_option(
102 |     "-l",
103 |     "--link",
104 |     dest="link",
105 |     help="http link set, default is prefix and none if prefix not set",
106 | )
107 | parser.add_option(
108 |     "-m",
109 |     "--maxitem",
110 |     dest="maxitem",
111 |     help="maximum item to list in the feed, default 32",
112 |     default=32,
113 |     type="int",
114 | )
115 | 
116 | (options, args) = parser.parse_args()
117 | 
118 | if options.prefix is None:
119 |     prefixurl = ""
120 | else:
121 |     prefixurl = options.prefix
122 | 
123 | if options.link is None:
124 |     link = options.prefix
125 | else:
126 |     link = options.link
127 | 
128 | if options.title is None:
129 |     title = options.prefix
130 | else:
131 |     title = options.title
132 | 
133 | if options.maxitem is None:
134 |     maxitem = 32
135 | else:
136 |     maxitem = options.maxitem
137 | 
138 | if not args:
139 |     print("Missing directory")
140 |     parser.print_help()
141 |     sys.exit(0)
142 | 
143 | file_to_list = []
144 | for x in all_files(args[0]):
145 |     file_to_list.append(x)
146 | 
147 | mylist = date_files(file_to_list)
148 | 
149 | mylist.sort()
150 | mylist.reverse()
151 | 
152 | print(complete_feed(build_rss(mylist, maxitem)))
153 | 


--------------------------------------------------------------------------------
/bin/rssfind.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python3
  2 | # [rssfind.py](https://github.com/adulau/rss-tools/blob/master/bin/rssfind.py) is a simple script designed to discover RSS or Atom feeds from a given URL.
  3 | #
  4 | # It employs two techniques:
  5 | #
  6 | # - The first involves searching for direct link references to the feed within the HTML page.
  7 | # - The second uses a brute-force approach, trying a series of known paths for feeds to determine if they are valid RSS or Atom feeds.
  8 | #
  9 | # The script returns an array in JSON format containing all the potential feeds it discovers.
 10 | 
 11 | import sys
 12 | import urllib.parse
 13 | from optparse import OptionParser
 14 | import random
 15 | 
 16 | import feedparser
 17 | import orjson as json
 18 | import requests
 19 | from bs4 import BeautifulSoup as bs4
 20 | 
 21 | brute_force_urls = [
 22 |     "index.xml",
 23 |     "feed/index.php",
 24 |     "feed.xml",
 25 |     "feed.atom",
 26 |     "feed.rss",
 27 |     "feed.json",
 28 |     "feed.php",
 29 |     "feed.asp",
 30 |     "posts.rss",
 31 |     "blog.xml",
 32 |     "atom.xml",
 33 |     "podcasts.xml",
 34 |     "main.atom",
 35 |     "main.xml",
 36 | ]
 37 | random.shuffle(brute_force_urls)
 38 | 
 39 | 
 40 | def findfeeds(url=None, disable_strict=False):
 41 |     if url is None:
 42 |         return None
 43 | 
 44 |     raw = requests.get(url, headers=headers).text
 45 |     results = []
 46 |     discovered_feeds = []
 47 |     html = bs4(raw, features="lxml")
 48 |     feed_urls = html.findAll("link", rel="alternate")
 49 |     if feed_urls:
 50 |         for f in feed_urls:
 51 |             tag = f.get("type", None)
 52 |             if tag:
 53 |                 if "feed" in tag or "rss" in tag or "xml" in tag:
 54 |                     href = f.get("href", None)
 55 |                     if href:
 56 |                         discovered_feeds.append(href)
 57 | 
 58 |     parsed_url = urllib.parse.urlparse(url)
 59 |     base = f"{parsed_url.scheme}://{parsed_url.hostname}"
 60 |     ahreftags = html.findAll("a")
 61 | 
 62 |     for a in ahreftags:
 63 |         href = a.get("href", None)
 64 |         if href:
 65 |             if "feed" in href or "rss" in href or "xml" in href:
 66 |                 discovered_feeds.append(f"{base}{href}")
 67 | 
 68 |     for url in list(set(discovered_feeds)):
 69 |         f = feedparser.parse(url)
 70 |         if f.entries:
 71 |             if url not in results:
 72 |                 results.append(url)
 73 | 
 74 |     if disable_strict:
 75 |         return list(set(discovered_feeds))
 76 |     else:
 77 |         return results
 78 | 
 79 | 
 80 | def brutefindfeeds(url=None, disable_strict=False):
 81 |     if url is None:
 82 |         return None
 83 |     found_urls = []
 84 |     found_valid_feeds = []
 85 |     parsed_url = urllib.parse.urlparse(url)
 86 |     for path in brute_force_urls:
 87 |         url = f"{parsed_url.scheme}://{parsed_url.hostname}/{path}"
 88 |         r = requests.get(url, headers=headers)
 89 |         if r.status_code == 200:
 90 |             found_urls.append(url)
 91 |     for url in list(set(found_urls)):
 92 |         f = feedparser.parse(url)
 93 |         if f.entries:
 94 |             if url not in found_valid_feeds:
 95 |                 found_valid_feeds.append(url)
 96 |     if disable_strict:
 97 |         return list(set(found_urls))
 98 |     else:
 99 |         return found_valid_feeds
100 | 
101 | 
102 | version = "0.2"
103 | 
104 | user_agent = f"rssfind.py {version} +https://github.com/adulau/rss-tools"
105 | 
106 | feedparser.USER_AGENT = user_agent
107 | 
108 | headers = {"User-Agent": user_agent}
109 | 
110 | usage = "Find RSS or Atom feeds from an URL\nusage: %prog [options]"
111 | 
112 | parser = OptionParser(usage)
113 | 
114 | parser.add_option(
115 |     "-l",
116 |     "--link",
117 |     dest="link",
118 |     help="http link where to find one or more feed source(s)",
119 | )
120 | 
121 | parser.add_option(
122 |     "-d",
123 |     "--disable-strict",
124 |     action="store_false",
125 |     default=False,
126 |     help="Include empty feeds in the list, default strict is enabled",
127 | )
128 | 
129 | parser.add_option(
130 |     "-b",
131 |     "--brute-force",
132 |     action="store_true",
133 |     default=False,
134 |     help="Search RSS/Atom feeds by brute-forcing url path (useful if the page is missing a link entry)",
135 | )
136 | 
137 | (options, args) = parser.parse_args()
138 | 
139 | if not options.link:
140 |     print("Link/url missing - -l option")
141 |     parser.print_help()
142 |     sys.exit(0)
143 | 
144 | if not options.brute_force:
145 |     print(
146 |         json.dumps(
147 |             findfeeds(url=options.link, disable_strict=options.disable_strict)
148 |         ).decode("utf-8")
149 |     )
150 | else:
151 |     print(
152 |         json.dumps(
153 |             brutefindfeeds(url=options.link, disable_strict=options.disable_strict)
154 |         ).decode("utf-8")
155 |     )
156 | 


--------------------------------------------------------------------------------
/bin/rssinternetdraft.py:
--------------------------------------------------------------------------------
 1 | #
 2 | # quick-and-dirty(tm) script to gather IETF Internet-Draft announce
 3 | # from a mbox and to generate a nice RSS feed of the recent announce.
 4 | #
 5 | # for more information : http://www.foo.be/ietf/id/
 6 | 
 7 | import mailbox
 8 | import time
 9 | import re
10 | import xml.etree.ElementTree as ET
11 | 
12 | date_rfc2822 = "%a, %d %b %Y %H:%M:%S"
13 | 
14 | tmsg = []
15 | 
16 | def date_as_rfc(value):
17 |         return time.strftime("%a, %d %b %Y %H:%M:%S +0000", time.localtime(value))
18 | 
19 | def build_rss(myitem,maxitem):
20 | 
21 |         RSSroot = ET.Element( 'rss', {'version':'2.0'} )
22 |         RSSchannel = ET.SubElement( RSSroot, 'channel' )
23 | 
24 |         ET.SubElement( RSSchannel, 'title' ).text = 'Latest Internet-Draft (IDs) Published - IETF - custom RSS feed'
25 |         ET.SubElement( RSSchannel, 'link' ).text = 'http://www.foo.be/ietf/id/' 
26 |         ET.SubElement( RSSchannel, 'description' ).text = 'Latest Internet-Draft (IDs) Published - IETF - custom RSS feed' 
27 |         ET.SubElement( RSSchannel, 'generator' ).text = 'rssany extended for parsing IETF IDs - http://www.foo.be/cgi-bin/wiki.pl/RssAny'
28 | #        ET.SubElement( RSSchannel, 'pubDate' ).text = date_as_rfc(time.time())
29 |         ET.SubElement( RSSchannel, 'pubDate' ).text = date_as_rfc(time.time()-10000)
30 | 
31 |         for bloodyitem in myitem[0:maxitem]:
32 |                 RSSitem = ET.SubElement ( RSSchannel, 'item' )
33 |                 ET.SubElement( RSSitem, 'title' ).text = bloodyitem[1]
34 |                 ET.SubElement( RSSitem, 'pubDate' ).text = date_as_rfc(bloodyitem[0])
35 |                 ET.SubElement( RSSitem, 'description').text = '<pre>'+bloodyitem[2]+'</pre>'
36 |                 ET.SubElement( RSSitem, 'guid').text = "http://tools.ietf.org/html/"+bloodyitem[3]
37 |                 ET.SubElement( RSSitem, 'link').text = "http://tools.ietf.org/html/"+bloodyitem[3]
38 |         RSSfeed = ET.ElementTree(RSSroot)
39 |         feed = ET.tostring(RSSroot)
40 |         return feed
41 | 
42 | for message in mailbox.mbox('/var/spool/mail/ietf'):
43 |     subject = message['subject']
44 |     date = message['date']
45 |     date_epoch = int(time.mktime(time.strptime(date[0:-12], date_rfc2822))) 
46 |     message_id = message['Message-Id']
47 |     body =  message.get_payload()[0].get_payload()
48 |     id = subject.split(":")[1].split(".")[0]
49 |     tmsg.append([date_epoch,subject,body,id])
50 | 
51 | tmsg.sort()
52 | tmsg.reverse()
53 | print build_rss(tmsg,100)
54 | 


--------------------------------------------------------------------------------
/bin/rssmerge.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | #
  4 | # a at foo dot be - Alexandre Dulaunoy - https://git.foo.be/adulau/rss-tools
  5 | #
  6 | # rssmerge.py is a simple script designed to aggregate RSS feeds and merge them in reverse chronological order.
  7 | # It outputs the merged content in text, HTML, or Markdown format. This tool is useful for tracking recent events
  8 | # from various feeds and publishing them on your website.
  9 | #
 10 | # Sample usage:
 11 | #
 12 | # python3 rssmerge.py "https://git.foo.be/adulau.rss"  "http://api.flickr.com/services/feeds/photos_public.gne?id=31797858@N00&lang=en-us&format=atom"
 13 | #  "https://github.com/adulau.atom" -o markdown --maxitem 20
 14 | 
 15 | import feedparser
 16 | import sys, os
 17 | import time
 18 | import datetime
 19 | import hashlib
 20 | from optparse import OptionParser
 21 | import html
 22 | from bs4 import BeautifulSoup
 23 | from urllib.parse import urlparse
 24 | 
 25 | feedparser.USER_AGENT = "rssmerge.py +https://github.com/adulau/rss-tools"
 26 | 
 27 | 
 28 | def RenderMerge(itemlist, output="text"):
 29 |     i = 0
 30 |     if output == "text":
 31 |         for item in itemlist:
 32 |             i = i + 1
 33 |             # Keep consistent datetime representation if not use allitem[item[1]]['updated']
 34 |             link = allitem[item[1]]["link"]
 35 |             title = html.escape(allitem[item[1]]["title"])
 36 |             timestamp = datetime.datetime.fromtimestamp(
 37 |                 allitem[item[1]]["epoch"]
 38 |             ).ctime()
 39 |             print(f'{i}:{title}:{timestamp}:{link}')
 40 | 
 41 |             if i == int(options.maxitem):
 42 |                 break
 43 | 
 44 |     if output == "phtml":
 45 |         print("<ul>")
 46 |         for item in itemlist:
 47 |             i = i + 1
 48 |             # Keep consistent datetime representation if not use allitem[item[1]]['updated']
 49 |             link = allitem[item[1]]["link"]
 50 |             title = html.escape(allitem[item[1]]["title"])
 51 |             timestamp = datetime.datetime.fromtimestamp(
 52 |                 allitem[item[1]]["epoch"]
 53 |             ).ctime()
 54 |             print(f'<li><a href="{link}"> {title}</a> --- (<i>{timestamp}</i>)</li>')
 55 |             if i == int(options.maxitem):
 56 |                 break
 57 |         print("</ul>")
 58 | 
 59 |     if output == "markdown":
 60 |         for item in itemlist:
 61 |             i = i + 1
 62 |             title = html.escape(allitem[item[1]]["title"])
 63 |             link = allitem[item[1]]["link"]
 64 |             timestamp = datetime.datetime.fromtimestamp(
 65 |                 allitem[item[1]]["epoch"]
 66 |             ).ctime()
 67 |             domain = urlparse(allitem[item[1]]["link"]).netloc
 68 |             print(f'- {domain} [{title}]({link}) @{timestamp}')
 69 |             if i == int(options.maxitem):
 70 |                 break
 71 | 
 72 | 
 73 | usage = "usage: %prog [options] url"
 74 | parser = OptionParser(usage)
 75 | 
 76 | parser.add_option(
 77 |     "-m",
 78 |     "--maxitem",
 79 |     dest="maxitem",
 80 |     default=200,
 81 |     help="maximum item to list in the feed, default 200",
 82 | )
 83 | parser.add_option(
 84 |     "-s",
 85 |     "--summarysize",
 86 |     dest="summarysize",
 87 |     default=60,
 88 |     help="maximum size of the summary if a title is not present",
 89 | )
 90 | parser.add_option(
 91 |     "-o",
 92 |     "--output",
 93 |     dest="output",
 94 |     default="text",
 95 |     help="output format (text, phtml, markdown), default text",
 96 | )
 97 | 
 98 | # 2007-11-10 11:25:51
 99 | pattern = "%Y-%m-%d %H:%M:%S"
100 | 
101 | (options, args) = parser.parse_args()
102 | 
103 | allitem = {}
104 | 
105 | for url in args:
106 |     d = feedparser.parse(url)
107 | 
108 |     for el in d.entries:
109 |         if "modified_parsed" in el:
110 |             eldatetime = datetime.datetime.fromtimestamp(
111 |                 time.mktime(el.modified_parsed)
112 |             )
113 |         else:
114 |             eldatetime = datetime.datetime.fromtimestamp(
115 |                 time.mktime(el.published_parsed)
116 |             )
117 |         elepoch = int(time.mktime(time.strptime(str(eldatetime), pattern)))
118 |         h = hashlib.md5()
119 |         h.update(el.link.encode("utf-8"))
120 |         linkkey = h.hexdigest()
121 |         allitem[linkkey] = {}
122 |         allitem[linkkey]["link"] = str(el.link)
123 |         allitem[linkkey]["epoch"] = int(elepoch)
124 |         allitem[linkkey]["updated"] = el.updated
125 |         if "title" in el:
126 |             allitem[linkkey]["title"] = html.unescape(el.title)
127 |         else:
128 |             cleantext = BeautifulSoup(el.summary, "lxml").text
129 |             allitem[linkkey]["title"] = cleantext[: options.summarysize]
130 | 
131 | itemlist = []
132 | 
133 | for something in list(allitem.keys()):
134 |     epochkeytuple = (allitem[something]["epoch"], something)
135 |     itemlist.append(epochkeytuple)
136 | 
137 | itemlist.sort()
138 | itemlist.reverse()
139 | 
140 | RenderMerge(itemlist, options.output)
141 | 


--------------------------------------------------------------------------------