├── .gitignore
├── requirements.txt
├── output.gif
├── nasa_visualization.png
├── config.json
├── sample_config.json
├── README.md
├── visualize.py
├── visualization.html
└── LICENSE
/.gitignore:
--------------------------------------------------------------------------------
1 | nasa_output
2 | nasa
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | feedparser
2 | sentence-transformers
3 | pandas
4 | nltk
--------------------------------------------------------------------------------
/output.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/code2k13/feed-visualizer/HEAD/output.gif
--------------------------------------------------------------------------------
/nasa_visualization.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/code2k13/feed-visualizer/HEAD/nasa_visualization.png
--------------------------------------------------------------------------------
/config.json:
--------------------------------------------------------------------------------
1 | {
2 | "input_directory": "nasa",
3 | "output_directory": "nasa_output",
4 | "pretrained_model": "all-mpnet-base-v2",
5 | "clust_dist_threshold":1,
6 | "tsne_iter": 8000,
7 | "text_max_length": 2048,
8 | "random_state": 45,
9 | "topic_str_min_df": 0.20
10 | }
--------------------------------------------------------------------------------
/sample_config.json:
--------------------------------------------------------------------------------
1 | {
2 | "input_directory": "nasa",
3 | "output_directory": "nasa_output",
4 | "pretrained_model": "all-mpnet-base-v2",
5 | "clust_dist_threshold":1,
6 | "tsne_iter": 8000,
7 | "text_max_length": 2048,
8 | "random_state": 45,
9 | "topic_str_min_df": 0.20
10 | }
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Introduction
2 |
3 | Feed Visualizer is a tool that can cluster RSS/Atom feed items based on semantic similarity and generate interactive visualization.
4 | This tool can be used to generate 'semantic summary' of any website by reading it's RSS/Atom feed. Shown below is an image of how the visualization generated by Feed Visualizer looks like. If you like this tool please consider giving a ⭐ on github !
5 |
6 | 
7 |
8 |
9 | ## Interactive Demos:
10 |
11 | * Visualization of around 950 items from [Slashdot’s RSS](http://rss.slashdot.org/Slashdot/slashdotMain) feed:
12 | 📈https://ashishware.com/static/slashdot_viz.html
13 |
14 | * Visualization of [NASA’s RSS](https://www.nasa.gov/rss/dyn/breaking_news.rss) feed:
15 | 📈https://ashishware.com/static/nasa_viz.html
16 |
17 | * Visualization of [Martin Fowler's Atom](https://martinfowler.com/feed.atom) feed:
18 | 📈https://ashishware.com/static/martin_fowler_viz.html
19 |
20 | * Visualization of [BCC's RSS ](http://feeds.bbci.co.uk/news/rss.xml) feed:
21 | 📈https://ashishware.com/static/bbc_viz.html
22 |
23 | ## Quick Start
24 |
25 | Clone the repo
26 |
27 | ```bash
28 | git clone https://github.com/code2k13/feed-visualizer.git
29 | ```
30 |
31 | Navigate to the the newly created directory
32 | ```bash
33 | cd feed-visualizer
34 | ```
35 |
36 | Install the required modules
37 | ```bash
38 | pip install -r requirements.txt
39 | ```
40 |
41 |
42 |
43 | > Typically a RSS or Atom file only contains recent information from the website. This is where, I would highly recommend using [wayback_machine_downloader](https://github.com/hartator/wayback-machine-downloader) tool. Follow the instructions on this page to install the tool.
44 |
45 | The below command downloads public RSS feed from [NASA](https://www.nasa.gov/rss/dyn/breaking_news.rss) for last few months and saves to folder named 'nasa'
46 | ```bash
47 | wayback_machine_downloader https://www.nasa.gov/rss/dyn/breaking_news.rss -s -f 202101 -t 202106 -d nasa
48 | ```
49 | > Alternatively you can simply create a new folder and paste all RSS or Atom files in it (if you have them) ! Make sure to point your config to this folder (read next step)
50 |
51 |
52 | Now, we need to create a config file for Feed Visualizer. The config file contains path to input directory, name of output directory and some other settings (discussed later) that control the output of the tool. This is what a sample configuration file looks like :
53 |
54 | ```json
55 | {
56 | "input_directory": "nasa",
57 | "output_directory": "nasa_output",
58 | "pretrained_model": "all-mpnet-base-v2",
59 | "clust_dist_threshold":1,
60 | "tsne_iter": 8000,
61 | "text_max_length": 2048,
62 | "random_state": 45,
63 | "topic_str_min_df": 0.20
64 | }
65 | ```
66 |
67 | Now its time to run our tool
68 |
69 | ```bash
70 | python3 visualize.py -c config.json
71 | ```
72 |
73 | Once the above command completes, you should see *visualization.html* and *data.csv* files in the output folder (nasa_output). Copy these files to a webserver (or use a dummy server like [http-server](https://www.npmjs.com/package/http-server) ) and view the visualization.html page in a browser. You should see something like this:
74 |
75 | 
76 |
77 |
78 | ## Config settings
79 |
80 | Here is some information on what each config setting does:
81 |
82 | ```json
83 | {
84 | "input_directory": "path to input directory. Can contain subfolders. But should only contain RSS or Atom files",
85 | "output_directory": "path to output directory where visualization will be stored. Directory is created if not present. Contents are always overwritten.",
86 | "pretrained_model": "name of pretrained model. Here is list of all valid model names https://www.sbert.net/docs/pretrained_models.html#model-overview",
87 | "clust_dist_threshold": "Integer representing maximum radius of cluster. There is no correct value here. Experiment !",
88 | "tsne_iter": "Integer representing number of iterations for TSNE (higher is better)",
89 | "text_max_length": "Integer representing number of characters to read from content/description for semantic encoding.",
90 | "random_state": "A integer to which serves as random seed while generating visualization. Use same random_state for reproducible results with set of data",
91 | "topic_str_min_df": "A float. For example value of 0.25 means that only phrases which are present in 25% or more items in a cluster will be considered for being used as name of the cluster."
92 | }
93 | ```
94 |
95 | ## Issues/Feature Requests/Bugs
96 |
97 | You can reach out to me on [👨💼 LinkedIn](https://www.linkedin.com/in/ashish-patil-66bb568/) and [🗨️Twitter](https://twitter.com/patilsaheb) for reporting any issues/bugs or for feature requests !
98 |
--------------------------------------------------------------------------------
/visualize.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 |
3 | import argparse
4 | import csv
5 | import glob
6 | import json
7 | import os
8 | import shutil
9 |
10 | import feedparser
11 | import numpy as np
12 | import pandas as pd
13 | from bs4 import BeautifulSoup, SoupStrainer
14 | from sentence_transformers import SentenceTransformer
15 | from sklearn.cluster import AgglomerativeClustering
16 | from sklearn.feature_extraction.text import CountVectorizer
17 | from sklearn.manifold import TSNE
18 | from tqdm import tqdm
19 | from scipy.spatial import ConvexHull
20 |
21 | parser = argparse.ArgumentParser(
22 | description='Generates cool visualization from Atom/RSS feeds !')
23 | parser.add_argument('-c', '--configuration', required=True,
24 | help='location of configuration file.')
25 | args = parser.parse_args()
26 |
27 | with open(args.configuration, 'r') as config_file:
28 | config = json.load(config_file)
29 |
30 | semantic_encoder_model = SentenceTransformer(config["pretrained_model"])
31 |
32 |
33 | def get_all_entries(path):
34 | all_entries = {}
35 | files = glob.glob(path+"/**/**/*.*", recursive=True)
36 | for file in tqdm(files, desc='Reading posts from files'):
37 |
38 | feed = feedparser.parse(file)
39 | for entry in feed['entries']:
40 | if 'summary' in entry:
41 | all_entries[entry['link']] = [
42 | entry['title'], entry['title'] + " " + entry['summary']]
43 | elif 'content' in entry:
44 | all_entries[entry['link']] = [
45 | entry['title'], entry['title'] + " " + entry['content'][0]['value']]
46 | return all_entries
47 |
48 |
49 | def generate_text_for_entry(raw_text, entry_counts):
50 | output = []
51 | raw_text = raw_text.replace("\n", " ")
52 | soup = BeautifulSoup(raw_text, features="html.parser")
53 | output.append(soup.text)
54 | for link in BeautifulSoup(raw_text, parse_only=SoupStrainer('a'), features="html.parser"):
55 | if link.has_attr('href'):
56 | url = link['href']
57 | if url in entry_counts:
58 | entry_counts[url] = entry_counts[url] + 1
59 | else:
60 | entry_counts[url] = 0
61 |
62 | return ' ' .join(output)
63 |
64 |
65 | def generate_embeddings(entries, entry_counts):
66 | sentences = [generate_text_for_entry(
67 | entries[a][1][0:config["text_max_length"]], entry_counts) for a in entries]
68 | print('Generating embeddings ...')
69 | embeddings = semantic_encoder_model.encode(sentences)
70 | print('Generating embeddings ... Done !')
71 | index = 0
72 | for uri in entries:
73 | entries[uri].append(embeddings[index])
74 | index = index+1
75 | return entries
76 |
77 |
78 | def get_coordinates(entries):
79 | X = [entries[e][-1] for e in entries]
80 | X = np.array(X)
81 | tsne = TSNE(n_iter=config["tsne_iter"], init='pca',
82 | learning_rate='auto', random_state=config["random_state"])
83 | clustering_model = AgglomerativeClustering(
84 | distance_threshold=config["clust_dist_threshold"], n_clusters=None)
85 | tsne_output = tsne.fit_transform(X)
86 | tsne_output = (tsne_output-tsne_output.min()) / \
87 | (tsne_output.max()-tsne_output.min())
88 | # tsne_output = (tsne_output-tsne_output.mean())/tsne_output.std()
89 | clusters = clustering_model.fit_predict(tsne_output)
90 | return [x[0] for x in tsne.fit_transform(X)], [x[1] for x in tsne.fit_transform(X)], clusters
91 |
92 |
93 | def find_topics(df):
94 | topics = []
95 | for i in range(0, df["cluster"].max()+1):
96 | try:
97 | df_text = df[df['cluster'] == i]["label"]
98 | vectorizer = CountVectorizer(ngram_range=(
99 | 1, 2), min_df=config["topic_str_min_df"], stop_words='english')
100 | X = vectorizer.fit_transform(df_text)
101 | possible_topics = vectorizer.get_feature_names_out()
102 | idx_topic = np.argmax([len(a) for a in possible_topics])
103 | topics.append(possible_topics[idx_topic])
104 | # x,y = np.argmax(np.max(X, axis=1)),np.argmax(np.max(X, axis=0))
105 | # topics.append(vectorizer.get_feature_names_out()[y])
106 | except:
107 | topics.append("NA")
108 | pass
109 | return topics
110 |
111 |
112 | def get_convex_hulls(df):
113 | convex_hulls = []
114 | cluster_labels = df['cluster'].unique()
115 | cluster_labels.sort()
116 | polygon_traces = []
117 | for label in cluster_labels:
118 | cluster_data = df.loc[df['cluster'] == label]
119 | x = cluster_data['x'].values
120 | y = cluster_data['y'].values
121 | points = np.column_stack((x, y))
122 | hull = ConvexHull(points)
123 | hull_points = np.append(hull.vertices, hull.vertices[0])
124 | convex_hulls.append(
125 | {"x": x[hull_points].tolist(), "y": y[hull_points].tolist()})
126 | return convex_hulls
127 |
128 |
129 | def main():
130 | all_entries = get_all_entries(config["input_directory"])
131 | entry_counts = {}
132 | entry_texts = []
133 | disinct_entries = {}
134 | for k in all_entries.keys():
135 | if all_entries[k][0] not in entry_texts:
136 | disinct_entries[k] = all_entries[k]
137 | entry_texts.append(all_entries[k][0])
138 |
139 | all_entries = disinct_entries
140 | entries = generate_embeddings(all_entries, entry_counts)
141 | print('Creating clusters ...')
142 | x, y, cluster_info = get_coordinates(entries)
143 | print('Creating clusters ... Done !')
144 | labels = [entries[k][0] for k in entries]
145 | counts = [entry_counts[k] if k in entry_counts else 0 for k in entries]
146 | df = pd.DataFrame({'x': x, 'y': y, 'label': labels,
147 | 'count': counts, 'url': entries.keys(), 'cluster': cluster_info})
148 |
149 | topics = find_topics(df)
150 | df["topic"] = df["cluster"].apply(lambda x: topics[x])
151 | print('Assigning cluster names !')
152 | if not os.path.exists(config["output_directory"]):
153 | os.makedirs(config["output_directory"])
154 | df.to_csv(config["output_directory"]+"/data.csv")
155 | convex_hulls = get_convex_hulls(df)
156 | with open(config["output_directory"] + '/convex_hulls.json', 'w') as f:
157 | f.write(json.dumps(convex_hulls))
158 | shutil.copy('visualization.html', config["output_directory"])
159 | print('Vizualization generation is complete !!')
160 |
161 |
162 | if __name__ == "__main__":
163 | main()
164 |
--------------------------------------------------------------------------------
/visualization.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
44 | Add information about the visualization here ! 45 |
46 |