├── LICENSE
├── README.md
├── example_extensive.py
├── example_extensive2.py
├── example_minimal.py
├── example_output
    ├── 8K.png
    ├── example_extensive2_output.png
    ├── example_extensive_output.png
    └── example_minimal_output.png
├── fonts
    ├── ShortStack-Regular.ttf
    ├── Sketch Serif.ttf
    ├── chint___.ttf
    └── readme.txt
├── requirements.txt
└── stackoverflow_users_taginfo.py


/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2017 Divakar
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 2 | 
 3 | # Stack Overflow (Stack Exchange) Users Tag Cloud
 4 | [**Stack Exchange**](http://stackexchange.com/) is a network of `Q&A` websites covering various fields. Users earn reputation and score tag points based on the tags of the questions and answers they involve themselves with. Each user has a tag section under his/her profile page that lists the tag names and the respective counts. The Python scripts in this repository parse and extract the tag names and scores, which could then be fed to [wordcloud module for Python](https://github.com/amueller/word_cloud) to produce a word cloud image with tags being the words and their respective sizes being proportional to the respective scores. The scripts could extract such information from [**all Stack Exchange Q&A sites**](http://stackexchange.com/sites), including of course it's biggest `Q&A` site 
 5 | [**Stack Overflow**](http://stackoverflow.com/).
 6 | 
 7 | ## Examples
 8 | 
 9 | Let's take Stack Overflow's highest reputation user [Jon Skeet](http://stackoverflow.com/users/22656/jon-skeet) as the sample. His profile page link has the ID : `22656`. So, a minimal Python script to generate his tag-cloud would be -
10 | 
11 | ```python
12 | >>> from stackoverflow_users_taginfo import tag_cloud
13 | >>> tag_cloud(link = 22656)
14 | ```
15 | 
16 | Giving it more options, here's a tag-cloud with the first **`1000`** tags on a `4K canvas` being produced using [example_extensive.py](https://github.com/droyed/stackoverflow_tag_cloud/blob/master/example_extensive.py) -
17 | 
18 | ![Screenshot](https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/master/example_output/example_extensive_output.png)
19 | 
20 | As a demo on extracting tag information and generating tag-cloud from other Q&A sites, here's a tag-cloud of [Jon Skeet's meta.stackexchange profile](http://meta.stackexchange.com/users/22656) generated with  [example_extensive2.py](https://github.com/droyed/stackoverflow_tag_cloud/blob/master/example_extensive2.py) -
21 | 
22 | ![Screenshot](https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/master/example_output/example_extensive2_output.png)
23 | 
24 | We are living in `8K` age, so here's [Jon Skeet's `1000` tags on `8K` canvas](https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/master/example_output/8K.png)!
25 | 
26 | ## Requirements
27 | * Python 2.x or 3.x.
28 | * Python modules : NumPy, Requests, urllib.
29 | * [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) - To extract html information. Works with version 4.4.1, might work with older versions too, but not tested. 
30 | * [Word_cloud](https://github.com/amueller/word_cloud) - Word cloud creation : Version 1.3.1 or newer.
31 | 


--------------------------------------------------------------------------------
/example_extensive.py:
--------------------------------------------------------------------------------
1 | from stackoverflow_users_taginfo import tag_cloud
2 | 
3 | tag_cloud(link=22656, lim_num_tags=1000, image_dims=(3840, 2160),
4 |           out_filepath="example_extensive_output.png")
5 | 


--------------------------------------------------------------------------------
/example_extensive2.py:
--------------------------------------------------------------------------------
1 | from stackoverflow_users_taginfo import tag_cloud
2 | 
3 | tag_cloud(link="http://meta.stackexchange.com/users/22656",
4 |           image_dims=(1920, 1200),
5 |           out_filepath="example_extensive2_output.png")
6 | 


--------------------------------------------------------------------------------
/example_minimal.py:
--------------------------------------------------------------------------------
1 | from stackoverflow_users_taginfo import tag_cloud
2 | 
3 | tag_cloud(link=22656)
4 | 


--------------------------------------------------------------------------------
/example_output/8K.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/example_output/8K.png


--------------------------------------------------------------------------------
/example_output/example_extensive2_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/example_output/example_extensive2_output.png


--------------------------------------------------------------------------------
/example_output/example_extensive_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/example_output/example_extensive_output.png


--------------------------------------------------------------------------------
/example_output/example_minimal_output.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/example_output/example_minimal_output.png


--------------------------------------------------------------------------------
/fonts/ShortStack-Regular.ttf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/fonts/ShortStack-Regular.ttf


--------------------------------------------------------------------------------
/fonts/Sketch Serif.ttf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/fonts/Sketch Serif.ttf


--------------------------------------------------------------------------------
/fonts/chint___.ttf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/fonts/chint___.ttf


--------------------------------------------------------------------------------
/fonts/readme.txt:
--------------------------------------------------------------------------------
1 | Please note that the fonts were obtained from :
2 | 
3 | http://www.dafont.com/chinacat.font
4 | http://www.dafont.com/sketch-serif.font
5 | http://www.dafont.com/short-stack.font


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | beautifulsoup4==4.10.0
2 | numpy==1.20.3
3 | requests==2.22.0
4 | wordcloud==1.8.1
5 | 


--------------------------------------------------------------------------------
/stackoverflow_users_taginfo.py:
--------------------------------------------------------------------------------
  1 | from bs4 import BeautifulSoup
  2 | import numpy as np
  3 | import requests
  4 | import itertools
  5 | import sys
  6 | from wordcloud import WordCloud
  7 | 
  8 | 
  9 | def find_between(in_str, start='>', end='<'):
 10 |     """ Find string between two search patterns.
 11 |     """
 12 |     return in_str.split(start)[1].split(end)[0]
 13 | 
 14 | 
 15 | def unquote_str(in_str):
 16 |     """ Decode URL strings. For example, replace %2b with plus sign and so on.
 17 |     """
 18 |     if (sys.version_info >= (3, 0)):
 19 |         from urllib.parse import unquote
 20 |     else:
 21 |         from urllib import unquote
 22 |     return unquote(in_str)
 23 | 
 24 | 
 25 | def toint(a):
 26 |     """ Convert string to int and also scale them accordingly if they end
 27 |     in "k", "m" or "b".
 28 |     """
 29 |     weights = {'k': 1000, 'm': 1000000, 'b': 1000000000}
 30 |     if len(a) > 1:
 31 |         if a[-1] in weights:
 32 |             return weights[a[-1]] * int(a[:-1])
 33 |     return int(a)
 34 | 
 35 | 
 36 | def info_mainpage(url):
 37 |     """ Given the main tag page, this function gets basic information about tag
 38 |     pages and tag names and their scores as well.
 39 |     On the basic info, there are three numbers scraped :
 40 |     1. Number of tag pages
 41 |     2. Total number of tags
 42 |     3. Number of tags per page
 43 | 
 44 |     Parameters
 45 |     ----------
 46 |     url : string
 47 |         URL link to user's main tag page.
 48 | 
 49 |     Output
 50 |     ------
 51 |     pginfo : dict
 52 |         Dictionary that holds the three page info as listed earlier.
 53 | 
 54 |     name : list of strings
 55 |         Holds the tag names
 56 | 
 57 |     count : list of ints
 58 |         Holds the tag scores
 59 |     """
 60 | 
 61 |     soup = BeautifulSoup(requests.get(url).text, "lxml")
 62 | 
 63 |     pg_blk = soup.find_all("div", class_="pager fr")
 64 |     tag_blk = soup.find_all("div", class_="answer-votes")
 65 |     tag_blk_str = [str(i) for i in tag_blk]
 66 |     str0 = find_between(str(soup.find_all("span", class_="count")))
 67 |     lim_ntags = int(str0.replace(',', ''))
 68 | 
 69 |     max_tags = None
 70 |     num_pages = 1
 71 |     if len(pg_blk) != 0:
 72 |         max_tags = len(tag_blk)
 73 |         last_page_blk = pg_blk[0].find_all("span", class_="page-numbers")[-2]
 74 |         num_pages = int(find_between(str(last_page_blk)))
 75 |     pginfo = {'pages': num_pages, 'tags': lim_ntags, 'tags_perpage': max_tags}
 76 | 
 77 |     name = [unquote_str(find_between(i, '[', ']')) for i in tag_blk_str]
 78 |     count = [toint(find_between(i)) for i in tag_blk_str]
 79 |     return pginfo, name, count
 80 | 
 81 | 
 82 | def stackoverflow_taginfo(url):
 83 |     """ Get information about an user's tags from their Stack Overflow
 84 |     tag pages fed as the input URL. Mainly two pieces of information are
 85 |     scraped : tag names and their respective counts/scores.
 86 | 
 87 |     Parameters
 88 |     ----------
 89 |     url : string
 90 |         URL link to user's main tag page.
 91 | 
 92 |     Output
 93 |     ------
 94 |     name : list of strings
 95 |         Holds the tag names
 96 | 
 97 |     count : list of ints
 98 |         Holds the tag scores
 99 |     """
100 | 
101 |     soup = BeautifulSoup(requests.get(url).text, "lxml")
102 |     tag_blk = soup.find_all("div", class_="answer-votes")
103 |     tag_blk_str = [str(i) for i in tag_blk]
104 |     name = [unquote_str(find_between(i, '[', ']')) for i in tag_blk_str]
105 |     count = [toint(find_between(i)) for i in tag_blk_str]
106 |     return name, count
107 | 
108 | 
109 | def taginfo(link, lim_num_tags=None, return_sort=True, print_page_count=False):
110 |     """ Get information about Stack Overflow and all Stack Exchange sites users'
111 |     tags (tags and corresponding tag points scored).
112 |     This could be directly used with wordcloud module for generating tag cloud.
113 | 
114 |     Parameters
115 |     ----------
116 | 
117 |     link : int or string
118 |         For int case, we are assuming Stackoverflow site users. So, input as
119 |         int is the user ID of the Stackoverflow user whose information is to be
120 |         extracted. For string case, it's the complete user profile link to the
121 |         Stackexchange site and is intended to cover all Stack Exchange sites
122 |         profiles.
123 | 
124 |         A. As an example for int case, consider Stackoverflow's top reputation
125 |         user Jon skeet. His Stack Overflow link is -
126 |         "http://stackoverflow.com/users/22656/jon-skeet".
127 |         So, we would have - link = 22656.
128 | 
129 |         B. As another example for string case, one can generate Jon's
130 |         meta.stackexchange tag cloud with -
131 |         link = "http://meta.stackexchange.com/users/22656/jon-skeet" or
132 |         link = "http://meta.stackexchange.com/users/22656".
133 | 
134 |     lim_num_tags : int (default=None)
135 |         Number of tags to be tracked. Default is None, which tracks all tags
136 |         possible.
137 | 
138 |     return_sort : bool (default=True)
139 |         This boolean flag decides whether the output list has the tags
140 |         sorted by their counts. Since WordCloud module internally sorts
141 |         them anyway, so for performance one can turn it off.
142 | 
143 |     print_page_count : bool(default=False)
144 |         Print per page progress on processing data.
145 | 
146 |     Output
147 |     ------
148 |     Output is a dictionary with items for tag names and keys for tag count.
149 |     """
150 | 
151 |     # Get start link (profile page's tag link)
152 |     if '.com' in str(link):
153 |         start_link = link + "?tab=tags&sort=votes&page="
154 |     else:
155 |         start_link = "http://stackoverflow.com/users/" + str(link) + \
156 |                                                 "?tab=tags&sort=votes&page="
157 | 
158 |     tag_name = []
159 |     tag_count = []
160 | 
161 |     if print_page_count:
162 |         print("Processing page : 1/NA")        
163 | 
164 |     info1 = info_mainpage(start_link + '1')
165 |     num_tags = info1[0]['tags']
166 |     tag_name.append(info1[1])
167 |     tag_count.append(info1[2])
168 |     tags_per_page = len(info1[1])
169 |     
170 |     if lim_num_tags is None:
171 |         num_tags = info1[0]['tags']
172 |     else:
173 |         num_tags = min(lim_num_tags, info1[0]['tags'])
174 |     num_pages = int(np.ceil(num_tags/tags_per_page))
175 | 
176 |     print('tags_per_page : '+str(tags_per_page))
177 |     print('num_tags : '+str(num_tags))
178 |     print('num_pages : '+str(num_pages))
179 | 
180 |     if num_pages > 1:
181 |         num_pages = int(np.ceil(lim_num_tags/float(tags_per_page)))
182 |         for page_id in range(2, num_pages+1):
183 |             if print_page_count:
184 |                 print("Processing page : " + str(page_id) + "/" + str(num_pages))
185 | 
186 |             url = start_link + str(page_id)
187 |             page_tag_name, page_tag_count = stackoverflow_taginfo(url)
188 |             tag_name.append(page_tag_name)
189 |             tag_count.append(page_tag_count)
190 | 
191 |     info0 = list(zip(itertools.chain(*tag_name), itertools.chain(*tag_count)))
192 |     sorted_indx = np.argsort([item[1] for item in info0])[::-1]
193 |     info = [info0[idx] for idx in sorted_indx][:lim_num_tags]
194 | 
195 |     # For a case when all tag counts are zeros, it would throw error.
196 |     # So, for such a case, escape it by setting all counts to "1".
197 |     dict_info = dict(info)
198 |     if info[0][1] == 0:
199 |         dict_info = dict.fromkeys(dict_info, 1)
200 | 
201 |     return dict_info
202 | 
203 | def draw_taginfo(info, 
204 |                  image_dims, 
205 |                  out_filepath,
206 |                  skip_tags = [],
207 |                  font_path="fonts/ShortStack-Regular.ttf",
208 |                  ):
209 |     
210 |     W, H = image_dims    # Wordcloud image size (width, height)
211 |     for sk in skip_tags:
212 |         del info[sk]
213 |                   
214 |     if info is None:
215 |         print("Error : No webpage found!")
216 |     else:
217 |         if len(info) == 0:
218 |             print("Error : No tags found!")
219 |         else:         # Successfully extracted tag info
220 |             WC = WordCloud(font_path=font_path, width=W, height=H,
221 |                            max_words=len(info)).generate_from_frequencies(info)
222 |             WC.to_image().save(out_filepath)
223 |             print("Tag Cloud Saved as " + out_filepath)
224 |             
225 |     return
226 | 
227 | def tag_cloud(link=22656, 
228 |               lim_num_tags=200, 
229 |               image_dims=(400, 200),
230 |               skip_tags = [],
231 |               out_filepath="TagCloud.png",
232 |               ):
233 |     """ Generate tag cloud and save it as an image.
234 | 
235 |     Parameters
236 |     ----------
237 |     link : same as used for the function taginfo.
238 | 
239 |     num_tags : same as used for the function taginfo.
240 | 
241 |     image_dims : tuple of two elements.
242 |         Image dimensions of the tag cloud image to be saved.
243 | 
244 |     out_filepath : string
245 |         Output image filepath.
246 | 
247 |     Output
248 |     ------
249 |     None
250 |     """
251 | 
252 |     info = taginfo(link=link, lim_num_tags=lim_num_tags)    
253 |     draw_taginfo(info, image_dims=image_dims, out_filepath=out_filepath, skip_tags = skip_tags)
254 |     return
255 | 


--------------------------------------------------------------------------------