├── LICENSE ├── README.md ├── example_extensive.py ├── example_extensive2.py ├── example_minimal.py ├── example_output ├── 8K.png ├── example_extensive2_output.png ├── example_extensive_output.png └── example_minimal_output.png ├── fonts ├── ShortStack-Regular.ttf ├── Sketch Serif.ttf ├── chint___.ttf └── readme.txt ├── requirements.txt └── stackoverflow_users_taginfo.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Divakar 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 2 | 3 | # Stack Overflow (Stack Exchange) Users Tag Cloud 4 | [**Stack Exchange**](http://stackexchange.com/) is a network of `Q&A` websites covering various fields. Users earn reputation and score tag points based on the tags of the questions and answers they involve themselves with. Each user has a tag section under his/her profile page that lists the tag names and the respective counts. The Python scripts in this repository parse and extract the tag names and scores, which could then be fed to [wordcloud module for Python](https://github.com/amueller/word_cloud) to produce a word cloud image with tags being the words and their respective sizes being proportional to the respective scores. The scripts could extract such information from [**all Stack Exchange Q&A sites**](http://stackexchange.com/sites), including of course it's biggest `Q&A` site 5 | [**Stack Overflow**](http://stackoverflow.com/). 6 | 7 | ## Examples 8 | 9 | Let's take Stack Overflow's highest reputation user [Jon Skeet](http://stackoverflow.com/users/22656/jon-skeet) as the sample. His profile page link has the ID : `22656`. So, a minimal Python script to generate his tag-cloud would be - 10 | 11 | ```python 12 | >>> from stackoverflow_users_taginfo import tag_cloud 13 | >>> tag_cloud(link = 22656) 14 | ``` 15 | 16 | Giving it more options, here's a tag-cloud with the first **`1000`** tags on a `4K canvas` being produced using [example_extensive.py](https://github.com/droyed/stackoverflow_tag_cloud/blob/master/example_extensive.py) - 17 | 18 | ![Screenshot](https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/master/example_output/example_extensive_output.png) 19 | 20 | As a demo on extracting tag information and generating tag-cloud from other Q&A sites, here's a tag-cloud of [Jon Skeet's meta.stackexchange profile](http://meta.stackexchange.com/users/22656) generated with [example_extensive2.py](https://github.com/droyed/stackoverflow_tag_cloud/blob/master/example_extensive2.py) - 21 | 22 | ![Screenshot](https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/master/example_output/example_extensive2_output.png) 23 | 24 | We are living in `8K` age, so here's [Jon Skeet's `1000` tags on `8K` canvas](https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/master/example_output/8K.png)! 25 | 26 | ## Requirements 27 | * Python 2.x or 3.x. 28 | * Python modules : NumPy, Requests, urllib. 29 | * [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) - To extract html information. Works with version 4.4.1, might work with older versions too, but not tested. 30 | * [Word_cloud](https://github.com/amueller/word_cloud) - Word cloud creation : Version 1.3.1 or newer. 31 | -------------------------------------------------------------------------------- /example_extensive.py: -------------------------------------------------------------------------------- 1 | from stackoverflow_users_taginfo import tag_cloud 2 | 3 | tag_cloud(link=22656, lim_num_tags=1000, image_dims=(3840, 2160), 4 | out_filepath="example_extensive_output.png") 5 | -------------------------------------------------------------------------------- /example_extensive2.py: -------------------------------------------------------------------------------- 1 | from stackoverflow_users_taginfo import tag_cloud 2 | 3 | tag_cloud(link="http://meta.stackexchange.com/users/22656", 4 | image_dims=(1920, 1200), 5 | out_filepath="example_extensive2_output.png") 6 | -------------------------------------------------------------------------------- /example_minimal.py: -------------------------------------------------------------------------------- 1 | from stackoverflow_users_taginfo import tag_cloud 2 | 3 | tag_cloud(link=22656) 4 | -------------------------------------------------------------------------------- /example_output/8K.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/example_output/8K.png -------------------------------------------------------------------------------- /example_output/example_extensive2_output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/example_output/example_extensive2_output.png -------------------------------------------------------------------------------- /example_output/example_extensive_output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/example_output/example_extensive_output.png -------------------------------------------------------------------------------- /example_output/example_minimal_output.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/example_output/example_minimal_output.png -------------------------------------------------------------------------------- /fonts/ShortStack-Regular.ttf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/fonts/ShortStack-Regular.ttf -------------------------------------------------------------------------------- /fonts/Sketch Serif.ttf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/fonts/Sketch Serif.ttf -------------------------------------------------------------------------------- /fonts/chint___.ttf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/droyed/stackoverflow_tag_cloud/fc7ee540d937f5efa215abd46e5f4d1d73eb1f66/fonts/chint___.ttf -------------------------------------------------------------------------------- /fonts/readme.txt: -------------------------------------------------------------------------------- 1 | Please note that the fonts were obtained from : 2 | 3 | http://www.dafont.com/chinacat.font 4 | http://www.dafont.com/sketch-serif.font 5 | http://www.dafont.com/short-stack.font -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | beautifulsoup4==4.10.0 2 | numpy==1.20.3 3 | requests==2.22.0 4 | wordcloud==1.8.1 5 | -------------------------------------------------------------------------------- /stackoverflow_users_taginfo.py: -------------------------------------------------------------------------------- 1 | from bs4 import BeautifulSoup 2 | import numpy as np 3 | import requests 4 | import itertools 5 | import sys 6 | from wordcloud import WordCloud 7 | 8 | 9 | def find_between(in_str, start='>', end='<'): 10 | """ Find string between two search patterns. 11 | """ 12 | return in_str.split(start)[1].split(end)[0] 13 | 14 | 15 | def unquote_str(in_str): 16 | """ Decode URL strings. For example, replace %2b with plus sign and so on. 17 | """ 18 | if (sys.version_info >= (3, 0)): 19 | from urllib.parse import unquote 20 | else: 21 | from urllib import unquote 22 | return unquote(in_str) 23 | 24 | 25 | def toint(a): 26 | """ Convert string to int and also scale them accordingly if they end 27 | in "k", "m" or "b". 28 | """ 29 | weights = {'k': 1000, 'm': 1000000, 'b': 1000000000} 30 | if len(a) > 1: 31 | if a[-1] in weights: 32 | return weights[a[-1]] * int(a[:-1]) 33 | return int(a) 34 | 35 | 36 | def info_mainpage(url): 37 | """ Given the main tag page, this function gets basic information about tag 38 | pages and tag names and their scores as well. 39 | On the basic info, there are three numbers scraped : 40 | 1. Number of tag pages 41 | 2. Total number of tags 42 | 3. Number of tags per page 43 | 44 | Parameters 45 | ---------- 46 | url : string 47 | URL link to user's main tag page. 48 | 49 | Output 50 | ------ 51 | pginfo : dict 52 | Dictionary that holds the three page info as listed earlier. 53 | 54 | name : list of strings 55 | Holds the tag names 56 | 57 | count : list of ints 58 | Holds the tag scores 59 | """ 60 | 61 | soup = BeautifulSoup(requests.get(url).text, "lxml") 62 | 63 | pg_blk = soup.find_all("div", class_="pager fr") 64 | tag_blk = soup.find_all("div", class_="answer-votes") 65 | tag_blk_str = [str(i) for i in tag_blk] 66 | str0 = find_between(str(soup.find_all("span", class_="count"))) 67 | lim_ntags = int(str0.replace(',', '')) 68 | 69 | max_tags = None 70 | num_pages = 1 71 | if len(pg_blk) != 0: 72 | max_tags = len(tag_blk) 73 | last_page_blk = pg_blk[0].find_all("span", class_="page-numbers")[-2] 74 | num_pages = int(find_between(str(last_page_blk))) 75 | pginfo = {'pages': num_pages, 'tags': lim_ntags, 'tags_perpage': max_tags} 76 | 77 | name = [unquote_str(find_between(i, '[', ']')) for i in tag_blk_str] 78 | count = [toint(find_between(i)) for i in tag_blk_str] 79 | return pginfo, name, count 80 | 81 | 82 | def stackoverflow_taginfo(url): 83 | """ Get information about an user's tags from their Stack Overflow 84 | tag pages fed as the input URL. Mainly two pieces of information are 85 | scraped : tag names and their respective counts/scores. 86 | 87 | Parameters 88 | ---------- 89 | url : string 90 | URL link to user's main tag page. 91 | 92 | Output 93 | ------ 94 | name : list of strings 95 | Holds the tag names 96 | 97 | count : list of ints 98 | Holds the tag scores 99 | """ 100 | 101 | soup = BeautifulSoup(requests.get(url).text, "lxml") 102 | tag_blk = soup.find_all("div", class_="answer-votes") 103 | tag_blk_str = [str(i) for i in tag_blk] 104 | name = [unquote_str(find_between(i, '[', ']')) for i in tag_blk_str] 105 | count = [toint(find_between(i)) for i in tag_blk_str] 106 | return name, count 107 | 108 | 109 | def taginfo(link, lim_num_tags=None, return_sort=True, print_page_count=False): 110 | """ Get information about Stack Overflow and all Stack Exchange sites users' 111 | tags (tags and corresponding tag points scored). 112 | This could be directly used with wordcloud module for generating tag cloud. 113 | 114 | Parameters 115 | ---------- 116 | 117 | link : int or string 118 | For int case, we are assuming Stackoverflow site users. So, input as 119 | int is the user ID of the Stackoverflow user whose information is to be 120 | extracted. For string case, it's the complete user profile link to the 121 | Stackexchange site and is intended to cover all Stack Exchange sites 122 | profiles. 123 | 124 | A. As an example for int case, consider Stackoverflow's top reputation 125 | user Jon skeet. His Stack Overflow link is - 126 | "http://stackoverflow.com/users/22656/jon-skeet". 127 | So, we would have - link = 22656. 128 | 129 | B. As another example for string case, one can generate Jon's 130 | meta.stackexchange tag cloud with - 131 | link = "http://meta.stackexchange.com/users/22656/jon-skeet" or 132 | link = "http://meta.stackexchange.com/users/22656". 133 | 134 | lim_num_tags : int (default=None) 135 | Number of tags to be tracked. Default is None, which tracks all tags 136 | possible. 137 | 138 | return_sort : bool (default=True) 139 | This boolean flag decides whether the output list has the tags 140 | sorted by their counts. Since WordCloud module internally sorts 141 | them anyway, so for performance one can turn it off. 142 | 143 | print_page_count : bool(default=False) 144 | Print per page progress on processing data. 145 | 146 | Output 147 | ------ 148 | Output is a dictionary with items for tag names and keys for tag count. 149 | """ 150 | 151 | # Get start link (profile page's tag link) 152 | if '.com' in str(link): 153 | start_link = link + "?tab=tags&sort=votes&page=" 154 | else: 155 | start_link = "http://stackoverflow.com/users/" + str(link) + \ 156 | "?tab=tags&sort=votes&page=" 157 | 158 | tag_name = [] 159 | tag_count = [] 160 | 161 | if print_page_count: 162 | print("Processing page : 1/NA") 163 | 164 | info1 = info_mainpage(start_link + '1') 165 | num_tags = info1[0]['tags'] 166 | tag_name.append(info1[1]) 167 | tag_count.append(info1[2]) 168 | tags_per_page = len(info1[1]) 169 | 170 | if lim_num_tags is None: 171 | num_tags = info1[0]['tags'] 172 | else: 173 | num_tags = min(lim_num_tags, info1[0]['tags']) 174 | num_pages = int(np.ceil(num_tags/tags_per_page)) 175 | 176 | print('tags_per_page : '+str(tags_per_page)) 177 | print('num_tags : '+str(num_tags)) 178 | print('num_pages : '+str(num_pages)) 179 | 180 | if num_pages > 1: 181 | num_pages = int(np.ceil(lim_num_tags/float(tags_per_page))) 182 | for page_id in range(2, num_pages+1): 183 | if print_page_count: 184 | print("Processing page : " + str(page_id) + "/" + str(num_pages)) 185 | 186 | url = start_link + str(page_id) 187 | page_tag_name, page_tag_count = stackoverflow_taginfo(url) 188 | tag_name.append(page_tag_name) 189 | tag_count.append(page_tag_count) 190 | 191 | info0 = list(zip(itertools.chain(*tag_name), itertools.chain(*tag_count))) 192 | sorted_indx = np.argsort([item[1] for item in info0])[::-1] 193 | info = [info0[idx] for idx in sorted_indx][:lim_num_tags] 194 | 195 | # For a case when all tag counts are zeros, it would throw error. 196 | # So, for such a case, escape it by setting all counts to "1". 197 | dict_info = dict(info) 198 | if info[0][1] == 0: 199 | dict_info = dict.fromkeys(dict_info, 1) 200 | 201 | return dict_info 202 | 203 | def draw_taginfo(info, 204 | image_dims, 205 | out_filepath, 206 | skip_tags = [], 207 | font_path="fonts/ShortStack-Regular.ttf", 208 | ): 209 | 210 | W, H = image_dims # Wordcloud image size (width, height) 211 | for sk in skip_tags: 212 | del info[sk] 213 | 214 | if info is None: 215 | print("Error : No webpage found!") 216 | else: 217 | if len(info) == 0: 218 | print("Error : No tags found!") 219 | else: # Successfully extracted tag info 220 | WC = WordCloud(font_path=font_path, width=W, height=H, 221 | max_words=len(info)).generate_from_frequencies(info) 222 | WC.to_image().save(out_filepath) 223 | print("Tag Cloud Saved as " + out_filepath) 224 | 225 | return 226 | 227 | def tag_cloud(link=22656, 228 | lim_num_tags=200, 229 | image_dims=(400, 200), 230 | skip_tags = [], 231 | out_filepath="TagCloud.png", 232 | ): 233 | """ Generate tag cloud and save it as an image. 234 | 235 | Parameters 236 | ---------- 237 | link : same as used for the function taginfo. 238 | 239 | num_tags : same as used for the function taginfo. 240 | 241 | image_dims : tuple of two elements. 242 | Image dimensions of the tag cloud image to be saved. 243 | 244 | out_filepath : string 245 | Output image filepath. 246 | 247 | Output 248 | ------ 249 | None 250 | """ 251 | 252 | info = taginfo(link=link, lim_num_tags=lim_num_tags) 253 | draw_taginfo(info, image_dims=image_dims, out_filepath=out_filepath, skip_tags = skip_tags) 254 | return 255 | --------------------------------------------------------------------------------