├── LICENSE ├── README.md ├── app.py ├── repo.py ├── scrap_test.ipynb ├── sshot1.png └── topics.py /LICENSE: -------------------------------------------------------------------------------- 1 | Apache License 2 | Version 2.0, January 2004 3 | http://www.apache.org/licenses/ 4 | 5 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 6 | 7 | 1. Definitions. 8 | 9 | "License" shall mean the terms and conditions for use, reproduction, 10 | and distribution as defined by Sections 1 through 9 of this document. 11 | 12 | "Licensor" shall mean the copyright owner or entity authorized by 13 | the copyright owner that is granting the License. 14 | 15 | "Legal Entity" shall mean the union of the acting entity and all 16 | other entities that control, are controlled by, or are under common 17 | control with that entity. For the purposes of this definition, 18 | "control" means (i) the power, direct or indirect, to cause the 19 | direction or management of such entity, whether by contract or 20 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 21 | outstanding shares, or (iii) beneficial ownership of such entity. 22 | 23 | "You" (or "Your") shall mean an individual or Legal Entity 24 | exercising permissions granted by this License. 25 | 26 | "Source" form shall mean the preferred form for making modifications, 27 | including but not limited to software source code, documentation 28 | source, and configuration files. 29 | 30 | "Object" form shall mean any form resulting from mechanical 31 | transformation or translation of a Source form, including but 32 | not limited to compiled object code, generated documentation, 33 | and conversions to other media types. 34 | 35 | "Work" shall mean the work of authorship, whether in Source or 36 | Object form, made available under the License, as indicated by a 37 | copyright notice that is included in or attached to the work 38 | (an example is provided in the Appendix below). 39 | 40 | "Derivative Works" shall mean any work, whether in Source or Object 41 | form, that is based on (or derived from) the Work and for which the 42 | editorial revisions, annotations, elaborations, or other modifications 43 | represent, as a whole, an original work of authorship. For the purposes 44 | of this License, Derivative Works shall not include works that remain 45 | separable from, or merely link (or bind by name) to the interfaces of, 46 | the Work and Derivative Works thereof. 47 | 48 | "Contribution" shall mean any work of authorship, including 49 | the original version of the Work and any modifications or additions 50 | to that Work or Derivative Works thereof, that is intentionally 51 | submitted to Licensor for inclusion in the Work by the copyright owner 52 | or by an individual or Legal Entity authorized to submit on behalf of 53 | the copyright owner. For the purposes of this definition, "submitted" 54 | means any form of electronic, verbal, or written communication sent 55 | to the Licensor or its representatives, including but not limited to 56 | communication on electronic mailing lists, source code control systems, 57 | and issue tracking systems that are managed by, or on behalf of, the 58 | Licensor for the purpose of discussing and improving the Work, but 59 | excluding communication that is conspicuously marked or otherwise 60 | designated in writing by the copyright owner as "Not a Contribution." 61 | 62 | "Contributor" shall mean Licensor and any individual or Legal Entity 63 | on behalf of whom a Contribution has been received by Licensor and 64 | subsequently incorporated within the Work. 65 | 66 | 2. Grant of Copyright License. Subject to the terms and conditions of 67 | this License, each Contributor hereby grants to You a perpetual, 68 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 69 | copyright license to reproduce, prepare Derivative Works of, 70 | publicly display, publicly perform, sublicense, and distribute the 71 | Work and such Derivative Works in Source or Object form. 72 | 73 | 3. Grant of Patent License. Subject to the terms and conditions of 74 | this License, each Contributor hereby grants to You a perpetual, 75 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 76 | (except as stated in this section) patent license to make, have made, 77 | use, offer to sell, sell, import, and otherwise transfer the Work, 78 | where such license applies only to those patent claims licensable 79 | by such Contributor that are necessarily infringed by their 80 | Contribution(s) alone or by combination of their Contribution(s) 81 | with the Work to which such Contribution(s) was submitted. If You 82 | institute patent litigation against any entity (including a 83 | cross-claim or counterclaim in a lawsuit) alleging that the Work 84 | or a Contribution incorporated within the Work constitutes direct 85 | or contributory patent infringement, then any patent licenses 86 | granted to You under this License for that Work shall terminate 87 | as of the date such litigation is filed. 88 | 89 | 4. Redistribution. You may reproduce and distribute copies of the 90 | Work or Derivative Works thereof in any medium, with or without 91 | modifications, and in Source or Object form, provided that You 92 | meet the following conditions: 93 | 94 | (a) You must give any other recipients of the Work or 95 | Derivative Works a copy of this License; and 96 | 97 | (b) You must cause any modified files to carry prominent notices 98 | stating that You changed the files; and 99 | 100 | (c) You must retain, in the Source form of any Derivative Works 101 | that You distribute, all copyright, patent, trademark, and 102 | attribution notices from the Source form of the Work, 103 | excluding those notices that do not pertain to any part of 104 | the Derivative Works; and 105 | 106 | (d) If the Work includes a "NOTICE" text file as part of its 107 | distribution, then any Derivative Works that You distribute must 108 | include a readable copy of the attribution notices contained 109 | within such NOTICE file, excluding those notices that do not 110 | pertain to any part of the Derivative Works, in at least one 111 | of the following places: within a NOTICE text file distributed 112 | as part of the Derivative Works; within the Source form or 113 | documentation, if provided along with the Derivative Works; or, 114 | within a display generated by the Derivative Works, if and 115 | wherever such third-party notices normally appear. The contents 116 | of the NOTICE file are for informational purposes only and 117 | do not modify the License. You may add Your own attribution 118 | notices within Derivative Works that You distribute, alongside 119 | or as an addendum to the NOTICE text from the Work, provided 120 | that such additional attribution notices cannot be construed 121 | as modifying the License. 122 | 123 | You may add Your own copyright statement to Your modifications and 124 | may provide additional or different license terms and conditions 125 | for use, reproduction, or distribution of Your modifications, or 126 | for any such Derivative Works as a whole, provided Your use, 127 | reproduction, and distribution of the Work otherwise complies with 128 | the conditions stated in this License. 129 | 130 | 5. Submission of Contributions. Unless You explicitly state otherwise, 131 | any Contribution intentionally submitted for inclusion in the Work 132 | by You to the Licensor shall be under the terms and conditions of 133 | this License, without any additional terms or conditions. 134 | Notwithstanding the above, nothing herein shall supersede or modify 135 | the terms of any separate license agreement you may have executed 136 | with Licensor regarding such Contributions. 137 | 138 | 6. Trademarks. This License does not grant permission to use the trade 139 | names, trademarks, service marks, or product names of the Licensor, 140 | except as required for reasonable and customary use in describing the 141 | origin of the Work and reproducing the content of the NOTICE file. 142 | 143 | 7. Disclaimer of Warranty. Unless required by applicable law or 144 | agreed to in writing, Licensor provides the Work (and each 145 | Contributor provides its Contributions) on an "AS IS" BASIS, 146 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 147 | implied, including, without limitation, any warranties or conditions 148 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 149 | PARTICULAR PURPOSE. You are solely responsible for determining the 150 | appropriateness of using or redistributing the Work and assume any 151 | risks associated with Your exercise of permissions under this License. 152 | 153 | 8. Limitation of Liability. In no event and under no legal theory, 154 | whether in tort (including negligence), contract, or otherwise, 155 | unless required by applicable law (such as deliberate and grossly 156 | negligent acts) or agreed to in writing, shall any Contributor be 157 | liable to You for damages, including any direct, indirect, special, 158 | incidental, or consequential damages of any character arising as a 159 | result of this License or out of the use or inability to use the 160 | Work (including but not limited to damages for loss of goodwill, 161 | work stoppage, computer failure or malfunction, or any and all 162 | other commercial damages or losses), even if such Contributor 163 | has been advised of the possibility of such damages. 164 | 165 | 9. Accepting Warranty or Additional Liability. While redistributing 166 | the Work or Derivative Works thereof, You may choose to offer, 167 | and charge a fee for, acceptance of support, warranty, indemnity, 168 | or other liability obligations and/or rights consistent with this 169 | License. However, in accepting such obligations, You may act only 170 | on Your own behalf and on Your sole responsibility, not on behalf 171 | of any other Contributor, and only if You agree to indemnify, 172 | defend, and hold each Contributor harmless for any liability 173 | incurred by, or claims asserted against, such Contributor by reason 174 | of your accepting any such warranty or additional liability. 175 | 176 | END OF TERMS AND CONDITIONS 177 | 178 | APPENDIX: How to apply the Apache License to your work. 179 | 180 | To apply the Apache License to your work, attach the following 181 | boilerplate notice, with the fields enclosed by brackets "[]" 182 | replaced with your own identifying information. (Don't include 183 | the brackets!) The text should be enclosed in the appropriate 184 | comment syntax for the file format. We also recommend that a 185 | file or class name and description of purpose be included on the 186 | same "printed page" as the copyright notice for easier 187 | identification within third-party archives. 188 | 189 | Copyright [yyyy] [name of copyright owner] 190 | 191 | Licensed under the Apache License, Version 2.0 (the "License"); 192 | you may not use this file except in compliance with the License. 193 | You may obtain a copy of the License at 194 | 195 | http://www.apache.org/licenses/LICENSE-2.0 196 | 197 | Unless required by applicable law or agreed to in writing, software 198 | distributed under the License is distributed on an "AS IS" BASIS, 199 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 200 | See the License for the specific language governing permissions and 201 | limitations under the License. 202 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Github Crawler 2 | 3 | Github Crawler is a simple tool which scraps data from Github. 4 | Currently, it is only scraping data from Topics section on Github. 5 | ## Features 6 | 7 | - Scrap any amount of data 8 | - Data can be downloaded 9 | - All links are provided in dataset 10 | - Useful for statistical analysis 11 | 12 | 13 | ## Screenshots 14 | 15 | ![Image link](https://github.com/abhroroy365/githubcrawler/blob/main/sshot1.png) 16 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | import pandas as pd 3 | import numpy as np 4 | from repo import scrap_repo 5 | from topics import scrap_topics 6 | 7 | st.markdown("

Github Crawler

", unsafe_allow_html=True) 8 | 9 | # def make_clickable(link): 10 | # # target _blank to open new window 11 | # # extract clickable text to display for your link 12 | # text = link.split('=')[0] 13 | # return f'{text}' 14 | 15 | @st.cache 16 | def convert_df(df): 17 | return df.to_csv().encode('utf-8') 18 | 19 | def display_repos(x,urls,topic): 20 | repo_df = scrap_repo(urls[x]) 21 | # download button for repositories data 22 | csv = convert_df(repo_df) 23 | st.download_button( 24 | label="Save {} repositories as CSV".format(topic[x]), 25 | data=csv, 26 | file_name=str(topic[x])+'.csv', 27 | mime='text/csv', 28 | ) 29 | # display repositories 30 | # repo_df['User Link'] = repo_df['User Link'].apply(make_clickable) 31 | # repo_df['Repository Link'] = repo_df['Repository Link'].apply(make_clickable) 32 | st.dataframe(repo_df) 33 | 34 | 35 | def display_topics(): 36 | number = 30 37 | with st.sidebar: 38 | agree = st.checkbox('Show more topics') 39 | if agree: 40 | number = st.radio( 41 | "Choose number of topics :", 42 | (30, 60, 90, 120, 150, 180) 43 | ) 44 | df= scrap_topics(number) 45 | # download button for topic data 46 | csv = convert_df(df) 47 | st.download_button( 48 | label="Download topics as CSV", 49 | data=csv, 50 | file_name='trending_github_topics.csv', 51 | mime='text/csv', 52 | ) 53 | user_table = df 54 | user_table['Repositories'] ='' 55 | col = st.columns((1, 2, 2, 1, 1)) 56 | st.write('Showing top {} topics out of 180 trending topics'.format(number)) 57 | for x, topic in enumerate(user_table['topic']): 58 | col1, col2, col3, col4, col5 = st.columns((1, 2, 2, 1, 1)) 59 | col1.write(x+1) # index 60 | col2.write(user_table['topic'][x]) 61 | col3.write(user_table['description'][x]) 62 | col4.write(user_table['url'][x]) 63 | show_repo = user_table['Repositories'][x] # flexible type of button 64 | button_type = "Show Repositories" 65 | button_phold = col5.empty() # create a placeholder 66 | do_action = button_phold.button(button_type, key=x) 67 | if do_action: 68 | button_phold.empty() 69 | with st.spinner("Loading..."): 70 | display_repos(x,user_table['url'],user_table['topic']) 71 | 72 | st.markdown("
Developed by Abhra Ray Chaudhuri
", unsafe_allow_html=True) 73 | 74 | 75 | def main(): 76 | st.markdown("

What are the trending github topics currently ?


", unsafe_allow_html=True) 77 | display_topics() 78 | 79 | if __name__ == '__main__': 80 | main() -------------------------------------------------------------------------------- /repo.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import requests 3 | from bs4 import BeautifulSoup 4 | 5 | def scrap_repo(url): 6 | base_url = 'https://github.com' 7 | response= requests.get(url) 8 | topic_page = response.text 9 | topic_doc = BeautifulSoup(topic_page, 'html.parser') 10 | repo_tag = topic_doc.find_all('h3',class_ = 'f3 color-fg-muted text-normal lh-condensed') 11 | username = [] 12 | repository = [] 13 | user_link = [] 14 | repo_link = [] 15 | for tag in repo_tag: 16 | user_link.append(base_url+tag.find_all('a')[0]['href'].strip()) 17 | repo_link.append(base_url+tag.find_all('a')[1]['href'].strip()) 18 | username.append(tag.find_all('a')[0].text.strip()) 19 | repository.append(tag.find_all('a')[1].text.strip()) 20 | star_tag = topic_doc.find_all('span',{'id' : "repo-stars-counter-star"}) 21 | stars = [] 22 | for tag in star_tag: 23 | star = tag['title'] 24 | star = int(star.replace(',','')) 25 | stars.append(star) 26 | dict = { 27 | 'Username':username, 28 | 'User Link':user_link, 29 | 'Repository': repository, 30 | 'Repository Link':repo_link, 31 | 'Stars': stars 32 | } 33 | df = pd.DataFrame(dict) 34 | return df 35 | -------------------------------------------------------------------------------- /scrap_test.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "### Scraping a web page" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "# 1.Picking a website and describing the objective" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.\n", 22 | "- Summarize your project idea and outline your strategy in a Juptyer notebook." 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 1, 28 | "metadata": {}, 29 | "outputs": [], 30 | "source": [ 31 | "site = \"https://github.com/topics\"" 32 | ] 33 | }, 34 | { 35 | "cell_type": "markdown", 36 | "metadata": {}, 37 | "source": [ 38 | "# 2.Use the requests library to download web pages" 39 | ] 40 | }, 41 | { 42 | "cell_type": "markdown", 43 | "metadata": {}, 44 | "source": [ 45 | "- Inspect the website's HTML source and identify the right URLs to download.\n", 46 | "- Download and save web pages locally using the requests library.\n", 47 | "- Create a function to automate downloading for different topics/search queries." 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": 2, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "import requests" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 3, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "page = requests.get(site)" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": 4, 71 | "metadata": {}, 72 | "outputs": [ 73 | { 74 | "data": { 75 | "text/plain": [ 76 | "200" 77 | ] 78 | }, 79 | "execution_count": 4, 80 | "metadata": {}, 81 | "output_type": "execute_result" 82 | } 83 | ], 84 | "source": [ 85 | "page.status_code" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": 5, 91 | "metadata": {}, 92 | "outputs": [], 93 | "source": [ 94 | "page_content = page.text" 95 | ] 96 | }, 97 | { 98 | "cell_type": "code", 99 | "execution_count": 6, 100 | "metadata": {}, 101 | "outputs": [ 102 | { 103 | "data": { 104 | "text/plain": [ 105 | "'\\n\\n\\n\\n \\n \\n \\n \\n \\n \\n \\n \\n\\n\\n\\n 3D

,\n", 189 | "

Ajax

,\n", 190 | "

Algorithm

,\n", 191 | "

Amp

,\n", 192 | "

Android

]" 193 | ] 194 | }, 195 | "execution_count": 11, 196 | "metadata": {}, 197 | "output_type": "execute_result" 198 | } 199 | ], 200 | "source": [ 201 | "topic_title_tag[:5]" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": 12, 207 | "metadata": {}, 208 | "outputs": [], 209 | "source": [ 210 | "topic_titles = []\n", 211 | "for tag in topic_title_tag:\n", 212 | " topic_titles.append(tag.text)" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 13, 218 | "metadata": {}, 219 | "outputs": [ 220 | { 221 | "data": { 222 | "text/plain": [ 223 | "['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']" 224 | ] 225 | }, 226 | "execution_count": 13, 227 | "metadata": {}, 228 | "output_type": "execute_result" 229 | } 230 | ], 231 | "source": [ 232 | "topic_titles[:5]" 233 | ] 234 | }, 235 | { 236 | "cell_type": "markdown", 237 | "metadata": {}, 238 | "source": [ 239 | "- getting topic description" 240 | ] 241 | }, 242 | { 243 | "cell_type": "code", 244 | "execution_count": 14, 245 | "metadata": {}, 246 | "outputs": [], 247 | "source": [ 248 | "selection_class = \"f5 color-fg-muted mb-0 mt-1\"\n", 249 | "topic_desc_tag = doc.find_all('p',class_ = selection_class)" 250 | ] 251 | }, 252 | { 253 | "cell_type": "code", 254 | "execution_count": 15, 255 | "metadata": {}, 256 | "outputs": [], 257 | "source": [ 258 | "topic_descs = []\n", 259 | "for desc in topic_desc_tag:\n", 260 | " topic_descs.append(desc.text.strip())" 261 | ] 262 | }, 263 | { 264 | "cell_type": "code", 265 | "execution_count": 16, 266 | "metadata": {}, 267 | "outputs": [ 268 | { 269 | "data": { 270 | "text/plain": [ 271 | "['3D modeling is the process of virtually developing the surface and structure of a 3D object.',\n", 272 | " 'Ajax is a technique for creating interactive web applications.',\n", 273 | " 'Algorithms are self-contained sequences that carry out a variety of tasks.',\n", 274 | " 'Amp is a non-blocking concurrency library for PHP.',\n", 275 | " 'Android is an operating system built by Google designed for mobile devices.']" 276 | ] 277 | }, 278 | "execution_count": 16, 279 | "metadata": {}, 280 | "output_type": "execute_result" 281 | } 282 | ], 283 | "source": [ 284 | "topic_descs[:5]" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "metadata": {}, 290 | "source": [ 291 | "- getting topic url" 292 | ] 293 | }, 294 | { 295 | "cell_type": "code", 296 | "execution_count": 17, 297 | "metadata": {}, 298 | "outputs": [], 299 | "source": [ 300 | "selection_class = \"no-underline flex-1 d-flex flex-column\"\n", 301 | "topic_url_tag = doc.find_all('a',class_ = selection_class)" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": 18, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "base_url = 'https://github.com'\n", 311 | "topic_url_list = []\n", 312 | "for i in range(len(topic_url_tag)):\n", 313 | " href = topic_url_tag[i]['href']\n", 314 | " topic_url = base_url + href\n", 315 | " topic_url_list.append(topic_url)" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": 19, 321 | "metadata": {}, 322 | "outputs": [ 323 | { 324 | "data": { 325 | "text/plain": [ 326 | "['https://github.com/topics/3d',\n", 327 | " 'https://github.com/topics/ajax',\n", 328 | " 'https://github.com/topics/algorithm',\n", 329 | " 'https://github.com/topics/amphp',\n", 330 | " 'https://github.com/topics/android',\n", 331 | " 'https://github.com/topics/angular',\n", 332 | " 'https://github.com/topics/ansible',\n", 333 | " 'https://github.com/topics/api',\n", 334 | " 'https://github.com/topics/arduino',\n", 335 | " 'https://github.com/topics/aspnet']" 336 | ] 337 | }, 338 | "execution_count": 19, 339 | "metadata": {}, 340 | "output_type": "execute_result" 341 | } 342 | ], 343 | "source": [ 344 | "topic_url_list[:10]" 345 | ] 346 | }, 347 | { 348 | "cell_type": "markdown", 349 | "metadata": {}, 350 | "source": [ 351 | "# 4.Create CSV file(s) with the extracted information\n", 352 | "\n" 353 | ] 354 | }, 355 | { 356 | "cell_type": "markdown", 357 | "metadata": {}, 358 | "source": [ 359 | "- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.\n", 360 | "- Execute the function with different inputs to create a dataset of CSV files.\n", 361 | "- Verify the information in the CSV files by reading them back using Pandas." 362 | ] 363 | }, 364 | { 365 | "cell_type": "code", 366 | "execution_count": 20, 367 | "metadata": {}, 368 | "outputs": [], 369 | "source": [ 370 | "import pandas as pd" 371 | ] 372 | }, 373 | { 374 | "cell_type": "code", 375 | "execution_count": 21, 376 | "metadata": {}, 377 | "outputs": [], 378 | "source": [ 379 | "from operator import index\n", 380 | "\n", 381 | "\n", 382 | "dict = {\n", 383 | " 'topic':topic_titles,\n", 384 | " 'decription': topic_descs,\n", 385 | " 'url' : topic_url_list\n", 386 | "}\n", 387 | "topic_df = pd.DataFrame(dict,index=None)" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 22, 393 | "metadata": {}, 394 | "outputs": [ 395 | { 396 | "data": { 397 | "text/html": [ 398 | "
\n", 399 | "\n", 412 | "\n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | "
topicdecriptionurl
03D3D modeling is the process of virtually develo...https://github.com/topics/3d
1AjaxAjax is a technique for creating interactive w...https://github.com/topics/ajax
2AlgorithmAlgorithms are self-contained sequences that c...https://github.com/topics/algorithm
3AmpAmp is a non-blocking concurrency library for ...https://github.com/topics/amphp
4AndroidAndroid is an operating system built by Google...https://github.com/topics/android
5AngularAngular is an open source web application plat...https://github.com/topics/angular
6AnsibleAnsible is a simple and powerful automation en...https://github.com/topics/ansible
7APIAn API (Application Programming Interface) is ...https://github.com/topics/api
8ArduinoArduino is an open source hardware and softwar...https://github.com/topics/arduino
9ASP.NETASP.NET is a web framework for building modern...https://github.com/topics/aspnet
\n", 484 | "
" 485 | ], 486 | "text/plain": [ 487 | " topic decription \\\n", 488 | "0 3D 3D modeling is the process of virtually develo... \n", 489 | "1 Ajax Ajax is a technique for creating interactive w... \n", 490 | "2 Algorithm Algorithms are self-contained sequences that c... \n", 491 | "3 Amp Amp is a non-blocking concurrency library for ... \n", 492 | "4 Android Android is an operating system built by Google... \n", 493 | "5 Angular Angular is an open source web application plat... \n", 494 | "6 Ansible Ansible is a simple and powerful automation en... \n", 495 | "7 API An API (Application Programming Interface) is ... \n", 496 | "8 Arduino Arduino is an open source hardware and softwar... \n", 497 | "9 ASP.NET ASP.NET is a web framework for building modern... \n", 498 | "\n", 499 | " url \n", 500 | "0 https://github.com/topics/3d \n", 501 | "1 https://github.com/topics/ajax \n", 502 | "2 https://github.com/topics/algorithm \n", 503 | "3 https://github.com/topics/amphp \n", 504 | "4 https://github.com/topics/android \n", 505 | "5 https://github.com/topics/angular \n", 506 | "6 https://github.com/topics/ansible \n", 507 | "7 https://github.com/topics/api \n", 508 | "8 https://github.com/topics/arduino \n", 509 | "9 https://github.com/topics/aspnet " 510 | ] 511 | }, 512 | "execution_count": 22, 513 | "metadata": {}, 514 | "output_type": "execute_result" 515 | } 516 | ], 517 | "source": [ 518 | "topic_df[:10]" 519 | ] 520 | }, 521 | { 522 | "cell_type": "code", 523 | "execution_count": 23, 524 | "metadata": {}, 525 | "outputs": [], 526 | "source": [ 527 | "topic_df.to_csv('topics.csv',index=None)" 528 | ] 529 | }, 530 | { 531 | "cell_type": "markdown", 532 | "metadata": {}, 533 | "source": [ 534 | "# Getting information from a topic page" 535 | ] 536 | }, 537 | { 538 | "cell_type": "code", 539 | "execution_count": 24, 540 | "metadata": {}, 541 | "outputs": [ 542 | { 543 | "data": { 544 | "text/plain": [ 545 | "'https://github.com/topics/3d'" 546 | ] 547 | }, 548 | "execution_count": 24, 549 | "metadata": {}, 550 | "output_type": "execute_result" 551 | } 552 | ], 553 | "source": [ 554 | "topic_url_list[0]" 555 | ] 556 | }, 557 | { 558 | "cell_type": "code", 559 | "execution_count": 25, 560 | "metadata": {}, 561 | "outputs": [], 562 | "source": [ 563 | "response= requests.get(topic_url_list[0])" 564 | ] 565 | }, 566 | { 567 | "cell_type": "code", 568 | "execution_count": 26, 569 | "metadata": {}, 570 | "outputs": [], 571 | "source": [ 572 | "topic_page = response.text" 573 | ] 574 | }, 575 | { 576 | "cell_type": "code", 577 | "execution_count": 27, 578 | "metadata": {}, 579 | "outputs": [], 580 | "source": [ 581 | "topic_doc = BeautifulSoup(topic_page, 'html.parser')" 582 | ] 583 | }, 584 | { 585 | "cell_type": "code", 586 | "execution_count": 28, 587 | "metadata": {}, 588 | "outputs": [ 589 | { 590 | "data": { 591 | "text/plain": [ 592 | "[\n", 593 | " mrdoob\n", 594 | " ,\n", 595 | " \n", 596 | " three.js\n", 597 | " ]" 598 | ] 599 | }, 600 | "execution_count": 28, 601 | "metadata": {}, 602 | "output_type": "execute_result" 603 | } 604 | ], 605 | "source": [ 606 | "repo_tag = topic_doc.find_all('h3',class_ = 'f3 color-fg-muted text-normal lh-condensed')\n", 607 | "repo_tag[0].find_all('a')" 608 | ] 609 | }, 610 | { 611 | "cell_type": "code", 612 | "execution_count": 29, 613 | "metadata": {}, 614 | "outputs": [], 615 | "source": [ 616 | "username = []\n", 617 | "repository = []\n", 618 | "user_link = []\n", 619 | "repo_link = []\n", 620 | "for tag in repo_tag:\n", 621 | " user_link.append(base_url+tag.find_all('a')[0]['href'].strip())\n", 622 | " repo_link.append(base_url+tag.find_all('a')[1]['href'].strip())\n", 623 | " username.append(tag.find_all('a')[0].text.strip())\n", 624 | " repository.append(tag.find_all('a')[1].text.strip())" 625 | ] 626 | }, 627 | { 628 | "cell_type": "code", 629 | "execution_count": 30, 630 | "metadata": {}, 631 | "outputs": [], 632 | "source": [ 633 | "star_tag = topic_doc.find_all('span',{'id' : \"repo-stars-counter-star\"})" 634 | ] 635 | }, 636 | { 637 | "cell_type": "code", 638 | "execution_count": 31, 639 | "metadata": {}, 640 | "outputs": [], 641 | "source": [ 642 | "stars = []\n", 643 | "for tag in star_tag:\n", 644 | " star = tag['title']\n", 645 | " star = int(star.replace(',',''))\n", 646 | " stars.append(star)" 647 | ] 648 | }, 649 | { 650 | "cell_type": "code", 651 | "execution_count": 32, 652 | "metadata": {}, 653 | "outputs": [], 654 | "source": [ 655 | "dict = {\n", 656 | " 'Username':username,\n", 657 | " 'User Link':user_link,\n", 658 | " 'Repository': repository,\n", 659 | " 'Repo Link':repo_link,\n", 660 | " 'Stars': stars\n", 661 | "}\n", 662 | "D3_repo_df = pd.DataFrame(dict)" 663 | ] 664 | }, 665 | { 666 | "cell_type": "code", 667 | "execution_count": 33, 668 | "metadata": {}, 669 | "outputs": [ 670 | { 671 | "data": { 672 | "text/html": [ 673 | "
\n", 674 | "\n", 687 | "\n", 688 | " \n", 689 | " \n", 690 | " \n", 691 | " \n", 692 | " \n", 693 | " \n", 694 | " \n", 695 | " \n", 696 | " \n", 697 | " \n", 698 | " \n", 699 | " \n", 700 | " \n", 701 | " \n", 702 | " \n", 703 | " \n", 704 | " \n", 705 | " \n", 706 | " \n", 707 | " \n", 708 | " \n", 709 | " \n", 710 | " \n", 711 | " \n", 712 | " \n", 713 | " \n", 714 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | "
UsernameUser LinkRepositoryRepo LinkStars
0mrdoobhttps://github.com/mrdoobthree.jshttps://github.com/mrdoob/three.js83130
1libgdxhttps://github.com/libgdxlibgdxhttps://github.com/libgdx/libgdx20147
2pmndrshttps://github.com/pmndrsreact-three-fiberhttps://github.com/pmndrs/react-three-fiber18517
3BabylonJShttps://github.com/BabylonJSBabylon.jshttps://github.com/BabylonJS/Babylon.js17680
4aframevrhttps://github.com/aframevraframehttps://github.com/aframevr/aframe14300
\n", 741 | "
" 742 | ], 743 | "text/plain": [ 744 | " Username User Link Repository \\\n", 745 | "0 mrdoob https://github.com/mrdoob three.js \n", 746 | "1 libgdx https://github.com/libgdx libgdx \n", 747 | "2 pmndrs https://github.com/pmndrs react-three-fiber \n", 748 | "3 BabylonJS https://github.com/BabylonJS Babylon.js \n", 749 | "4 aframevr https://github.com/aframevr aframe \n", 750 | "\n", 751 | " Repo Link Stars \n", 752 | "0 https://github.com/mrdoob/three.js 83130 \n", 753 | "1 https://github.com/libgdx/libgdx 20147 \n", 754 | "2 https://github.com/pmndrs/react-three-fiber 18517 \n", 755 | "3 https://github.com/BabylonJS/Babylon.js 17680 \n", 756 | "4 https://github.com/aframevr/aframe 14300 " 757 | ] 758 | }, 759 | "execution_count": 33, 760 | "metadata": {}, 761 | "output_type": "execute_result" 762 | } 763 | ], 764 | "source": [ 765 | "D3_repo_df[:5]" 766 | ] 767 | }, 768 | { 769 | "cell_type": "markdown", 770 | "metadata": {}, 771 | "source": [ 772 | "# Summarizing the tasks" 773 | ] 774 | }, 775 | { 776 | "cell_type": "code", 777 | "execution_count": 42, 778 | "metadata": {}, 779 | "outputs": [], 780 | "source": [ 781 | "def scrap_topics_repos():\n", 782 | " i=0\n", 783 | " for url in topic_url_list:\n", 784 | " response= requests.get(url)\n", 785 | " topic_page = response.text\n", 786 | " topic_doc = BeautifulSoup(topic_page, 'html.parser')\n", 787 | " repo_tag = topic_doc.find_all('h3',class_ = 'f3 color-fg-muted text-normal lh-condensed')\n", 788 | " username = []\n", 789 | " repository = []\n", 790 | " user_link = []\n", 791 | " repo_link = []\n", 792 | " for tag in repo_tag:\n", 793 | " user_link.append(base_url+tag.find_all('a')[0]['href'].strip())\n", 794 | " repo_link.append(base_url+tag.find_all('a')[1]['href'].strip())\n", 795 | " username.append(tag.find_all('a')[0].text.strip())\n", 796 | " repository.append(tag.find_all('a')[1].text.strip())\n", 797 | " star_tag = topic_doc.find_all('span',{'id' : \"repo-stars-counter-star\"})\n", 798 | " stars = []\n", 799 | " for tag in star_tag:\n", 800 | " star = tag['title']\n", 801 | " star = int(star.replace(',',''))\n", 802 | " stars.append(star)\n", 803 | " dict = {\n", 804 | " 'Username':username,\n", 805 | " 'User Link':user_link,\n", 806 | " 'Repository': repository,\n", 807 | " 'Repo Link':repo_link,\n", 808 | " 'Stars': stars\n", 809 | " }\n", 810 | " df = pd.DataFrame(dict)\n", 811 | " filename = topic_titles[i] +'.csv'\n", 812 | " df.to_csv(filename,index=None)\n", 813 | " i+=1\n", 814 | " " 815 | ] 816 | }, 817 | { 818 | "cell_type": "code", 819 | "execution_count": 43, 820 | "metadata": {}, 821 | "outputs": [], 822 | "source": [ 823 | "scrap_topics_repos()" 824 | ] 825 | }, 826 | { 827 | "cell_type": "markdown", 828 | "metadata": {}, 829 | "source": [ 830 | "# 5.Document and share your work" 831 | ] 832 | }, 833 | { 834 | "cell_type": "markdown", 835 | "metadata": {}, 836 | "source": [ 837 | "- Add proper headings and documentation in your Jupyter notebook.\n", 838 | "- Publish your Jupyter notebook to your Jovian profile\n", 839 | "- (Optional) Write a blog post about your project and share it online." 840 | ] 841 | }, 842 | { 843 | "cell_type": "code", 844 | "execution_count": null, 845 | "metadata": {}, 846 | "outputs": [], 847 | "source": [] 848 | }, 849 | { 850 | "cell_type": "code", 851 | "execution_count": null, 852 | "metadata": {}, 853 | "outputs": [], 854 | "source": [] 855 | } 856 | ], 857 | "metadata": { 858 | "kernelspec": { 859 | "display_name": "Python 3.9.7 64-bit", 860 | "language": "python", 861 | "name": "python3" 862 | }, 863 | "language_info": { 864 | "codemirror_mode": { 865 | "name": "ipython", 866 | "version": 3 867 | }, 868 | "file_extension": ".py", 869 | "mimetype": "text/x-python", 870 | "name": "python", 871 | "nbconvert_exporter": "python", 872 | "pygments_lexer": "ipython3", 873 | "version": "3.9.7" 874 | }, 875 | "orig_nbformat": 4, 876 | "vscode": { 877 | "interpreter": { 878 | "hash": "5f988359990852b9f05d733faad7e7ee387134e25f4a1ce8f641bacdeab60818" 879 | } 880 | } 881 | }, 882 | "nbformat": 4, 883 | "nbformat_minor": 2 884 | } 885 | -------------------------------------------------------------------------------- /sshot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/abhroroy365/githubcrawler/99d1c99e48a32189d63dd5f52a6fa6d03637c69f/sshot1.png -------------------------------------------------------------------------------- /topics.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import requests 3 | from bs4 import BeautifulSoup 4 | 5 | def scrap_topics(number): 6 | base_site = 'https://github.com/topics' 7 | total = int(number/30) 8 | topic_titles = [] 9 | topic_descs = [] 10 | topic_url_list = [] 11 | for i in range(total): 12 | page = '?page='+str(i+1) 13 | site = base_site+page 14 | req = requests.get(site) 15 | page_content = req.text 16 | doc = BeautifulSoup(page_content, "html.parser") 17 | # getting topic title 18 | selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary" 19 | topic_title_tag = doc.find_all('p',class_ = selection_class) 20 | for tag in topic_title_tag: 21 | topic_titles.append(tag.text) 22 | # getting topic description 23 | selection_class = "f5 color-fg-muted mb-0 mt-1" 24 | topic_desc_tag = doc.find_all('p',class_ = selection_class) 25 | for desc in topic_desc_tag: 26 | topic_descs.append(desc.text.strip()) 27 | 28 | # getting topic links 29 | selection_class = "no-underline flex-1 d-flex flex-column" 30 | topic_url_tag = doc.find_all('a',class_ = selection_class) 31 | base_url = 'https://github.com' 32 | for i in range(len(topic_url_tag)): 33 | href = topic_url_tag[i]['href'] 34 | topic_url = base_url + href 35 | topic_url_list.append(topic_url) 36 | 37 | # creating dataframe 38 | dict = { 39 | 'topic':topic_titles, 40 | 'description': topic_descs, 41 | 'url' : topic_url_list, 42 | } 43 | topic_df = pd.DataFrame(dict) 44 | return topic_df --------------------------------------------------------------------------------