├── LICENSE
├── README.md
├── app.py
├── repo.py
├── scrap_test.ipynb
├── sshot1.png
└── topics.py


/LICENSE:
--------------------------------------------------------------------------------
  1 |                                  Apache License
  2 |                            Version 2.0, January 2004
  3 |                         http://www.apache.org/licenses/
  4 | 
  5 |    TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
  6 | 
  7 |    1. Definitions.
  8 | 
  9 |       "License" shall mean the terms and conditions for use, reproduction,
 10 |       and distribution as defined by Sections 1 through 9 of this document.
 11 | 
 12 |       "Licensor" shall mean the copyright owner or entity authorized by
 13 |       the copyright owner that is granting the License.
 14 | 
 15 |       "Legal Entity" shall mean the union of the acting entity and all
 16 |       other entities that control, are controlled by, or are under common
 17 |       control with that entity. For the purposes of this definition,
 18 |       "control" means (i) the power, direct or indirect, to cause the
 19 |       direction or management of such entity, whether by contract or
 20 |       otherwise, or (ii) ownership of fifty percent (50%) or more of the
 21 |       outstanding shares, or (iii) beneficial ownership of such entity.
 22 | 
 23 |       "You" (or "Your") shall mean an individual or Legal Entity
 24 |       exercising permissions granted by this License.
 25 | 
 26 |       "Source" form shall mean the preferred form for making modifications,
 27 |       including but not limited to software source code, documentation
 28 |       source, and configuration files.
 29 | 
 30 |       "Object" form shall mean any form resulting from mechanical
 31 |       transformation or translation of a Source form, including but
 32 |       not limited to compiled object code, generated documentation,
 33 |       and conversions to other media types.
 34 | 
 35 |       "Work" shall mean the work of authorship, whether in Source or
 36 |       Object form, made available under the License, as indicated by a
 37 |       copyright notice that is included in or attached to the work
 38 |       (an example is provided in the Appendix below).
 39 | 
 40 |       "Derivative Works" shall mean any work, whether in Source or Object
 41 |       form, that is based on (or derived from) the Work and for which the
 42 |       editorial revisions, annotations, elaborations, or other modifications
 43 |       represent, as a whole, an original work of authorship. For the purposes
 44 |       of this License, Derivative Works shall not include works that remain
 45 |       separable from, or merely link (or bind by name) to the interfaces of,
 46 |       the Work and Derivative Works thereof.
 47 | 
 48 |       "Contribution" shall mean any work of authorship, including
 49 |       the original version of the Work and any modifications or additions
 50 |       to that Work or Derivative Works thereof, that is intentionally
 51 |       submitted to Licensor for inclusion in the Work by the copyright owner
 52 |       or by an individual or Legal Entity authorized to submit on behalf of
 53 |       the copyright owner. For the purposes of this definition, "submitted"
 54 |       means any form of electronic, verbal, or written communication sent
 55 |       to the Licensor or its representatives, including but not limited to
 56 |       communication on electronic mailing lists, source code control systems,
 57 |       and issue tracking systems that are managed by, or on behalf of, the
 58 |       Licensor for the purpose of discussing and improving the Work, but
 59 |       excluding communication that is conspicuously marked or otherwise
 60 |       designated in writing by the copyright owner as "Not a Contribution."
 61 | 
 62 |       "Contributor" shall mean Licensor and any individual or Legal Entity
 63 |       on behalf of whom a Contribution has been received by Licensor and
 64 |       subsequently incorporated within the Work.
 65 | 
 66 |    2. Grant of Copyright License. Subject to the terms and conditions of
 67 |       this License, each Contributor hereby grants to You a perpetual,
 68 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 69 |       copyright license to reproduce, prepare Derivative Works of,
 70 |       publicly display, publicly perform, sublicense, and distribute the
 71 |       Work and such Derivative Works in Source or Object form.
 72 | 
 73 |    3. Grant of Patent License. Subject to the terms and conditions of
 74 |       this License, each Contributor hereby grants to You a perpetual,
 75 |       worldwide, non-exclusive, no-charge, royalty-free, irrevocable
 76 |       (except as stated in this section) patent license to make, have made,
 77 |       use, offer to sell, sell, import, and otherwise transfer the Work,
 78 |       where such license applies only to those patent claims licensable
 79 |       by such Contributor that are necessarily infringed by their
 80 |       Contribution(s) alone or by combination of their Contribution(s)
 81 |       with the Work to which such Contribution(s) was submitted. If You
 82 |       institute patent litigation against any entity (including a
 83 |       cross-claim or counterclaim in a lawsuit) alleging that the Work
 84 |       or a Contribution incorporated within the Work constitutes direct
 85 |       or contributory patent infringement, then any patent licenses
 86 |       granted to You under this License for that Work shall terminate
 87 |       as of the date such litigation is filed.
 88 | 
 89 |    4. Redistribution. You may reproduce and distribute copies of the
 90 |       Work or Derivative Works thereof in any medium, with or without
 91 |       modifications, and in Source or Object form, provided that You
 92 |       meet the following conditions:
 93 | 
 94 |       (a) You must give any other recipients of the Work or
 95 |           Derivative Works a copy of this License; and
 96 | 
 97 |       (b) You must cause any modified files to carry prominent notices
 98 |           stating that You changed the files; and
 99 | 
100 |       (c) You must retain, in the Source form of any Derivative Works
101 |           that You distribute, all copyright, patent, trademark, and
102 |           attribution notices from the Source form of the Work,
103 |           excluding those notices that do not pertain to any part of
104 |           the Derivative Works; and
105 | 
106 |       (d) If the Work includes a "NOTICE" text file as part of its
107 |           distribution, then any Derivative Works that You distribute must
108 |           include a readable copy of the attribution notices contained
109 |           within such NOTICE file, excluding those notices that do not
110 |           pertain to any part of the Derivative Works, in at least one
111 |           of the following places: within a NOTICE text file distributed
112 |           as part of the Derivative Works; within the Source form or
113 |           documentation, if provided along with the Derivative Works; or,
114 |           within a display generated by the Derivative Works, if and
115 |           wherever such third-party notices normally appear. The contents
116 |           of the NOTICE file are for informational purposes only and
117 |           do not modify the License. You may add Your own attribution
118 |           notices within Derivative Works that You distribute, alongside
119 |           or as an addendum to the NOTICE text from the Work, provided
120 |           that such additional attribution notices cannot be construed
121 |           as modifying the License.
122 | 
123 |       You may add Your own copyright statement to Your modifications and
124 |       may provide additional or different license terms and conditions
125 |       for use, reproduction, or distribution of Your modifications, or
126 |       for any such Derivative Works as a whole, provided Your use,
127 |       reproduction, and distribution of the Work otherwise complies with
128 |       the conditions stated in this License.
129 | 
130 |    5. Submission of Contributions. Unless You explicitly state otherwise,
131 |       any Contribution intentionally submitted for inclusion in the Work
132 |       by You to the Licensor shall be under the terms and conditions of
133 |       this License, without any additional terms or conditions.
134 |       Notwithstanding the above, nothing herein shall supersede or modify
135 |       the terms of any separate license agreement you may have executed
136 |       with Licensor regarding such Contributions.
137 | 
138 |    6. Trademarks. This License does not grant permission to use the trade
139 |       names, trademarks, service marks, or product names of the Licensor,
140 |       except as required for reasonable and customary use in describing the
141 |       origin of the Work and reproducing the content of the NOTICE file.
142 | 
143 |    7. Disclaimer of Warranty. Unless required by applicable law or
144 |       agreed to in writing, Licensor provides the Work (and each
145 |       Contributor provides its Contributions) on an "AS IS" BASIS,
146 |       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147 |       implied, including, without limitation, any warranties or conditions
148 |       of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149 |       PARTICULAR PURPOSE. You are solely responsible for determining the
150 |       appropriateness of using or redistributing the Work and assume any
151 |       risks associated with Your exercise of permissions under this License.
152 | 
153 |    8. Limitation of Liability. In no event and under no legal theory,
154 |       whether in tort (including negligence), contract, or otherwise,
155 |       unless required by applicable law (such as deliberate and grossly
156 |       negligent acts) or agreed to in writing, shall any Contributor be
157 |       liable to You for damages, including any direct, indirect, special,
158 |       incidental, or consequential damages of any character arising as a
159 |       result of this License or out of the use or inability to use the
160 |       Work (including but not limited to damages for loss of goodwill,
161 |       work stoppage, computer failure or malfunction, or any and all
162 |       other commercial damages or losses), even if such Contributor
163 |       has been advised of the possibility of such damages.
164 | 
165 |    9. Accepting Warranty or Additional Liability. While redistributing
166 |       the Work or Derivative Works thereof, You may choose to offer,
167 |       and charge a fee for, acceptance of support, warranty, indemnity,
168 |       or other liability obligations and/or rights consistent with this
169 |       License. However, in accepting such obligations, You may act only
170 |       on Your own behalf and on Your sole responsibility, not on behalf
171 |       of any other Contributor, and only if You agree to indemnify,
172 |       defend, and hold each Contributor harmless for any liability
173 |       incurred by, or claims asserted against, such Contributor by reason
174 |       of your accepting any such warranty or additional liability.
175 | 
176 |    END OF TERMS AND CONDITIONS
177 | 
178 |    APPENDIX: How to apply the Apache License to your work.
179 | 
180 |       To apply the Apache License to your work, attach the following
181 |       boilerplate notice, with the fields enclosed by brackets "[]"
182 |       replaced with your own identifying information. (Don't include
183 |       the brackets!)  The text should be enclosed in the appropriate
184 |       comment syntax for the file format. We also recommend that a
185 |       file or class name and description of purpose be included on the
186 |       same "printed page" as the copyright notice for easier
187 |       identification within third-party archives.
188 | 
189 |    Copyright [yyyy] [name of copyright owner]
190 | 
191 |    Licensed under the Apache License, Version 2.0 (the "License");
192 |    you may not use this file except in compliance with the License.
193 |    You may obtain a copy of the License at
194 | 
195 |        http://www.apache.org/licenses/LICENSE-2.0
196 | 
197 |    Unless required by applicable law or agreed to in writing, software
198 |    distributed under the License is distributed on an "AS IS" BASIS,
199 |    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200 |    See the License for the specific language governing permissions and
201 |    limitations under the License.
202 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Github Crawler
 2 | 
 3 | Github Crawler is a simple tool which scraps data from Github.
 4 | Currently, it is only scraping data from Topics section on Github.
 5 | ## Features
 6 | 
 7 | - Scrap any amount of data
 8 | - Data can be downloaded
 9 | - All links are provided in dataset
10 | - Useful for statistical analysis
11 | 
12 | 
13 | ## Screenshots
14 | 
15 | ![Image link](https://github.com/abhroroy365/githubcrawler/blob/main/sshot1.png)
16 | 


--------------------------------------------------------------------------------
/app.py:
--------------------------------------------------------------------------------
 1 | import streamlit as st
 2 | import pandas as pd
 3 | import numpy as np
 4 | from repo import scrap_repo
 5 | from topics import scrap_topics
 6 | 
 7 | st.markdown("<h1 style='text-align: center;'>Github Crawler</h1>", unsafe_allow_html=True)
 8 | 
 9 | # def make_clickable(link):
10 | #     # target _blank to open new window
11 | #     # extract clickable text to display for your link
12 | #     text = link.split('=')[0]
13 | #     return f'<a target="_blank" href="{link}">{text}</a>'
14 | 
15 | @st.cache
16 | def convert_df(df):
17 |     return df.to_csv().encode('utf-8')
18 | 
19 | def display_repos(x,urls,topic):
20 |     repo_df = scrap_repo(urls[x])
21 |     # download button for repositories data
22 |     csv = convert_df(repo_df)
23 |     st.download_button(
24 |         label="Save {} repositories as CSV".format(topic[x]),
25 |         data=csv,
26 |         file_name=str(topic[x])+'.csv',
27 |         mime='text/csv',
28 |     )
29 |     # display repositories
30 |     # repo_df['User Link'] = repo_df['User Link'].apply(make_clickable)
31 |     # repo_df['Repository Link'] = repo_df['Repository Link'].apply(make_clickable)
32 |     st.dataframe(repo_df)
33 | 
34 | 
35 | def display_topics():
36 |     number = 30
37 |     with st.sidebar:
38 |         agree = st.checkbox('Show more topics')
39 |         if agree:
40 |             number = st.radio(
41 |                 "Choose number of topics :",
42 |                 (30, 60, 90, 120, 150, 180)
43 |             )
44 |     df= scrap_topics(number)
45 |   # download button for topic data
46 |     csv = convert_df(df)
47 |     st.download_button(
48 |         label="Download topics as CSV",
49 |         data=csv,
50 |         file_name='trending_github_topics.csv',
51 |         mime='text/csv',
52 |     )
53 |     user_table = df
54 |     user_table['Repositories'] =''
55 |     col = st.columns((1, 2, 2, 1, 1))
56 |     st.write('Showing top {} topics out of 180 trending topics'.format(number))
57 |     for x, topic in enumerate(user_table['topic']):
58 |         col1, col2, col3, col4, col5 = st.columns((1, 2, 2, 1, 1))
59 |         col1.write(x+1)  # index
60 |         col2.write(user_table['topic'][x]) 
61 |         col3.write(user_table['description'][x]) 
62 |         col4.write(user_table['url'][x]) 
63 |         show_repo = user_table['Repositories'][x]  # flexible type of button
64 |         button_type = "Show Repositories"
65 |         button_phold = col5.empty()  # create a placeholder
66 |         do_action = button_phold.button(button_type, key=x)
67 |         if do_action:
68 |             button_phold.empty()
69 |             with st.spinner("Loading..."):
70 |                 display_repos(x,user_table['url'],user_table['topic'])
71 | 
72 |     st.markdown("<br><h6 style='text-align: center; color: white;'>Developed by Abhra Ray Chaudhuri</h6>", unsafe_allow_html=True)
73 |     
74 | 
75 | def main():
76 |     st.markdown("<h4 style='text-align: center;'>What are the trending github topics currently ?</h4><br>", unsafe_allow_html=True)
77 |     display_topics() 
78 | 
79 | if __name__ == '__main__':
80 |     main()


--------------------------------------------------------------------------------
/repo.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import requests
 3 | from bs4 import BeautifulSoup
 4 | 
 5 | def scrap_repo(url):
 6 |     base_url = 'https://github.com'
 7 |     response= requests.get(url)
 8 |     topic_page = response.text
 9 |     topic_doc = BeautifulSoup(topic_page, 'html.parser')
10 |     repo_tag = topic_doc.find_all('h3',class_ = 'f3 color-fg-muted text-normal lh-condensed')
11 |     username = []
12 |     repository = []
13 |     user_link = []
14 |     repo_link = []
15 |     for tag in repo_tag:
16 |         user_link.append(base_url+tag.find_all('a')[0]['href'].strip())
17 |         repo_link.append(base_url+tag.find_all('a')[1]['href'].strip())
18 |         username.append(tag.find_all('a')[0].text.strip())
19 |         repository.append(tag.find_all('a')[1].text.strip())
20 |     star_tag = topic_doc.find_all('span',{'id' : "repo-stars-counter-star"})
21 |     stars = []
22 |     for tag in star_tag:
23 |         star = tag['title']
24 |         star = int(star.replace(',',''))
25 |         stars.append(star)
26 |     dict = {
27 |         'Username':username,
28 |         'User Link':user_link,
29 |         'Repository': repository,
30 |         'Repository Link':repo_link,
31 |         'Stars': stars
32 |     }
33 |     df = pd.DataFrame(dict)
34 |     return df
35 | 


--------------------------------------------------------------------------------
/scrap_test.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "### Scraping a web page"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "# 1.Picking a website and describing the objective"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "- Identify the information you'd like to scrape from the site. Decide the format of the output CSV file.\n",
 22 |     "- Summarize your project idea and outline your strategy in a Juptyer notebook."
 23 |    ]
 24 |   },
 25 |   {
 26 |    "cell_type": "code",
 27 |    "execution_count": 1,
 28 |    "metadata": {},
 29 |    "outputs": [],
 30 |    "source": [
 31 |     "site = \"https://github.com/topics\""
 32 |    ]
 33 |   },
 34 |   {
 35 |    "cell_type": "markdown",
 36 |    "metadata": {},
 37 |    "source": [
 38 |     "# 2.Use the requests library to download web pages"
 39 |    ]
 40 |   },
 41 |   {
 42 |    "cell_type": "markdown",
 43 |    "metadata": {},
 44 |    "source": [
 45 |     "- Inspect the website's HTML source and identify the right URLs to download.\n",
 46 |     "- Download and save web pages locally using the requests library.\n",
 47 |     "- Create a function to automate downloading for different topics/search queries."
 48 |    ]
 49 |   },
 50 |   {
 51 |    "cell_type": "code",
 52 |    "execution_count": 2,
 53 |    "metadata": {},
 54 |    "outputs": [],
 55 |    "source": [
 56 |     "import requests"
 57 |    ]
 58 |   },
 59 |   {
 60 |    "cell_type": "code",
 61 |    "execution_count": 3,
 62 |    "metadata": {},
 63 |    "outputs": [],
 64 |    "source": [
 65 |     "page = requests.get(site)"
 66 |    ]
 67 |   },
 68 |   {
 69 |    "cell_type": "code",
 70 |    "execution_count": 4,
 71 |    "metadata": {},
 72 |    "outputs": [
 73 |     {
 74 |      "data": {
 75 |       "text/plain": [
 76 |        "200"
 77 |       ]
 78 |      },
 79 |      "execution_count": 4,
 80 |      "metadata": {},
 81 |      "output_type": "execute_result"
 82 |     }
 83 |    ],
 84 |    "source": [
 85 |     "page.status_code"
 86 |    ]
 87 |   },
 88 |   {
 89 |    "cell_type": "code",
 90 |    "execution_count": 5,
 91 |    "metadata": {},
 92 |    "outputs": [],
 93 |    "source": [
 94 |     "page_content = page.text"
 95 |    ]
 96 |   },
 97 |   {
 98 |    "cell_type": "code",
 99 |    "execution_count": 6,
100 |    "metadata": {},
101 |    "outputs": [
102 |     {
103 |      "data": {
104 |       "text/plain": [
105 |        "'\\n\\n<!DOCTYPE html>\\n<html lang=\"en\" data-color-mode=\"auto\" data-light-theme=\"light\" data-dark-theme=\"dark\" data-a11y-animated-images=\"system\">\\n  <head>\\n    <meta charset=\"utf-8\">\\n  <link rel=\"dns-prefetch\" href=\"https://github.githubassets.com\">\\n  <link rel=\"dns-prefetch\" href=\"https://avatars.githubusercontent.com\">\\n  <link rel=\"dns-prefetch\" href=\"https://github-cloud.s3.amazonaws.com\">\\n  <link rel=\"dns-prefetch\" href=\"https://user-images.githubusercontent.com/\">\\n  <link rel=\"preconnect\" href=\"https://github.githubassets.com\" crossorigin>\\n  <link rel=\"preconnect\" href=\"https://avatars.githubusercontent.com\">\\n\\n\\n\\n  <link crossorigin=\"anonymous\" media=\"all\" integrity=\"sha512-ksfTgQOOnE+FFXf+yNfVjKSlEckJAdufFIYGK7ZjRhWcZgzAGcmZqqArTgMLpu90FwthqcCX4ldDgKXbmVMeuQ==\" rel=\"stylesheet\" href=\"https://github.githubassets.com/assets/light-92c7d381038e.css\" /><link crossorigin=\"anonymous\" media=\"all\" integrity=\"sha512-1KkMNn8M/al/dtzBLupRwkIOgnA9MWkm8oxS+solP87jByEvY/g4BmoxLihRogKcX1obPnf4Yp7dI0ZTW'"
106 |       ]
107 |      },
108 |      "execution_count": 6,
109 |      "metadata": {},
110 |      "output_type": "execute_result"
111 |     }
112 |    ],
113 |    "source": [
114 |     "page_content[:1000]"
115 |    ]
116 |   },
117 |   {
118 |    "cell_type": "code",
119 |    "execution_count": 7,
120 |    "metadata": {},
121 |    "outputs": [],
122 |    "source": [
123 |     "with open('webpage.html', 'w', encoding='utf-8') as f:\n",
124 |     "    f.write(page_content)"
125 |    ]
126 |   },
127 |   {
128 |    "cell_type": "markdown",
129 |    "metadata": {},
130 |    "source": [
131 |     "# 3.Use Beautiful Soup to parse and extract information\n",
132 |     "\n"
133 |    ]
134 |   },
135 |   {
136 |    "cell_type": "markdown",
137 |    "metadata": {},
138 |    "source": [
139 |     "- Parse and explore the structure of downloaded web pages using Beautiful soup.\n",
140 |     "- Use the right properties and methods to extract the required information.\n",
141 |     "- Create functions to extract from the page into lists and dictionaries.\n",
142 |     "- (Optional) Use a REST API to acquire additional information if required."
143 |    ]
144 |   },
145 |   {
146 |    "cell_type": "code",
147 |    "execution_count": 8,
148 |    "metadata": {},
149 |    "outputs": [],
150 |    "source": [
151 |     "from bs4 import BeautifulSoup"
152 |    ]
153 |   },
154 |   {
155 |    "cell_type": "code",
156 |    "execution_count": 9,
157 |    "metadata": {},
158 |    "outputs": [],
159 |    "source": [
160 |     "doc = BeautifulSoup(page_content, \"html.parser\")"
161 |    ]
162 |   },
163 |   {
164 |    "cell_type": "markdown",
165 |    "metadata": {},
166 |    "source": [
167 |     "- getting topic title"
168 |    ]
169 |   },
170 |   {
171 |    "cell_type": "code",
172 |    "execution_count": 10,
173 |    "metadata": {},
174 |    "outputs": [],
175 |    "source": [
176 |     "selection_class = \"f3 lh-condensed mb-0 mt-1 Link--primary\"\n",
177 |     "topic_title_tag =  doc.find_all('p',class_ = selection_class)"
178 |    ]
179 |   },
180 |   {
181 |    "cell_type": "code",
182 |    "execution_count": 11,
183 |    "metadata": {},
184 |    "outputs": [
185 |     {
186 |      "data": {
187 |       "text/plain": [
188 |        "[<p class=\"f3 lh-condensed mb-0 mt-1 Link--primary\">3D</p>,\n",
189 |        " <p class=\"f3 lh-condensed mb-0 mt-1 Link--primary\">Ajax</p>,\n",
190 |        " <p class=\"f3 lh-condensed mb-0 mt-1 Link--primary\">Algorithm</p>,\n",
191 |        " <p class=\"f3 lh-condensed mb-0 mt-1 Link--primary\">Amp</p>,\n",
192 |        " <p class=\"f3 lh-condensed mb-0 mt-1 Link--primary\">Android</p>]"
193 |       ]
194 |      },
195 |      "execution_count": 11,
196 |      "metadata": {},
197 |      "output_type": "execute_result"
198 |     }
199 |    ],
200 |    "source": [
201 |     "topic_title_tag[:5]"
202 |    ]
203 |   },
204 |   {
205 |    "cell_type": "code",
206 |    "execution_count": 12,
207 |    "metadata": {},
208 |    "outputs": [],
209 |    "source": [
210 |     "topic_titles = []\n",
211 |     "for tag in topic_title_tag:\n",
212 |     "    topic_titles.append(tag.text)"
213 |    ]
214 |   },
215 |   {
216 |    "cell_type": "code",
217 |    "execution_count": 13,
218 |    "metadata": {},
219 |    "outputs": [
220 |     {
221 |      "data": {
222 |       "text/plain": [
223 |        "['3D', 'Ajax', 'Algorithm', 'Amp', 'Android']"
224 |       ]
225 |      },
226 |      "execution_count": 13,
227 |      "metadata": {},
228 |      "output_type": "execute_result"
229 |     }
230 |    ],
231 |    "source": [
232 |     "topic_titles[:5]"
233 |    ]
234 |   },
235 |   {
236 |    "cell_type": "markdown",
237 |    "metadata": {},
238 |    "source": [
239 |     "- getting topic description"
240 |    ]
241 |   },
242 |   {
243 |    "cell_type": "code",
244 |    "execution_count": 14,
245 |    "metadata": {},
246 |    "outputs": [],
247 |    "source": [
248 |     "selection_class = \"f5 color-fg-muted mb-0 mt-1\"\n",
249 |     "topic_desc_tag = doc.find_all('p',class_ = selection_class)"
250 |    ]
251 |   },
252 |   {
253 |    "cell_type": "code",
254 |    "execution_count": 15,
255 |    "metadata": {},
256 |    "outputs": [],
257 |    "source": [
258 |     "topic_descs = []\n",
259 |     "for desc in topic_desc_tag:\n",
260 |     "    topic_descs.append(desc.text.strip())"
261 |    ]
262 |   },
263 |   {
264 |    "cell_type": "code",
265 |    "execution_count": 16,
266 |    "metadata": {},
267 |    "outputs": [
268 |     {
269 |      "data": {
270 |       "text/plain": [
271 |        "['3D modeling is the process of virtually developing the surface and structure of a 3D object.',\n",
272 |        " 'Ajax is a technique for creating interactive web applications.',\n",
273 |        " 'Algorithms are self-contained sequences that carry out a variety of tasks.',\n",
274 |        " 'Amp is a non-blocking concurrency library for PHP.',\n",
275 |        " 'Android is an operating system built by Google designed for mobile devices.']"
276 |       ]
277 |      },
278 |      "execution_count": 16,
279 |      "metadata": {},
280 |      "output_type": "execute_result"
281 |     }
282 |    ],
283 |    "source": [
284 |     "topic_descs[:5]"
285 |    ]
286 |   },
287 |   {
288 |    "cell_type": "markdown",
289 |    "metadata": {},
290 |    "source": [
291 |     "- getting topic url"
292 |    ]
293 |   },
294 |   {
295 |    "cell_type": "code",
296 |    "execution_count": 17,
297 |    "metadata": {},
298 |    "outputs": [],
299 |    "source": [
300 |     "selection_class = \"no-underline flex-1 d-flex flex-column\"\n",
301 |     "topic_url_tag = doc.find_all('a',class_ = selection_class)"
302 |    ]
303 |   },
304 |   {
305 |    "cell_type": "code",
306 |    "execution_count": 18,
307 |    "metadata": {},
308 |    "outputs": [],
309 |    "source": [
310 |     "base_url = 'https://github.com'\n",
311 |     "topic_url_list = []\n",
312 |     "for i in range(len(topic_url_tag)):\n",
313 |     "    href = topic_url_tag[i]['href']\n",
314 |     "    topic_url = base_url + href\n",
315 |     "    topic_url_list.append(topic_url)"
316 |    ]
317 |   },
318 |   {
319 |    "cell_type": "code",
320 |    "execution_count": 19,
321 |    "metadata": {},
322 |    "outputs": [
323 |     {
324 |      "data": {
325 |       "text/plain": [
326 |        "['https://github.com/topics/3d',\n",
327 |        " 'https://github.com/topics/ajax',\n",
328 |        " 'https://github.com/topics/algorithm',\n",
329 |        " 'https://github.com/topics/amphp',\n",
330 |        " 'https://github.com/topics/android',\n",
331 |        " 'https://github.com/topics/angular',\n",
332 |        " 'https://github.com/topics/ansible',\n",
333 |        " 'https://github.com/topics/api',\n",
334 |        " 'https://github.com/topics/arduino',\n",
335 |        " 'https://github.com/topics/aspnet']"
336 |       ]
337 |      },
338 |      "execution_count": 19,
339 |      "metadata": {},
340 |      "output_type": "execute_result"
341 |     }
342 |    ],
343 |    "source": [
344 |     "topic_url_list[:10]"
345 |    ]
346 |   },
347 |   {
348 |    "cell_type": "markdown",
349 |    "metadata": {},
350 |    "source": [
351 |     "# 4.Create CSV file(s) with the extracted information\n",
352 |     "\n"
353 |    ]
354 |   },
355 |   {
356 |    "cell_type": "markdown",
357 |    "metadata": {},
358 |    "source": [
359 |     "- Create functions for the end-to-end process of downloading, parsing, and saving CSVs.\n",
360 |     "- Execute the function with different inputs to create a dataset of CSV files.\n",
361 |     "- Verify the information in the CSV files by reading them back using Pandas."
362 |    ]
363 |   },
364 |   {
365 |    "cell_type": "code",
366 |    "execution_count": 20,
367 |    "metadata": {},
368 |    "outputs": [],
369 |    "source": [
370 |     "import pandas as pd"
371 |    ]
372 |   },
373 |   {
374 |    "cell_type": "code",
375 |    "execution_count": 21,
376 |    "metadata": {},
377 |    "outputs": [],
378 |    "source": [
379 |     "from operator import index\n",
380 |     "\n",
381 |     "\n",
382 |     "dict = {\n",
383 |     "    'topic':topic_titles,\n",
384 |     "    'decription': topic_descs,\n",
385 |     "    'url' : topic_url_list\n",
386 |     "}\n",
387 |     "topic_df = pd.DataFrame(dict,index=None)"
388 |    ]
389 |   },
390 |   {
391 |    "cell_type": "code",
392 |    "execution_count": 22,
393 |    "metadata": {},
394 |    "outputs": [
395 |     {
396 |      "data": {
397 |       "text/html": [
398 |        "<div>\n",
399 |        "<style scoped>\n",
400 |        "    .dataframe tbody tr th:only-of-type {\n",
401 |        "        vertical-align: middle;\n",
402 |        "    }\n",
403 |        "\n",
404 |        "    .dataframe tbody tr th {\n",
405 |        "        vertical-align: top;\n",
406 |        "    }\n",
407 |        "\n",
408 |        "    .dataframe thead th {\n",
409 |        "        text-align: right;\n",
410 |        "    }\n",
411 |        "</style>\n",
412 |        "<table border=\"1\" class=\"dataframe\">\n",
413 |        "  <thead>\n",
414 |        "    <tr style=\"text-align: right;\">\n",
415 |        "      <th></th>\n",
416 |        "      <th>topic</th>\n",
417 |        "      <th>decription</th>\n",
418 |        "      <th>url</th>\n",
419 |        "    </tr>\n",
420 |        "  </thead>\n",
421 |        "  <tbody>\n",
422 |        "    <tr>\n",
423 |        "      <th>0</th>\n",
424 |        "      <td>3D</td>\n",
425 |        "      <td>3D modeling is the process of virtually develo...</td>\n",
426 |        "      <td>https://github.com/topics/3d</td>\n",
427 |        "    </tr>\n",
428 |        "    <tr>\n",
429 |        "      <th>1</th>\n",
430 |        "      <td>Ajax</td>\n",
431 |        "      <td>Ajax is a technique for creating interactive w...</td>\n",
432 |        "      <td>https://github.com/topics/ajax</td>\n",
433 |        "    </tr>\n",
434 |        "    <tr>\n",
435 |        "      <th>2</th>\n",
436 |        "      <td>Algorithm</td>\n",
437 |        "      <td>Algorithms are self-contained sequences that c...</td>\n",
438 |        "      <td>https://github.com/topics/algorithm</td>\n",
439 |        "    </tr>\n",
440 |        "    <tr>\n",
441 |        "      <th>3</th>\n",
442 |        "      <td>Amp</td>\n",
443 |        "      <td>Amp is a non-blocking concurrency library for ...</td>\n",
444 |        "      <td>https://github.com/topics/amphp</td>\n",
445 |        "    </tr>\n",
446 |        "    <tr>\n",
447 |        "      <th>4</th>\n",
448 |        "      <td>Android</td>\n",
449 |        "      <td>Android is an operating system built by Google...</td>\n",
450 |        "      <td>https://github.com/topics/android</td>\n",
451 |        "    </tr>\n",
452 |        "    <tr>\n",
453 |        "      <th>5</th>\n",
454 |        "      <td>Angular</td>\n",
455 |        "      <td>Angular is an open source web application plat...</td>\n",
456 |        "      <td>https://github.com/topics/angular</td>\n",
457 |        "    </tr>\n",
458 |        "    <tr>\n",
459 |        "      <th>6</th>\n",
460 |        "      <td>Ansible</td>\n",
461 |        "      <td>Ansible is a simple and powerful automation en...</td>\n",
462 |        "      <td>https://github.com/topics/ansible</td>\n",
463 |        "    </tr>\n",
464 |        "    <tr>\n",
465 |        "      <th>7</th>\n",
466 |        "      <td>API</td>\n",
467 |        "      <td>An API (Application Programming Interface) is ...</td>\n",
468 |        "      <td>https://github.com/topics/api</td>\n",
469 |        "    </tr>\n",
470 |        "    <tr>\n",
471 |        "      <th>8</th>\n",
472 |        "      <td>Arduino</td>\n",
473 |        "      <td>Arduino is an open source hardware and softwar...</td>\n",
474 |        "      <td>https://github.com/topics/arduino</td>\n",
475 |        "    </tr>\n",
476 |        "    <tr>\n",
477 |        "      <th>9</th>\n",
478 |        "      <td>ASP.NET</td>\n",
479 |        "      <td>ASP.NET is a web framework for building modern...</td>\n",
480 |        "      <td>https://github.com/topics/aspnet</td>\n",
481 |        "    </tr>\n",
482 |        "  </tbody>\n",
483 |        "</table>\n",
484 |        "</div>"
485 |       ],
486 |       "text/plain": [
487 |        "       topic                                         decription  \\\n",
488 |        "0         3D  3D modeling is the process of virtually develo...   \n",
489 |        "1       Ajax  Ajax is a technique for creating interactive w...   \n",
490 |        "2  Algorithm  Algorithms are self-contained sequences that c...   \n",
491 |        "3        Amp  Amp is a non-blocking concurrency library for ...   \n",
492 |        "4    Android  Android is an operating system built by Google...   \n",
493 |        "5    Angular  Angular is an open source web application plat...   \n",
494 |        "6    Ansible  Ansible is a simple and powerful automation en...   \n",
495 |        "7        API  An API (Application Programming Interface) is ...   \n",
496 |        "8    Arduino  Arduino is an open source hardware and softwar...   \n",
497 |        "9    ASP.NET  ASP.NET is a web framework for building modern...   \n",
498 |        "\n",
499 |        "                                   url  \n",
500 |        "0         https://github.com/topics/3d  \n",
501 |        "1       https://github.com/topics/ajax  \n",
502 |        "2  https://github.com/topics/algorithm  \n",
503 |        "3      https://github.com/topics/amphp  \n",
504 |        "4    https://github.com/topics/android  \n",
505 |        "5    https://github.com/topics/angular  \n",
506 |        "6    https://github.com/topics/ansible  \n",
507 |        "7        https://github.com/topics/api  \n",
508 |        "8    https://github.com/topics/arduino  \n",
509 |        "9     https://github.com/topics/aspnet  "
510 |       ]
511 |      },
512 |      "execution_count": 22,
513 |      "metadata": {},
514 |      "output_type": "execute_result"
515 |     }
516 |    ],
517 |    "source": [
518 |     "topic_df[:10]"
519 |    ]
520 |   },
521 |   {
522 |    "cell_type": "code",
523 |    "execution_count": 23,
524 |    "metadata": {},
525 |    "outputs": [],
526 |    "source": [
527 |     "topic_df.to_csv('topics.csv',index=None)"
528 |    ]
529 |   },
530 |   {
531 |    "cell_type": "markdown",
532 |    "metadata": {},
533 |    "source": [
534 |     "# Getting information from a topic page"
535 |    ]
536 |   },
537 |   {
538 |    "cell_type": "code",
539 |    "execution_count": 24,
540 |    "metadata": {},
541 |    "outputs": [
542 |     {
543 |      "data": {
544 |       "text/plain": [
545 |        "'https://github.com/topics/3d'"
546 |       ]
547 |      },
548 |      "execution_count": 24,
549 |      "metadata": {},
550 |      "output_type": "execute_result"
551 |     }
552 |    ],
553 |    "source": [
554 |     "topic_url_list[0]"
555 |    ]
556 |   },
557 |   {
558 |    "cell_type": "code",
559 |    "execution_count": 25,
560 |    "metadata": {},
561 |    "outputs": [],
562 |    "source": [
563 |     "response= requests.get(topic_url_list[0])"
564 |    ]
565 |   },
566 |   {
567 |    "cell_type": "code",
568 |    "execution_count": 26,
569 |    "metadata": {},
570 |    "outputs": [],
571 |    "source": [
572 |     "topic_page = response.text"
573 |    ]
574 |   },
575 |   {
576 |    "cell_type": "code",
577 |    "execution_count": 27,
578 |    "metadata": {},
579 |    "outputs": [],
580 |    "source": [
581 |     "topic_doc = BeautifulSoup(topic_page, 'html.parser')"
582 |    ]
583 |   },
584 |   {
585 |    "cell_type": "code",
586 |    "execution_count": 28,
587 |    "metadata": {},
588 |    "outputs": [
589 |     {
590 |      "data": {
591 |       "text/plain": [
592 |        "[<a data-ga-click=\"Explore, go to repository owner, location:explore feed\" data-hydro-click='{\"event_type\":\"explore.click\",\"payload\":{\"click_context\":\"REPOSITORY_CARD\",\"click_target\":\"OWNER\",\"click_visual_representation\":\"REPOSITORY_OWNER_HEADING\",\"actor_id\":null,\"record_id\":97088,\"originating_url\":\"https://github.com/topics/3d\",\"user_id\":null}}' data-hydro-click-hmac=\"4bdbc49d3c05ae7f70b531fbce709a384200b0768554e0172950286a8db30940\" data-turbo=\"false\" data-view-component=\"true\" href=\"/mrdoob\">\n",
593 |        "             mrdoob\n",
594 |        " </a>,\n",
595 |        " <a class=\"text-bold wb-break-word\" data-ga-click=\"Explore, go to repository, location:explore feed\" data-hydro-click='{\"event_type\":\"explore.click\",\"payload\":{\"click_context\":\"REPOSITORY_CARD\",\"click_target\":\"REPOSITORY\",\"click_visual_representation\":\"REPOSITORY_NAME_HEADING\",\"actor_id\":null,\"record_id\":576201,\"originating_url\":\"https://github.com/topics/3d\",\"user_id\":null}}' data-hydro-click-hmac=\"517d3d5cb9d89752156923904a4238816bc9b51ab7772f3e3644ce897d8dd4e5\" data-turbo=\"false\" data-view-component=\"true\" href=\"/mrdoob/three.js\">\n",
596 |        "             three.js\n",
597 |        " </a>]"
598 |       ]
599 |      },
600 |      "execution_count": 28,
601 |      "metadata": {},
602 |      "output_type": "execute_result"
603 |     }
604 |    ],
605 |    "source": [
606 |     "repo_tag = topic_doc.find_all('h3',class_ = 'f3 color-fg-muted text-normal lh-condensed')\n",
607 |     "repo_tag[0].find_all('a')"
608 |    ]
609 |   },
610 |   {
611 |    "cell_type": "code",
612 |    "execution_count": 29,
613 |    "metadata": {},
614 |    "outputs": [],
615 |    "source": [
616 |     "username = []\n",
617 |     "repository = []\n",
618 |     "user_link = []\n",
619 |     "repo_link = []\n",
620 |     "for tag in repo_tag:\n",
621 |     "    user_link.append(base_url+tag.find_all('a')[0]['href'].strip())\n",
622 |     "    repo_link.append(base_url+tag.find_all('a')[1]['href'].strip())\n",
623 |     "    username.append(tag.find_all('a')[0].text.strip())\n",
624 |     "    repository.append(tag.find_all('a')[1].text.strip())"
625 |    ]
626 |   },
627 |   {
628 |    "cell_type": "code",
629 |    "execution_count": 30,
630 |    "metadata": {},
631 |    "outputs": [],
632 |    "source": [
633 |     "star_tag = topic_doc.find_all('span',{'id' : \"repo-stars-counter-star\"})"
634 |    ]
635 |   },
636 |   {
637 |    "cell_type": "code",
638 |    "execution_count": 31,
639 |    "metadata": {},
640 |    "outputs": [],
641 |    "source": [
642 |     "stars = []\n",
643 |     "for tag in star_tag:\n",
644 |     "    star = tag['title']\n",
645 |     "    star = int(star.replace(',',''))\n",
646 |     "    stars.append(star)"
647 |    ]
648 |   },
649 |   {
650 |    "cell_type": "code",
651 |    "execution_count": 32,
652 |    "metadata": {},
653 |    "outputs": [],
654 |    "source": [
655 |     "dict = {\n",
656 |     "    'Username':username,\n",
657 |     "    'User Link':user_link,\n",
658 |     "    'Repository': repository,\n",
659 |     "    'Repo Link':repo_link,\n",
660 |     "    'Stars': stars\n",
661 |     "}\n",
662 |     "D3_repo_df = pd.DataFrame(dict)"
663 |    ]
664 |   },
665 |   {
666 |    "cell_type": "code",
667 |    "execution_count": 33,
668 |    "metadata": {},
669 |    "outputs": [
670 |     {
671 |      "data": {
672 |       "text/html": [
673 |        "<div>\n",
674 |        "<style scoped>\n",
675 |        "    .dataframe tbody tr th:only-of-type {\n",
676 |        "        vertical-align: middle;\n",
677 |        "    }\n",
678 |        "\n",
679 |        "    .dataframe tbody tr th {\n",
680 |        "        vertical-align: top;\n",
681 |        "    }\n",
682 |        "\n",
683 |        "    .dataframe thead th {\n",
684 |        "        text-align: right;\n",
685 |        "    }\n",
686 |        "</style>\n",
687 |        "<table border=\"1\" class=\"dataframe\">\n",
688 |        "  <thead>\n",
689 |        "    <tr style=\"text-align: right;\">\n",
690 |        "      <th></th>\n",
691 |        "      <th>Username</th>\n",
692 |        "      <th>User Link</th>\n",
693 |        "      <th>Repository</th>\n",
694 |        "      <th>Repo Link</th>\n",
695 |        "      <th>Stars</th>\n",
696 |        "    </tr>\n",
697 |        "  </thead>\n",
698 |        "  <tbody>\n",
699 |        "    <tr>\n",
700 |        "      <th>0</th>\n",
701 |        "      <td>mrdoob</td>\n",
702 |        "      <td>https://github.com/mrdoob</td>\n",
703 |        "      <td>three.js</td>\n",
704 |        "      <td>https://github.com/mrdoob/three.js</td>\n",
705 |        "      <td>83130</td>\n",
706 |        "    </tr>\n",
707 |        "    <tr>\n",
708 |        "      <th>1</th>\n",
709 |        "      <td>libgdx</td>\n",
710 |        "      <td>https://github.com/libgdx</td>\n",
711 |        "      <td>libgdx</td>\n",
712 |        "      <td>https://github.com/libgdx/libgdx</td>\n",
713 |        "      <td>20147</td>\n",
714 |        "    </tr>\n",
715 |        "    <tr>\n",
716 |        "      <th>2</th>\n",
717 |        "      <td>pmndrs</td>\n",
718 |        "      <td>https://github.com/pmndrs</td>\n",
719 |        "      <td>react-three-fiber</td>\n",
720 |        "      <td>https://github.com/pmndrs/react-three-fiber</td>\n",
721 |        "      <td>18517</td>\n",
722 |        "    </tr>\n",
723 |        "    <tr>\n",
724 |        "      <th>3</th>\n",
725 |        "      <td>BabylonJS</td>\n",
726 |        "      <td>https://github.com/BabylonJS</td>\n",
727 |        "      <td>Babylon.js</td>\n",
728 |        "      <td>https://github.com/BabylonJS/Babylon.js</td>\n",
729 |        "      <td>17680</td>\n",
730 |        "    </tr>\n",
731 |        "    <tr>\n",
732 |        "      <th>4</th>\n",
733 |        "      <td>aframevr</td>\n",
734 |        "      <td>https://github.com/aframevr</td>\n",
735 |        "      <td>aframe</td>\n",
736 |        "      <td>https://github.com/aframevr/aframe</td>\n",
737 |        "      <td>14300</td>\n",
738 |        "    </tr>\n",
739 |        "  </tbody>\n",
740 |        "</table>\n",
741 |        "</div>"
742 |       ],
743 |       "text/plain": [
744 |        "    Username                     User Link         Repository  \\\n",
745 |        "0     mrdoob     https://github.com/mrdoob           three.js   \n",
746 |        "1     libgdx     https://github.com/libgdx             libgdx   \n",
747 |        "2     pmndrs     https://github.com/pmndrs  react-three-fiber   \n",
748 |        "3  BabylonJS  https://github.com/BabylonJS         Babylon.js   \n",
749 |        "4   aframevr   https://github.com/aframevr             aframe   \n",
750 |        "\n",
751 |        "                                     Repo Link  Stars  \n",
752 |        "0           https://github.com/mrdoob/three.js  83130  \n",
753 |        "1             https://github.com/libgdx/libgdx  20147  \n",
754 |        "2  https://github.com/pmndrs/react-three-fiber  18517  \n",
755 |        "3      https://github.com/BabylonJS/Babylon.js  17680  \n",
756 |        "4           https://github.com/aframevr/aframe  14300  "
757 |       ]
758 |      },
759 |      "execution_count": 33,
760 |      "metadata": {},
761 |      "output_type": "execute_result"
762 |     }
763 |    ],
764 |    "source": [
765 |     "D3_repo_df[:5]"
766 |    ]
767 |   },
768 |   {
769 |    "cell_type": "markdown",
770 |    "metadata": {},
771 |    "source": [
772 |     "# Summarizing the tasks"
773 |    ]
774 |   },
775 |   {
776 |    "cell_type": "code",
777 |    "execution_count": 42,
778 |    "metadata": {},
779 |    "outputs": [],
780 |    "source": [
781 |     "def scrap_topics_repos():\n",
782 |     "    i=0\n",
783 |     "    for url in topic_url_list:\n",
784 |     "        response= requests.get(url)\n",
785 |     "        topic_page = response.text\n",
786 |     "        topic_doc = BeautifulSoup(topic_page, 'html.parser')\n",
787 |     "        repo_tag = topic_doc.find_all('h3',class_ = 'f3 color-fg-muted text-normal lh-condensed')\n",
788 |     "        username = []\n",
789 |     "        repository = []\n",
790 |     "        user_link = []\n",
791 |     "        repo_link = []\n",
792 |     "        for tag in repo_tag:\n",
793 |     "            user_link.append(base_url+tag.find_all('a')[0]['href'].strip())\n",
794 |     "            repo_link.append(base_url+tag.find_all('a')[1]['href'].strip())\n",
795 |     "            username.append(tag.find_all('a')[0].text.strip())\n",
796 |     "            repository.append(tag.find_all('a')[1].text.strip())\n",
797 |     "        star_tag = topic_doc.find_all('span',{'id' : \"repo-stars-counter-star\"})\n",
798 |     "        stars = []\n",
799 |     "        for tag in star_tag:\n",
800 |     "            star = tag['title']\n",
801 |     "            star = int(star.replace(',',''))\n",
802 |     "            stars.append(star)\n",
803 |     "        dict = {\n",
804 |     "            'Username':username,\n",
805 |     "            'User Link':user_link,\n",
806 |     "            'Repository': repository,\n",
807 |     "            'Repo Link':repo_link,\n",
808 |     "            'Stars': stars\n",
809 |     "        }\n",
810 |     "        df = pd.DataFrame(dict)\n",
811 |     "        filename = topic_titles[i] +'.csv'\n",
812 |     "        df.to_csv(filename,index=None)\n",
813 |     "        i+=1\n",
814 |     "            "
815 |    ]
816 |   },
817 |   {
818 |    "cell_type": "code",
819 |    "execution_count": 43,
820 |    "metadata": {},
821 |    "outputs": [],
822 |    "source": [
823 |     "scrap_topics_repos()"
824 |    ]
825 |   },
826 |   {
827 |    "cell_type": "markdown",
828 |    "metadata": {},
829 |    "source": [
830 |     "# 5.Document and share your work"
831 |    ]
832 |   },
833 |   {
834 |    "cell_type": "markdown",
835 |    "metadata": {},
836 |    "source": [
837 |     "- Add proper headings and documentation in your Jupyter notebook.\n",
838 |     "- Publish your Jupyter notebook to your Jovian profile\n",
839 |     "- (Optional) Write a blog post about your project and share it online."
840 |    ]
841 |   },
842 |   {
843 |    "cell_type": "code",
844 |    "execution_count": null,
845 |    "metadata": {},
846 |    "outputs": [],
847 |    "source": []
848 |   },
849 |   {
850 |    "cell_type": "code",
851 |    "execution_count": null,
852 |    "metadata": {},
853 |    "outputs": [],
854 |    "source": []
855 |   }
856 |  ],
857 |  "metadata": {
858 |   "kernelspec": {
859 |    "display_name": "Python 3.9.7 64-bit",
860 |    "language": "python",
861 |    "name": "python3"
862 |   },
863 |   "language_info": {
864 |    "codemirror_mode": {
865 |     "name": "ipython",
866 |     "version": 3
867 |    },
868 |    "file_extension": ".py",
869 |    "mimetype": "text/x-python",
870 |    "name": "python",
871 |    "nbconvert_exporter": "python",
872 |    "pygments_lexer": "ipython3",
873 |    "version": "3.9.7"
874 |   },
875 |   "orig_nbformat": 4,
876 |   "vscode": {
877 |    "interpreter": {
878 |     "hash": "5f988359990852b9f05d733faad7e7ee387134e25f4a1ce8f641bacdeab60818"
879 |    }
880 |   }
881 |  },
882 |  "nbformat": 4,
883 |  "nbformat_minor": 2
884 | }
885 | 


--------------------------------------------------------------------------------
/sshot1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/abhroroy365/githubcrawler/99d1c99e48a32189d63dd5f52a6fa6d03637c69f/sshot1.png


--------------------------------------------------------------------------------
/topics.py:
--------------------------------------------------------------------------------
 1 | import pandas as pd
 2 | import requests
 3 | from bs4 import BeautifulSoup
 4 | 
 5 | def scrap_topics(number):
 6 |     base_site = 'https://github.com/topics'
 7 |     total = int(number/30)
 8 |     topic_titles = []
 9 |     topic_descs = []
10 |     topic_url_list = []
11 |     for i in range(total):
12 |         page = '?page='+str(i+1)
13 |         site = base_site+page
14 |         req = requests.get(site)
15 |         page_content = req.text
16 |         doc = BeautifulSoup(page_content, "html.parser")
17 |         # getting topic title
18 |         selection_class = "f3 lh-condensed mb-0 mt-1 Link--primary"
19 |         topic_title_tag =  doc.find_all('p',class_ = selection_class)
20 |         for tag in topic_title_tag:
21 |             topic_titles.append(tag.text)
22 |         # getting topic description
23 |         selection_class = "f5 color-fg-muted mb-0 mt-1"
24 |         topic_desc_tag = doc.find_all('p',class_ = selection_class)
25 |         for desc in topic_desc_tag:
26 |             topic_descs.append(desc.text.strip())
27 | 
28 |         # getting topic links
29 |         selection_class = "no-underline flex-1 d-flex flex-column"
30 |         topic_url_tag = doc.find_all('a',class_ = selection_class)
31 |         base_url = 'https://github.com'
32 |         for i in range(len(topic_url_tag)):
33 |             href = topic_url_tag[i]['href']
34 |             topic_url = base_url + href
35 |             topic_url_list.append(topic_url)
36 |         
37 |     # creating dataframe    
38 |     dict = {
39 |         'topic':topic_titles,
40 |         'description': topic_descs,
41 |         'url' : topic_url_list,
42 |     }
43 |     topic_df = pd.DataFrame(dict)
44 |     return topic_df  


--------------------------------------------------------------------------------