├── MOOC Insights through Scraping edX.pdf ├── README.md ├── courses_new_working.pkl ├── courses_whole.csv ├── edXScraper.ipynb └── edXcourses_working.ipynb /MOOC Insights through Scraping edX.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pmasiphelps/ScrapingProject/e3504eea13e2dae387cbf5f9c5faee2bf47af330/MOOC Insights through Scraping edX.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # edX Web Scraping Project 2 | ## Patrick Masi-Phelps 3 | 4 | ### Blog post: https://blog.nycdatascience.com/student-works/mooc-insights-scraping-edx/ 5 | ### NYC Data Science Academy 6 | 7 | The purpose of this excercise was to successfully scrape edX's website for information on online courses currently offered, and then conduct exploratory data analysis on the scraped data. This information could be useful for educational institutions - getting a clear picture of the current supply and characteristics of MOOCs could better inform current and potential market participants. This can also be useful for students looking understand the availability of alternative online options. 8 | 9 | The "edXScraper" python notebook contains the code used to scrape edX. 10 | 11 | The "edXcourses_working" python notebook shows the code used to clean and manipulate the scraped data, and then perform some basic visualizations and data analysis. 12 | 13 | The .pkl file contains the master dataframe used for all visualizations and analysis. This file has already undergone the cleaning and manipulating process outlined in "edXcourses_working". 14 | 15 | The .csv file contains the initial scraped data from edX, with some minor tweaks. 16 | 17 | The pdf contains a presentation outlining the process and findings. 18 | -------------------------------------------------------------------------------- /courses_new_working.pkl: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pmasiphelps/ScrapingProject/e3504eea13e2dae387cbf5f9c5faee2bf47af330/courses_new_working.pkl -------------------------------------------------------------------------------- /courses_whole.csv: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pmasiphelps/ScrapingProject/e3504eea13e2dae387cbf5f9c5faee2bf47af330/courses_whole.csv -------------------------------------------------------------------------------- /edXScraper.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# edX Web Scraping Project \n", 8 | "\n", 9 | "\n", 10 | "\n", 11 | "### Patrick Masi-Phelps\n", 12 | "### NYCDSA Cohort 10\n", 13 | "\n", 14 | "\n", 15 | "#### This document shows the code used to scrape edX's courses offered in English as of July 28, 2017. " 16 | ] 17 | }, 18 | { 19 | "cell_type": "markdown", 20 | "metadata": {}, 21 | "source": [ 22 | "#### The purpose of this excercise was to successfully scrape edX's website for information on online courses currently offered, and then conduct exploratory data analysis on the scraped data. This information could be useful for educational institutions - getting a clear picture of the current supply and characteristics of MOOCs could better inform current and potential market participants. This can also be useful for students looking understand the availability of alternative online options." 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "### 1. Import packages, initialize driver, and open csv writer" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": 55, 35 | "metadata": { 36 | "collapsed": false 37 | }, 38 | "outputs": [], 39 | "source": [ 40 | "from selenium import webdriver\n", 41 | "from selenium.webdriver.common.by import By\n", 42 | "from selenium.webdriver.support.ui import WebDriverWait\n", 43 | "from selenium.webdriver.support import expected_conditions as EC\n", 44 | "import time\n", 45 | "import csv\n", 46 | "import re\n", 47 | "\n", 48 | "driver = webdriver.Chrome('path to driver')\n", 49 | "\n", 50 | "#main edX page of all English courses -- this scraper excludes edX courses in other languages\n", 51 | "driver.get('https://www.edx.org/course?language=English')\n", 52 | "\n", 53 | "#open a new blank csv\n", 54 | "csv_file = open('courses_whole.csv', 'w')\n", 55 | "\n", 56 | "writer = csv.writer(csv_file)\n", 57 | "\n", 58 | "#write column headers for each of the variables to scrape\n", 59 | "writer.writerow(['course_link','price', 'title', \n", 60 | " 'subject', 'level', 'institution', 'length', \n", 61 | " 'prerequisites', 'short_description', 'effort'])\n", 62 | "\n" 63 | ] 64 | }, 65 | { 66 | "cell_type": "markdown", 67 | "metadata": {}, 68 | "source": [ 69 | "### 2. Scrape the main course page \n", 70 | "#### Link: https://www.edx.org/course?language=English" 71 | ] 72 | }, 73 | { 74 | "cell_type": "code", 75 | "execution_count": 4, 76 | "metadata": { 77 | "collapsed": false, 78 | "scrolled": true 79 | }, 80 | "outputs": [ 81 | { 82 | "name": "stdout", 83 | "output_type": "stream", 84 | "text": [ 85 | "Message: \n", 86 | "\n", 87 | "1257\n", 88 | "==================================================\n", 89 | "1257\n", 90 | "\n", 91 | "\n", 92 | "\n", 93 | "\n", 94 | "==================================================\n", 95 | "==================================================\n", 96 | "1257\n", 97 | "https://www.edx.org/course/introduction-web-accessibility-microsoft-dev240x\n" 98 | ] 99 | } 100 | ], 101 | "source": [ 102 | "### this code scrapes the main course list page, returning a list of all english course links\n", 103 | "\n", 104 | "#edX lists the total number of courses on the top of the main page. This scrapes that number.\n", 105 | "num_classes_str = driver.find_element_by_xpath('//span[@class=\"js-count result-count\"]').text\n", 106 | "\n", 107 | "#convert total course number to an integer\n", 108 | "num_classes = int((re.findall(r'\\d+', num_classes_str))[0])\n", 109 | "\n", 110 | "#initialize page number = 0\n", 111 | "page = 0\n", 112 | "\n", 113 | "### this while loop scrolls down the main course page until all courses are loaded\n", 114 | "\n", 115 | "while page < num_classes:\n", 116 | " \n", 117 | " #driver does an initial scroll down to bottom of the page\n", 118 | " driver.execute_script(\"window.scrollTo(0, document.body.scrollHeight);\")\n", 119 | " \n", 120 | " #this try command waits until it can see the \"loading...\" icon. Once it sees the icon, we add 1 to page counter\n", 121 | " #and continue at the top of the while loop to do another scroll\n", 122 | " try:\n", 123 | " WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//div[@class=\"loading\"]')))\n", 124 | " page += 1\n", 125 | " \n", 126 | " #when the driver waits 10 seconds and still cannot see the \"loading...\" icon, it will raise an exception\n", 127 | " #at this point, we will be at bottom of the page, all courses visible, break out of the loop\n", 128 | " except Exception as e:\n", 129 | " print(e)\n", 130 | " print(page)\n", 131 | " break\n", 132 | "\n", 133 | "#get a list of all course link xpath elements\n", 134 | "courses = driver.find_elements_by_xpath('//div[@class=\"discovery-card-inner-wrapper\"]/a[@class=\"course-link\"]')\n", 135 | " \n", 136 | "#initialize empty list\n", 137 | "course_links = []\n", 138 | "\n", 139 | "#for each course link xpath, grab the link itself (the href element) and append it to the course_link list\n", 140 | "for course in courses:\n", 141 | " course_links.append(course.get_attribute('href'))\n", 142 | "\n" 143 | ] 144 | }, 145 | { 146 | "cell_type": "markdown", 147 | "metadata": {}, 148 | "source": [ 149 | "#### (Some optional test code)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": 27, 155 | "metadata": { 156 | "collapsed": false 157 | }, 158 | "outputs": [], 159 | "source": [ 160 | "### optional testing code to scrape a sample of courses\n", 161 | "### course_links_test1 = ['https://www.edx.org/course/apr-italian-language-culture-wellesleyx-apita-x', \n", 162 | "### 'https://www.edx.org/course/ramp-ap-biology-weston-high-school-bio101x-1',\n", 163 | "### 'https://www.edx.org/course/selling-ideas-how-influence-others-get-wharton-sellingideas101x-2']" 164 | ] 165 | }, 166 | { 167 | "cell_type": "markdown", 168 | "metadata": {}, 169 | "source": [ 170 | "### 3. Scrape the individual course pages" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": 56, 176 | "metadata": { 177 | "collapsed": false 178 | }, 179 | "outputs": [], 180 | "source": [ 181 | "###each of these functions returns the corresponding value from each edX course page###\n", 182 | "###if the scraper can't find a value, it returns the string 'Missing'###\n", 183 | "\n", 184 | "#get title of course\n", 185 | "def get_title():\n", 186 | " try:\n", 187 | " title = driver.find_element_by_xpath('.//*[@id=\"course-info-page\"]//h1[@id=\"course-intro-heading\"]').text\n", 188 | " except:\n", 189 | " title = 'Missing'\n", 190 | " finally:\n", 191 | " return title\n", 192 | "\n", 193 | "#get short description of course\n", 194 | "def get_short_description():\n", 195 | " try:\n", 196 | " short_description = driver.find_element_by_xpath('.//*[@id=\"course-info-page\"]//p[@class=\"course-intro-lead-in\"]').text\n", 197 | " except:\n", 198 | " short_description = 'Missing'\n", 199 | " finally:\n", 200 | " return short_description\n", 201 | "\n", 202 | "#get length of course (typically number of weeks, or total number of hours)\n", 203 | "def get_length():\n", 204 | " try:\n", 205 | " length = driver.find_element_by_xpath('.//*[@id=\"course-summary-area\"]//li[@data-field=\"length\"]/span[2]').text\n", 206 | " except:\n", 207 | " length = 'Missing' \n", 208 | " finally:\n", 209 | " return length\n", 210 | "\n", 211 | "#get the effort of course (typically hours per week, or total course hours)\n", 212 | "def get_effort():\n", 213 | " try:\n", 214 | " effort = driver.find_element_by_xpath('.//*[@id=\"course-summary-area\"]//li[@data-field=\"effort\"]//span[@class=\"block-list__desc\"]').text\n", 215 | " except:\n", 216 | " effort = 'Missing'\n", 217 | " finally:\n", 218 | " return effort\n", 219 | "\n", 220 | "#get the price of course. The first \"try\" only works for free courses. This grabs the text \"FREE\" by xpath\n", 221 | "#to get the price of not-free courses, the \"except, try\" gets the unique \"tag\" icon, then jumps to the parent \n", 222 | "#span class, then to a sibling span class to get the price amount. Unfortunately, the price amount doesn't \n", 223 | "#have a unique identifier.\n", 224 | "\n", 225 | "def get_price():\n", 226 | " try:\n", 227 | " price = driver.find_element_by_xpath('.//*[@id=\"course-summary-area\"]//li[@data-field=\"price\"]//span[@class=\"block-list__desc\"]/span[@class=\"uppercase\"]').text\n", 228 | " except:\n", 229 | " try: \n", 230 | " price = driver.find_element_by_xpath('.//*[@id=\"course-summary-area\"]//span[@class=\"fa fa-tag fa-lg\"]]/../parent::span/following-sibling::span').text()\n", 231 | " except:\n", 232 | " price = \"Missing\"\n", 233 | " finally:\n", 234 | " return price\n", 235 | " \n", 236 | "#gets the institution\n", 237 | "def get_institution():\n", 238 | " try:\n", 239 | " institution = driver.find_element_by_xpath('.//*[@id=\"course-summary-area\"]//li[@data-field=\"school\"]/span[2]/a').text\n", 240 | " except:\n", 241 | " institution = 'Missing'\n", 242 | " finally:\n", 243 | " return institution\n", 244 | "\n", 245 | "#gets the subject\n", 246 | "def get_subject():\n", 247 | " try:\n", 248 | " subject = driver.find_element_by_xpath('.//*[@id=\"course-summary-area\"]//li[@data-field=\"subject\"]/span[2]/a').text\n", 249 | " except:\n", 250 | " subject = 'Missing'\n", 251 | " finally:\n", 252 | " return subject\n", 253 | "\n", 254 | "#gets the level (introductory, intermediate, advanced)\n", 255 | "def get_level():\n", 256 | " try:\n", 257 | " level = driver.find_element_by_xpath('.//*[@id=\"course-summary-area\"]//li[@data-field=\"level\"]//span[@class=\"block-list__desc\"]').text\n", 258 | " except:\n", 259 | " level = 'Missing'\n", 260 | " finally:\n", 261 | " return level\n", 262 | "\n", 263 | "#gets the prerequisites, if any\n", 264 | "def get_prerequisites():\n", 265 | " try:\n", 266 | " prerequisites = driver.find_element_by_xpath('.//*[@id=\"course-summary-area\"]/div[2]/p').text\n", 267 | " except:\n", 268 | " try:\n", 269 | " prerequisites = driver.find_element_by_xpath('.//*[@id=\"course-summary-area\"]/div[2]/ul/li[1]')\n", 270 | " except:\n", 271 | " prerequisites = 'Missing'\n", 272 | " finally:\n", 273 | " return prerequisites\n", 274 | "\n", 275 | "### this for loop:\n", 276 | "### 1) iterates through each of the course links\n", 277 | "### 2) creates a new empty dictionary\n", 278 | "### 3) directs the driver to the link\n", 279 | "### 4) calls each of the scraping functions above and saves the return values in the dictionary\n", 280 | "### 5) writes the dictionary values out to the csv\n", 281 | "\n", 282 | "for course_link in course_links:\n", 283 | " course_dict = {}\n", 284 | " driver = webdriver.Chrome('/Users/Patrick/Downloads/chromedriver')\n", 285 | " driver.get(course_link)\n", 286 | " \n", 287 | " \n", 288 | " course_dict['link'] = course_link\n", 289 | " course_dict['title'] = get_title()\n", 290 | " course_dict['short_description'] = get_short_description()\n", 291 | " course_dict['length'] = get_length()\n", 292 | " course_dict['effort'] = get_effort()\n", 293 | " course_dict['price'] = get_price()\n", 294 | " course_dict['institution'] = get_institution()\n", 295 | " course_dict['subject'] = get_subject()\n", 296 | " course_dict['level'] = get_level()\n", 297 | " course_dict['prerequisites'] = get_prerequisites()\n", 298 | " writer.writerow(course_dict.values())\n", 299 | " driver.close()\n", 300 | "\n", 301 | "#close the csv once all course info is scraped\n", 302 | "csv_file.close()" 303 | ] 304 | } 305 | ], 306 | "metadata": { 307 | "anaconda-cloud": {}, 308 | "kernelspec": { 309 | "display_name": "Python [conda root]", 310 | "language": "python", 311 | "name": "conda-root-py" 312 | }, 313 | "language_info": { 314 | "codemirror_mode": { 315 | "name": "ipython", 316 | "version": 3 317 | }, 318 | "file_extension": ".py", 319 | "mimetype": "text/x-python", 320 | "name": "python", 321 | "nbconvert_exporter": "python", 322 | "pygments_lexer": "ipython3", 323 | "version": "3.5.2" 324 | } 325 | }, 326 | "nbformat": 4, 327 | "nbformat_minor": 1 328 | } 329 | --------------------------------------------------------------------------------