├── .gitignore
├── LICENSE
├── README.md
├── companies.csv
├── scrape.gif
└── scraper.py


/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 | api-credentials.json
3 | credentials2.json
4 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2019 ginglis13
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # magic-formula-scraper
  2 | 
  3 | Python script for scraping [magicformulainvesting.com](https://www.magicformulainvesting.com/) and appending data to a Google Sheet using [selenium](https://www.seleniumhq.org/), [Google Sheets API](https://developers.google.com/sheets/api/), and [gspread](https://gspread.readthedocs.io/en/latest/).
  4 | 
  5 | My brother and I make investments by following Joel Greenblatt's Magic Formula.
  6 | The site above uses this formula and outputs the top X companies that fit within
  7 | the criteria of the formula. However, the site does not allow a user to copy the information of
  8 | these companies from the webpage directly. Manually typing out the names of 30+ companies and their information
  9 | is a time-suck, so I created this script to scrape this information instead.
 10 | 
 11 | Example GIF
 12 | ------
 13 | Here is the script running using a headless version of the Google Chrome browser, one w/o a GUI. It is also running using my credentials, so there is no interaction between the program and user. 
 14 | 
 15 | <p align="center">
 16 |   <img src="scrape.gif" />
 17 | </p>
 18 | 
 19 | Features
 20 | ------
 21 | + opens a chrome browser to the magic formula login page, then uses selenium's Keys and the getpass library to enter login information
 22 | + once logged in, selects the number of stocks to view and clicks the corresponding button to display them
 23 | + scrapes information about listed companies, writes to csv file titled 'companies.csv'
 24 | + appends data to spreadsheet using the Google Sheets API and gspread 
 25 | + Optional: can be turned into a cronjob, instructions below
 26 | 
 27 | ### Main Loop
 28 | This is where the data is both written to a csv file and added to a Google worksheet
 29 | ```python
 30 | # find all td elements, write needed elements to file
 31 | trs=driver.find_elements_by_xpath('//table[@class="divheight screeningdata"]/tbody/tr')
 32 | 
 33 | for tr in trs:
 34 |     td = tr.find_elements_by_xpath(".//td")
 35 |     # encode company info as string to write to file
 36 |     company_name=td[0].get_attribute("innerHTML").encode("UTF-8")
 37 |     company_tikr=td[1].get_attribute("innerHTML").encode("UTF-8")
 38 |     # write to csv file
 39 |     writer.writerow([company_name,company_tikr])
 40 |     # append row to worksheet
 41 |     # use value input option = user entered so that price can be called from google finance
 42 |     worksheet.append_row([company_name,company_tikr,'=GOOGLEFINANCE("' + company_tikr + '","price")'], value_input_option="USER_ENTERED")  
 43 | 
 44 | driver.quit()
 45 | ```
 46 | 
 47 | Usage
 48 | ------
 49 | 1. [Create a Google Developer Account](https://console.developers.google.com/). This allows access to Google's Drive and Sheets APIs, as well as a ton of other resources. Signing up gives the user $300 in credit!
 50 | 
 51 | 2. [Read the gspread docs on how to generate credentials](https://gspread.readthedocs.io/en/latest/oauth2.html). This will help with linking your worksheet to the script. Make sure you put the path to the JSON file on line 74!
 52 | 
 53 | 3. Some parts of the script will have to be personalized by the user. These sections of scraper.py are listed below.
 54 | 
 55 | #### Add Oauth Credentials
 56 | ```python
 57 | credentials = ServiceAccountCredentials.from_json_keyfile_name('/path/to/your/credentials', scope)
 58 | ```
 59 | 
 60 | #### Add URL to Your Spreadsheet
 61 | ```python
 62 | # access sheet by url
 63 | worksheet = gc.open_by_url('URL_TO_YOUR_SPREADSHEET').get_worksheet(1) # worksheet number
 64 | ```
 65 | 
 66 | ### Cron Job
 67 | 
 68 | I have set up my script to run using a cron job every 3 months on the first of each month at 1 pm. 
 69 | 
 70 | Edit lines 31-35 if you wish to hardcode your login credentials
 71 | 
 72 | ```python
 73 | # enter email and password. uses getpass to hide password (i.e. not using plaintext)
 74 | your_email=raw_input("Please enter your email for magicformulainvesting.com: ")
 75 | your_password=getpass.getpass("Please enter your password for magicformulainvesting.com: ")
 76 | username.send_keys(your_email)
 77 | password.send_keys(your_password)
 78 | ```
 79 | 
 80 | To run selenium with a cron job, the browser used must be headless. I am using Chrome and giving it the option to run headless in my personal script. Chrome webdrivers must also be installed:
 81 | 
 82 | ```sh
 83 | brew cask install chromedriver
 84 | ```
 85 | 
 86 | Add these lines to scraper.py in place of the current 'driver = ...' line:
 87 | 
 88 | ```python
 89 | options = webdriver.ChromeOptions()
 90 | options.add_argument('headless')
 91 | 
 92 | # declare driver as chrome headless instance
 93 | driver = webdriver.Chrome(executable_path="path/to/chromedriver", chrome_options=options)
 94 | ```
 95 | 
 96 | Below is my cron job, accessed on Mac or Linux by running 'crontab -e' at the terminal. I first had to give iTerm and the Terminal apps permission to read/write from my ssd.
 97 | 
 98 | ```bash
 99 | SHELL=/bin/bash
100 | PATH=/usr/local/bin/:/usr/bin:/usr/sbin
101 | 0 1 1 */3 * export DISPLAY=:0 && cd /path/to/scraper && /usr/bin/python scraper.py
102 | ```
103 | 
104 | From reading online, it sounds as though a cron job cannot read standard input and will generate an end of file error. So for the cronjob, I have hardcoded my username and password, which is really bad practice. However, since this site doesn't really contain sensitive information, I'm okay with that. The provided script in this repository still uses the secure method provided by getpass to deal with the user's password.
105 | 
106 | Features to Implement
107 | ------
108 | + have a file of companies already researched/invested in, check this list before writing to csv or updating google worksheet
109 | + need to add a blank row before adding all company info to google worksheet
110 | + maybe scrape for company descriptions and add these to the spreadsheet
111 | 


--------------------------------------------------------------------------------
/companies.csv:
--------------------------------------------------------------------------------
1 | Company,Ticker
2 | 


--------------------------------------------------------------------------------
/scrape.gif:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ginglis13/magic-formula-scraper/8cf3a98793e10ea93fa275b796df03bdc31ae0cb/scrape.gif


--------------------------------------------------------------------------------
/scraper.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python3
  2 | # scraper.py
  3 | # web scraper for magicformulainvesting.com
  4 | # pulls company information from site to save time that would be spent manually typing out the info
  5 | # Gavin Inglis
  6 | # January 2019
  7 | 
  8 | from selenium import webdriver
  9 | from selenium.webdriver.common.keys import Keys
 10 | from selenium.webdriver.support.ui import WebDriverWait
 11 | 
 12 | import zipfile
 13 | import time
 14 | import datetime
 15 | import gspread
 16 | from oauth2client.service_account import ServiceAccountCredentials
 17 | import requests
 18 | import re
 19 | import getpass
 20 | 
 21 | # Get latest chromedriver zip file for mac, extract into same folder
 22 | try:
 23 |     version = requests.get('https://chromedriver.storage.googleapis.com/LATEST_RELEASE').text
 24 |     url = 'https://chromedriver.storage.googleapis.com/{0}/{1}'.format(version, 'chromedriver_mac64.zip')
 25 |     r = requests.get(url, allow_redirects=True)
 26 |     open('chromedriver.zip', 'wb').write(r.content)
 27 |     with zipfile.ZipFile("chromedriver.zip", "r") as zip_ref:
 28 |         zip_ref.extractall()
 29 | except:
 30 |     pass
 31 | 
 32 | '''Globals'''
 33 | 
 34 | GOOGLE_URL = 'http://www.google.com/search'
 35 | 
 36 | # scope of access for api
 37 | scope = ['https://spreadsheets.google.com/feeds',
 38 |          'https://www.googleapis.com/auth/drive']
 39 | 
 40 | # credentials file generated by google developer console when creating sheets api
 41 | credentials = ServiceAccountCredentials.from_json_keyfile_name('PATH TO YOUR CREDENTIALS', scope)
 42 | gc = gspread.authorize(credentials)
 43 | 
 44 | # login url for site
 45 | url = 'https://www.magicformulainvesting.com/Account/LogOn'
 46 | 
 47 | options = webdriver.ChromeOptions()
 48 | options.add_argument('headless')
 49 | 
 50 | # declare driver as chrome headless instance
 51 | driver = webdriver.Chrome(executable_path="./chromedriver", options=options)
 52 | 
 53 | '''Functions'''
 54 | def scrapeSite():
 55 | 
 56 |     print("Scraping stock info...")  # update for terminal
 57 | 
 58 |     # find all td elements, write needed elements to file
 59 |     trs=driver.find_elements_by_xpath('//table[@class="divheight screeningdata"]/tbody/tr')
 60 | 
 61 |     names = []
 62 |     tikrs = []
 63 | 
 64 |     for tr in trs:
 65 |         td = tr.find_elements_by_xpath(".//td")
 66 | 
 67 |         company_name=td[0].get_attribute("innerHTML")
 68 |         company_tikr=td[1].get_attribute("innerHTML")
 69 | 
 70 |         names.append(company_name)
 71 |         tikrs.append(company_tikr)
 72 | 
 73 |     return names, tikrs
 74 | 
 75 | def writeSheet(names, tikrs):
 76 | 
 77 |     print("Writing to sheet...")  # update to terminal
 78 | 
 79 |     # access sheet by url
 80 |     wks = gc.open_by_url("YOUR URL HERE").get_worksheet(1) # worksheet number
 81 |     
 82 |     #wks.append_row([' '], table_range='A1') # append a blank line before tickers as requested by OC
 83 |          
 84 |     date=datetime.datetime.today().strftime('%Y-%m-%d') # current date
 85 |     wks.append_row([date], table_range='A1') # append the date, starts in first column
 86 | 
 87 |     for i in range(len(names)):
 88 |         price = '=GOOGLEFINANCE("' + tikrs[i] + '","price")'
 89 | 
 90 |         query = names[i]
 91 | 
 92 |         url = getUrl(query)
 93 | 
 94 |         wks.append_row([names[i],tikrs[i], price, url], table_range='A1', value_input_option="USER_ENTERED") # start in first column
 95 | 
 96 | def getUrl(companyName):
 97 |     url    = GOOGLE_URL + '?q=' + companyName
 98 |     result = requests.get(url)
 99 |     # fancy regex courtesy of pbui
100 |     urls     = re.findall('/url\?q=([^&]*)', result.text)
101 |     return urls[0]
102 | 
103 | '''Main Execution'''
104 | 
105 | # go to page url
106 | driver.get(url)
107 | 
108 | # find the input elements for logging in
109 | username=driver.find_element_by_name("Email")
110 | password=driver.find_element_by_name("Password")
111 | 
112 | # enter email and password. uses getpass to hide password (i.e. not using plaintext)
113 | your_email=raw_input("Please enter your email for magicformulainvesting.com: ")
114 | your_password=getpass.getpass("Please enter your password for magicformulainvesting.com: ")
115 | username.send_keys(your_email)
116 | password.send_keys(your_password)
117 | 
118 | # enter email and password (for hard coding only)
119 | # username.send_keys("EMAIL")
120 | # password.send_keys("PASSWORD")
121 | 
122 | # click login button
123 | button=driver.find_element_by_name("login")
124 | button.click()
125 | 
126 | time.sleep(1) # seconds
127 | 
128 | # use xpathing to find the radio button element for 50 stocks and click it
129 | radio = driver.find_element_by_xpath('//input[@value="false" and contains(@name,"Select30")]')
130 | radio.click()
131 | 
132 | button2=driver.find_element_by_name("stocks")
133 | button2.click()
134 | 
135 | time.sleep(.5)
136 | 
137 | names, tikrs = scrapeSite()
138 | 
139 | driver.quit()
140 | 
141 | writeSheet(names, tikrs)
142 | 


--------------------------------------------------------------------------------