├── .gitignore ├── LICENSE.txt ├── README.md ├── cryptopanic_scraper.py ├── images ├── logo.png └── screenshot.png ├── jupyter ├── Scratchpad.ipynb └── eda.ipynb ├── requirements.txt └── test.py /.gitignore: -------------------------------------------------------------------------------- 1 | notes.txt 2 | data/ 3 | 4 | # Byte-compiled / optimized / DLL files 5 | __pycache__/ 6 | *.py[cod] 7 | *$py.class 8 | 9 | # C extensions 10 | *.so 11 | 12 | # Distribution / packaging 13 | .Python 14 | build/ 15 | develop-eggs/ 16 | dist/ 17 | downloads/ 18 | eggs/ 19 | .eggs/ 20 | lib/ 21 | lib64/ 22 | parts/ 23 | sdist/ 24 | var/ 25 | wheels/ 26 | *.egg-info/ 27 | .installed.cfg 28 | *.egg 29 | MANIFEST 30 | 31 | # PyInstaller 32 | # Usually these files are written by a python script from a template 33 | # before PyInstaller builds the exe, so as to inject date/other infos into it. 34 | *.manifest 35 | *.spec 36 | 37 | # Installer logs 38 | pip-log.txt 39 | pip-delete-this-directory.txt 40 | 41 | # Unit test / coverage reports 42 | htmlcov/ 43 | .tox/ 44 | .coverage 45 | .coverage.* 46 | .cache 47 | nosetests.xml 48 | coverage.xml 49 | *.cover 50 | .hypothesis/ 51 | .pytest_cache/ 52 | 53 | # Translations 54 | *.mo 55 | *.pot 56 | 57 | # Django stuff: 58 | *.log 59 | local_settings.py 60 | db.sqlite3 61 | 62 | # Flask stuff: 63 | instance/ 64 | .webassets-cache 65 | 66 | # Scrapy stuff: 67 | .scrapy 68 | 69 | # Sphinx documentation 70 | docs/_build/ 71 | 72 | # PyBuilder 73 | target/ 74 | 75 | # Jupyter Notebook 76 | .ipynb_checkpoints 77 | 78 | # pyenv 79 | .python-version 80 | 81 | # celery beat schedule file 82 | celerybeat-schedule 83 | 84 | # SageMath parsed files 85 | *.sage.py 86 | 87 | # Environments 88 | .env 89 | .venv 90 | env/ 91 | venv/ 92 | ENV/ 93 | env.bak/ 94 | venv.bak/ 95 | 96 | # Spyder project settings 97 | .spyderproject 98 | .spyproject 99 | 100 | # Rope project settings 101 | .ropeproject 102 | 103 | # mkdocs documentation 104 | /site 105 | 106 | # mypy 107 | .mypy_cache/ 108 | 109 | .idea 110 | data/ 111 | .DS_Store -------------------------------------------------------------------------------- /LICENSE.txt: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Paul Mendes 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | [![Contributors][contributors-shield]][contributors-url] 7 | [![Forks][forks-shield]][forks-url] 8 | [![Stargazers][stars-shield]][stars-url] 9 | [![Issues][issues-shield]][issues-url] 10 | [![MIT License][license-shield]][license-url] 11 | [![LinkedIn][linkedin-shield]][linkedin-url] 12 | 13 | 14 | 15 | 16 |
17 |

18 | 19 | Logo 20 | 21 | 22 |

Cryptopanic Scraper

23 | 24 |

25 | Headless chromedriver for automatic scraping of cryptopanics asynchronous newsfeed. 26 |
27 | Explore the docs » 28 |
29 |
30 | Report Bug 31 | · 32 | Request Feature 33 |

34 |

35 | 36 | 37 | 38 | 39 | ## Table of Contents 40 | 41 | * [About the Project](#about-the-project) 42 | * [Built With](#built-with) 43 | * [Getting Started](#getting-started) 44 | * [Prerequisites](#prerequisites) 45 | * [Installation](#installation) 46 | * [Usage](#usage) 47 | * [Roadmap](#roadmap) 48 | * [Contributing](#contributing) 49 | * [License](#license) 50 | * [Contact](#contact) 51 | 52 | 53 | 54 | 55 | ## About The Project 56 | 57 | [![Product Name Screen Shot][product-screenshot]](https://cryptopanic.com/) 58 | 59 | Cryptopanic is a crypto news aggregator that offers realtime news feeds of all things crypto as well 60 | as user input for ratings. 61 | This project was designed to scrape the data from their website so it could be later analyzed using NLP. 62 | 63 | ### Built With 64 | 65 | * [Python](https://github.com/topics/python) 66 | * [Selenium](https://github.com/topics/selenium) 67 | 68 | 69 | 70 | 71 | ## Getting Started 72 | 73 | To get a local copy up and running follow these simple steps. 74 | 75 | ### Prerequisites 76 | 77 | 78 | * python 3 79 | * pip 80 | 81 | 82 | ### Installation 83 | 84 | 1. Clone the cryptopanic_scraper 85 | ```sh 86 | git clone https:://github.com/grilledchickenthighs/cryptopanic_scraper.git 87 | ``` 88 | 2. Change directory 89 | ```sh 90 | cd cryptopanic_scraper 91 | ``` 92 | 3. Install packages 93 | ```sh 94 | pip install -r requirements.txt 95 | ``` 96 | 97 | 98 | 99 | 100 | ## Usage 101 | Simply run: 102 | ```sh 103 | python cryptopanic_scraper.py --headless 104 | ``` 105 | If you want to see it in action, run the script without any flags. 106 | ```sh 107 | python cryptopanic_scraper.py 108 | ``` 109 | If you want to filter the type of news to scrape add the --filter flag and choose 110 | a type. {all,hot,rising,bullish,bearish,lol,commented,important,saved} 111 | ```sh 112 | python cryptopanic_scraper.py --filter hot 113 | ``` 114 | You can always use the --help flag if you forget these commands: 115 | ```sh 116 | python cryptopanic_scraper.py --help 117 | 118 | usage: cryptopanic_webdriver.py [-h] [-v] 119 | [-f {all,hot,rising,bullish,bearish,lol,commented,important,saved}] 120 | [-s] 121 | 122 | optional arguments: 123 | -h, --help show this help message and exit 124 | -v, --verbose increase output verbosity 125 | -f {all,hot,rising,bullish,bearish,lol,commented,important,saved}, --filter {all,hot,rising,bullish,bearish,lol,commented,important,saved} 126 | Type of News filter 127 | -s, --headless Run Chrome driver headless 128 | ``` 129 | 130 | If your interested in analyzing the data: 131 | 132 | Please feel free to check out the [jupyter](https://github.com/GrilledChickenThighs/cryptopanic_scraper/tree/master/jupyter) directory for getting started. 133 | 134 | 135 | ## Roadmap 136 | 137 | See the [open issues](https://github.com/grilledchickenthighs/cryptopanic_scraper/issues) for a list of proposed features (and known issues). 138 | 139 | 140 | 141 | 142 | ## Contributing 143 | 144 | Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are **greatly appreciated**. 145 | 146 | 1. Fork the Project 147 | 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) 148 | 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) 149 | 4. Push to the Branch (`git push origin feature/AmazingFeature`) 150 | 5. Open a Pull Request 151 | 152 | 153 | 154 | 155 | ## License 156 | 157 | Distributed under the MIT License. See `LICENSE` for more information. 158 | 159 | 160 | 161 | 162 | ## Contact 163 | 164 | [Paul Mendes](https://grilledchickenthighs.github.io/) - [@BTCTradeNation](https://twitter.com/BTCTradeNation) - [paulsperformance@gmail.com](mailto:paulseperformance@gmail.com) 165 | 166 | Project Link: [https://github.com/grilledchickenthighs/cryptopanic_scraper](https://github.com/grilledchickenthighs/cryptopanic_scraper) 167 | 168 | 169 | 170 | 171 | 172 | [contributors-shield]: https://img.shields.io/github/contributors/grilledchickenthighs/cryptopanic_scraper?style=flat-square 173 | [contributors-url]: https://github.com/GrilledChickenThighs/cryptopanic_scraper/graphs/contributors 174 | [forks-shield]: https://img.shields.io/github/forks/grilledchickenthighs/cryptopanic_scraper?style=flat-sqaure 175 | [forks-url]: https://github.com/GrilledChickenThighs/cryptopanic_scraper/network/members 176 | [stars-shield]: https://img.shields.io/github/stars/grilledchickenthighs/cryptopanic_scraper?style=flat-square 177 | [stars-url]: https://github.com/grilledchickenthighs/cryptopanic_scraper/stargazers 178 | [issues-shield]: https://img.shields.io/github/issues/grilledchickenthighs/cryptopanic_scraper.svg?style=flat-square 179 | [issues-url]: https://github.com/grilledchickenthighs/cryptopanic_scraper/issues 180 | [license-shield]: https://img.shields.io/github/license/grilledchickenthighs/cryptopanic_scraper.svg?style=flat-square 181 | [license-url]: https://github.com/grilledchickenthighs/cryptopanic_scraper/blob/master/LICENSE.txt 182 | [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=flat-square&logo=linkedin&colorB=555 183 | [linkedin-url]: https://linkedin.com/in/paul-mendes 184 | [product-screenshot]: images/screenshot.png -------------------------------------------------------------------------------- /cryptopanic_scraper.py: -------------------------------------------------------------------------------- 1 | from selenium import webdriver 2 | import os 3 | import time 4 | import datetime 5 | import re 6 | import pickle 7 | import urllib 8 | import argparse 9 | from webdriver_manager.chrome import ChromeDriverManager 10 | import pathlib 11 | 12 | parser = argparse.ArgumentParser() 13 | parser.add_argument("-v", "--verbose", help="increase output verbosity", 14 | action="store_true") 15 | 16 | parser.add_argument("-f", "--filter", 17 | help="Type of News filter", 18 | default="All", 19 | choices=['all', "hot", "rising", "bullish", 20 | "bearish", "lol", "commented", "important", "saved"]) 21 | 22 | parser.add_argument("-s", "--headless", help="Run Chrome driver headless", 23 | action="store_true") 24 | 25 | parser.add_argument("-l", "--limit", help="Amount of pages to scrape", 26 | type=int, default=None) 27 | 28 | 29 | args = parser.parse_args() 30 | 31 | if args.verbose: 32 | print("verbosity turned on") 33 | 34 | # TODO: Create logger for exception handling 35 | # TODO: Replace print with logger 36 | # TODO: Create bash script or cron to automate this script 37 | 38 | 39 | 40 | SCROLL_PAUSE_TIME = 1 41 | 42 | 43 | def setUp(): 44 | 45 | url = "https://www.cryptopanic.com/news?filter={}".format(args.filter) 46 | 47 | options = webdriver.ChromeOptions() 48 | 49 | # initialize headless mode 50 | if args.headless: 51 | options.add_argument('headless') 52 | 53 | # Don't load images 54 | prefs = {"profile.managed_default_content_settings.images": 2} 55 | options.add_experimental_option("prefs", prefs) 56 | 57 | # Set the window size 58 | options.add_argument('window-size=1200x800') 59 | 60 | # initialize the driver 61 | print("Initializing chromedriver.\n") 62 | driver = webdriver.Chrome(ChromeDriverManager().install(), options=options) 63 | 64 | print("Navigating to %s\n" % url) 65 | driver.get(url) 66 | 67 | # wait up to 2.5 seconds for the elements to become available 68 | driver.implicitly_wait(2.5) 69 | 70 | return driver 71 | 72 | 73 | def loadMore(len_elements): 74 | # Infinite scroll 75 | 76 | # Load More News 77 | load_more = driver.find_element_by_class_name('btn-outline-primary') 78 | driver.execute_script("arguments[0].scrollIntoView();", load_more) 79 | 80 | time.sleep(SCROLL_PAUSE_TIME) 81 | 82 | elements = driver.find_elements_by_css_selector('div.news-row.news-row-link') 83 | if len_elements < len(elements): 84 | if args.verbose: 85 | print("Loading %s more rows" % (len(elements) - len_elements)) 86 | return True 87 | else: 88 | if args.verbose: 89 | print("No more rows to load :/") 90 | print("Total rows loaded: %s\n" % len(elements)) 91 | return False 92 | 93 | 94 | def getData(): 95 | data = dict() 96 | elements = driver.find_elements_by_css_selector('div.news-row.news-row-link') 97 | 98 | total_rows = len(elements) - 7 # elements being returned are appended by 7 of the first rows. 99 | print("Downloading Data...\n") 100 | start = datetime.datetime.now() 101 | print("Time Start: %s\n" % start) 102 | 103 | for i in range(total_rows): 104 | if i >= args.limit: 105 | print(f'Limit argument of {args.limit} hit.') 106 | break 107 | time.sleep(.5) # Busy sleep to keep cpu cool 108 | try: 109 | # Get date posted 110 | date_time = elements[i].find_element_by_css_selector('time').get_attribute('datetime') 111 | # string_date = re.sub('-.*', '', date_time) 112 | # date_time = datetime.datetime.strptime(string_date, "%a %b %d %Y %H:%M:%S %Z") 113 | # Get Title of News 114 | title = elements[i].find_element_by_css_selector("span.title-text span:nth-child(1)").text 115 | if title == '': 116 | driver.execute_script("arguments[0].scrollIntoView();", 117 | elements[i].find_element_by_css_selector("span.title-text")) 118 | title = elements[i].find_element_by_css_selector("span.title-text span:nth-child(1)").text 119 | 120 | # Get Source URL 121 | elements[i].find_element_by_css_selector("a.news-cell.nc-title").click() 122 | source_name = elements[i].find_element_by_css_selector("span.si-source-name").text 123 | source_link = driver.find_element_by_xpath("//div/h1/a[2]").get_property('href') 124 | source_url = re.sub(".*=", '', urllib.parse.unquote(source_link)) 125 | driver.back() 126 | 127 | # Get Currency Tags 128 | currencies = [] 129 | currency_elements = elements[i].find_elements_by_class_name("colored-link") 130 | for currency in currency_elements: 131 | currencies.append(currency.text) 132 | 133 | votes = dict() 134 | nc_votes = elements[i].find_elements_by_css_selector("span.nc-vote-cont") 135 | for nc_vote in nc_votes: 136 | vote = nc_vote.get_attribute('title') 137 | value = vote[:2] 138 | action = vote.replace(value, '').replace('votes', '').strip() 139 | votes[action] = int(value) 140 | 141 | data[i] = {"Date": date_time, 142 | "Title": title, 143 | "Currencies": currencies, 144 | "Votes": votes, 145 | "Source": source_name, 146 | "URL": source_url} 147 | if args.verbose: 148 | print("Downloaded %s of %s\nPublished: %s\nTitle: %s\nSource: %s\nURL: %s\n" % (i + 1, 149 | total_rows, 150 | data[i]["Date"], 151 | data[i]["Title"], 152 | data[i]["Source"], 153 | data[i]["URL"])) 154 | except Exception as e: 155 | print(e) 156 | raise e 157 | 158 | print("Finished gathering %s rows of data\n" % len(data)) 159 | print("Time End: %.19s" % datetime.datetime.now()) 160 | print("Elapsed Time Gathering Data: %.7s\n" % (datetime.datetime.now() - start)) 161 | 162 | return data 163 | 164 | 165 | def saveData(data): 166 | # Save the website data 167 | file_name = "cryptopanic_{}_{:.10}->{:.10}.pickle".format(args.filter.lower(), 168 | str(data[len(data) - 1]['Date']), 169 | str(data[0]['Date'])) 170 | # Make sure directory exists, if not make one. 171 | pathlib.Path("data").mkdir(parents=True, exist_ok=True) 172 | 173 | with open(os.path.join(os.getcwd(), 'data', file_name), 'wb') as f: 174 | pickle.dump(data, f) 175 | 176 | print("Saved data to %s\n" % file_name) 177 | 178 | 179 | def tearDown(): 180 | if args.verbose: 181 | print("Exiting Chrome Driver") 182 | driver.quit() 183 | 184 | 185 | driver = setUp() 186 | if __name__ == "__main__": 187 | if args.limit is not None: 188 | data_limit = args.limit 189 | else: 190 | data_limit = 100000 # Just make this number massive. 191 | print("Loading News Feed...\n") 192 | while True: 193 | 194 | elements = driver.find_elements_by_css_selector('div.news-row.news-row-link') 195 | 196 | if len(elements) <= data_limit and loadMore(len(elements)): 197 | continue 198 | else: 199 | data = getData() 200 | saveData(data) 201 | tearDown() 202 | break 203 | -------------------------------------------------------------------------------- /images/logo.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pAulseperformance/cryptopanic_scraper/a2784d9d1297bd8f1ec5ea32c33f671b35ea85a7/images/logo.png -------------------------------------------------------------------------------- /images/screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pAulseperformance/cryptopanic_scraper/a2784d9d1297bd8f1ec5ea32c33f671b35ea85a7/images/screenshot.png -------------------------------------------------------------------------------- /jupyter/Scratchpad.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# df['Currencies'].apply(', '.join)" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": 2, 15 | "metadata": {}, 16 | "outputs": [ 17 | { 18 | "data": { 19 | "text/html": [ 20 | "\n", 21 | "
\n", 22 | " \n", 23 | " Loading BokehJS ...\n", 24 | "
" 25 | ] 26 | }, 27 | "metadata": {}, 28 | "output_type": "display_data" 29 | }, 30 | { 31 | "data": { 32 | "application/javascript": [ 33 | "\n", 34 | "(function(root) {\n", 35 | " function now() {\n", 36 | " return new Date();\n", 37 | " }\n", 38 | "\n", 39 | " var force = true;\n", 40 | "\n", 41 | " if (typeof (root._bokeh_onload_callbacks) === \"undefined\" || force === true) {\n", 42 | " root._bokeh_onload_callbacks = [];\n", 43 | " root._bokeh_is_loading = undefined;\n", 44 | " }\n", 45 | "\n", 46 | " var JS_MIME_TYPE = 'application/javascript';\n", 47 | " var HTML_MIME_TYPE = 'text/html';\n", 48 | " var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", 49 | " var CLASS_NAME = 'output_bokeh rendered_html';\n", 50 | "\n", 51 | " /**\n", 52 | " * Render data to the DOM node\n", 53 | " */\n", 54 | " function render(props, node) {\n", 55 | " var script = document.createElement(\"script\");\n", 56 | " node.appendChild(script);\n", 57 | " }\n", 58 | "\n", 59 | " /**\n", 60 | " * Handle when an output is cleared or removed\n", 61 | " */\n", 62 | " function handleClearOutput(event, handle) {\n", 63 | " var cell = handle.cell;\n", 64 | "\n", 65 | " var id = cell.output_area._bokeh_element_id;\n", 66 | " var server_id = cell.output_area._bokeh_server_id;\n", 67 | " // Clean up Bokeh references\n", 68 | " if (id != null && id in Bokeh.index) {\n", 69 | " Bokeh.index[id].model.document.clear();\n", 70 | " delete Bokeh.index[id];\n", 71 | " }\n", 72 | "\n", 73 | " if (server_id !== undefined) {\n", 74 | " // Clean up Bokeh references\n", 75 | " var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", 76 | " cell.notebook.kernel.execute(cmd, {\n", 77 | " iopub: {\n", 78 | " output: function(msg) {\n", 79 | " var id = msg.content.text.trim();\n", 80 | " if (id in Bokeh.index) {\n", 81 | " Bokeh.index[id].model.document.clear();\n", 82 | " delete Bokeh.index[id];\n", 83 | " }\n", 84 | " }\n", 85 | " }\n", 86 | " });\n", 87 | " // Destroy server and session\n", 88 | " var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", 89 | " cell.notebook.kernel.execute(cmd);\n", 90 | " }\n", 91 | " }\n", 92 | "\n", 93 | " /**\n", 94 | " * Handle when a new output is added\n", 95 | " */\n", 96 | " function handleAddOutput(event, handle) {\n", 97 | " var output_area = handle.output_area;\n", 98 | " var output = handle.output;\n", 99 | "\n", 100 | " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", 101 | " if ((output.output_type != \"display_data\") || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n", 102 | " return\n", 103 | " }\n", 104 | "\n", 105 | " var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", 106 | "\n", 107 | " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", 108 | " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", 109 | " // store reference to embed id on output_area\n", 110 | " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", 111 | " }\n", 112 | " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", 113 | " var bk_div = document.createElement(\"div\");\n", 114 | " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", 115 | " var script_attrs = bk_div.children[0].attributes;\n", 116 | " for (var i = 0; i < script_attrs.length; i++) {\n", 117 | " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", 118 | " }\n", 119 | " // store reference to server id on output_area\n", 120 | " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", 121 | " }\n", 122 | " }\n", 123 | "\n", 124 | " function register_renderer(events, OutputArea) {\n", 125 | "\n", 126 | " function append_mime(data, metadata, element) {\n", 127 | " // create a DOM node to render to\n", 128 | " var toinsert = this.create_output_subarea(\n", 129 | " metadata,\n", 130 | " CLASS_NAME,\n", 131 | " EXEC_MIME_TYPE\n", 132 | " );\n", 133 | " this.keyboard_manager.register_events(toinsert);\n", 134 | " // Render to node\n", 135 | " var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", 136 | " render(props, toinsert[toinsert.length - 1]);\n", 137 | " element.append(toinsert);\n", 138 | " return toinsert\n", 139 | " }\n", 140 | "\n", 141 | " /* Handle when an output is cleared or removed */\n", 142 | " events.on('clear_output.CodeCell', handleClearOutput);\n", 143 | " events.on('delete.Cell', handleClearOutput);\n", 144 | "\n", 145 | " /* Handle when a new output is added */\n", 146 | " events.on('output_added.OutputArea', handleAddOutput);\n", 147 | "\n", 148 | " /**\n", 149 | " * Register the mime type and append_mime function with output_area\n", 150 | " */\n", 151 | " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", 152 | " /* Is output safe? */\n", 153 | " safe: true,\n", 154 | " /* Index of renderer in `output_area.display_order` */\n", 155 | " index: 0\n", 156 | " });\n", 157 | " }\n", 158 | "\n", 159 | " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", 160 | " if (root.Jupyter !== undefined) {\n", 161 | " var events = require('base/js/events');\n", 162 | " var OutputArea = require('notebook/js/outputarea').OutputArea;\n", 163 | "\n", 164 | " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", 165 | " register_renderer(events, OutputArea);\n", 166 | " }\n", 167 | " }\n", 168 | "\n", 169 | " \n", 170 | " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", 171 | " root._bokeh_timeout = Date.now() + 5000;\n", 172 | " root._bokeh_failed_load = false;\n", 173 | " }\n", 174 | "\n", 175 | " var NB_LOAD_WARNING = {'data': {'text/html':\n", 176 | " \"
\\n\"+\n", 177 | " \"

\\n\"+\n", 178 | " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", 179 | " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", 180 | " \"

\\n\"+\n", 181 | " \"\\n\"+\n", 185 | " \"\\n\"+\n", 186 | " \"from bokeh.resources import INLINE\\n\"+\n", 187 | " \"output_notebook(resources=INLINE)\\n\"+\n", 188 | " \"\\n\"+\n", 189 | " \"
\"}};\n", 190 | "\n", 191 | " function display_loaded() {\n", 192 | " var el = document.getElementById(\"eaa22502-2ee8-4116-b1ce-02e804e16a5a\");\n", 193 | " if (el != null) {\n", 194 | " el.textContent = \"BokehJS is loading...\";\n", 195 | " }\n", 196 | " if (root.Bokeh !== undefined) {\n", 197 | " if (el != null) {\n", 198 | " el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n", 199 | " }\n", 200 | " } else if (Date.now() < root._bokeh_timeout) {\n", 201 | " setTimeout(display_loaded, 100)\n", 202 | " }\n", 203 | " }\n", 204 | "\n", 205 | "\n", 206 | " function run_callbacks() {\n", 207 | " try {\n", 208 | " root._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", 209 | " }\n", 210 | " finally {\n", 211 | " delete root._bokeh_onload_callbacks\n", 212 | " }\n", 213 | " console.info(\"Bokeh: all callbacks have finished\");\n", 214 | " }\n", 215 | "\n", 216 | " function load_libs(js_urls, callback) {\n", 217 | " root._bokeh_onload_callbacks.push(callback);\n", 218 | " if (root._bokeh_is_loading > 0) {\n", 219 | " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", 220 | " return null;\n", 221 | " }\n", 222 | " if (js_urls == null || js_urls.length === 0) {\n", 223 | " run_callbacks();\n", 224 | " return null;\n", 225 | " }\n", 226 | " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", 227 | " root._bokeh_is_loading = js_urls.length;\n", 228 | " for (var i = 0; i < js_urls.length; i++) {\n", 229 | " var url = js_urls[i];\n", 230 | " var s = document.createElement('script');\n", 231 | " s.src = url;\n", 232 | " s.async = false;\n", 233 | " s.onreadystatechange = s.onload = function() {\n", 234 | " root._bokeh_is_loading--;\n", 235 | " if (root._bokeh_is_loading === 0) {\n", 236 | " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", 237 | " run_callbacks()\n", 238 | " }\n", 239 | " };\n", 240 | " s.onerror = function() {\n", 241 | " console.warn(\"failed to load library \" + url);\n", 242 | " };\n", 243 | " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", 244 | " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", 245 | " }\n", 246 | " };var element = document.getElementById(\"eaa22502-2ee8-4116-b1ce-02e804e16a5a\");\n", 247 | " if (element == null) {\n", 248 | " console.log(\"Bokeh: ERROR: autoload.js configured with elementid 'eaa22502-2ee8-4116-b1ce-02e804e16a5a' but no matching script tag was found. \")\n", 249 | " return false;\n", 250 | " }\n", 251 | "\n", 252 | " var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-0.13.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.13.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-tables-0.13.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-gl-0.13.0.min.js\"];\n", 253 | "\n", 254 | " var inline_js = [\n", 255 | " function(Bokeh) {\n", 256 | " Bokeh.set_log_level(\"info\");\n", 257 | " },\n", 258 | " \n", 259 | " function(Bokeh) {\n", 260 | " \n", 261 | " },\n", 262 | " function(Bokeh) {\n", 263 | " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.13.0.min.css\");\n", 264 | " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.13.0.min.css\");\n", 265 | " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.13.0.min.css\");\n", 266 | " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.13.0.min.css\");\n", 267 | " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-tables-0.13.0.min.css\");\n", 268 | " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-tables-0.13.0.min.css\");\n", 269 | " }\n", 270 | " ];\n", 271 | "\n", 272 | " function run_inline_js() {\n", 273 | " \n", 274 | " if ((root.Bokeh !== undefined) || (force === true)) {\n", 275 | " for (var i = 0; i < inline_js.length; i++) {\n", 276 | " inline_js[i].call(root, root.Bokeh);\n", 277 | " }if (force === true) {\n", 278 | " display_loaded();\n", 279 | " }} else if (Date.now() < root._bokeh_timeout) {\n", 280 | " setTimeout(run_inline_js, 100);\n", 281 | " } else if (!root._bokeh_failed_load) {\n", 282 | " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", 283 | " root._bokeh_failed_load = true;\n", 284 | " } else if (force !== true) {\n", 285 | " var cell = $(document.getElementById(\"eaa22502-2ee8-4116-b1ce-02e804e16a5a\")).parents('.cell').data().cell;\n", 286 | " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", 287 | " }\n", 288 | "\n", 289 | " }\n", 290 | "\n", 291 | " if (root._bokeh_is_loading === 0) {\n", 292 | " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", 293 | " run_inline_js();\n", 294 | " } else {\n", 295 | " load_libs(js_urls, function() {\n", 296 | " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", 297 | " run_inline_js();\n", 298 | " });\n", 299 | " }\n", 300 | "}(window));" 301 | ], 302 | "application/vnd.bokehjs_load.v0+json": "\n(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n\n if (typeof (root._bokeh_onload_callbacks) === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n \n\n \n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n var NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded() {\n var el = document.getElementById(\"eaa22502-2ee8-4116-b1ce-02e804e16a5a\");\n if (el != null) {\n el.textContent = \"BokehJS is loading...\";\n }\n if (root.Bokeh !== undefined) {\n if (el != null) {\n el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(display_loaded, 100)\n }\n }\n\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n }\n finally {\n delete root._bokeh_onload_callbacks\n }\n console.info(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(js_urls, callback) {\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = js_urls.length;\n for (var i = 0; i < js_urls.length; i++) {\n var url = js_urls[i];\n var s = document.createElement('script');\n s.src = url;\n s.async = false;\n s.onreadystatechange = s.onload = function() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.log(\"Bokeh: all BokehJS libraries loaded\");\n run_callbacks()\n }\n };\n s.onerror = function() {\n console.warn(\"failed to load library \" + url);\n };\n console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.getElementsByTagName(\"head\")[0].appendChild(s);\n }\n };var element = document.getElementById(\"eaa22502-2ee8-4116-b1ce-02e804e16a5a\");\n if (element == null) {\n console.log(\"Bokeh: ERROR: autoload.js configured with elementid 'eaa22502-2ee8-4116-b1ce-02e804e16a5a' but no matching script tag was found. \")\n return false;\n }\n\n var js_urls = [\"https://cdn.pydata.org/bokeh/release/bokeh-0.13.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.13.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-tables-0.13.0.min.js\", \"https://cdn.pydata.org/bokeh/release/bokeh-gl-0.13.0.min.js\"];\n\n var inline_js = [\n function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\n \n function(Bokeh) {\n \n },\n function(Bokeh) {\n console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.13.0.min.css\");\n Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.13.0.min.css\");\n console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.13.0.min.css\");\n Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.13.0.min.css\");\n console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-tables-0.13.0.min.css\");\n Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-tables-0.13.0.min.css\");\n }\n ];\n\n function run_inline_js() {\n \n if ((root.Bokeh !== undefined) || (force === true)) {\n for (var i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }if (force === true) {\n display_loaded();\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n var cell = $(document.getElementById(\"eaa22502-2ee8-4116-b1ce-02e804e16a5a\")).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n\n }\n\n if (root._bokeh_is_loading === 0) {\n console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(js_urls, function() {\n console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" 303 | }, 304 | "metadata": {}, 305 | "output_type": "display_data" 306 | } 307 | ], 308 | "source": [ 309 | "# Now that the data is cleaned and we have a backup of our df let's explore\n", 310 | "from bokeh.plotting import figure, output_notebook, show\n", 311 | "output_notebook()\n", 312 | "from bokeh.models import ColumnDataSource\n", 313 | "from bokeh.models.tools import HoverTool\n", 314 | "from bokeh.transform import factor_cmap\n", 315 | "from bokeh.palettes import Spectral5, Spectral3, inferno, viridis, Category20" 316 | ] 317 | }, 318 | { 319 | "cell_type": "code", 320 | "execution_count": 3, 321 | "metadata": {}, 322 | "outputs": [ 323 | { 324 | "ename": "NameError", 325 | "evalue": "name 'df' is not defined", 326 | "output_type": "error", 327 | "traceback": [ 328 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 329 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 330 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0msource\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mColumnDataSource\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mtypes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Currencies'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munique\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtolist\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0mcolor_map\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfactor_cmap\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfield_name\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'Currencies'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpalette\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mviridis\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m18\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfactors\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mtypes\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 331 | "\u001b[0;31mNameError\u001b[0m: name 'df' is not defined" 332 | ] 333 | } 334 | ], 335 | "source": [ 336 | "source = ColumnDataSource(df)\n", 337 | "types = df['Currencies'].unique().tolist()\n", 338 | "color_map = factor_cmap(field_name='Currencies', palette=viridis(18), factors=types)" 339 | ] 340 | }, 341 | { 342 | "cell_type": "code", 343 | "execution_count": 4, 344 | "metadata": {}, 345 | "outputs": [ 346 | { 347 | "ename": "NameError", 348 | "evalue": "name 'source' is not defined", 349 | "output_type": "error", 350 | "traceback": [ 351 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 352 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 353 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0mp\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfigure\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx_axis_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'datetime'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcircle\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'Date'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'positive'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msource\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0msource\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msize\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcolor_map\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0;31m# p.title.text = 'Pokemon Attack vs Speed'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 354 | "\u001b[0;31mNameError\u001b[0m: name 'source' is not defined" 355 | ] 356 | } 357 | ], 358 | "source": [ 359 | "p = figure(x_axis_type='datetime')\n", 360 | "\n", 361 | "p.circle(x='Date', y='positive', source=source, size=10, color=color_map)\n", 362 | "\n", 363 | "# p.title.text = 'Pokemon Attack vs Speed'\n", 364 | "# p.xaxis.axis_label = 'Attacking Stats'\n", 365 | "# p.yaxis.axis_label = 'Speed Stats'\n", 366 | "\n", 367 | "hover = HoverTool()\n", 368 | "hover.tooltips=[\n", 369 | " ('Positive', '@positive'),\n", 370 | " ('Negative', '@negative'),\n", 371 | " ('Important', '@{important}'),\n", 372 | " ('Title', '@Title'),\n", 373 | "]\n", 374 | "\n", 375 | "p.add_tools(hover)\n", 376 | "\n", 377 | "show(p)" 378 | ] 379 | }, 380 | { 381 | "cell_type": "code", 382 | "execution_count": 5, 383 | "metadata": {}, 384 | "outputs": [ 385 | { 386 | "ename": "NameError", 387 | "evalue": "name 'df' is not defined", 388 | "output_type": "error", 389 | "traceback": [ 390 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 391 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 392 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mattribs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroupby\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Currencies'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'positive'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", 393 | "\u001b[0;31mNameError\u001b[0m: name 'df' is not defined" 394 | ] 395 | } 396 | ], 397 | "source": [ 398 | "attribs = df.groupby('Currencies')['positive'].mean()" 399 | ] 400 | }, 401 | { 402 | "cell_type": "code", 403 | "execution_count": 6, 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "ename": "NameError", 408 | "evalue": "name 'pd' is not defined", 409 | "output_type": "error", 410 | "traceback": [ 411 | "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", 412 | "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", 413 | "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mconcat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdrop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Currencies'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Currencies'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mSeries\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfillna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdescribe\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", 414 | "\u001b[0;31mNameError\u001b[0m: name 'pd' is not defined" 415 | ] 416 | } 417 | ], 418 | "source": [ 419 | "df = pd.concat([df.drop(['Currencies'], axis=1), df['Currencies'].apply(pd.Series)], axis=1).fillna(0)\n", 420 | "df.describe()" 421 | ] 422 | }, 423 | { 424 | "cell_type": "code", 425 | "execution_count": null, 426 | "metadata": {}, 427 | "outputs": [], 428 | "source": [] 429 | } 430 | ], 431 | "metadata": { 432 | "kernelspec": { 433 | "display_name": "Python 3", 434 | "language": "python", 435 | "name": "python3" 436 | }, 437 | "language_info": { 438 | "codemirror_mode": { 439 | "name": "ipython", 440 | "version": 3 441 | }, 442 | "file_extension": ".py", 443 | "mimetype": "text/x-python", 444 | "name": "python", 445 | "nbconvert_exporter": "python", 446 | "pygments_lexer": "ipython3", 447 | "version": "3.6.6" 448 | } 449 | }, 450 | "nbformat": 4, 451 | "nbformat_minor": 2 452 | } 453 | -------------------------------------------------------------------------------- /jupyter/eda.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 3, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "# I want to look at how each cryptocurrency compares with the number of votes it recieves.\n", 10 | "# I want to look at how big of a response each source gets" 11 | ] 12 | }, 13 | { 14 | "cell_type": "code", 15 | "execution_count": 4, 16 | "metadata": {}, 17 | "outputs": [ 18 | { 19 | "data": { 20 | "text/html": [ 21 | "
\n", 22 | "\n", 35 | "\n", 36 | " \n", 37 | " \n", 38 | " \n", 39 | " \n", 40 | " \n", 41 | " \n", 42 | " \n", 43 | " \n", 44 | " \n", 45 | " \n", 46 | " \n", 47 | " \n", 48 | " \n", 49 | " \n", 50 | " \n", 51 | " \n", 52 | " \n", 53 | " \n", 54 | " \n", 55 | " \n", 56 | " \n", 57 | " \n", 58 | " \n", 59 | " \n", 60 | " \n", 61 | " \n", 62 | " \n", 63 | " \n", 64 | " \n", 65 | " \n", 66 | " \n", 67 | " \n", 68 | " \n", 69 | " \n", 70 | " \n", 71 | " \n", 72 | " \n", 73 | " \n", 74 | " \n", 75 | " \n", 76 | " \n", 77 | " \n", 78 | " \n", 79 | " \n", 80 | " \n", 81 | " \n", 82 | " \n", 83 | " \n", 84 | " \n", 85 | " \n", 86 | " \n", 87 | " \n", 88 | " \n", 89 | " \n", 90 | " \n", 91 | " \n", 92 | " \n", 93 | " \n", 94 | "
CurrenciesDateSourceTitleURLVotes
0[]2019-02-05 21:17:27cryptoglobe.comCrypto-Fund Assets at All-Time High Despite Be...https://www.cryptoglobe.com/latest/2019/02/cry...{}
1[BTC]2019-02-05 21:07:00cointelegraph.comCrypto Firm Accused of Fraud, Duping Investor ...https://cointelegraph.com/news/crypto-firm-acc...{}
2[]2019-02-05 20:20:05cryptoslate.comAnalysis of UAE and Saudi Arabia’s Government ...https://cryptoslate.com/uae-saudi-arabia-launc...{}
3[]2019-02-05 20:00:54bitcoinist.comBottom Feeders’ Time to Shinehttps://bitcoinist.com/bottom-feeders-time-to-...{}
4[BTC]2019-02-05 19:56:00cointelegraph.comUS Trading Platform LedgerX Introduces Binary ...https://cointelegraph.com/news/us-trading-plat...{'positive': 1, 'like': 1}
\n", 95 | "
" 96 | ], 97 | "text/plain": [ 98 | " Currencies Date Source \\\n", 99 | "0 [] 2019-02-05 21:17:27 cryptoglobe.com \n", 100 | "1 [BTC] 2019-02-05 21:07:00 cointelegraph.com \n", 101 | "2 [] 2019-02-05 20:20:05 cryptoslate.com \n", 102 | "3 [] 2019-02-05 20:00:54 bitcoinist.com \n", 103 | "4 [BTC] 2019-02-05 19:56:00 cointelegraph.com \n", 104 | "\n", 105 | " Title \\\n", 106 | "0 Crypto-Fund Assets at All-Time High Despite Be... \n", 107 | "1 Crypto Firm Accused of Fraud, Duping Investor ... \n", 108 | "2 Analysis of UAE and Saudi Arabia’s Government ... \n", 109 | "3 Bottom Feeders’ Time to Shine \n", 110 | "4 US Trading Platform LedgerX Introduces Binary ... \n", 111 | "\n", 112 | " URL \\\n", 113 | "0 https://www.cryptoglobe.com/latest/2019/02/cry... \n", 114 | "1 https://cointelegraph.com/news/crypto-firm-acc... \n", 115 | "2 https://cryptoslate.com/uae-saudi-arabia-launc... \n", 116 | "3 https://bitcoinist.com/bottom-feeders-time-to-... \n", 117 | "4 https://cointelegraph.com/news/us-trading-plat... \n", 118 | "\n", 119 | " Votes \n", 120 | "0 {} \n", 121 | "1 {} \n", 122 | "2 {} \n", 123 | "3 {} \n", 124 | "4 {'positive': 1, 'like': 1} " 125 | ] 126 | }, 127 | "execution_count": 4, 128 | "metadata": {}, 129 | "output_type": "execute_result" 130 | } 131 | ], 132 | "source": [ 133 | "import pandas as pd\n", 134 | "\n", 135 | "# Read pickle and transform dataframe\n", 136 | "\n", 137 | "data = pd.read_pickle(\"../data/cryptopanic_all_2019-02-01->2019-02-05.pickle\")\n", 138 | "df = pd.DataFrame(data)\n", 139 | "df = df.T\n", 140 | "df.head()" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": 5, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "# Make Datetime column the index\n", 150 | "df.index = df['Date']\n", 151 | "df.drop(columns='Date', inplace=True)" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": 6, 157 | "metadata": {}, 158 | "outputs": [ 159 | { 160 | "data": { 161 | "text/html": [ 162 | "
\n", 163 | "\n", 176 | "\n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " \n", 182 | " \n", 183 | " \n", 184 | " \n", 185 | " \n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 197 | " \n", 198 | " \n", 199 | " \n", 200 | " \n", 201 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 206 | " \n", 207 | " \n", 208 | " \n", 209 | " \n", 210 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 214 | " \n", 215 | " \n", 216 | " \n", 217 | " \n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " \n", 223 | " \n", 224 | " \n", 225 | " \n", 226 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 238 | " \n", 239 | " \n", 240 | " \n", 241 | " \n", 242 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 247 | " \n", 248 | " \n", 249 | " \n", 250 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 257 | " \n", 258 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 263 | " \n", 264 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 278 | " \n", 279 | " \n", 280 | " \n", 281 | " \n", 282 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 286 | " \n", 287 | " \n", 288 | " \n", 289 | " \n", 290 | " \n", 291 | " \n", 292 | " \n", 293 | "
CurrenciesSourceTitleURLcommentsdislikeimportantlikelolnegativepositivesavesaves
Date
2019-02-05 21:17:27[]cryptoglobe.comCrypto-Fund Assets at All-Time High Despite Be...https://www.cryptoglobe.com/latest/2019/02/cry...0.00.00.00.00.00.00.00.00.0
2019-02-05 21:07:00[BTC]cointelegraph.comCrypto Firm Accused of Fraud, Duping Investor ...https://cointelegraph.com/news/crypto-firm-acc...0.00.00.00.00.00.00.00.00.0
2019-02-05 20:20:05[]cryptoslate.comAnalysis of UAE and Saudi Arabia’s Government ...https://cryptoslate.com/uae-saudi-arabia-launc...0.00.00.00.00.00.00.00.00.0
2019-02-05 20:00:54[]bitcoinist.comBottom Feeders’ Time to Shinehttps://bitcoinist.com/bottom-feeders-time-to-...0.00.00.00.00.00.00.00.00.0
2019-02-05 19:56:00[BTC]cointelegraph.comUS Trading Platform LedgerX Introduces Binary ...https://cointelegraph.com/news/us-trading-plat...0.00.00.01.00.00.01.00.00.0
\n", 294 | "
" 295 | ], 296 | "text/plain": [ 297 | " Currencies Source \\\n", 298 | "Date \n", 299 | "2019-02-05 21:17:27 [] cryptoglobe.com \n", 300 | "2019-02-05 21:07:00 [BTC] cointelegraph.com \n", 301 | "2019-02-05 20:20:05 [] cryptoslate.com \n", 302 | "2019-02-05 20:00:54 [] bitcoinist.com \n", 303 | "2019-02-05 19:56:00 [BTC] cointelegraph.com \n", 304 | "\n", 305 | " Title \\\n", 306 | "Date \n", 307 | "2019-02-05 21:17:27 Crypto-Fund Assets at All-Time High Despite Be... \n", 308 | "2019-02-05 21:07:00 Crypto Firm Accused of Fraud, Duping Investor ... \n", 309 | "2019-02-05 20:20:05 Analysis of UAE and Saudi Arabia’s Government ... \n", 310 | "2019-02-05 20:00:54 Bottom Feeders’ Time to Shine \n", 311 | "2019-02-05 19:56:00 US Trading Platform LedgerX Introduces Binary ... \n", 312 | "\n", 313 | " URL \\\n", 314 | "Date \n", 315 | "2019-02-05 21:17:27 https://www.cryptoglobe.com/latest/2019/02/cry... \n", 316 | "2019-02-05 21:07:00 https://cointelegraph.com/news/crypto-firm-acc... \n", 317 | "2019-02-05 20:20:05 https://cryptoslate.com/uae-saudi-arabia-launc... \n", 318 | "2019-02-05 20:00:54 https://bitcoinist.com/bottom-feeders-time-to-... \n", 319 | "2019-02-05 19:56:00 https://cointelegraph.com/news/us-trading-plat... \n", 320 | "\n", 321 | " comments dislike important like lol negative \\\n", 322 | "Date \n", 323 | "2019-02-05 21:17:27 0.0 0.0 0.0 0.0 0.0 0.0 \n", 324 | "2019-02-05 21:07:00 0.0 0.0 0.0 0.0 0.0 0.0 \n", 325 | "2019-02-05 20:20:05 0.0 0.0 0.0 0.0 0.0 0.0 \n", 326 | "2019-02-05 20:00:54 0.0 0.0 0.0 0.0 0.0 0.0 \n", 327 | "2019-02-05 19:56:00 0.0 0.0 0.0 1.0 0.0 0.0 \n", 328 | "\n", 329 | " positive save saves \n", 330 | "Date \n", 331 | "2019-02-05 21:17:27 0.0 0.0 0.0 \n", 332 | "2019-02-05 21:07:00 0.0 0.0 0.0 \n", 333 | "2019-02-05 20:20:05 0.0 0.0 0.0 \n", 334 | "2019-02-05 20:00:54 0.0 0.0 0.0 \n", 335 | "2019-02-05 19:56:00 1.0 0.0 0.0 " 336 | ] 337 | }, 338 | "execution_count": 6, 339 | "metadata": {}, 340 | "output_type": "execute_result" 341 | } 342 | ], 343 | "source": [ 344 | "# Split Votes list into separate columns and fill NaN values\n", 345 | "df = pd.concat([df.drop(['Votes'], axis=1), df['Votes'].apply(pd.Series)], axis=1).fillna(0)\n", 346 | "df.head()" 347 | ] 348 | }, 349 | { 350 | "cell_type": "code", 351 | "execution_count": 7, 352 | "metadata": {}, 353 | "outputs": [], 354 | "source": [ 355 | "# Remove list from Currency column\n", 356 | "df['Currencies'] = df['Currencies'].apply(', '.join)" 357 | ] 358 | }, 359 | { 360 | "cell_type": "code", 361 | "execution_count": 8, 362 | "metadata": {}, 363 | "outputs": [], 364 | "source": [ 365 | "# find all unique currencies\n", 366 | "unique_currencies = set([c for i in df.Currencies for c in i])" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": 55, 372 | "metadata": {}, 373 | "outputs": [ 374 | { 375 | "data": { 376 | "text/html": [ 377 | "
\n", 378 | "\n", 391 | "\n", 392 | " \n", 393 | " \n", 394 | " \n", 395 | " \n", 396 | " \n", 397 | " \n", 398 | " \n", 399 | " \n", 400 | " \n", 401 | " \n", 402 | " \n", 403 | " \n", 404 | " \n", 405 | " \n", 406 | " \n", 407 | " \n", 408 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 439 | " \n", 440 | " \n", 441 | " \n", 442 | " \n", 443 | " \n", 444 | " \n", 445 | " \n", 446 | " \n", 447 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 482 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 509 | " \n", 510 | " \n", 511 | " \n", 512 | " \n", 513 | " \n", 514 | " \n", 515 | " \n", 516 | "
commentsdislikeimportantlikelolnegativepositivesavesaves
Currencies
XRP282828282828282828
XRP, EOS, XLM111111111
XRP, ETH222222222
XRP, ETH, BCH111111111
XRP, ETH, TRX111111111
XRP, LTC111111111
XRP, LTC, TRX111111111
XRP, TRX, XLM222222222
\n", 517 | "
" 518 | ], 519 | "text/plain": [ 520 | " comments dislike important like lol negative positive \\\n", 521 | "Currencies \n", 522 | "XRP 28 28 28 28 28 28 28 \n", 523 | "XRP, EOS, XLM 1 1 1 1 1 1 1 \n", 524 | "XRP, ETH 2 2 2 2 2 2 2 \n", 525 | "XRP, ETH, BCH 1 1 1 1 1 1 1 \n", 526 | "XRP, ETH, TRX 1 1 1 1 1 1 1 \n", 527 | "XRP, LTC 1 1 1 1 1 1 1 \n", 528 | "XRP, LTC, TRX 1 1 1 1 1 1 1 \n", 529 | "XRP, TRX, XLM 2 2 2 2 2 2 2 \n", 530 | "\n", 531 | " save saves \n", 532 | "Currencies \n", 533 | "XRP 28 28 \n", 534 | "XRP, EOS, XLM 1 1 \n", 535 | "XRP, ETH 2 2 \n", 536 | "XRP, ETH, BCH 1 1 \n", 537 | "XRP, ETH, TRX 1 1 \n", 538 | "XRP, LTC 1 1 \n", 539 | "XRP, LTC, TRX 1 1 \n", 540 | "XRP, TRX, XLM 2 2 " 541 | ] 542 | }, 543 | "execution_count": 55, 544 | "metadata": {}, 545 | "output_type": "execute_result" 546 | } 547 | ], 548 | "source": [ 549 | "for c in unique_currencies:\n", 550 | " a =df[df.Currencies.str.match('XRP')]\n", 551 | "\n", 552 | "votes = [i for i in df.iloc[:, 4:]]\n", 553 | "a.groupby('Currencies')[votes].count()" 554 | ] 555 | }, 556 | { 557 | "cell_type": "code", 558 | "execution_count": 45, 559 | "metadata": {}, 560 | "outputs": [], 561 | "source": [ 562 | "# df['Currencies']\n", 563 | "# df.groupby(['Currencies']).apply(list)\n", 564 | "# df.Currencies.unique()" 565 | ] 566 | }, 567 | { 568 | "cell_type": "code", 569 | "execution_count": 46, 570 | "metadata": {}, 571 | "outputs": [ 572 | { 573 | "data": { 574 | "text/plain": [ 575 | "['comments',\n", 576 | " 'dislike',\n", 577 | " 'important',\n", 578 | " 'like',\n", 579 | " 'lol',\n", 580 | " 'negative',\n", 581 | " 'positive',\n", 582 | " 'save',\n", 583 | " 'saves']" 584 | ] 585 | }, 586 | "execution_count": 46, 587 | "metadata": {}, 588 | "output_type": "execute_result" 589 | } 590 | ], 591 | "source": [ 592 | "votes" 593 | ] 594 | }, 595 | { 596 | "cell_type": "code", 597 | "execution_count": null, 598 | "metadata": {}, 599 | "outputs": [], 600 | "source": [] 601 | } 602 | ], 603 | "metadata": { 604 | "kernelspec": { 605 | "display_name": "Python 3", 606 | "language": "python", 607 | "name": "python3" 608 | }, 609 | "language_info": { 610 | "codemirror_mode": { 611 | "name": "ipython", 612 | "version": 3 613 | }, 614 | "file_extension": ".py", 615 | "mimetype": "text/x-python", 616 | "name": "python", 617 | "nbconvert_exporter": "python", 618 | "pygments_lexer": "ipython3", 619 | "version": "3.6.6" 620 | } 621 | }, 622 | "nbformat": 4, 623 | "nbformat_minor": 2 624 | } 625 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | certifi==2019.9.11 2 | chardet==3.0.4 3 | colorama==0.4.1 4 | configparser==4.0.2 5 | crayons==0.2.0 6 | idna==2.8 7 | requests==2.22.0 8 | selenium==3.141.0 9 | urllib3==1.25.6 10 | webdriver-manager==1.8.2 11 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | import cryptopanic_scraper as cw 2 | 3 | def test_setUp(): 4 | assert (cw.setUp()) 5 | 6 | 7 | def test_setUp(): 8 | assert (cw.getData()) 9 | --------------------------------------------------------------------------------