├── .gitignore ├── LICENSE.md ├── Procfile ├── README.md ├── app.json ├── requirements-dev.txt ├── requirements.txt ├── runtime.txt ├── scripts ├── run_dev_flask.sh ├── start_service.sh └── stop_service.sh └── src ├── config.py ├── fb_comment_downloader_app.py ├── get_fb_comments_from_fb.py ├── static ├── loader.gif ├── script.js └── style.css ├── templates └── index.html ├── test_data_urls.json ├── test_validation.py └── validation.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | .* 3 | env/ 4 | *.bak 5 | __pycache__ 6 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Washington State Department of Transportation 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Procfile: -------------------------------------------------------------------------------- 1 | web: gunicorn --chdir src fb_comment_downloader_app:app --timeout 180 --worker-class gevent 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Facebook Comment Downloader # 2 | 3 | A small web app for downloading comments from a public facebook page post. 4 | Comment downloading from https://github.com/minimaxir/facebook-page-post-scraper 5 | 6 |  7 | 8 | Setup 9 | ----- 10 | 11 | ``` 12 | pip install -r requirements.txt 13 | ``` 14 | 15 | *Note: this will install [Gunicorn](http://gunicorn.org/) and [Gevent](http://www.gevent.org/). These packages are not required if you choose a different server.* 16 | 17 | This application is set up to only download comments on posts from a specified public facebook page. You will need to [register and configure a Facebook app](https://developers.facebook.com/docs/apps/register/). Once you've done this, fill out `config.py` with your information. 18 | 19 | To get comment author and reactions info you will need to use a [Page Access token](https://developers.facebook.com/docs/facebook-login/access-tokens/#pagetokens) from a user who has admin rights to the page. You can get a token by setting up a [system user](https://developers.facebook.com/docs/audience-network/reporting-api/systemuser/). 20 | 21 | Be aware of the following restriction: 22 | > Devmode Apps — Apps in Devmode are now rate-limited to 200 calls per hour, per page-app pair, and can only access Users who have a role on the app (admin, developer, or tester). 23 | 24 | https://developers.facebook.com/docs/graph-api/changelog/breaking-changes 25 | 26 | Deployment 27 | ---------- 28 | This project is built with [Flask](http://flask.pocoo.org/). 29 | Hosting is up to you, the Flask webpage lists [some options](http://flask.pocoo.org/docs/0.12/deploying/). 30 | 31 | Click below to deploy the app with [Gunicorn](http://gunicorn.org/) on Heroku. 32 | 33 | [](https://heroku.com/deploy) 34 | 35 | Development Setup 36 | ---------------- 37 | 38 | ``` 39 | pip install -r requirements-dev.txt 40 | ``` 41 | 42 | ##### Start Flask dev server 43 | 44 | `FLASK_APP=fb_comment_downloader_app.py flask run` 45 | 46 | ##### Run tests 47 | 48 | `python test_validation.py`. 49 | Currently, we only have a few tests for checking the facebook urls. 50 | 51 | Contributing 52 | ------------ 53 | 54 | Find a bug? Got an idea? Send us a pull request or open an issue and we'll take a look. You can also check the issue tracker. 55 | 56 | License 57 | ------- 58 | 59 | MIT 60 | -------------------------------------------------------------------------------- /app.json: -------------------------------------------------------------------------------- 1 | { 2 | "name": "Facebook Comment Downloader", 3 | "description": "A web app for downloading Facebook comments as a csv file.", 4 | "keywords": [ 5 | "python", 6 | "flask" 7 | ], 8 | "repository": "https://github.com/WSDOT/fb-comment-downloader", 9 | "env": { 10 | "PAGE_ACCESS_TOKEN": { 11 | "description": "Your Facebook page access token. Recommend using a permanent system user token.", 12 | "value": "Replace with your page access token" 13 | }, 14 | "PAGE_ID": { 15 | "description": "ID of the public Facebook page where comments will be downloaded from.", 16 | "value": "Replace with your page ID" 17 | }, 18 | "PAGE_NAME": { 19 | "description": "The name of your Facebook page where comments will be downloaded from.", 20 | "value": "Replace with your page name" 21 | } 22 | }, 23 | "formation": { 24 | "web": { 25 | "quantity": 1, 26 | "size": "free" 27 | } 28 | }, 29 | "image": "heroku/python" 30 | } 31 | -------------------------------------------------------------------------------- /requirements-dev.txt: -------------------------------------------------------------------------------- 1 | click==6.7 2 | ddt==1.1.1 3 | Flask==1.0.2 4 | itsdangerous==0.24 5 | Jinja2==2.11.3 6 | MarkupSafe==1.0 7 | Werkzeug==0.15.5 8 | ddt==1.1.1 9 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | click==6.7 2 | Flask==1.0.2 3 | itsdangerous==0.24 4 | Jinja2==2.11.3 5 | MarkupSafe==1.0 6 | Werkzeug==0.15.5 7 | gunicorn==19.7.1 8 | gevent==1.2.2 9 | -------------------------------------------------------------------------------- /runtime.txt: -------------------------------------------------------------------------------- 1 | python-2.7.14 2 | -------------------------------------------------------------------------------- /scripts/run_dev_flask.sh: -------------------------------------------------------------------------------- 1 | FLASK_APP=../src/fb_comment_downloader_app.py flask run 2 | -------------------------------------------------------------------------------- /scripts/start_service.sh: -------------------------------------------------------------------------------- 1 | systemctl start fb-comment-downloader.service 2 | -------------------------------------------------------------------------------- /scripts/stop_service.sh: -------------------------------------------------------------------------------- 1 | systemctl stop fb-comment-downloader.service 2 | -------------------------------------------------------------------------------- /src/config.py: -------------------------------------------------------------------------------- 1 | import os 2 | 3 | access_token = os.environ['PAGE_ACCESS_TOKEN'] or "x" 4 | 5 | page_id = os.environ['PAGE_ID'] or "-1" # ID of the public Facebook page where comments will be downloaded from. 6 | page_name = os.environ['PAGE_NAME'] or "PAGE NAME" # Name of page, must be exactly as shown in page url. 7 | -------------------------------------------------------------------------------- /src/fb_comment_downloader_app.py: -------------------------------------------------------------------------------- 1 | import io 2 | import csv 3 | import re 4 | 5 | from flask import Flask 6 | from flask import render_template 7 | from flask import stream_with_context 8 | from flask import Response 9 | from flask import request 10 | from flask import jsonify 11 | 12 | from werkzeug.datastructures import Headers 13 | 14 | from validation import get_post_id 15 | from validation import get_page_name 16 | 17 | from get_fb_comments_from_fb import scrapeFacebookPageFeedComments 18 | from get_fb_comments_from_fb import request_once 19 | 20 | import config 21 | 22 | app = Flask(__name__) 23 | 24 | @app.route('/') 25 | def index(): 26 | return render_template('index.html', page_name=config.page_name) 27 | 28 | @app.route('/', methods=['POST']) 29 | def index_post(): 30 | 31 | error = None 32 | url = request.form['text'] 33 | 34 | if request_once(url) == None: 35 | message="Please make sure you entered a vaild url" 36 | return jsonify({"error": message}) 37 | 38 | if get_page_name(url) != config.page_name: 39 | message="Please enter a post url for the {0} page".format(config.page_name) 40 | return jsonify({"error": message}) 41 | 42 | 43 | post_id = get_post_id(url) 44 | 45 | status_id = "{0}_{1}".format(config.page_id, post_id) 46 | 47 | if status_id == None: 48 | message="Please make sure you entered a vaild Facebook url" 49 | return jsonify({"error": message}) 50 | 51 | si = io.StringIO() 52 | cw = csv.writer(si) 53 | 54 | # add a filename 55 | headers = Headers() 56 | headers.set('Content-Disposition', 'attachment', filename='fb_comments.csv') 57 | 58 | # stream the response as the data is generated 59 | return Response( 60 | stream_with_context(scrapeFacebookPageFeedComments( 61 | si, 62 | cw, 63 | config.page_id, 64 | config.access_token, 65 | status_id)), 66 | mimetype='application/download', headers=headers 67 | ) 68 | 69 | if __name__ == "__main__": 70 | app.run() 71 | -------------------------------------------------------------------------------- /src/get_fb_comments_from_fb.py: -------------------------------------------------------------------------------- 1 | # MIT License 2 | # 3 | # Copyright (c) 2017 Max Woolf 4 | # 5 | # Permission is hereby granted, free of charge, to any person 6 | # obtaining a copy of this software and associated documentation 7 | # files (the "Software"), to deal in the Software without 8 | # restriction, including without limitation the rights to use, 9 | # copy, modify, merge, publish, distribute, sublicense, and/or 10 | # sell copies of the Software, and to permit persons to whom 11 | # the Software is furnished to do so, subject to the following conditions: 12 | 13 | # The above copyright notice and this permission notice shall be included 14 | # in all copies or substantial portions of the Software. 15 | 16 | # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 17 | # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES 18 | # OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND 19 | # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT 20 | # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, 21 | # WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, 22 | # ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR 23 | # THE USE OR OTHER DEALINGS IN THE SOFTWARE. 24 | # 25 | # https://github.com/minimaxir/facebook-page-post-scraper 26 | # 27 | import json 28 | import datetime 29 | import csv 30 | import time 31 | 32 | try: 33 | from urllib.request import urlopen, Request 34 | except ImportError: 35 | from urllib.request import urlopen, Request 36 | 37 | # Modififed to only attempt once - Logan Sims 38 | def request_once(url): 39 | req = Request(url) 40 | try: 41 | response = urlopen(req) 42 | if response.getcode() == 200: 43 | success = True 44 | except Exception as e: 45 | print(e) 46 | print("Error for URL {}: {}".format(url, datetime.datetime.now())) 47 | return None 48 | 49 | return response.read() 50 | 51 | # Needed to write tricky unicode correctly to csv 52 | def unicode_decode(text): 53 | try: 54 | return text.encode('utf-8').decode() 55 | except UnicodeDecodeError: 56 | return text.encode('utf-8') 57 | 58 | def getFacebookCommentFeedUrl(base_url): 59 | 60 | # Construct the URL string 61 | fields = "&fields=id,message,reactions.limit(0).summary(true)" + \ 62 | ",created_time,comments,from,attachment" 63 | url = base_url + fields 64 | 65 | return url 66 | 67 | def processFacebookComment(comment, status_id, parent_id=''): 68 | 69 | # The status is now a Python dictionary, so for top-level items, 70 | # we can simply call the key. 71 | 72 | # Additionally, some items may not always exist, 73 | # so must check for existence first 74 | 75 | comment_id = comment['id'] 76 | comment_message = '' if 'message' not in comment or comment['message'] \ 77 | is '' else unicode_decode(comment['message']) 78 | 79 | comment_author = unicode_decode(comment['from']['name'] if 'from' in comment else "user") 80 | 81 | if 'attachment' in comment: 82 | attachment_type = comment['attachment']['type'] 83 | attachment_type = 'gif' if attachment_type == 'animated_image_share' \ 84 | else attachment_type 85 | attach_tag = "[[{}]]".format(attachment_type.upper()) 86 | comment_message = attach_tag if comment_message is '' else \ 87 | comment_message + " " + attach_tag 88 | 89 | # Time needs special care since a) it's in UTC and 90 | # b) it's not easy to use in statistical programs. 91 | 92 | comment_published = datetime.datetime.strptime( 93 | comment['created_time'], '%Y-%m-%dT%H:%M:%S+0000') 94 | comment_published = comment_published + datetime.timedelta(hours=-5) # EST 95 | comment_published = comment_published.strftime( 96 | '%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs 97 | 98 | # Return a tuple of all processed data 99 | 100 | return (comment_id, status_id, parent_id, comment_message, comment_author, 101 | comment_published) 102 | 103 | # Modififed to yield contents of CSV - Logan Sims 104 | def scrapeFacebookPageFeedComments(stringIO, writer, page_id, access_token, status_id): 105 | writer.writerow(["comment_id", "status_id", "parent_id", "comment_message", 106 | "comment_author", "comment_published"]) 107 | 108 | num_processed = 0 109 | scrape_starttime = datetime.datetime.now() 110 | after = '' 111 | base = "https://graph.facebook.com/v2.9" 112 | parameters = "/?limit={}&access_token={}".format( 113 | 100, access_token) 114 | 115 | print("Scraping {} Comments From Posts: {}\n".format(page_id, scrape_starttime)) 116 | 117 | reader = [dict(status_id=status_id)] 118 | 119 | for status in reader: 120 | has_next_page = True 121 | 122 | while has_next_page: 123 | 124 | node = "/{}/comments".format(status['status_id']) 125 | after = '' if after is '' else "&after={}".format(after) 126 | base_url = base + node + parameters + after 127 | 128 | url = getFacebookCommentFeedUrl(base_url) 129 | 130 | data = request_once(url) 131 | 132 | if data is None: 133 | writer.writerow(['url error']) 134 | yield stringIO.getvalue() 135 | stringIO.seek(0) 136 | stringIO.truncate(0) 137 | raise StopIteration 138 | 139 | # python 3.6+ decodes automatically 140 | try: 141 | comments = json.loads(data) 142 | except TypeError: 143 | comments = json.loads(data.decode('utf-8')) 144 | 145 | for comment in comments['data']: 146 | comment_data = processFacebookComment( 147 | comment, status['status_id']) 148 | 149 | writer.writerow(comment_data) 150 | yield stringIO.getvalue() 151 | stringIO.seek(0) 152 | stringIO.truncate(0) 153 | 154 | if 'comments' in comment: 155 | has_next_subpage = True 156 | sub_after = '' 157 | 158 | while has_next_subpage: 159 | sub_node = "/{}/comments".format(comment['id']) 160 | sub_after = '' if sub_after is '' else "&after={}".format( 161 | sub_after) 162 | sub_base_url = base + sub_node + parameters + sub_after 163 | 164 | sub_url = getFacebookCommentFeedUrl( 165 | sub_base_url) 166 | sub_comments = json.loads( 167 | request_once(sub_url)) 168 | 169 | for sub_comment in sub_comments['data']: 170 | sub_comment_data = processFacebookComment( 171 | sub_comment, status['status_id'], comment['id']) 172 | 173 | writer.writerow(sub_comment_data) 174 | yield stringIO.getvalue() 175 | stringIO.seek(0) 176 | stringIO.truncate(0) 177 | 178 | num_processed += 1 179 | if num_processed % 100 == 0: 180 | print("{} Comments Processed: {}".format(num_processed,datetime.datetime.now())) 181 | 182 | if 'paging' in sub_comments: 183 | if 'next' in sub_comments['paging']: 184 | sub_after = sub_comments[ 185 | 'paging']['cursors']['after'] 186 | else: 187 | has_next_subpage = False 188 | else: 189 | has_next_subpage = False 190 | 191 | # output progress occasionally to make sure code is not 192 | # stalling 193 | num_processed += 1 194 | if num_processed % 100 == 0: 195 | print("{} Comments Processed: {}".format(num_processed, datetime.datetime.now())) 196 | 197 | if 'paging' in comments: 198 | if 'next' in comments['paging']: 199 | after = comments['paging']['cursors']['after'] 200 | else: 201 | has_next_page = False 202 | else: 203 | has_next_page = False 204 | 205 | print("\nDone!\n{} Comments Processed in {}".format(num_processed, datetime.datetime.now() - scrape_starttime)) 206 | -------------------------------------------------------------------------------- /src/static/loader.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/WSDOT/fb-comment-downloader/630c0e5b44a0f17b477df1879ac852b12dbb1060/src/static/loader.gif -------------------------------------------------------------------------------- /src/static/script.js: -------------------------------------------------------------------------------- 1 | function setup(){ 2 | 3 | // Set up loading spinner 4 | $body = $("body"); 5 | $(document).on({ 6 | ajaxStart: function() { $body.addClass("loading"); }, 7 | ajaxStop: function() { $body.removeClass("loading"); } 8 | }); 9 | 10 | // Get comments, generate download button when complete 11 | $( "#url-form" ).submit(function( event ) { 12 | 13 | $("#download").empty() 14 | $("#error").empty() 15 | 16 | var form = $(this); 17 | $.ajax({ 18 | url : form.attr('action'), 19 | type : form.attr('method'), 20 | data : form.serialize(), // data to be submitted 21 | success: function(response){ 22 | if (response.error) { 23 | console.log(response.error) 24 | $("
").appendTo("#error"); 25 | 26 | } else { 27 | var downloadLink = "" + 29 | " Download Comments" 30 | 31 | $(downloadLink).appendTo("#download"); 32 | } 33 | }, 34 | error: function(error) { 35 | alert("error") 36 | } 37 | }); 38 | event.preventDefault(); 39 | }); 40 | 41 | } 42 | -------------------------------------------------------------------------------- /src/static/style.css: -------------------------------------------------------------------------------- 1 | h3, p { 2 | font-family: Arial, sans-serif; 3 | } 4 | 5 | input[type=text], select { 6 | font-family: Arial, sans-serif; 7 | font-size: 10pt; 8 | width: 100%; 9 | padding: 12px 20px; 10 | margin: 8px 0; 11 | display: inline-block; 12 | border: 1px solid #ccc; 13 | border-radius: 4px; 14 | box-sizing: border-box; 15 | } 16 | 17 | input[type=submit], .download-button { 18 | font-family: Arial, sans-serif; 19 | width: 100%; 20 | color: white; 21 | padding: 14px 20px; 22 | font-size: 12pt; 23 | margin: 8px 0; 24 | border: none; 25 | border-radius: 4px; 26 | cursor: pointer; 27 | } 28 | 29 | input[type=submit], .download-button { 30 | background-color: #00795F; 31 | } 32 | 33 | input[type=submit]:hover { 34 | background-color: #004F50; 35 | } 36 | 37 | .download-button { 38 | background-color: #007A99; 39 | text-decoration: none; 40 | } 41 | 42 | .info-1 { 43 | font-family: Arial, sans-serif; 44 | font-size: 12pt; 45 | } 46 | 47 | .info-2 { 48 | font-family: Arial, sans-serif; 49 | font-size: 10pt; 50 | padding-bottom: 14px; 51 | } 52 | 53 | .error-message { 54 | color: #af2b2b; 55 | } 56 | 57 | .form { 58 | width: 50%; 59 | border-radius: 5px; 60 | background-color: #f2f2f2; 61 | padding: 20px; 62 | display: inline-block; 63 | } 64 | 65 | .material-icons, .icon-text { 66 | vertical-align: middle; 67 | } 68 | 69 | .wrapper { 70 | text-align: center; 71 | } 72 | 73 | /* loading spinner */ 74 | .modal { 75 | display: none; 76 | position: fixed; 77 | z-index: 1000; 78 | top: 0; 79 | left: 0; 80 | height: 100%; 81 | width: 100%; 82 | background: rgba( 255, 255, 255, .8 ) 83 | url('loader.gif') 84 | 50% 50% 85 | no-repeat; 86 | } 87 | 88 | body.loading { 89 | overflow: hidden; 90 | } 91 | 92 | body.loading .modal { 93 | display: block; 94 | } 95 | -------------------------------------------------------------------------------- /src/templates/index.html: -------------------------------------------------------------------------------- 1 | 2 | {% block head %} 3 | 4 |23 | Click a post's timestamp to get the url 24 |
25 | 29 | 30 |31 | This may take awhile for posts with a large number of comments 32 |
33 | 34 | 35 | 36 | 37 |