├── .gitignore ├── README.md ├── data ├── bbcdebate │ └── speakers │ │ ├── README.md │ │ ├── bbcdebate_log.csv │ │ ├── bbcdebate_log_raw.csv │ │ ├── debate_logger.py │ │ └── process_debate_log.py ├── eu_referendum │ └── electoral_commission │ │ └── results │ │ ├── README.md │ │ ├── raw │ │ └── .gitkeep │ │ └── scripts │ │ └── retrieve.py ├── general_election │ └── electoral_commission │ │ └── results │ │ ├── README.md │ │ ├── clean │ │ └── .gitkeep │ │ ├── raw │ │ └── .gitkeep │ │ └── scripts │ │ ├── process_2010.py │ │ ├── process_2015.py │ │ ├── retrieve_2010.py │ │ └── retrieve_2015.py ├── generate_data.py ├── model │ ├── clean │ │ └── .gitkeep │ └── scripts │ │ └── process.py ├── polls │ ├── README.md │ └── generate_json.py └── push_to_s3.py ├── docs └── setup.md ├── requirements.in └── requirements.txt /.gitignore: -------------------------------------------------------------------------------- 1 | # Python files 2 | *.pyc 3 | 4 | # Datasets 5 | data/**/*.csv 6 | data/**/*.feather 7 | data/**/*.gz 8 | data/**/*.json 9 | data/**/*.pdf 10 | data/**/*.xls 11 | data/**/*.xlsx 12 | 13 | # Notebooks 14 | data/**/*.ipynb 15 | 16 | # Whitelists 17 | !data/bbcdebate/**/*.csv 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # SixFifty Data Pipeline 2 | ETL data pipeline for SixFifty modelling & analytics. 3 | 4 | ## SixFifty Datasets 5 | 6 | ### Raw 7 | | Dataset | Date | Format | Source | Licence | Download URL | Repo Path | 8 | | -- | -- | -- | -- | -- | -- | -- | 9 | | UK Parliament general election results | 6th May 2010 | XLS | [Electoral Commission](http://www.electoralcommission.org.uk/our-work/our-research/electoral-data) | [Open Government Licence v2.0](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/) | [GE2010-results-flatfile-website.xls](http://www.electoralcommission.org.uk/__data/assets/excel_doc/0003/105726/GE2010-results-flatfile-website.xls) | [/data/general_election/electoral_commission/results/](https://github.com/six50/pipeline/tree/master/data/general_election/electoral_commission/results/) | 10 | | UK Parliament general election results | 7th May 2015 | CSV - Zip file | [Electoral Commission](http://www.electoralcommission.org.uk/our-work/our-research/electoral-data) | [Open Government Licence v2.0](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/) | [2015-UK-general-election-data-results-WEB.zip](http://www.electoralcommission.org.uk/__data/assets/file/0004/191650/2015-UK-general-election-data-results-WEB.zip) | [/data/general_election/electoral_commission/results/](https://github.com/six50/pipeline/tree/master/data/general_election/electoral_commission/results/) | 11 | | EU Referendum results | 23rd June 2016 | CSV | [Electoral Commission](http://www.electoralcommission.org.uk/find-information-by-subject/elections-and-referendums/upcoming-elections-and-referendums/eu-referendum/electorate-and-count-information) | [Open Government Licence v2.0](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/) | [EU-referendum-result-data.csv](http://www.electoralcommission.org.uk/__data/assets/file/0014/212135/EU-referendum-result-data.csv) | [/data/eu_referendum/electoral_commission/results/](https://github.com/six50/pipeline/tree/master/data/eu_referendum/electoral_commission/results/) | 12 | 13 | ### Processed 14 | We aim to provide our processed datasets in both CSV and [Feather](https://blog.rstudio.org/2016/03/29/feather/) formats. 15 | 16 | | Dataset | Description | Download URL | Repo Path | 17 | | -- | -- | -- | -- | 18 | | [`ge_2010_results`](https://github.com/six50/pipeline/tree/master/data/general_election/electoral_commission/results/README.md) | Cleaner version of 2010 GE data | [CSV](https://s3-eu-west-1.amazonaws.com/sixfifty/ge_2010_results.csv), [Feather](https://s3-eu-west-1.amazonaws.com/sixfifty/ge_2010_results.feather) | [data/general_election/electoral_commission/results/clean/ge_2010_results.csv](https://github.com/six50/pipeline/tree/master/data/general_election/electoral_commission/results/README.md) | 19 | | [`ge_2015_results`](https://github.com/six50/pipeline/tree/master/data/general_election/electoral_commission/results/README.md) | Cleaner version of 2015 GE data | [CSV](https://s3-eu-west-1.amazonaws.com/sixfifty/ge_2015_results.csv), [Feather](https://s3-eu-west-1.amazonaws.com/sixfifty/ge_2015_results.feather) | [data/general_election/electoral_commission/results/clean/ge_2015_results.csv](https://github.com/six50/pipeline/tree/master/data/general_election/electoral_commission/results/README.md) | 20 | | [`model_2015`](https://github.com/six50/pipeline/tree/master/data/model/clean/) | Clean version of 2015 GE data along with counties and EU Referendum results at a regional level | [CSV](https://s3-eu-west-1.amazonaws.com/sixfifty/model_2015.csv), [Feather](https://s3-eu-west-1.amazonaws.com/sixfifty/model_2015.feather) | [data/model/clean/model_2015.csv](https://github.com/six50/pipeline/tree/master/data/model/clean/) | 21 | 22 | ### UK Political Polling 23 | A manually curated set of poll results can be downloaded in a variety of formats. See [data/polls/](https://github.com/six50/pipeline/tree/master/data/polls/) for more information including a data dictionary. 24 | - [Download as JSON](https://s3-eu-west-1.amazonaws.com/sixfifty/polls.json) 25 | - [Download as CSV](https://s3-eu-west-1.amazonaws.com/sixfifty/polls.csv) 26 | - [Download as Feather](https://s3-eu-west-1.amazonaws.com/sixfifty/polls.feather) 27 | 28 | ### BBC Debate Logs 29 | Created by SixFifty, includes timestamps of when each person was speaking and for how long. See [data/bbcdebate/speakers/](https://github.com/six50/pipeline/tree/master/data/bbcdebate/speakers/) for more information including a data dictionary. 30 | - [Download as CSV](https://github.com/six50/pipeline/raw/master/data/bbcdebate/speakers/bbcdebate_log.csv) 31 | 32 | --- 33 | 34 | ## Filling this repo with data (short version) 35 | 1. Check you're running Python 3. 36 | 2. Ensure you have the Python requirements with `pip install -r requirements.txt` 37 | 3. Then cd into the repo root (where this README is located) and run the following to download, populate this repo with data and auto-clean it ready for modelling: 38 | ``` 39 | python data/generate_data.py 40 | ``` 41 | 42 | ## Filling this repo with data (detailed setup instructions) 43 | Please see [these instructions on installing Anaconda + dependencies + configuring S3 tokens](docs/setup.md). 44 | 45 | --- 46 | 47 | ## Licences 48 | | Name | Description | Attribution Statement | 49 | | -- | -- | -- | 50 | | [Open Parliament Licence](http://www.parliament.uk/site-information/copyright/open-parliament-licence/) | Free to copy, publish, distribute, transmit, adapt and exploit commercially or non-commercially. See URL for full details. | Contains Parliamentary information licensed under the Open Parliament Licence v3.0. | 51 | | [Open Government Licence](http://www.nationalarchives.gov.uk/doc/open-government-licence/version/2/) | Free to copy, publish, distribute, transmit, adapt and exploit commercially and non-commercially. See URL for full details. | Contains public sector information licensed under the Open Government Licence v2.0. | 52 | -------------------------------------------------------------------------------- /data/bbcdebate/speakers/README.md: -------------------------------------------------------------------------------- 1 | # BBC Debate Speaker Log 2 | 3 | If you have any questions about these datasets please [contact us @SixFiftyData](https://twitter.com/SixFiftyData) on Twitter. 4 | 5 | ## Licence 6 | This data is provided under a [CC-BY-SA license](https://creativecommons.org/licenses/by-sa/3.0/). This means you are free to reuse the data for any reason, as long as you give attribution to "SixFifty.org.uk" and any reuse is shared under the same licence. 7 | 8 | ## 31st May 2017 BBC Debate Speaker Log 9 | The following dataset is available for download from this repo: **[`bbcdebate_log.csv`](https://github.com/six50/pipeline/raw/master/data/bbcdebate/speakers/bbcdebate_log.csv)** 10 | 11 | | Column | Type | Description | 12 | | -- | -- | -- | 13 | | `timestamp` | string | Timestamp when speaker started speaking in format `%Y-%m-%d %H:%M:%S` starting at 31st May 2017 1930 BST | 14 | | `party` | string | Party name of speaker, e.g. `Conservatives`, `Labour`, can be blank for chair Mishal Husain | 15 | | `speaker` | string | Name of speaker, e.g. `Mishal Husain`, `Caroline Lucas` | 16 | | `section` | string | Section of debate, e.g. `Opening credits`, `Question 1 - Living standards` | 17 | | `counter` | float | Number of integer seconds that this person is dominant speaker (i.e. until interrupted or has finished speaking) | 18 | | `time_elapsed` | float | Seconds elapsed since start of debate | 19 | 20 | ## Scripts 21 | This repo contains two scripts: 22 | 23 | 1. `debate_logger.py` is an extremely simple (~6 lines of code) script for timestamping text records and writing these to a CSV file. This was used to log who was speaking at each time using simple shorthand codes e.g. `l`, `c` for Labour/Conservative, as well as section/question markers. 24 | 25 | 2. `process_debate_log.py` processes this into the clean dataset documented above. 26 | -------------------------------------------------------------------------------- /data/bbcdebate/speakers/bbcdebate_log.csv: -------------------------------------------------------------------------------- 1 | timestamp,party,speaker,section,counter,time_elapsed 2 | 2017-05-31 19:30:00,,Mishal Husain,Opening credits,86.0,86.0 3 | 2017-05-31 19:31:26,Plaid Cymru,Leanne Wood,Opening statements,67.0,153.0 4 | 2017-05-31 19:32:33,,Mishal Husain,Opening statements,3.0,156.0 5 | 2017-05-31 19:32:36,Green,Caroline Lucas,Opening statements,75.0,231.0 6 | 2017-05-31 19:33:51,,Mishal Husain,Opening statements,4.0,235.0 7 | 2017-05-31 19:33:55,Conservatives,Amber Rudd,Opening statements,68.0,303.0 8 | 2017-05-31 19:35:03,,Mishal Husain,Opening statements,1.0,304.0 9 | 2017-05-31 19:35:04,Labour,Jeremy Corbyn,Opening statements,52.0,356.0 10 | 2017-05-31 19:35:56,,Mishal Husain,Opening statements,3.0,359.0 11 | 2017-05-31 19:35:59,UKIP,Paul Nuttall,Opening statements,68.0,427.0 12 | 2017-05-31 19:37:07,,Mishal Husain,Opening statements,1.0,428.0 13 | 2017-05-31 19:37:08,SNP,Angus Robertson,Opening statements,55.0,483.0 14 | 2017-05-31 19:38:03,,Mishal Husain,Opening statements,3.0,486.0 15 | 2017-05-31 19:38:06,Liberal Democrats,Tim Farron,Opening statements,65.0,551.0 16 | 2017-05-31 19:39:11,,Mishal Husain,Opening statements,41.0,592.0 17 | 2017-05-31 19:39:52,Conservatives,Amber Rudd,Question 1 - Living standards,62.00000000000001,654.0 18 | 2017-05-31 19:40:54,,Mishal Husain,Question 1 - Living standards,1.0,655.0 19 | 2017-05-31 19:40:55,Labour,Jeremy Corbyn,Question 1 - Living standards,54.0,709.0 20 | 2017-05-31 19:41:49,,Mishal Husain,Question 1 - Living standards,4.0,713.0 21 | 2017-05-31 19:41:53,SNP,Angus Robertson,Question 1 - Living standards,88.0,801.0 22 | 2017-05-31 19:43:21,,Mishal Husain,Question 1 - Living standards,1.0,802.0 23 | 2017-05-31 19:43:22,Conservatives,Amber Rudd,Question 1 - Living standards,21.0,823.0 24 | 2017-05-31 19:43:43,Labour,Jeremy Corbyn,Question 1 - Living standards,6.0,829.0 25 | 2017-05-31 19:43:49,Conservatives,Amber Rudd,Question 1 - Living standards,18.0,847.0 26 | 2017-05-31 19:44:07,,Mishal Husain,Question 1 - Living standards,11.0,858.0 27 | 2017-05-31 19:44:18,Liberal Democrats,Tim Farron,Question 1 - Living standards,85.0,943.0 28 | 2017-05-31 19:45:43,,Mishal Husain,Question 1 - Living standards,1.0,944.0 29 | 2017-05-31 19:45:44,Green,Caroline Lucas,Question 1 - Living standards,81.0,1025.0 30 | 2017-05-31 19:47:05,,Mishal Husain,Question 1 - Living standards,2.0,1027.0 31 | 2017-05-31 19:47:07,Conservatives,Amber Rudd,Question 1 - Living standards,16.0,1043.0 32 | 2017-05-31 19:47:23,Green,Caroline Lucas,Question 1 - Living standards,3.0,1046.0 33 | 2017-05-31 19:47:26,Conservatives,Amber Rudd,Question 1 - Living standards,28.0,1074.0 34 | 2017-05-31 19:47:54,Liberal Democrats,Tim Farron,Question 1 - Living standards,3.0,1077.0 35 | 2017-05-31 19:47:57,,Mishal Husain,Question 1 - Living standards,3.0,1080.0 36 | 2017-05-31 19:48:00,Plaid Cymru,Leanne Wood,Question 1 - Living standards,41.0,1121.0 37 | 2017-05-31 19:48:41,UKIP,Paul Nuttall,Question 1 - Living standards,68.0,1189.0 38 | 2017-05-31 19:49:49,SNP,Angus Robertson,Question 1 - Living standards,1.0,1190.0 39 | 2017-05-31 19:49:50,UKIP,Paul Nuttall,Question 1 - Living standards,1.0,1191.0 40 | 2017-05-31 19:49:51,,Mishal Husain,Question 1 - Living standards,3.0,1194.0 41 | 2017-05-31 19:49:54,Labour,Jeremy Corbyn,Question 1 - Living standards,62.00000000000001,1256.0 42 | 2017-05-31 19:50:56,Conservatives,Amber Rudd,Question 1 - Living standards,8.0,1264.0 43 | 2017-05-31 19:51:04,Labour,Jeremy Corbyn,Question 1 - Living standards,3.0,1267.0 44 | 2017-05-31 19:51:07,UKIP,Paul Nuttall,Question 1 - Living standards,10.0,1277.0 45 | 2017-05-31 19:51:17,,Mishal Husain,Question 1 - Living standards,14.0,1291.0 46 | 2017-05-31 19:51:31,Labour,Jeremy Corbyn,Question 1 - Living standards,43.0,1334.0 47 | 2017-05-31 19:52:14,,Mishal Husain,Question 1 - Living standards,3.0,1337.0 48 | 2017-05-31 19:52:17,UKIP,Paul Nuttall,Question 1 - Living standards,8.0,1345.0 49 | 2017-05-31 19:52:25,SNP,Angus Robertson,Question 1 - Living standards,3.0,1348.0 50 | 2017-05-31 19:52:28,UKIP,Paul Nuttall,Question 1 - Living standards,7.0,1355.0 51 | 2017-05-31 19:52:35,Liberal Democrats,Tim Farron,Question 1 - Living standards,34.0,1389.0 52 | 2017-05-31 19:53:09,Conservatives,Amber Rudd,Question 1 - Living standards,35.0,1424.0 53 | 2017-05-31 19:53:44,,Mishal Husain,Question 1 - Living standards,29.000000000000004,1453.0 54 | 2017-05-31 19:54:13,UKIP,Paul Nuttall,Question 2 - Brexit,14.0,1467.0 55 | 2017-05-31 19:54:27,Plaid Cymru,Leanne Wood,Question 2 - Brexit,2.0,1469.0 56 | 2017-05-31 19:54:29,UKIP,Paul Nuttall,Question 2 - Brexit,54.0,1523.0 57 | 2017-05-31 19:55:23,Plaid Cymru,Leanne Wood,Question 2 - Brexit,3.0,1526.0 58 | 2017-05-31 19:55:26,Liberal Democrats,Tim Farron,Question 2 - Brexit,62.00000000000001,1588.0 59 | 2017-05-31 19:56:28,,Mishal Husain,Question 2 - Brexit,2.0,1590.0 60 | 2017-05-31 19:56:30,Conservatives,Amber Rudd,Question 2 - Brexit,36.0,1626.0 61 | 2017-05-31 19:57:06,,Mishal Husain,Question 2 - Brexit,9.0,1635.0 62 | 2017-05-31 19:57:15,Labour,Jeremy Corbyn,Question 2 - Brexit,46.0,1681.0 63 | 2017-05-31 19:58:01,,Mishal Husain,Question 2 - Brexit,3.0,1684.0 64 | 2017-05-31 19:58:04,Labour,Jeremy Corbyn,Question 2 - Brexit,10.0,1694.0 65 | 2017-05-31 19:58:14,,Mishal Husain,Question 2 - Brexit,1.0,1695.0 66 | 2017-05-31 19:58:15,Labour,Jeremy Corbyn,Question 2 - Brexit,11.0,1706.0 67 | 2017-05-31 19:58:26,Conservatives,Amber Rudd,Question 2 - Brexit,1.0,1707.0 68 | 2017-05-31 19:58:27,Labour,Jeremy Corbyn,Question 2 - Brexit,17.0,1724.0 69 | 2017-05-31 19:58:44,,Mishal Husain,Question 2 - Brexit,8.0,1732.0 70 | 2017-05-31 19:58:52,Plaid Cymru,Leanne Wood,Question 2 - Brexit,1.0,1733.0 71 | 2017-05-31 19:58:53,,Mishal Husain,Question 2 - Brexit,12.0,1745.0 72 | 2017-05-31 19:59:05,Plaid Cymru,Leanne Wood,Question 2 - Brexit,4.0,1749.0 73 | 2017-05-31 19:59:09,,Mishal Husain,Question 2 - Brexit,2.0,1751.0 74 | 2017-05-31 19:59:11,Plaid Cymru,Leanne Wood,Question 2 - Brexit,12.0,1763.0 75 | 2017-05-31 19:59:23,,Mishal Husain,Question 2 - Brexit,3.0,1766.0 76 | 2017-05-31 19:59:26,Plaid Cymru,Leanne Wood,Question 2 - Brexit,5.0,1771.0 77 | 2017-05-31 19:59:31,,Mishal Husain,Question 2 - Brexit,5.0,1776.0 78 | 2017-05-31 19:59:36,Plaid Cymru,Leanne Wood,Question 2 - Brexit,30.000000000000004,1806.0 79 | 2017-05-31 20:00:06,UKIP,Paul Nuttall,Question 2 - Brexit,3.0,1809.0 80 | 2017-05-31 20:00:09,Plaid Cymru,Leanne Wood,Question 2 - Brexit,3.0,1812.0 81 | 2017-05-31 20:00:12,UKIP,Paul Nuttall,Question 2 - Brexit,34.0,1846.0 82 | 2017-05-31 20:00:46,Liberal Democrats,Tim Farron,Question 2 - Brexit,11.0,1857.0 83 | 2017-05-31 20:00:57,SNP,Angus Robertson,Question 2 - Brexit,57.0,1914.0 84 | 2017-05-31 20:01:54,,Mishal Husain,Question 2 - Brexit,2.0,1916.0 85 | 2017-05-31 20:01:56,SNP,Angus Robertson,Question 2 - Brexit,30.0,1946.0 86 | 2017-05-31 20:02:26,,Mishal Husain,Question 2 - Brexit,1.0,1947.0 87 | 2017-05-31 20:02:27,SNP,Angus Robertson,Question 2 - Brexit,4.0,1951.0 88 | 2017-05-31 20:02:31,,Mishal Husain,Question 2 - Brexit,1.0,1952.0 89 | 2017-05-31 20:02:32,Green,Caroline Lucas,Question 2 - Brexit,82.0,2034.0 90 | 2017-05-31 20:03:54,,Mishal Husain,Question 2 - Brexit,3.0,2037.0 91 | 2017-05-31 20:03:57,Labour,Jeremy Corbyn,Question 2 - Brexit,22.0,2059.0 92 | 2017-05-31 20:04:19,Green,Caroline Lucas,Question 2 - Brexit,1.0,2060.0 93 | 2017-05-31 20:04:20,Labour,Jeremy Corbyn,Question 2 - Brexit,19.0,2079.0 94 | 2017-05-31 20:04:39,SNP,Angus Robertson,Question 2 - Brexit,1.0,2080.0 95 | 2017-05-31 20:04:40,Labour,Jeremy Corbyn,Question 2 - Brexit,3.0,2083.0 96 | 2017-05-31 20:04:43,SNP,Angus Robertson,Question 2 - Brexit,12.0,2095.0 97 | 2017-05-31 20:04:55,Labour,Jeremy Corbyn,Question 2 - Brexit,3.0,2098.0 98 | 2017-05-31 20:04:58,SNP,Angus Robertson,Question 2 - Brexit,10.0,2108.0 99 | 2017-05-31 20:05:08,Labour,Jeremy Corbyn,Question 2 - Brexit,4.0,2112.0 100 | 2017-05-31 20:05:12,Liberal Democrats,Tim Farron,Question 2 - Brexit,23.0,2135.0 101 | 2017-05-31 20:05:35,,Mishal Husain,Question 2 - Brexit,5.0,2140.0 102 | 2017-05-31 20:05:40,Conservatives,Amber Rudd,Question 2 - Brexit,28.0,2168.0 103 | 2017-05-31 20:06:08,Plaid Cymru,Leanne Wood,Question 2 - Brexit,6.0,2174.0 104 | 2017-05-31 20:06:14,Green,Caroline Lucas,Question 2 - Brexit,29.000000000000004,2203.0 105 | 2017-05-31 20:06:43,Conservatives,Amber Rudd,Question 2 - Brexit,23.0,2226.0 106 | 2017-05-31 20:07:06,,Mishal Husain,Question 2 - Brexit,25.0,2251.0 107 | 2017-05-31 20:07:31,Green,Caroline Lucas,Question 3 - Public services,87.0,2338.0 108 | 2017-05-31 20:08:58,,Mishal Husain,Question 3 - Public services,8.0,2346.0 109 | 2017-05-31 20:09:06,Conservatives,Amber Rudd,Question 3 - Public services,11.0,2357.0 110 | 2017-05-31 20:09:17,Labour,Jeremy Corbyn,Question 3 - Public services,1.0,2358.0 111 | 2017-05-31 20:09:18,Conservatives,Amber Rudd,Question 3 - Public services,8.0,2366.0 112 | 2017-05-31 20:09:26,Plaid Cymru,Leanne Wood,Question 3 - Public services,2.0,2368.0 113 | 2017-05-31 20:09:28,Conservatives,Amber Rudd,Question 3 - Public services,38.0,2406.0 114 | 2017-05-31 20:10:06,Plaid Cymru,Leanne Wood,Question 3 - Public services,1.0,2407.0 115 | 2017-05-31 20:10:07,Labour,Jeremy Corbyn,Question 3 - Public services,41.0,2448.0 116 | 2017-05-31 20:10:48,,Mishal Husain,Question 3 - Public services,5.0,2453.0 117 | 2017-05-31 20:10:53,Labour,Jeremy Corbyn,Question 3 - Public services,28.0,2481.0 118 | 2017-05-31 20:11:21,,Mishal Husain,Question 3 - Public services,1.0,2482.0 119 | 2017-05-31 20:11:22,Labour,Jeremy Corbyn,Question 3 - Public services,25.0,2507.0 120 | 2017-05-31 20:11:47,Plaid Cymru,Leanne Wood,Question 3 - Public services,12.0,2519.0 121 | 2017-05-31 20:11:59,Labour,Jeremy Corbyn,Question 3 - Public services,21.0,2540.0 122 | 2017-05-31 20:12:20,Plaid Cymru,Leanne Wood,Question 3 - Public services,1.0,2541.0 123 | 2017-05-31 20:12:21,Labour,Jeremy Corbyn,Question 3 - Public services,15.000000000000002,2556.0 124 | 2017-05-31 20:12:36,Plaid Cymru,Leanne Wood,Question 3 - Public services,2.0,2558.0 125 | 2017-05-31 20:12:38,,Mishal Husain,Question 3 - Public services,10.0,2568.0 126 | 2017-05-31 20:12:48,SNP,Angus Robertson,Question 3 - Public services,21.0,2589.0 127 | 2017-05-31 20:13:09,Labour,Jeremy Corbyn,Question 3 - Public services,3.0,2592.0 128 | 2017-05-31 20:13:12,SNP,Angus Robertson,Question 3 - Public services,43.0,2635.0 129 | 2017-05-31 20:13:55,,Mishal Husain,Question 3 - Public services,1.0,2636.0 130 | 2017-05-31 20:13:56,SNP,Angus Robertson,Question 3 - Public services,13.0,2649.0 131 | 2017-05-31 20:14:09,Conservatives,Amber Rudd,Question 3 - Public services,3.0,2652.0 132 | 2017-05-31 20:14:12,SNP,Angus Robertson,Question 3 - Public services,1.0,2653.0 133 | 2017-05-31 20:14:13,Conservatives,Amber Rudd,Question 3 - Public services,4.0,2657.0 134 | 2017-05-31 20:14:17,SNP,Angus Robertson,Question 3 - Public services,1.0,2658.0 135 | 2017-05-31 20:14:18,Conservatives,Amber Rudd,Question 3 - Public services,6.0,2664.0 136 | 2017-05-31 20:14:24,Labour,Jeremy Corbyn,Question 3 - Public services,1.0,2665.0 137 | 2017-05-31 20:14:25,Conservatives,Amber Rudd,Question 3 - Public services,10.0,2675.0 138 | 2017-05-31 20:14:35,Labour,Jeremy Corbyn,Question 3 - Public services,1.0,2676.0 139 | 2017-05-31 20:14:36,Conservatives,Amber Rudd,Question 3 - Public services,10.0,2686.0 140 | 2017-05-31 20:14:46,SNP,Angus Robertson,Question 3 - Public services,1.0,2687.0 141 | 2017-05-31 20:14:47,Conservatives,Amber Rudd,Question 3 - Public services,5.0,2692.0 142 | 2017-05-31 20:14:52,Labour,Jeremy Corbyn,Question 3 - Public services,2.0,2694.0 143 | 2017-05-31 20:14:54,Conservatives,Amber Rudd,Question 3 - Public services,4.0,2698.0 144 | 2017-05-31 20:14:58,Labour,Jeremy Corbyn,Question 3 - Public services,2.0,2700.0 145 | 2017-05-31 20:15:00,Conservatives,Amber Rudd,Question 3 - Public services,4.0,2704.0 146 | 2017-05-31 20:15:04,Labour,Jeremy Corbyn,Question 3 - Public services,1.0,2705.0 147 | 2017-05-31 20:15:05,Conservatives,Amber Rudd,Question 3 - Public services,4.0,2709.0 148 | 2017-05-31 20:15:09,,Mishal Husain,Question 3 - Public services,12.0,2721.0 149 | 2017-05-31 20:15:21,Conservatives,Amber Rudd,Question 3 - Public services,26.0,2747.0 150 | 2017-05-31 20:15:47,Labour,Jeremy Corbyn,Question 3 - Public services,3.0,2750.0 151 | 2017-05-31 20:15:50,Liberal Democrats,Tim Farron,Question 3 - Public services,53.0,2803.0 152 | 2017-05-31 20:16:43,,Mishal Husain,Question 3 - Public services,4.0,2807.0 153 | 2017-05-31 20:16:47,Liberal Democrats,Tim Farron,Question 3 - Public services,1.0,2808.0 154 | 2017-05-31 20:16:48,,Mishal Husain,Question 3 - Public services,4.0,2812.0 155 | 2017-05-31 20:16:52,Liberal Democrats,Tim Farron,Question 3 - Public services,41.0,2853.0 156 | 2017-05-31 20:17:33,Green,Caroline Lucas,Question 3 - Public services,1.0,2854.0 157 | 2017-05-31 20:17:34,Liberal Democrats,Tim Farron,Question 3 - Public services,14.0,2868.0 158 | 2017-05-31 20:17:48,Green,Caroline Lucas,Question 3 - Public services,1.0,2869.0 159 | 2017-05-31 20:17:49,Liberal Democrats,Tim Farron,Question 3 - Public services,2.0,2871.0 160 | 2017-05-31 20:17:51,,Mishal Husain,Question 3 - Public services,1.0,2872.0 161 | 2017-05-31 20:17:52,Liberal Democrats,Tim Farron,Question 3 - Public services,4.0,2876.0 162 | 2017-05-31 20:17:56,UKIP,Paul Nuttall,Question 3 - Public services,27.0,2903.0 163 | 2017-05-31 20:18:23,SNP,Angus Robertson,Question 3 - Public services,1.0,2904.0 164 | 2017-05-31 20:18:24,UKIP,Paul Nuttall,Question 3 - Public services,3.0,2907.0 165 | 2017-05-31 20:18:27,SNP,Angus Robertson,Question 3 - Public services,1.0,2908.0 166 | 2017-05-31 20:18:28,UKIP,Paul Nuttall,Question 3 - Public services,9.0,2917.0 167 | 2017-05-31 20:18:37,Liberal Democrats,Tim Farron,Question 3 - Public services,2.0,2919.0 168 | 2017-05-31 20:18:39,UKIP,Paul Nuttall,Question 3 - Public services,15.000000000000002,2934.0 169 | 2017-05-31 20:18:54,,Mishal Husain,Question 3 - Public services,2.0,2936.0 170 | 2017-05-31 20:18:56,UKIP,Paul Nuttall,Question 3 - Public services,20.0,2956.0 171 | 2017-05-31 20:19:16,,Mishal Husain,Question 3 - Public services,38.0,2994.0 172 | 2017-05-31 20:19:54,SNP,Angus Robertson,Question 4 - Security,112.0,3106.0 173 | 2017-05-31 20:21:46,,Mishal Husain,Question 4 - Security,6.0,3112.0 174 | 2017-05-31 20:21:52,Labour,Jeremy Corbyn,Question 4 - Security,81.0,3193.0 175 | 2017-05-31 20:23:13,,Mishal Husain,Question 4 - Security,5.0,3198.0 176 | 2017-05-31 20:23:18,Labour,Jeremy Corbyn,Question 4 - Security,49.0,3247.0 177 | 2017-05-31 20:24:07,,Mishal Husain,Question 4 - Security,12.0,3259.0 178 | 2017-05-31 20:24:19,Labour,Jeremy Corbyn,Question 4 - Security,6.0,3265.0 179 | 2017-05-31 20:24:25,,Mishal Husain,Question 4 - Security,1.0,3266.0 180 | 2017-05-31 20:24:26,Labour,Jeremy Corbyn,Question 4 - Security,21.0,3287.0 181 | 2017-05-31 20:24:47,,Mishal Husain,Question 4 - Security,4.0,3291.0 182 | 2017-05-31 20:24:51,Conservatives,Amber Rudd,Question 4 - Security,46.0,3337.0 183 | 2017-05-31 20:25:37,,Mishal Husain,Question 4 - Security,5.0,3342.0 184 | 2017-05-31 20:25:42,Conservatives,Amber Rudd,Question 4 - Security,44.0,3386.0 185 | 2017-05-31 20:26:26,Labour,Jeremy Corbyn,Question 4 - Security,38.0,3424.0 186 | 2017-05-31 20:27:04,,Mishal Husain,Question 4 - Security,2.0,3426.0 187 | 2017-05-31 20:27:06,Conservatives,Amber Rudd,Question 4 - Security,2.0,3428.0 188 | 2017-05-31 20:27:08,,Mishal Husain,Question 4 - Security,1.0,3429.0 189 | 2017-05-31 20:27:09,Labour,Jeremy Corbyn,Question 4 - Security,3.0,3432.0 190 | 2017-05-31 20:27:12,Conservatives,Amber Rudd,Question 4 - Security,1.0,3433.0 191 | 2017-05-31 20:27:13,,Mishal Husain,Question 4 - Security,14.0,3447.0 192 | 2017-05-31 20:27:27,Liberal Democrats,Tim Farron,Question 4 - Security,95.0,3542.0 193 | 2017-05-31 20:29:02,,Mishal Husain,Question 4 - Security,1.0,3543.0 194 | 2017-05-31 20:29:03,UKIP,Paul Nuttall,Question 4 - Security,28.0,3571.0 195 | 2017-05-31 20:29:31,,Mishal Husain,Question 4 - Security,2.0,3573.0 196 | 2017-05-31 20:29:33,UKIP,Paul Nuttall,Question 4 - Security,4.0,3577.0 197 | 2017-05-31 20:29:37,Liberal Democrats,Tim Farron,Question 4 - Security,1.0,3578.0 198 | 2017-05-31 20:29:38,UKIP,Paul Nuttall,Question 4 - Security,59.00000000000001,3637.0 199 | 2017-05-31 20:30:37,Liberal Democrats,Tim Farron,Question 4 - Security,1.0,3638.0 200 | 2017-05-31 20:30:38,UKIP,Paul Nuttall,Question 4 - Security,1.0,3639.0 201 | 2017-05-31 20:30:39,Liberal Democrats,Tim Farron,Question 4 - Security,7.0,3646.0 202 | 2017-05-31 20:30:46,,Mishal Husain,Question 4 - Security,1.0,3647.0 203 | 2017-05-31 20:30:47,Green,Caroline Lucas,Question 4 - Security,31.000000000000004,3678.0 204 | 2017-05-31 20:31:18,UKIP,Paul Nuttall,Question 4 - Security,1.0,3679.0 205 | 2017-05-31 20:31:19,Green,Caroline Lucas,Question 4 - Security,70.0,3749.0 206 | 2017-05-31 20:32:29,,Mishal Husain,Question 4 - Security,1.0,3750.0 207 | 2017-05-31 20:32:30,Conservatives,Amber Rudd,Question 4 - Security,8.0,3758.0 208 | 2017-05-31 20:32:38,Green,Caroline Lucas,Question 4 - Security,2.0,3760.0 209 | 2017-05-31 20:32:40,Conservatives,Amber Rudd,Question 4 - Security,3.0,3763.0 210 | 2017-05-31 20:32:43,SNP,Angus Robertson,Question 4 - Security,1.0,3764.0 211 | 2017-05-31 20:32:44,Conservatives,Amber Rudd,Question 4 - Security,2.0,3766.0 212 | 2017-05-31 20:32:46,Green,Caroline Lucas,Question 4 - Security,7.0,3773.0 213 | 2017-05-31 20:32:53,Conservatives,Amber Rudd,Question 4 - Security,3.0,3776.0 214 | 2017-05-31 20:32:56,,Mishal Husain,Question 4 - Security,1.0,3777.0 215 | 2017-05-31 20:32:57,Plaid Cymru,Leanne Wood,Question 4 - Security,25.0,3802.0 216 | 2017-05-31 20:33:22,,Mishal Husain,Question 4 - Security,6.0,3808.0 217 | 2017-05-31 20:33:28,Plaid Cymru,Leanne Wood,Question 4 - Security,18.0,3826.0 218 | 2017-05-31 20:33:46,,Mishal Husain,Question 4 - Security,3.0,3829.0 219 | 2017-05-31 20:33:49,Plaid Cymru,Leanne Wood,Question 4 - Security,35.0,3864.0 220 | 2017-05-31 20:34:24,,Mishal Husain,Question 4 - Security,4.0,3868.0 221 | 2017-05-31 20:34:28,Liberal Democrats,Tim Farron,Question 4 - Security,26.0,3894.0 222 | 2017-05-31 20:34:54,,Mishal Husain,Question 4 - Security,1.0,3895.0 223 | 2017-05-31 20:34:55,Liberal Democrats,Tim Farron,Question 4 - Security,2.0,3897.0 224 | 2017-05-31 20:34:57,UKIP,Paul Nuttall,Question 4 - Security,23.0,3920.0 225 | 2017-05-31 20:35:20,,Mishal Husain,Question 4 - Security,1.0,3921.0 226 | 2017-05-31 20:35:21,Green,Caroline Lucas,Question 4 - Security,6.0,3927.0 227 | 2017-05-31 20:35:27,UKIP,Paul Nuttall,Question 4 - Security,14.0,3941.0 228 | 2017-05-31 20:35:41,SNP,Angus Robertson,Question 4 - Security,1.0,3942.0 229 | 2017-05-31 20:35:42,UKIP,Paul Nuttall,Question 4 - Security,6.0,3948.0 230 | 2017-05-31 20:35:48,SNP,Angus Robertson,Question 4 - Security,9.0,3957.0 231 | 2017-05-31 20:35:57,UKIP,Paul Nuttall,Question 4 - Security,1.0,3958.0 232 | 2017-05-31 20:35:58,SNP,Angus Robertson,Question 4 - Security,21.0,3979.0 233 | 2017-05-31 20:36:19,,Mishal Husain,Question 4 - Security,3.0,3982.0 234 | 2017-05-31 20:36:22,Labour,Jeremy Corbyn,Question 4 - Security,15.000000000000002,3997.0 235 | 2017-05-31 20:36:37,UKIP,Paul Nuttall,Question 4 - Security,1.0,3998.0 236 | 2017-05-31 20:36:38,Labour,Jeremy Corbyn,Question 4 - Security,5.0,4003.0 237 | 2017-05-31 20:36:43,UKIP,Paul Nuttall,Question 4 - Security,2.0,4005.0 238 | 2017-05-31 20:36:45,Labour,Jeremy Corbyn,Question 4 - Security,20.0,4025.0 239 | 2017-05-31 20:37:05,,Mishal Husain,Question 4 - Security,27.0,4052.0 240 | 2017-05-31 20:37:32,Liberal Democrats,Tim Farron,Question 5 - Climate change and Trump,52.0,4104.0 241 | 2017-05-31 20:38:24,,Mishal Husain,Question 5 - Climate change and Trump,2.0,4106.0 242 | 2017-05-31 20:38:26,Green,Caroline Lucas,Question 5 - Climate change and Trump,80.0,4186.0 243 | 2017-05-31 20:39:46,,Mishal Husain,Question 5 - Climate change and Trump,3.0,4189.0 244 | 2017-05-31 20:39:49,UKIP,Paul Nuttall,Question 5 - Climate change and Trump,23.0,4212.0 245 | 2017-05-31 20:40:12,,Mishal Husain,Question 5 - Climate change and Trump,1.0,4213.0 246 | 2017-05-31 20:40:13,UKIP,Paul Nuttall,Question 5 - Climate change and Trump,19.0,4232.0 247 | 2017-05-31 20:40:32,,Mishal Husain,Question 5 - Climate change and Trump,12.0,4244.0 248 | 2017-05-31 20:40:44,Labour,Jeremy Corbyn,Question 5 - Climate change and Trump,46.0,4290.0 249 | 2017-05-31 20:41:30,,Mishal Husain,Question 5 - Climate change and Trump,1.0,4291.0 250 | 2017-05-31 20:41:31,Conservatives,Amber Rudd,Question 5 - Climate change and Trump,5.0,4296.0 251 | 2017-05-31 20:41:36,,Mishal Husain,Question 5 - Climate change and Trump,5.0,4301.0 252 | 2017-05-31 20:41:41,Conservatives,Amber Rudd,Question 5 - Climate change and Trump,25.0,4326.0 253 | 2017-05-31 20:42:06,Plaid Cymru,Leanne Wood,Question 5 - Climate change and Trump,6.0,4332.0 254 | 2017-05-31 20:42:12,Conservatives,Amber Rudd,Question 5 - Climate change and Trump,1.0,4333.0 255 | 2017-05-31 20:42:13,Plaid Cymru,Leanne Wood,Question 5 - Climate change and Trump,1.0,4334.0 256 | 2017-05-31 20:42:14,Conservatives,Amber Rudd,Question 5 - Climate change and Trump,15.000000000000002,4349.0 257 | 2017-05-31 20:42:29,Labour,Jeremy Corbyn,Question 5 - Climate change and Trump,1.0,4350.0 258 | 2017-05-31 20:42:30,Plaid Cymru,Leanne Wood,Question 5 - Climate change and Trump,1.0,4351.0 259 | 2017-05-31 20:42:31,Labour,Jeremy Corbyn,Question 5 - Climate change and Trump,3.0,4354.0 260 | 2017-05-31 20:42:34,Plaid Cymru,Leanne Wood,Question 5 - Climate change and Trump,28.0,4382.0 261 | 2017-05-31 20:43:02,,Mishal Husain,Question 5 - Climate change and Trump,2.0,4384.0 262 | 2017-05-31 20:43:04,SNP,Angus Robertson,Question 5 - Climate change and Trump,78.0,4462.0 263 | 2017-05-31 20:44:22,,Mishal Husain,Question 5 - Climate change and Trump,30.000000000000004,4492.0 264 | 2017-05-31 20:44:52,Labour,Jeremy Corbyn,Question 6 - Leadership,65.0,4557.0 265 | 2017-05-31 20:45:57,,Mishal Husain,Question 6 - Leadership,4.0,4561.0 266 | 2017-05-31 20:46:01,UKIP,Paul Nuttall,Question 6 - Leadership,71.0,4632.0 267 | 2017-05-31 20:47:12,Plaid Cymru,Leanne Wood,Question 6 - Leadership,6.0,4638.0 268 | 2017-05-31 20:47:18,UKIP,Paul Nuttall,Question 6 - Leadership,16.0,4654.0 269 | 2017-05-31 20:47:34,Plaid Cymru,Leanne Wood,Question 6 - Leadership,4.0,4658.0 270 | 2017-05-31 20:47:38,UKIP,Paul Nuttall,Question 6 - Leadership,1.0,4659.0 271 | 2017-05-31 20:47:39,Plaid Cymru,Leanne Wood,Question 6 - Leadership,7.0,4666.0 272 | 2017-05-31 20:47:46,UKIP,Paul Nuttall,Question 6 - Leadership,2.0,4668.0 273 | 2017-05-31 20:47:48,Plaid Cymru,Leanne Wood,Question 6 - Leadership,1.0,4669.0 274 | 2017-05-31 20:47:49,UKIP,Paul Nuttall,Question 6 - Leadership,8.0,4677.0 275 | 2017-05-31 20:47:57,,Mishal Husain,Question 6 - Leadership,13.0,4690.0 276 | 2017-05-31 20:48:10,Conservatives,Amber Rudd,Question 6 - Leadership,10.0,4700.0 277 | 2017-05-31 20:48:20,SNP,Angus Robertson,Question 6 - Leadership,1.0,4701.0 278 | 2017-05-31 20:48:21,Conservatives,Amber Rudd,Question 6 - Leadership,8.0,4709.0 279 | 2017-05-31 20:48:29,Labour,Jeremy Corbyn,Question 6 - Leadership,3.0,4712.0 280 | 2017-05-31 20:48:32,Conservatives,Amber Rudd,Question 6 - Leadership,46.0,4758.0 281 | 2017-05-31 20:49:18,,Mishal Husain,Question 6 - Leadership,2.0,4760.0 282 | 2017-05-31 20:49:20,Labour,Jeremy Corbyn,Question 6 - Leadership,11.0,4771.0 283 | 2017-05-31 20:49:31,,Mishal Husain,Question 6 - Leadership,1.0,4772.0 284 | 2017-05-31 20:49:32,Green,Caroline Lucas,Question 6 - Leadership,81.0,4853.0 285 | 2017-05-31 20:50:53,,Mishal Husain,Question 6 - Leadership,1.0,4854.0 286 | 2017-05-31 20:50:54,Liberal Democrats,Tim Farron,Question 6 - Leadership,25.0,4879.0 287 | 2017-05-31 20:51:19,,Mishal Husain,Question 6 - Leadership,1.0,4880.0 288 | 2017-05-31 20:51:20,Liberal Democrats,Tim Farron,Question 6 - Leadership,59.00000000000001,4939.0 289 | 2017-05-31 20:52:19,,Mishal Husain,Question 6 - Leadership,6.0,4945.0 290 | 2017-05-31 20:52:25,SNP,Angus Robertson,Question 6 - Leadership,75.0,5020.0 291 | 2017-05-31 20:53:40,,Mishal Husain,Question 6 - Leadership,2.0,5022.0 292 | 2017-05-31 20:53:42,Plaid Cymru,Leanne Wood,Question 6 - Leadership,39.0,5061.0 293 | 2017-05-31 20:54:21,,Mishal Husain,Question 6 - Leadership,32.0,5093.0 294 | 2017-05-31 20:54:53,UKIP,Paul Nuttall,Closing statements,39.0,5132.0 295 | 2017-05-31 20:55:32,,Mishal Husain,Closing statements,2.0,5134.0 296 | 2017-05-31 20:55:34,Green,Caroline Lucas,Closing statements,40.0,5174.0 297 | 2017-05-31 20:56:14,,Mishal Husain,Closing statements,2.0,5176.0 298 | 2017-05-31 20:56:16,Labour,Jeremy Corbyn,Closing statements,31.000000000000004,5207.0 299 | 2017-05-31 20:56:47,,Mishal Husain,Closing statements,2.0,5209.0 300 | 2017-05-31 20:56:49,SNP,Angus Robertson,Closing statements,30.000000000000004,5239.0 301 | 2017-05-31 20:57:19,,Mishal Husain,Closing statements,2.0,5241.0 302 | 2017-05-31 20:57:21,Plaid Cymru,Leanne Wood,Closing statements,36.0,5277.0 303 | 2017-05-31 20:57:57,,Mishal Husain,Closing statements,1.0,5278.0 304 | 2017-05-31 20:57:58,Liberal Democrats,Tim Farron,Closing statements,32.0,5310.0 305 | 2017-05-31 20:58:30,,Mishal Husain,Closing statements,3.0,5313.0 306 | 2017-05-31 20:58:33,Conservatives,Amber Rudd,Closing statements,32.0,5345.0 307 | 2017-05-31 20:59:05,,Mishal Husain,Closing statements,58.00000000000001,5403.0 308 | -------------------------------------------------------------------------------- /data/bbcdebate/speakers/bbcdebate_log_raw.csv: -------------------------------------------------------------------------------- 1 | 2017-05-31 21:00:20,start 2 | 2017-05-31 21:00:55,. 3 | 2017-05-31 21:02:20,opening 4 | 2017-05-31 21:02:21,p 5 | 2017-05-31 21:03:28,. 6 | 2017-05-31 21:03:31,g 7 | 2017-05-31 21:04:46,. 8 | 2017-05-31 21:04:50,c 9 | 2017-05-31 21:05:58,. 10 | 2017-05-31 21:05:59,l 11 | 2017-05-31 21:06:51,. 12 | 2017-05-31 21:06:54,u 13 | 2017-05-31 21:08:02,. 14 | 2017-05-31 21:08:02,s 15 | 2017-05-31 21:08:58,. 16 | 2017-05-31 21:09:01,ld 17 | 2017-05-31 21:10:06,. 18 | 2017-05-31 21:10:45,q1 - how are you going to help working people 19 | 2017-05-31 21:10:47,c 20 | 2017-05-31 21:11:49,. 21 | 2017-05-31 21:11:49,l 22 | 2017-05-31 21:12:44,. 23 | 2017-05-31 21:12:48,s 24 | 2017-05-31 21:14:16,. 25 | 2017-05-31 21:14:17,c 26 | 2017-05-31 21:14:38,l 27 | 2017-05-31 21:14:44,c 28 | 2017-05-31 21:15:02,. 29 | 2017-05-31 21:15:13,ld 30 | 2017-05-31 21:16:38,. 31 | 2017-05-31 21:16:39,g 32 | 2017-05-31 21:18:00,. 33 | 2017-05-31 21:18:02,c 34 | 2017-05-31 21:18:18,g 35 | 2017-05-31 21:18:21,c 36 | 2017-05-31 21:18:49,ld 37 | 2017-05-31 21:18:52,. 38 | 2017-05-31 21:18:55,p 39 | 2017-05-31 21:19:36,u 40 | 2017-05-31 21:20:44,s 41 | 2017-05-31 21:20:45,u 42 | 2017-05-31 21:20:46,. 43 | 2017-05-31 21:20:49,l 44 | 2017-05-31 21:21:51,c 45 | 2017-05-31 21:21:59,l 46 | 2017-05-31 21:22:02,u 47 | 2017-05-31 21:22:12,. 48 | 2017-05-31 21:22:26,l 49 | 2017-05-31 21:23:09,. 50 | 2017-05-31 21:23:12,u 51 | 2017-05-31 21:23:20,s 52 | 2017-05-31 21:23:23,u 53 | 2017-05-31 21:23:30,ld 54 | 2017-05-31 21:24:04,c 55 | 2017-05-31 21:24:39,. 56 | 2017-05-31 21:25:07,q2 - how would we have the workers and skills we need to make uk a success after brexit 57 | 2017-05-31 21:25:08,u 58 | 2017-05-31 21:25:22,p 59 | 2017-05-31 21:25:24,u 60 | 2017-05-31 21:26:18,p 61 | 2017-05-31 21:26:21,ld 62 | 2017-05-31 21:27:23,. 63 | 2017-05-31 21:27:25,c 64 | 2017-05-31 21:28:01,. 65 | 2017-05-31 21:28:10,l 66 | 2017-05-31 21:28:56,. 67 | 2017-05-31 21:28:59,l 68 | 2017-05-31 21:29:09,. 69 | 2017-05-31 21:29:10,l 70 | 2017-05-31 21:29:21,c 71 | 2017-05-31 21:29:22,l 72 | 2017-05-31 21:29:39,. 73 | 2017-05-31 21:29:47,p 74 | 2017-05-31 21:29:48,. 75 | 2017-05-31 21:30:00,p 76 | 2017-05-31 21:30:04,. 77 | 2017-05-31 21:30:06,p 78 | 2017-05-31 21:30:18,. 79 | 2017-05-31 21:30:21,p 80 | 2017-05-31 21:30:26,. 81 | 2017-05-31 21:30:31,p 82 | 2017-05-31 21:31:01,u 83 | 2017-05-31 21:31:04,p 84 | 2017-05-31 21:31:07,u 85 | 2017-05-31 21:31:41,ld 86 | 2017-05-31 21:31:52,s 87 | 2017-05-31 21:32:49,. 88 | 2017-05-31 21:32:51,s 89 | 2017-05-31 21:33:03,pause 90 | 2017-05-31 21:33:23,restart 91 | 2017-05-31 21:33:41,pause 92 | 2017-05-31 21:35:53,restart 93 | 2017-05-31 21:36:13,. 94 | 2017-05-31 21:36:14,s 95 | 2017-05-31 21:36:18,. 96 | 2017-05-31 21:36:19,g 97 | 2017-05-31 21:36:46,pause 98 | 2017-05-31 21:37:41,restart 99 | 2017-05-31 21:38:36,. 100 | 2017-05-31 21:38:39,l 101 | 2017-05-31 21:39:01,g 102 | 2017-05-31 21:39:02,l 103 | 2017-05-31 21:39:21,s 104 | 2017-05-31 21:39:22,l 105 | 2017-05-31 21:39:25,s 106 | 2017-05-31 21:39:37,l 107 | 2017-05-31 21:39:40,s 108 | 2017-05-31 21:39:50,l 109 | 2017-05-31 21:39:54,ld 110 | 2017-05-31 21:40:17,. 111 | 2017-05-31 21:40:22,c 112 | 2017-05-31 21:40:50,p 113 | 2017-05-31 21:40:56,g 114 | 2017-05-31 21:41:25,c 115 | 2017-05-31 21:41:48,. 116 | 2017-05-31 21:42:12,q3 - where is the money coming from for our public services and how can we trust your plans add up 117 | 2017-05-31 21:42:13,g 118 | 2017-05-31 21:43:40,. 119 | 2017-05-31 21:43:48,c 120 | 2017-05-31 21:43:59,l 121 | 2017-05-31 21:43:59,c 122 | 2017-05-31 21:44:08,p 123 | 2017-05-31 21:44:10,c 124 | 2017-05-31 21:44:48,p 125 | 2017-05-31 21:44:49,l 126 | 2017-05-31 21:45:30,. 127 | 2017-05-31 21:45:35,l 128 | 2017-05-31 21:46:03,. 129 | 2017-05-31 21:46:04,l 130 | 2017-05-31 21:46:29,p 131 | 2017-05-31 21:46:41,l 132 | 2017-05-31 21:47:02,p 133 | 2017-05-31 21:47:03,l 134 | 2017-05-31 21:47:18,p 135 | 2017-05-31 21:47:20,. 136 | 2017-05-31 21:47:30,s 137 | 2017-05-31 21:47:51,l 138 | 2017-05-31 21:47:54,s 139 | 2017-05-31 21:48:37,. 140 | 2017-05-31 21:48:38,s 141 | 2017-05-31 21:48:51,c 142 | 2017-05-31 21:48:54,s 143 | 2017-05-31 21:48:54,c 144 | 2017-05-31 21:48:59,s 145 | 2017-05-31 21:49:00,c 146 | 2017-05-31 21:49:06,l 147 | 2017-05-31 21:49:07,c 148 | 2017-05-31 21:49:17,l 149 | 2017-05-31 21:49:18,c 150 | 2017-05-31 21:49:28,s 151 | 2017-05-31 21:49:29,c 152 | 2017-05-31 21:49:34,l 153 | 2017-05-31 21:49:36,c 154 | 2017-05-31 21:49:40,l 155 | 2017-05-31 21:49:42,c 156 | 2017-05-31 21:49:46,l 157 | 2017-05-31 21:49:46,c 158 | 2017-05-31 21:49:51,. 159 | 2017-05-31 21:50:03,c 160 | 2017-05-31 21:50:29,l 161 | 2017-05-31 21:50:32,ld 162 | 2017-05-31 21:51:25,. 163 | 2017-05-31 21:51:29,ld 164 | 2017-05-31 21:51:29,. 165 | 2017-05-31 21:51:34,ld 166 | 2017-05-31 21:52:15,g 167 | 2017-05-31 21:52:16,ld 168 | 2017-05-31 21:52:30,g 169 | 2017-05-31 21:52:31,ld 170 | 2017-05-31 21:52:33,. 171 | 2017-05-31 21:52:34,ld 172 | 2017-05-31 21:52:38,u 173 | 2017-05-31 21:53:05,s 174 | 2017-05-31 21:53:06,u 175 | 2017-05-31 21:53:09,s 176 | 2017-05-31 21:53:10,u 177 | 2017-05-31 21:53:19,ld 178 | 2017-05-31 21:53:21,u 179 | 2017-05-31 21:53:36,. 180 | 2017-05-31 21:53:38,u 181 | 2017-05-31 21:53:58,. 182 | 2017-05-31 21:54:34,q4 - what are your priorities for making britain a safer country and the world a safer place 183 | 2017-05-31 21:54:36,s 184 | 2017-05-31 21:56:28,. 185 | 2017-05-31 21:56:34,l 186 | 2017-05-31 21:57:55,. 187 | 2017-05-31 21:58:00,l 188 | 2017-05-31 21:58:49,. 189 | 2017-05-31 21:59:01,l 190 | 2017-05-31 21:59:07,. 191 | 2017-05-31 21:59:08,l 192 | 2017-05-31 21:59:29,. 193 | 2017-05-31 21:59:33,c 194 | 2017-05-31 22:00:19,. 195 | 2017-05-31 22:00:24,c 196 | 2017-05-31 22:01:08,l 197 | 2017-05-31 22:01:46,. 198 | 2017-05-31 22:01:48,c 199 | 2017-05-31 22:01:50,. 200 | 2017-05-31 22:01:51,l 201 | 2017-05-31 22:01:54,c 202 | 2017-05-31 22:01:55,. 203 | 2017-05-31 22:02:09,ld 204 | 2017-05-31 22:03:44,. 205 | 2017-05-31 22:03:44,u 206 | 2017-05-31 22:04:13,. 207 | 2017-05-31 22:04:15,u 208 | 2017-05-31 22:04:19,ld 209 | 2017-05-31 22:04:20,u 210 | 2017-05-31 22:05:19,ld 211 | 2017-05-31 22:05:20,u 212 | 2017-05-31 22:05:21,ld 213 | 2017-05-31 22:05:28,. 214 | 2017-05-31 22:05:29,g 215 | 2017-05-31 22:06:00,u 216 | 2017-05-31 22:06:01,g 217 | 2017-05-31 22:07:11,. 218 | 2017-05-31 22:07:12,c 219 | 2017-05-31 22:07:20,g 220 | 2017-05-31 22:07:22,c 221 | 2017-05-31 22:07:25,s 222 | 2017-05-31 22:07:26,c 223 | 2017-05-31 22:07:28,g 224 | 2017-05-31 22:07:35,c 225 | 2017-05-31 22:07:38,. 226 | 2017-05-31 22:07:39,p 227 | 2017-05-31 22:08:04,. 228 | 2017-05-31 22:08:10,p 229 | 2017-05-31 22:08:28,. 230 | 2017-05-31 22:08:31,p 231 | 2017-05-31 22:09:06,. 232 | 2017-05-31 22:09:10,ld 233 | 2017-05-31 22:09:36,. 234 | 2017-05-31 22:09:37,ld 235 | 2017-05-31 22:09:39,u 236 | 2017-05-31 22:10:02,. 237 | 2017-05-31 22:10:03,g 238 | 2017-05-31 22:10:09,u 239 | 2017-05-31 22:10:23,s 240 | 2017-05-31 22:10:24,u 241 | 2017-05-31 22:10:30,s 242 | 2017-05-31 22:10:39,u 243 | 2017-05-31 22:10:39,s 244 | 2017-05-31 22:11:01,. 245 | 2017-05-31 22:11:04,l 246 | 2017-05-31 22:11:19,u 247 | 2017-05-31 22:11:20,l 248 | 2017-05-31 22:11:25,u 249 | 2017-05-31 22:11:27,l 250 | 2017-05-31 22:11:47,. 251 | 2017-05-31 22:12:13,q5 - how will panellists deal with trump pulling out of climate change agreement 252 | 2017-05-31 22:12:14,ld 253 | 2017-05-31 22:13:06,. 254 | 2017-05-31 22:13:08,g 255 | 2017-05-31 22:14:28,. 256 | 2017-05-31 22:14:31,u 257 | 2017-05-31 22:14:54,. 258 | 2017-05-31 22:14:55,u 259 | 2017-05-31 22:15:14,. 260 | 2017-05-31 22:15:26,l 261 | 2017-05-31 22:16:08,pause 262 | 2017-05-31 22:16:55,restart 263 | 2017-05-31 22:16:59,. 264 | 2017-05-31 22:17:00,c 265 | 2017-05-31 22:17:05,. 266 | 2017-05-31 22:17:10,c 267 | 2017-05-31 22:17:35,p 268 | 2017-05-31 22:17:41,c 269 | 2017-05-31 22:17:42,p 270 | 2017-05-31 22:17:43,c 271 | 2017-05-31 22:17:58,l 272 | 2017-05-31 22:17:59,p 273 | 2017-05-31 22:18:00,l 274 | 2017-05-31 22:18:03,p 275 | 2017-05-31 22:18:31,. 276 | 2017-05-31 22:18:33,s 277 | 2017-05-31 22:19:51,. 278 | 2017-05-31 22:20:20,q6 - in what way does your leadership have the talent and character needed to take this country forward 279 | 2017-05-31 22:20:21,l 280 | 2017-05-31 22:21:26,. 281 | 2017-05-31 22:21:30,u 282 | 2017-05-31 22:22:41,p 283 | 2017-05-31 22:22:47,u 284 | 2017-05-31 22:23:03,p 285 | 2017-05-31 22:23:07,u 286 | 2017-05-31 22:23:08,p 287 | 2017-05-31 22:23:15,u 288 | 2017-05-31 22:23:17,p 289 | 2017-05-31 22:23:18,u 290 | 2017-05-31 22:23:26,. 291 | 2017-05-31 22:23:39,c 292 | 2017-05-31 22:23:49,s 293 | 2017-05-31 22:23:49,c 294 | 2017-05-31 22:23:58,l 295 | 2017-05-31 22:24:01,c 296 | 2017-05-31 22:24:47,. 297 | 2017-05-31 22:24:49,l 298 | 2017-05-31 22:25:00,. 299 | 2017-05-31 22:25:01,g 300 | 2017-05-31 22:26:22,. 301 | 2017-05-31 22:26:23,ld 302 | 2017-05-31 22:26:48,. 303 | 2017-05-31 22:26:48,ld 304 | 2017-05-31 22:27:48,. 305 | 2017-05-31 22:27:54,s 306 | 2017-05-31 22:27:59,pause 307 | 2017-05-31 23:38:18,test 308 | 2017-05-31 23:39:38,restart 309 | 2017-05-31 23:40:48,. 310 | 2017-05-31 23:40:50,p 311 | 2017-05-31 23:41:29,. 312 | 2017-05-31 23:41:50,closing remarks 313 | 2017-05-31 23:42:01,u 314 | 2017-05-31 23:42:40,. 315 | 2017-05-31 23:42:42,g 316 | 2017-05-31 23:43:22,. 317 | 2017-05-31 23:43:24,l 318 | 2017-05-31 23:43:55,. 319 | 2017-05-31 23:43:57,s 320 | 2017-05-31 23:44:27,. 321 | 2017-05-31 23:44:29,p 322 | 2017-05-31 23:45:05,. 323 | 2017-05-31 23:45:06,ld 324 | 2017-05-31 23:45:38,. 325 | 2017-05-31 23:45:41,c 326 | 2017-05-31 23:46:13,. 327 | 2017-05-31 23:46:46,credits 328 | 2017-05-31 23:47:11,end 329 | -------------------------------------------------------------------------------- /data/bbcdebate/speakers/debate_logger.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | 3 | 4 | if __name__ == "__main__": 5 | while True: 6 | speaker = input("> ") 7 | ts = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S') 8 | out = "{},{}\n".format(ts, speaker) 9 | with open('bbcdebate_log_raw.csv', 'a') as f: 10 | f.write(out) 11 | -------------------------------------------------------------------------------- /data/bbcdebate/speakers/process_debate_log.py: -------------------------------------------------------------------------------- 1 | import datetime 2 | import numpy as np 3 | import pandas as pd 4 | 5 | 6 | # Import raw data 7 | log = pd.read_csv('bbcdebate_log_raw.csv', header=None, names=['timestamp', 'speaking']) 8 | 9 | lookups = { 10 | # Speakers 11 | '.': ('chair', 'BBC', 'Mishal Husain'), 12 | 'p': ('party', 'Plaid Cymru', 'Leanne Wood'), 13 | 'g': ('party', 'Green', 'Caroline Lucas'), 14 | 'c': ('party', 'Conservatives', 'Amber Rudd'), 15 | 'l': ('party', 'Labour', 'Jeremy Corbyn'), 16 | 'u': ('party', 'UKIP', 'Paul Nuttall'), 17 | 's': ('party', 'SNP', 'Angus Robertson'), 18 | 'ld': ('party', 'Liberal Democrats', 'Tim Farron'), 19 | # Sections 20 | 'opening': ('section', 'Opening statements', ''), 21 | 'q1 - how are you going to help working people': ('section', 'Question 1 - Living standards', ''), 22 | 'q2 - how would we have the workers and skills we need to make uk a success after brexit': ('section', 'Question 2 - Brexit', ''), # noqa 23 | 'q3 - where is the money coming from for our public services and how can we trust your plans add up': ('section', 'Question 3 - Public services', ''), # noqa 24 | 'q4 - what are your priorities for making britain a safer country and the world a safer place': ('section', 'Question 4 - Security'), # noqa 25 | 'q5 - how will panellists deal with trump pulling out of climate change agreement': ('section', 'Question 5 - Climate change and Trump', ''), # noqa 26 | 'q6 - in what way does your leadership have the talent and character needed to take this country forward': ('section', 'Question 6 - Leadership', ''), # noqa 27 | 'closing remarks': ('section', 'Closing statements', ''), 28 | 'credits': ('section', 'Credits', ''), 29 | # Meta / markets 30 | 'start': ('marker', '', ''), 31 | 'end': ('marker', '', ''), 32 | 'pause': ('marker', '', ''), 33 | 'restart': ('marker', '', ''), 34 | 'test': ('marker', '', ''), 35 | } 36 | 37 | # Lookup type/party/speaker/section + forward fill section data 38 | log['type'] = log.speaking.apply(lambda x: lookups[x][0]) 39 | log['party'] = log.apply(lambda row: lookups[row['speaking']][1] if row['type'] == 'party' else np.nan, axis=1) 40 | log['speaker'] = log.apply(lambda row: lookups[row['speaking']][2] if row['type'] in ['party', 'chair'] else np.nan, axis=1) # noqa 41 | log['section'] = log.apply(lambda row: lookups[row['speaking']][1] if row['type'] == 'section' else np.nan, axis=1) 42 | log['section'].ffill(limit=None, inplace=True) 43 | log.loc[log.section.isnull(), 'section'] = 'Opening credits' 44 | 45 | # Remove markers and sections 46 | log = log[log.type != 'section'].copy().reset_index(drop=True) 47 | 48 | # Enforce unique timestamps - add one second to any record that has timestamp as one above 49 | log['timestamp'] = pd.to_datetime(log.timestamp) 50 | for i, row in log.iterrows(): 51 | if i > 0 and log.iloc[i].timestamp == log.iloc[i-1].timestamp: 52 | log.ix[i, 'timestamp'] += datetime.timedelta(seconds=1) 53 | 54 | # Check 55 | # sum(log.timestamp.duplicated()) 56 | 57 | # Add time counter for each record (time diff between one record and next) 58 | log['counter'] = np.nan 59 | for i, row in log.iterrows(): 60 | if i < len(log) - 1: 61 | log.ix[i, 'counter'] = (log.iloc[i + 1].timestamp - log.iloc[i].timestamp).total_seconds() 62 | 63 | # Raw data includes a number of pause/restart records (I was doing this on a train...!) 64 | log = log[log.speaking != 'pause'].copy().reset_index(drop=True) 65 | log = log[log.speaking != 'test'].copy().reset_index(drop=True) 66 | # Any restart's time needs to be added to record above 67 | for i, row in log.iterrows(): 68 | if i > 0 and log.iloc[i].speaking == 'restart': 69 | log.ix[i-1, 'counter'] += log.iloc[i].counter 70 | # Drop markers 71 | log = log[log.type != 'marker'].copy().reset_index(drop=True) 72 | 73 | # Check (sums to almost exactly 90 minutes, pretty good in my books) 74 | # log.counter.sum() / 60 75 | 76 | # Drop speaking/type cols 77 | del log['speaking'] 78 | del log['type'] 79 | 80 | # Add total time elapsed since event begun 81 | log['time_elapsed'] = log.counter.cumsum() 82 | 83 | # Recalculate record timestamps assuming a 7:30pm start time 84 | start_time = pd.to_datetime('2017-05-31 19:30:00') 85 | log.loc[0, 'timestamp'] = start_time 86 | log.loc[1:, 'timestamp'] = log.time_elapsed.apply(lambda x: start_time + datetime.timedelta(seconds=x)).iloc[:-1].values 87 | 88 | # Export as CSV 89 | log.to_csv('bbcdebate_log.csv', header=True, index=False) 90 | -------------------------------------------------------------------------------- /data/eu_referendum/electoral_commission/results/README.md: -------------------------------------------------------------------------------- 1 | # EU Referendum results (23rd June 2016) 2 | 3 | ## Source 4 | http://www.electoralcommission.org.uk/find-information-by-subject/elections-and-referendums/upcoming-elections-and-referendums/eu-referendum/electorate-and-count-information 5 | 6 | ## Raw files 7 | - `EU-referendum-result-data.csv` ([download from SixFifty S3](https://s3-eu-west-1.amazonaws.com/sixfifty/EU-referendum-result-data.csv)) 8 | 9 | ## Retrieving the data 10 | Run `python retrieve.py` 11 | 12 | ## Cleaning the data 13 | No data cleaning required. 14 | -------------------------------------------------------------------------------- /data/eu_referendum/electoral_commission/results/raw/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/six50/pipeline/6bc7dbb33387ea59f28cdfc9c35e6ef3daedf2a3/data/eu_referendum/electoral_commission/results/raw/.gitkeep -------------------------------------------------------------------------------- /data/eu_referendum/electoral_commission/results/scripts/retrieve.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import requests 3 | 4 | 5 | def main(ROOT_DIR): 6 | """Expects ROOT_DIR to be one level above (i.e. /results/)""" 7 | 8 | # Config 9 | ROOT_DIR = Path(ROOT_DIR) 10 | url = 'http://www.electoralcommission.org.uk/__data/assets/file/0014/212135/' 11 | filename = 'EU-referendum-result-data.csv' 12 | target = ROOT_DIR / 'raw' / filename 13 | 14 | # Download URL into local directory 15 | print('Downloading into {}'.format(target.resolve())) 16 | with open(target, 'wb') as f: 17 | response = requests.get(url + filename) 18 | f.write(response.content) 19 | 20 | 21 | # If being run from inside /scripts/ folder 22 | if __name__ == "__main__": 23 | main(ROOT_DIR='../') 24 | -------------------------------------------------------------------------------- /data/general_election/electoral_commission/results/README.md: -------------------------------------------------------------------------------- 1 | # UK Parliament general election results 2 | 3 | ## Source 4 | - **6th May 2010:** http://www.electoralcommission.org.uk/our-work/our-research/electoral-data 5 | - **7th May 2015:** http://www.electoralcommission.org.uk/our-work/our-research/electoral-data 6 | 7 | ## Raw files 8 | - **6th May 2010:** 9 | - `GE2010-results-flatfile-website.xls` ([download from SixFifty S3](https://s3-eu-west-1.amazonaws.com/sixfifty/GE2010-results-flatfile-website.xls)) 10 | - **7th May 2015:** 11 | - `RESULTS.csv` 12 | - `RESULTS FOR ANALYSIS.csv` ([download from SixFifty S3](https://s3-eu-west-1.amazonaws.com/sixfifty/CONSTITUENCY.csv)) 13 | - `CONSTITUENCY.csv` 14 | - `PARTY NAMES.csv` 15 | - `NOTES.csv` 16 | 17 | ## Cleaned files 18 | - **6th May 2010:** 19 | - `ge_2010_results.csv` ([download from SixFifty S3](https://s3-eu-west-1.amazonaws.com/sixfifty/ge_2010_results.csv)) 20 | - `ge_2010_results.feather` ([download from SixFifty S3](https://s3-eu-west-1.amazonaws.com/sixfifty/ge_2010_results.feather)) 21 | - **7th May 2015:** 22 | - `ge_2015_results.csv` ([download from SixFifty S3](https://s3-eu-west-1.amazonaws.com/sixfifty/ge_2015_results.csv)) 23 | - `ge_2015_results.feather` ([download from SixFifty S3](https://s3-eu-west-1.amazonaws.com/sixfifty/ge_2015_results.feather)) 24 | 25 | ## Retrieving the data 26 | ``` 27 | python retrieve_2010.py 28 | python retrieve_2015.py 29 | ``` 30 | 31 | ## Cleaning the data 32 | ``` 33 | python process_2010.py 34 | python process_2015.py 35 | ``` 36 | -------------------------------------------------------------------------------- /data/general_election/electoral_commission/results/clean/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/six50/pipeline/6bc7dbb33387ea59f28cdfc9c35e6ef3daedf2a3/data/general_election/electoral_commission/results/clean/.gitkeep -------------------------------------------------------------------------------- /data/general_election/electoral_commission/results/raw/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/six50/pipeline/6bc7dbb33387ea59f28cdfc9c35e6ef3daedf2a3/data/general_election/electoral_commission/results/raw/.gitkeep -------------------------------------------------------------------------------- /data/general_election/electoral_commission/results/scripts/process_2010.py: -------------------------------------------------------------------------------- 1 | import feather 2 | import pandas as pd 3 | from pathlib import Path 4 | 5 | 6 | def main(ROOT_DIR): 7 | """Expects ROOT_DIR to be one level above (i.e. /results/)""" 8 | 9 | # Import 10 | print('Read and clean GE2010-results-flatfile-website.xls') 11 | ROOT_DIR = Path(ROOT_DIR) 12 | results = pd.read_excel(ROOT_DIR / 'raw' / 'GE2010-results-flatfile-website.xls', 13 | sheetname='Party vote share') 14 | 15 | # Remove rows where Constituency Name is blank 16 | blank_rows = results['Constituency Name'].isnull() 17 | results = results[-blank_rows].copy() 18 | 19 | # Set NA vals to zero 20 | for col in results.columns[6:]: 21 | results[col] = results[col].fillna(0) 22 | 23 | # Checks 24 | assert(results.shape == (650, 144)) 25 | 26 | # Export as both CSV and Feather 27 | file_path = ROOT_DIR / 'clean' 28 | print('Exporting to {}'.format(file_path.resolve())) 29 | results.to_csv(file_path / 'ge_2010_results.csv', index=False) 30 | feather.write_dataframe(results, str(file_path / 'ge_2010_results.feather')) 31 | 32 | 33 | # If being run from inside /scripts/ folder 34 | if __name__ == "__main__": 35 | main(ROOT_DIR='../') 36 | -------------------------------------------------------------------------------- /data/general_election/electoral_commission/results/scripts/process_2015.py: -------------------------------------------------------------------------------- 1 | import feather 2 | import pandas as pd 3 | from pathlib import Path 4 | 5 | 6 | def main(ROOT_DIR): 7 | """Expects ROOT_DIR to be one level above (i.e. /results/)""" 8 | 9 | # Config 10 | ROOT_DIR = Path(ROOT_DIR) 11 | 12 | 13 | # GENERAL ELECTION RESULTS 14 | print('Read and clean RESULTS FOR ANALYSIS.csv') 15 | 16 | # Import general election results 17 | results = pd.read_csv(ROOT_DIR / 'raw' / 'RESULTS FOR ANALYSIS.csv') 18 | 19 | # Remove 'Unnamed: 9' columnd 20 | del results['Unnamed: 9'] 21 | 22 | # Fix bad column name (' Total number of valid votes counted ' to 'Valid Votes') 23 | results.columns = list(results.columns[:8]) + ['Valid Votes'] + list(results.columns[9:]) 24 | 25 | # Remove rows where Constituency Name is blank 26 | blank_rows = results['Constituency Name'].isnull() 27 | results = results[-blank_rows].copy() 28 | 29 | # Remove commas & coerce Electorate and Total number of valid votes counted 30 | for col in ['Electorate', 'Valid Votes']: 31 | results[col] = results[col].apply(lambda x: float(x.replace(",", ""))) 32 | 33 | # Set NA vals to zero 34 | for col in results.columns[9:]: 35 | results[col] = results[col].fillna(0) 36 | 37 | # Checks 38 | assert(results.shape == (650, 146)) 39 | 40 | 41 | # CONSTITUENCY DATA 42 | print('Read and clean CONSTITUENCY.csv') 43 | 44 | # Import constituency data 45 | constituency = pd.read_csv(ROOT_DIR / 'raw' / 'CONSTITUENCY.csv', encoding='latin1') 46 | 47 | # Remove rows where Constituency Name is blank 48 | blank_rows = constituency['Constituency Name'].isnull() 49 | constituency = constituency[-blank_rows].copy() 50 | 51 | # Remove 'Unnamed: 6' columnd 52 | del constituency['Unnamed: 6'] 53 | 54 | # Checks 55 | assert(constituency.shape == (650, 10)) 56 | 57 | 58 | # MERGE 59 | print('Merging and export') 60 | 61 | # Pre-merge checks 62 | match_col = 'Constituency ID' 63 | assert(len(set(constituency[match_col]).intersection(set(results[match_col]))) == 650) 64 | assert(len(set(constituency[match_col]).difference(set(results[match_col]))) == 0) 65 | assert(len(set(results[match_col]).difference(set(constituency[match_col]))) == 0) 66 | 67 | # Merge on Constituency ID 68 | results = pd.merge( 69 | left=results, 70 | right=constituency[['Constituency ID', 'Region ID', 'County']], 71 | how='left', 72 | on='Constituency ID' 73 | ) 74 | 75 | # EXPORT 76 | column_order = ['Press Association ID Number', 'Constituency ID', 'Constituency Name', 'Constituency Type', 77 | 'County', 'Region ID', 'Region', 'Country', 'Election Year', 'Electorate', 78 | 'Valid Votes'] + list(results.columns[9:146]) 79 | results = results[column_order] 80 | 81 | # Export as both CSV and Feather 82 | file_path = ROOT_DIR / 'clean' 83 | results.to_csv(file_path / 'ge_2015_results.csv', index=False) 84 | feather.write_dataframe(results, str(file_path / 'ge_2015_results.feather')) 85 | 86 | 87 | # If being run from inside /scripts/ folder 88 | if __name__ == "__main__": 89 | main(ROOT_DIR='../') 90 | -------------------------------------------------------------------------------- /data/general_election/electoral_commission/results/scripts/retrieve_2010.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | import requests 3 | 4 | 5 | def main(ROOT_DIR): 6 | """Expects ROOT_DIR to be one level above (i.e. /results/)""" 7 | 8 | # Config 9 | ROOT_DIR = Path(ROOT_DIR) 10 | url = 'http://www.electoralcommission.org.uk/__data/assets/excel_doc/0003/105726/' 11 | filename = 'GE2010-results-flatfile-website.xls' 12 | target = ROOT_DIR / 'raw' / filename 13 | 14 | # Download URL into local directory 15 | print('Downloading into {}'.format(target.resolve())) 16 | with open(target, 'wb') as f: 17 | response = requests.get(url + filename) 18 | f.write(response.content) 19 | 20 | 21 | # If being run from inside /scripts/ folder 22 | if __name__ == "__main__": 23 | main(ROOT_DIR='../') 24 | -------------------------------------------------------------------------------- /data/general_election/electoral_commission/results/scripts/retrieve_2015.py: -------------------------------------------------------------------------------- 1 | import os 2 | from pathlib import Path 3 | import requests 4 | import zipfile 5 | 6 | 7 | def main(ROOT_DIR): 8 | """Expects ROOT_DIR to be one level above (i.e. /results/)""" 9 | 10 | # Config 11 | ROOT_DIR = Path(ROOT_DIR) 12 | url = 'http://www.electoralcommission.org.uk/__data/assets/file/0004/191650/' 13 | filename = '2015-UK-general-election-data-results-WEB.zip' 14 | target = ROOT_DIR / 'raw' 15 | 16 | # Download URL into local directory 17 | print('Downloading into {}'.format(target.resolve())) 18 | with open(filename, 'wb') as f: 19 | response = requests.get(url + filename) 20 | f.write(response.content) 21 | 22 | # Extract into target location 23 | print('Extracting into {}'.format(target.resolve())) 24 | with zipfile.ZipFile(filename, "r") as f: 25 | f.extractall(target) 26 | 27 | # Delete the .zip file, we don't need it 28 | print('Cleaning up') 29 | os.remove(filename) 30 | 31 | 32 | # If being run from inside /scripts/ folder 33 | if __name__ == "__main__": 34 | main(ROOT_DIR='../') 35 | -------------------------------------------------------------------------------- /data/generate_data.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | 3 | from eu_referendum.electoral_commission.results.scripts import retrieve as eu_retrieve 4 | from general_election.electoral_commission.results.scripts import ( 5 | process_2010, 6 | process_2015, 7 | retrieve_2010, 8 | retrieve_2015, 9 | ) 10 | from model.scripts import process as model_process 11 | 12 | if __name__ == "__main__": 13 | 14 | # Retrieve EU Referendum data 15 | eu_path = Path(".") / "data" / "eu_referendum" / "electoral_commission" / "results" 16 | eu_retrieve.main(eu_path) 17 | 18 | # Retrieve & clean general election data 19 | ge_path = Path(".") / "data" / "general_election" / "electoral_commission" / "results" 20 | retrieve_2010.main(ge_path) 21 | # retrieve_2015.main(ge_path) # TODO: This is broken. 22 | # process_2010.main(ge_path) # TODO: This is broken. 23 | # process_2015.main(ge_path) # TODO: This is broken. 24 | 25 | # Process data ready for modelling 26 | model_path = Path(".") / "data" 27 | # model_process.main(model_path) # TODO: This is broken. 28 | -------------------------------------------------------------------------------- /data/model/clean/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/six50/pipeline/6bc7dbb33387ea59f28cdfc9c35e6ef3daedf2a3/data/model/clean/.gitkeep -------------------------------------------------------------------------------- /data/model/scripts/process.py: -------------------------------------------------------------------------------- 1 | import feather 2 | import pandas as pd 3 | from pathlib import Path 4 | 5 | 6 | def main(ROOT_DIR): 7 | """Expects ROOT_DIR to be repo root""" 8 | 9 | # Config 10 | ROOT_DIR = Path(ROOT_DIR) 11 | 12 | 13 | # IMPORT 14 | print('Importing data') 15 | 16 | # Import general election results 17 | ge_results = {} 18 | # file_path = ROOT_DIR / 'general_election' / 'electoral_commission' / 'results' / 'clean' 19 | # ge_results['2010'] = pd.read_csv(file_path / 'ge_2010_results.csv') 20 | 21 | file_path = ROOT_DIR / 'general_election' / 'electoral_commission' / 'results' / 'clean' 22 | ge_results['2015'] = pd.read_csv(file_path / 'ge_2015_results.csv') 23 | 24 | # Import EU Referendum results 25 | file_path = ROOT_DIR / 'eu_referendum' / 'electoral_commission' / 'results' / 'raw' 26 | eu_ref = pd.read_csv(file_path / 'EU-referendum-result-data.csv') 27 | 28 | # Aggregate EU Referendum data at Region level and derive `pc_remain` column 29 | eu_by_region = eu_ref.groupby('Region').sum()[['Remain', 'Leave', 'Valid_Votes']].reset_index() 30 | eu_by_region['pc_remain'] = eu_by_region['Remain'] / eu_by_region['Valid_Votes'] 31 | 32 | 33 | # MERGE 34 | print('Merging and export') 35 | 36 | # Pre-merge checks 37 | match_col = 'Region' 38 | assert(len(set(eu_by_region[match_col]).intersection(set(ge_results['2015'][match_col]))) == 12) 39 | assert(len(set(eu_by_region[match_col]).difference(set(ge_results['2015'][match_col]))) == 0) 40 | assert(len(set(ge_results['2015'][match_col]).difference(set(eu_by_region[match_col]))) == 0) 41 | 42 | # Merge on Region 43 | ge_results['2015'] = pd.merge( 44 | left=ge_results['2015'], 45 | right=eu_by_region[['Region', 'pc_remain']], 46 | how='left', 47 | on='Region' 48 | ) 49 | 50 | # Reorder columns 51 | cols = list(ge_results['2015']) 52 | cols.insert(cols.index('Region') + 1, # insert `pc_remain` col after `Region` 53 | cols.pop(cols.index('pc_remain'))) # remove it from current location 54 | ge_results['2015'] = ge_results['2015'].ix[:, cols] 55 | 56 | 57 | # EXPORT 58 | file_path = ROOT_DIR / 'model' / 'clean' 59 | ge_results['2015'].to_csv(file_path / 'model_2015.csv', index=False) 60 | feather.write_dataframe(ge_results['2015'], str(file_path / 'model_2015.feather')) 61 | 62 | 63 | # If being run from inside /scripts/ folder 64 | if __name__ == "__main__": 65 | main(ROOT_DIR='../../') 66 | -------------------------------------------------------------------------------- /data/polls/README.md: -------------------------------------------------------------------------------- 1 | # UK Polling Data 2 | 3 | If you have any questions about these datasets please [contact us @SixFiftyData](https://twitter.com/SixFiftyData) on Twitter. 4 | 5 | ## Sources 6 | We directly source our data from polling companies' provided tables. Read our post, _[Building SixFifty's Election Tracker](https://sixfifty.org.uk/2017/05/21/building-sixfiftys-election-tracker/)_ to read more. 7 | 8 | ## SixFifty Cleaned Polling Datasets 9 | 10 | The following cleaned datasets are available for download from S3 in multiple formats. 11 | 12 | | Region | JSON | CSV | Feather | 13 | | -- | -- | -- | -- | 14 | | **National polling** | [`polls.json`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls.json) | [`polls.csv`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls.csv) | [`polls.feather`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls.feather) | 15 | | **London-only polling** | [`polls_london.json`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_london.json) | [`polls_london.csv`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_london.csv) | [`polls_london.feather`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_london.feather) | 16 | | **Scotland-only polling** | [`polls_scotland.json`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_scotland.json) | [`polls_scotland.csv`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_scotland.csv) | [`polls_scotland.feather`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_scotland.feather) | 17 | | **Wales-only polling** | [`polls_wales.json`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_wales.json) | [`polls_wales.csv`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_wales.csv) | [`polls_wales.feather`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_wales.feather) | 18 | | **Northern Ireland polling** | [`polls_ni.json`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_ni.json) | [`polls_ni.csv`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_ni.csv) | [`polls_ni.feather`](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_ni.feather) | 19 | 20 | #### Data dictionary 21 | | Column | Type | Description | 22 | | -- | -- | -- | 23 | | `company` | string | Polling company name, e.g. `YouGov` | 24 | | `client` | string | Publisher name, e.g. `Times` | 25 | | `method` | string | Either `Online` or `Phone` | 26 | | `from` | string | Date sampled from, e.g. `2017-05-03` | 27 | | `to` | string | Date sampled to, e.g. `2017-05-05` | 28 | | `sample_size` | float | Sample size of poll, e.g. `1053.0` | 29 | | `party1` (e.g. `con`) | float | Percentage considering voting for party 1 (i.e. `con` is Conservative), e.g. `0.48` | 30 | | `party2` (e.g. `lab`) | float | Percentage considering voting for party2 (i.e. `lab` is Labour), e.g. `0.29` | 31 | | `...` | float | Remaining columns reference party names | 32 | | pdf | string | URL of raw data (if available) | 33 | 34 | 35 | ## SixFifty Smoothed Polls Dataset 36 | LOWESS smoothed poll-of-polls ([details](https://sixfifty.org.uk/2017/05/21/building-sixfiftys-election-tracker/)) are also available for download from S3 in multiple formats. 37 | 38 | | Region | JSON | CSV | Feather | 39 | | -- | -- | -- | -- | 40 | | **National polling** | [polls_smoothed.json](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_smoothed.json) | [polls_smoothed.csv](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_smoothed.csv) | [polls_smoothed.feather](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_smoothed.feather) | 41 | | **London-only polling** | [polls_london_smoothed.json](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_london_smoothed.json) | [polls_london_smoothed.csv](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_london_smoothed.csv) | [polls_london_smoothed.feather](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_london_smoothed.feather) | 42 | | **Scotland-only polling** | [polls_scotland_smoothed.json](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_scotland_smoothed.json) | [polls_scotland_smoothed.csv](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_scotland_smoothed.csv) | [polls_scotland_smoothed.feather](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_scotland_smoothed.feather) | 43 | | **Wales-only polling** | [polls_wales_smoothed.json](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_wales_smoothed.json) | [polls_wales_smoothed.csv](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_wales_smoothed.csv) | [polls_wales_smoothed.feather](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_wales_smoothed.feather) | 44 | | **Northern Ireland polling** | [polls_ni_smoothed.json](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_ni_smoothed.json) | [polls_ni_smoothed.csv](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_ni_smoothed.csv) | [polls_ni_smoothed.feather](https://s3-eu-west-1.amazonaws.com/sixfifty/polls_ni_smoothed.feather) | 45 | 46 | #### Data dictionary 47 | | Column | Type | Description | 48 | | -- | -- | -- | 49 | | `date` | string | Date (derived from `sampled_to`, e.g. `2017-05-03` | 50 | | `party1` (e.g. `con`) | float | Smoothed value for party 1 vote intention (i..e `con` is Conservative), e.g. `0.4789333335794975` | 51 | | `party2` (e.g. `lab`) | float | Smoothed value for party 2 vote intention (i.e. `lab` is Labour), e.g. `0.28146406464685286` | 52 | | `...` | float | Remaining columns reference party names | 53 | 54 | ## Scripts 55 | Executing `python generate_json.py` from this directory will: 56 | 57 | 1. Pull down the latest polling data from our manually curated Google Spreadsheet. 58 | 2. Clean & process the dataset. 59 | 3. Export the dataset into `data/` in .csv, .feather and .json formats. 60 | 4. Upload the dataset to S3 using the AWS CLI tool (n.b. you must have a valid AWS token with S3 permissions for this to work). 61 | -------------------------------------------------------------------------------- /data/polls/generate_json.py: -------------------------------------------------------------------------------- 1 | import gzip 2 | import json 3 | from pathlib import Path 4 | from subprocess import DEVNULL, STDOUT, check_call 5 | 6 | import pandas as pd 7 | 8 | import feather 9 | import statsmodels.api as sm 10 | 11 | if __name__ == "__main__": 12 | 13 | # Config 14 | DATA_DIR = Path(".") / "data" 15 | base_url = "https://docs.google.com/spreadsheets/d/1CHMArwUdVza-ayOT1aG2tRJJfa1OWvwfMGiCu20X7Ys/export" 16 | urls = { 17 | "uk": base_url + "?gid=1495247060&format=csv", 18 | "london": base_url + "?gid=683561754&format=csv", 19 | "scotland": base_url + "?gid=1896448771&format=csv", 20 | "wales": base_url + "?gid=2059731736&format=csv", 21 | "ni": base_url + "?gid=1413308295&format=csv", 22 | } 23 | cutoff_date = "2017-01-01" 24 | 25 | # RETRIEVE DATA 26 | print("Downloading from Google Sheet") 27 | polls = {} 28 | for geo in urls: 29 | polls[geo] = pd.read_csv(urls[geo], low_memory=False) 30 | 31 | # PROCESS 32 | # Define parties, respect ordering from Google Doc 33 | def get_party_names(input_list): 34 | # Ignore these columns 35 | cols_to_ignore = [ 36 | "Company", 37 | "Client", 38 | "Type", 39 | "Method", 40 | "From", 41 | "To", 42 | "Sample Size", 43 | "Source", 44 | "PDF", 45 | "DataSource", 46 | ] 47 | for col_name in cols_to_ignore: 48 | if col_name in input_list: 49 | input_list.remove(col_name) 50 | # Slugify parties 51 | input_list = [x.lower().replace(" ", "_") for x in input_list] 52 | return input_list 53 | 54 | parties = {} 55 | for geo in urls: 56 | parties[geo] = get_party_names(list(polls[geo].columns)) 57 | 58 | # Formatting 59 | for geo in parties: 60 | # Rename columns 61 | polls[geo].columns = [x.lower().replace(" ", "_") for x in polls[geo].columns] 62 | 63 | # Remove cols we don't want 64 | if "source" in polls[geo].columns: 65 | del polls[geo]["source"] 66 | 67 | # Format percentages into decimal 68 | for col in parties[geo]: 69 | polls[geo][col] = polls[geo][col].apply(lambda x: x / 100) 70 | 71 | # Process dates 72 | polls[geo]["to"] = pd.to_datetime(polls[geo]["to"]) 73 | polls[geo]["from"] = pd.to_datetime(polls[geo]["from"]) 74 | 75 | # Add LOWESS smoothing for 2017 data only 76 | polls_smoothed = {} 77 | for geo in parties: 78 | polls_smoothed[geo] = polls[geo][polls[geo].to >= cutoff_date].groupby("to").mean().reset_index() 79 | polls_smoothed[geo] = polls_smoothed[geo].ffill(limit=None).bfill(limit=None) 80 | for party in parties[geo]: 81 | polls_smoothed[geo][party + "_smooth"] = sm.nonparametric.lowess( 82 | polls_smoothed[geo][party], polls_smoothed[geo]["to"], frac=0.15 83 | )[:, 1] 84 | polls_smoothed[geo] = polls_smoothed[geo][["to"] + [col + "_smooth" for col in parties[geo]]] 85 | polls_smoothed[geo].columns = ["date"] + parties[geo] 86 | 87 | # EXPORT 88 | def multi_format_export(df, filename): 89 | df.to_json(str(DATA_DIR / (filename + ".json")), orient="records") 90 | df.to_csv(DATA_DIR / (filename + ".csv"), index=False) 91 | feather.write_dataframe(df, str(DATA_DIR / (filename + ".feather"))) 92 | 93 | for geo in parties: 94 | filename = "polls_" + geo if geo != "uk" else "polls" 95 | multi_format_export(polls[geo], filename) 96 | multi_format_export(polls_smoothed[geo], filename + "_smoothed") 97 | 98 | # Combine polls with polls_smoothed 99 | combined = [ 100 | json.loads(polls[geo][polls[geo].to > cutoff_date].to_json(orient="records")), 101 | json.loads(polls_smoothed[geo].to_json(orient="records")), 102 | ] 103 | with open(str(DATA_DIR / (filename + "_tracker" + ".json")), "w") as f: 104 | f.write(json.dumps(combined)) 105 | with gzip.open(str(DATA_DIR / (filename + "_tracker" + ".json.gz")), "w") as f: 106 | f.write(json.dumps(combined).encode("utf-8")) 107 | 108 | # Upload to S3 109 | print("Uploading to S3...") 110 | files_to_upload = [] 111 | for geo in parties: 112 | filename = "polls_" + geo if geo != "uk" else "polls" 113 | for file_format in [".json", ".csv", ".feather"]: 114 | files_to_upload.append(DATA_DIR / (filename + file_format)) 115 | files_to_upload.append(DATA_DIR / (filename + "_smoothed" + file_format)) 116 | files_to_upload.append(DATA_DIR / (filename + "_tracker.json")) 117 | files_to_upload.append(DATA_DIR / (filename + "_tracker.json.gz")) 118 | 119 | for file_path in files_to_upload: 120 | print("\t{}".format(file_path)) 121 | key = file_path.name 122 | body = str(file_path.resolve()) 123 | acl = "public-read" 124 | check_call( 125 | [f"aws s3api put-object --bucket sixfifty --key '{key}' --body '{body}' --acl {acl}"], 126 | stdout=DEVNULL, 127 | stderr=STDOUT, 128 | shell=True, 129 | ) 130 | -------------------------------------------------------------------------------- /data/push_to_s3.py: -------------------------------------------------------------------------------- 1 | from pathlib import Path 2 | from subprocess import DEVNULL, STDOUT, check_call 3 | 4 | 5 | def main(ROOT_DIR): 6 | ROOT_DIR = Path(ROOT_DIR) 7 | 8 | # Upload into bucket 9 | print("Uploading to S3 bucket...") 10 | eu_path = ROOT_DIR / 'data' / 'eu_referendum' / 'electoral_commission' / 'results' 11 | ge_path = ROOT_DIR / 'data' / 'general_election' / 'electoral_commission' / 'results' 12 | model_path = ROOT_DIR / 'data' / 'model' 13 | 14 | # Define files to upload 15 | files_to_upload = [ 16 | # EU Referendum 17 | eu_path / 'raw' / 'EU-referendum-result-data.csv', 18 | # General Election 2010 - RAW 19 | ge_path / 'raw' / 'GE2010-results-flatfile-website.xls', 20 | # General Election 2010 - CLEAN 21 | ge_path / 'clean' / 'ge_2010_results.csv', 22 | ge_path / 'clean' / 'ge_2010_results.feather', 23 | # General Election 2015 - RAW 24 | ge_path / 'raw' / 'RESULTS FOR ANALYSIS.csv', 25 | ge_path / 'raw' / 'CONSTITUENCY.csv', 26 | # General Election 2015 - CLEAN 27 | ge_path / 'clean' / 'ge_2015_results.csv', 28 | ge_path / 'clean' / 'ge_2015_results.feather', 29 | # Model 30 | model_path / 'clean' / 'model_2015.csv', 31 | model_path / 'clean' / 'model_2015.feather', 32 | ] 33 | 34 | for file_path in files_to_upload: 35 | print("\t{}".format(file_path)) 36 | # boto3 is really slow?! 37 | key = file_path.name # Pathlib attribute 38 | body = str(file_path.resolve()) 39 | acl = 'public-read' 40 | check_call(["aws s3api put-object --bucket sixfifty --key '{0}' --body '{1}' --acl {2}".format(key, body, acl)], 41 | stdout=DEVNULL, stderr=STDOUT, shell=True) 42 | 43 | 44 | if __name__ == '__main__': 45 | main('.') 46 | -------------------------------------------------------------------------------- /docs/setup.md: -------------------------------------------------------------------------------- 1 | # SixFifty Pipeline Requirements 2 | 3 | Follow these steps to get your system setup to run this pipeline: 4 | 5 | 1. If you're not comfortable with configuring a Python environment on your system (i.e. working with virtualenvs and pip in Python 3) then it is strongly recommended that you download the [Anaconda Python 3.6 installer](https://www.continuum.io/downloads). 6 | 2. When this is installed you should be able to open your system shell (Terminal on macOS, or search for "Anaconda Prompt" from Windows Start menu) and type `conda list` to see installed and available Python packages. You may wish to [read a little more about what Anaconda is](https://docs.continuum.io/anaconda/) at this stage if you are new to it. 7 | 3. Currently [over 450 packages are available to install](https://docs.continuum.io/anaconda/pkg-docs) for Python 3.6 via Anaconda's `conda install package-name` command. Other packages can be installed from the open source [Python Package Index](https://pypi.python.org/pypi) (a.k.a PyPI) using `pip install package-name`. The only difference is that packages installed using `conda` will be downloaded from the Anaconda repository, which has had some level of vetting by the company Continuum Analytics (provider of Anaconda) around security/reliability for use within enterprise. 8 | 4. The following packages you can `conda install` (many will already be installed): 9 | ``` 10 | conda install boto 11 | conda install cython 12 | conda install numpy==1.11.3 13 | conda install pandas==0.19.2 14 | conda install python-dateutil==2.6.0 15 | conda install pyyaml==3.12 16 | conda install requests==2.12.4 17 | conda install xlrd==1.0.0 18 | ``` 19 | 5. The following packages you will have to `pip install`: 20 | ``` 21 | pip install awscli==1.11.82 22 | pip install feather-format==0.3.1 23 | ``` 24 | 6. At this point you should be able to change into the repo root (`cd pipeline`) and run `python data/generate_data.py` to populate this repo with the various datasets. 25 | 7. If you want to also be able to push files to S3, you will need to do a couple more steps: 26 | - You will need to ask John for permissions to write into the SixFifty S3 bucket. He will set up an IAM user account and provide you with an AWS Access Key ID and a Secret Access Key. 27 | - Run `aws configure` from your CLI and enter the provided tokens plus `eu-west-1` for the default region. [Documentation for this step can be found here](http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html). 28 | - You should now be able to run `python data/push_to_s3.py` 29 | 30 | ### TODO 31 | Steps 4 and 5 above can be made a lot easier if we use [conda environments to manage package dependencies](https://conda.io/docs/using/envs.html#export-the-environment-file) split by conda vs pip. These are well documented and can be easily created by running `conda env export > environment.yml`, and activated from the file using `activate environment_name` (Windows) or `source activate environment_name` (macOS, Linux). 32 | 33 | However, for this project to remain easily accessible to people using either Anaconda OR pip+virtualenv to manage their dependencies, we would need to maintain both an `environment.yml` and `requirements.txt`, unless someone can figure out a way to get `pip` to use `environment.yml`, or we autogenerate `requirements.txt` from `environment.yml`. For now, this is pretty low down my priority list, the instructions above should suffice for now. 34 | -------------------------------------------------------------------------------- /requirements.in: -------------------------------------------------------------------------------- 1 | awscli==1.17.9 2 | boto==2.49.0 3 | feather-format==0.4.0 4 | ipython==7.12.0 5 | pandas==1.0.0 6 | pip-tools==4.4.1 7 | requests==2.22.0 8 | statsmodels==0.11.0 9 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # 2 | # This file is autogenerated by pip-compile 3 | # To update, run: 4 | # 5 | # pip-compile requirements.in 6 | # 7 | appnope==0.1.0 # via ipython 8 | awscli==1.17.9 9 | backcall==0.1.0 # via ipython 10 | boto==2.49.0 11 | botocore==1.14.9 # via awscli, s3transfer 12 | certifi==2019.11.28 # via requests 13 | chardet==3.0.4 # via requests 14 | click==7.0 # via pip-tools 15 | colorama==0.4.1 # via awscli 16 | decorator==4.4.1 # via ipython, traitlets 17 | docutils==0.15.2 # via awscli, botocore 18 | feather-format==0.4.0 19 | idna==2.8 # via requests 20 | ipython-genutils==0.2.0 # via traitlets 21 | ipython==7.12.0 22 | jedi==0.16.0 # via ipython 23 | jmespath==0.9.4 # via botocore 24 | numpy==1.18.1 # via pandas, patsy, pyarrow, scipy, statsmodels 25 | pandas==1.0.0 26 | parso==0.6.0 # via jedi 27 | patsy==0.5.1 # via statsmodels 28 | pexpect==4.8.0 # via ipython 29 | pickleshare==0.7.5 # via ipython 30 | pip-tools==4.4.1 31 | prompt-toolkit==3.0.3 # via ipython 32 | ptyprocess==0.6.0 # via pexpect 33 | pyarrow==0.15.1 # via feather-format 34 | pyasn1==0.4.8 # via rsa 35 | pygments==2.5.2 # via ipython 36 | python-dateutil==2.8.1 # via botocore, pandas 37 | pytz==2019.3 # via pandas 38 | pyyaml==5.2 # via awscli 39 | requests==2.22.0 40 | rsa==3.4.2 # via awscli 41 | s3transfer==0.3.2 # via awscli 42 | scipy==1.4.1 # via statsmodels 43 | six==1.14.0 # via patsy, pip-tools, pyarrow, python-dateutil, traitlets 44 | statsmodels==0.11.0 45 | traitlets==4.3.3 # via ipython 46 | urllib3==1.25.8 # via botocore, requests 47 | wcwidth==0.1.8 # via prompt-toolkit 48 | 49 | # The following packages are considered to be unsafe in a requirements file: 50 | # setuptools 51 | --------------------------------------------------------------------------------