About this Map
49 |50 | This is a cool map from 51 | www.py4e.com. 52 |
53 | 54 | 55 | -------------------------------------------------------------------------------- /Course 4 - Using databases with Python/ex16/where.js: -------------------------------------------------------------------------------- 1 | myData = [ 2 | [50.06688579999999,19.9136192, 'aleja Adama Mickiewicza 30, 30-059 Kraków, Poland'], 3 | [52.2394019,21.0150792, 'Krakowskie Przedmieście 5, 00-068 Warszawa, Poland'], 4 | [33.4641541,-111.9231478, '1475 N Scottsdale Rd, Scottsdale, AZ 85257, USA'], 5 | [38.0399391,23.8030901, 'Monumental Plaza, Building C, 1st Floor, Leof. Kifisias 44, Marousi 151 25, Greece'], 6 | [28.3639976,75.58696809999999, 'VidyaVihar Campus, Pilani, Rajasthan 333031, India'], 7 | [6.8919631,3.7186605, 'Ilishan Remo Ogun State Nigeria, ILISHAN REMO, Nigeria'], 8 | [25.2677203,82.99125819999999, 'Ajagara, Banaras Hindu University Campus, Varanasi, Uttar Pradesh 221005, India'], 9 | [12.9503878,77.5022224, 'Mysore Road, Jnana Bharathi, Bengaluru, Karnataka 560056, India'], 10 | [31.549841,-97.1143146, '1301 S University Parks Dr, Waco, TX 76706, USA'], 11 | [39.9619537,116.3662615, '19 Xinjiekou Outer St, BeiTaiPingZhuang, Haidian Qu, Beijing Shi, China, 100875'], 12 | [53.8930389,27.5455567, 'praspiekt Niezaliežnasci 4, Minsk, Belarus'], 13 | [44.8184339,20.4575676, 'Studentski trg 1, Beograd, Serbia'], 14 | [42.5030333,-89.0309048, '700 College St, Beloit, WI 53511, USA'], 15 | [53.8930389,27.5455567, 'praspiekt Niezaliežnasci 4, Minsk, Belarus'], 16 | [10.6779085,78.74454879999999, 'Palkalaiperur, Tiruchirappalli, Tamil Nadu 620024, India'], 17 | [42.3504997,-71.1053991, 'Boston, MA 02215, USA'], 18 | [47.486135,19.057964, 'Budapest, Fővám tér 8., 1093 Hungary'], 19 | [35.3050053,-120.6624942, 'San Luis Obispo, CA 93407, USA'] 20 | ]; 21 | -------------------------------------------------------------------------------- /Course 4 - Using databases with Python/ex16/zoomed map with added location.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 4 - Using databases with Python/ex16/zoomed map with added location.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/LICENSE: -------------------------------------------------------------------------------- 1 | Copyright (c) 2012, Michael Bostock 2 | All rights reserved. 3 | 4 | Redistribution and use in source and binary forms, with or without 5 | modification, are permitted provided that the following conditions are met: 6 | 7 | * Redistributions of source code must retain the above copyright notice, this 8 | list of conditions and the following disclaimer. 9 | 10 | * Redistributions in binary form must reproduce the above copyright notice, 11 | this list of conditions and the following disclaimer in the documentation 12 | and/or other materials provided with the distribution. 13 | 14 | * The name Michael Bostock may not be used to endorse or promote products 15 | derived from this software without specific prior written permission. 16 | 17 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" 18 | AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 19 | IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE 20 | DISCLAIMED. IN NO EVENT SHALL MICHAEL BOSTOCK BE LIABLE FOR ANY DIRECT, 21 | INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 22 | BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 23 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY 24 | OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 25 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, 26 | EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 27 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/README.txt: -------------------------------------------------------------------------------- 1 | Simple Python Search Spider, Page Ranker, and Visualizer 2 | 3 | This is a set of programs that emulate some of the functions of a 4 | search engine. They store their data in a SQLITE3 database named 5 | 'spider.sqlite'. This file can be removed at any time to restart the 6 | process. 7 | 8 | You should install the SQLite browser to view and modify 9 | the databases from: 10 | 11 | http://sqlitebrowser.org/ 12 | 13 | This program crawls a web site and pulls a series of pages into the 14 | database, recording the links between pages. 15 | 16 | Note: Windows has difficulty in displaying UTF-8 characters 17 | in the console so for each console window you open, you may need 18 | to type the following command before running this code: 19 | 20 | chcp 65001 21 | 22 | http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how 23 | 24 | Mac: rm spider.sqlite 25 | Mac: python3 spider.py 26 | 27 | Win: del spider.sqlite 28 | Win: spider.py 29 | 30 | Enter web url or enter: http://www.dr-chuck.com/ 31 | ['http://www.dr-chuck.com'] 32 | How many pages:2 33 | 1 http://www.dr-chuck.com/ 12 34 | 2 http://www.dr-chuck.com/csev-blog/ 57 35 | How many pages: 36 | 37 | In this sample run, we told it to crawl a website and retrieve two 38 | pages. If you restart the program again and tell it to crawl more 39 | pages, it will not re-crawl any pages already in the database. Upon 40 | restart it goes to a random non-crawled page and starts there. So 41 | each successive run of spider.py is additive. 42 | 43 | Mac: python3 spider.py 44 | Win: spider.py 45 | 46 | Enter web url or enter: http://www.dr-chuck.com/ 47 | ['http://www.dr-chuck.com'] 48 | How many pages:3 49 | 3 http://www.dr-chuck.com/csev-blog 57 50 | 4 http://www.dr-chuck.com/dr-chuck/resume/speaking.htm 1 51 | 5 http://www.dr-chuck.com/dr-chuck/resume/index.htm 13 52 | How many pages: 53 | 54 | You can have multiple starting points in the same database - 55 | within the program these are called "webs". The spider 56 | chooses randomly amongst all non-visited links across all 57 | the webs. 58 | 59 | If you want to dump the contents of the spider.sqlite file, you can 60 | run spdump.py as follows: 61 | 62 | Mac: python3 spdump.py 63 | Win: spdump.py 64 | 65 | (5, None, 1.0, 3, u'http://www.dr-chuck.com/csev-blog') 66 | (3, None, 1.0, 4, u'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm') 67 | (1, None, 1.0, 2, u'http://www.dr-chuck.com/csev-blog/') 68 | (1, None, 1.0, 5, u'http://www.dr-chuck.com/dr-chuck/resume/index.htm') 69 | 4 rows. 70 | 71 | This shows the number of incoming links, the old page rank, the new page 72 | rank, the id of the page, and the url of the page. The spdump.py program 73 | only shows pages that have at least one incoming link to them. 74 | 75 | Once you have a few pages in the database, you can run Page Rank on the 76 | pages using the sprank.py program. You simply tell it how many Page 77 | Rank iterations to run. 78 | 79 | Mac: python3 sprank.py 80 | Win: sprank.py 81 | 82 | How many iterations:2 83 | 1 0.546848992536 84 | 2 0.226714939664 85 | [(1, 0.559), (2, 0.659), (3, 0.985), (4, 2.135), (5, 0.659)] 86 | 87 | You can dump the database again to see that page rank has been updated: 88 | 89 | Mac: python3 spdump.py 90 | Win: spdump.py 91 | 92 | (5, 1.0, 0.985, 3, u'http://www.dr-chuck.com/csev-blog') 93 | (3, 1.0, 2.135, 4, u'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm') 94 | (1, 1.0, 0.659, 2, u'http://www.dr-chuck.com/csev-blog/') 95 | (1, 1.0, 0.659, 5, u'http://www.dr-chuck.com/dr-chuck/resume/index.htm') 96 | 4 rows. 97 | 98 | You can run sprank.py as many times as you like and it will simply refine 99 | the page rank the more times you run it. You can even run sprank.py a few times 100 | and then go spider a few more pages sith spider.py and then run sprank.py 101 | to converge the page ranks. 102 | 103 | If you want to restart the Page Rank calculations without re-spidering the 104 | web pages, you can use spreset.py 105 | 106 | Mac: python3 spreset.py 107 | Win: spreset.py 108 | 109 | All pages set to a rank of 1.0 110 | 111 | Mac: python3 sprank.py 112 | Win: sprank.py 113 | 114 | How many iterations:50 115 | 1 0.546848992536 116 | 2 0.226714939664 117 | 3 0.0659516187242 118 | 4 0.0244199333 119 | 5 0.0102096489546 120 | 6 0.00610244329379 121 | ... 122 | 42 0.000109076928206 123 | 43 9.91987599002e-05 124 | 44 9.02151706798e-05 125 | 45 8.20451504471e-05 126 | 46 7.46150183837e-05 127 | 47 6.7857770908e-05 128 | 48 6.17124694224e-05 129 | 49 5.61236959327e-05 130 | 50 5.10410499467e-05 131 | [(512, 0.02963718031139026), (1, 12.790786721866658), (2, 28.939418898678284), (3, 6.808468390725946), (4, 13.469889092397006)] 132 | 133 | For each iteration of the page rank algorithm it prints the average 134 | change per page of the page rank. The network initially is quite 135 | unbalanced and so the individual page ranks are changing wildly. 136 | But in a few short iterations, the page rank converges. You 137 | should run prank.py long enough that the page ranks converge. 138 | 139 | If you want to visualize the current top pages in terms of page rank, 140 | run spjson.py to write the pages out in JSON format to be viewed in a 141 | web browser. 142 | 143 | Mac: python3 spjson.py 144 | Win: spjson.py 145 | 146 | Creating JSON output on spider.js... 147 | How many nodes? 30 148 | Open force.html in a browser to view the visualization 149 | 150 | You can view this data by opening the file force.html in your web browser. 151 | This shows an automatic layout of the nodes and links. You can click and 152 | drag any node and you can also double click on a node to find the URL 153 | that is represented by the node. 154 | 155 | This visualization is provided using the force layout from: 156 | 157 | http://mbostock.github.com/d3/ 158 | 159 | If you rerun the other utilities and then re-run spjson.py - you merely 160 | have to press refresh in the browser to get the new data from spider.js. 161 | 162 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/__pycache__/spider.cpython-36.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/__pycache__/spider.cpython-36.pyc -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/dr-chuck-site-dump.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/dr-chuck-site-dump.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/dr-chuck-site-top25.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/dr-chuck-site-top25.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/force.css: -------------------------------------------------------------------------------- 1 | circle.node { 2 | stroke: #fff; 3 | stroke-width: 1.5px; 4 | } 5 | 6 | line.link { 7 | stroke: #999; 8 | stroke-opacity: .6; 9 | } 10 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/force.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 |If you don't see a chart above, check the JavaScript console. You may 16 | need to use a different browser.
17 | 18 | 19 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/force.js: -------------------------------------------------------------------------------- 1 | var width = 600, 2 | height = 600; 3 | 4 | var color = d3.scale.category20(); 5 | 6 | var dist = (width + height) / 4; 7 | 8 | var force = d3.layout.force() 9 | .charge(-120) 10 | .linkDistance(dist) 11 | .size([width, height]); 12 | 13 | function getrank(rval) { 14 | return (rval/2.0) + 3; 15 | } 16 | 17 | function getcolor(rval) { 18 | return color(rval); 19 | } 20 | 21 | var svg = d3.select("#chart").append("svg") 22 | .attr("width", width) 23 | .attr("height", height); 24 | 25 | function loadData(json) { 26 | force 27 | .nodes(json.nodes) 28 | .links(json.links); 29 | 30 | var k = Math.sqrt(json.nodes.length / (width * height)); 31 | 32 | force 33 | .charge(-10 / k) 34 | .gravity(100 * k) 35 | .start(); 36 | 37 | var link = svg.selectAll("line.link") 38 | .data(json.links) 39 | .enter().append("line") 40 | .attr("class", "link") 41 | .style("stroke-width", function(d) { return Math.sqrt(d.value); }); 42 | 43 | var node = svg.selectAll("circle.node") 44 | .data(json.nodes) 45 | .enter().append("circle") 46 | .attr("class", "node") 47 | .attr("r", function(d) { return getrank(d.rank); } ) 48 | .style("fill", function(d) { return getcolor(d.rank); }) 49 | .on("dblclick",function(d) { 50 | if ( confirm('Do you want to open '+d.url) ) 51 | window.open(d.url,'_new',''); 52 | d3.event.stopPropagation(); 53 | }) 54 | .call(force.drag); 55 | 56 | node.append("title") 57 | .text(function(d) { return d.url; }); 58 | 59 | force.on("tick", function() { 60 | link.attr("x1", function(d) { return d.source.x; }) 61 | .attr("y1", function(d) { return d.source.y; }) 62 | .attr("x2", function(d) { return d.target.x; }) 63 | .attr("y2", function(d) { return d.target.y; }); 64 | 65 | node.attr("cx", function(d) { return d.x; }) 66 | .attr("cy", function(d) { return d.y; }); 67 | }); 68 | 69 | } 70 | loadData(spiderJson); 71 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spdump.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | conn = sqlite3.connect('spider.sqlite') 4 | cur = conn.cursor() 5 | 6 | cur.execute('''SELECT COUNT(from_id) AS inbound, old_rank, new_rank, id, url 7 | FROM Pages JOIN Links ON Pages.id = Links.to_id 8 | WHERE html IS NOT NULL 9 | GROUP BY id ORDER BY inbound DESC''') 10 | 11 | count = 0 12 | for row in cur : 13 | if count < 50 : print(row) 14 | count = count + 1 15 | print(count, 'rows.') 16 | cur.close() 17 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spider-coincube.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spider-coincube.sqlite -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spider-dr-chuck.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spider-dr-chuck.sqlite -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spider.js: -------------------------------------------------------------------------------- 1 | spiderJson = {"nodes":[ 2 | {"weight":21,"rank":19.0, "id":1, "url":"http://variance.hu"}, 3 | {"weight":27,"rank":15.558784255770082, "id":22, "url":"http://variance.hu/2018/01/03/kilenc-ev"}, 4 | {"weight":22,"rank":19.0, "id":107, "url":"http://variance.hu/tag/ant"}, 5 | {"weight":23,"rank":19.0, "id":150, "url":"http://variance.hu/tag/block"}, 6 | {"weight":22,"rank":19.0, "id":157, "url":"http://variance.hu/tag/bor"}, 7 | {"weight":22,"rank":19.0, "id":213, "url":"http://variance.hu/tag/cyptocurrency"}, 8 | {"weight":20,"rank":15.558784255770082, "id":299, "url":"http://variance.hu/tag/gephaz"}, 9 | {"weight":21,"rank":15.558784255770082, "id":341, "url":"http://variance.hu/tag/jobs"}, 10 | {"weight":21,"rank":15.558784255770082, "id":342, "url":"http://variance.hu/tag/job-seeking"}, 11 | {"weight":22,"rank":15.756062326850063, "id":354, "url":"http://variance.hu/tag/kotveny"}, 12 | {"weight":20,"rank":15.558784255770082, "id":361, "url":"http://variance.hu/tag/linear-regression"}, 13 | {"weight":20,"rank":16.528019996293462, "id":372, "url":"http://variance.hu/tag/manipulation"}, 14 | {"weight":27,"rank":15.609796663166051, "id":469, "url":"http://variance.hu/tag/python"}, 15 | {"weight":20,"rank":15.558784255770082, "id":553, "url":"http://variance.hu/tag/ta"}, 16 | {"weight":20,"rank":15.558784255770082, "id":556, "url":"http://variance.hu/tag/tangle"}, 17 | {"weight":21,"rank":15.558784255770082, "id":602, "url":"http://variance.hu/tag/utxo"}, 18 | {"weight":20,"rank":15.558784255770082, "id":621, "url":"http://variance.hu/tag/whales"}, 19 | {"weight":29,"rank":15.558784255770082, "id":636, "url":"http://variance.hu/tag/zec"}, 20 | {"weight":19,"rank":0.0, "id":661, "url":"http://variance.hu/2017/08/28/ismerkedes-a-bittrex-api-val-python"}, 21 | {"weight":9,"rank":0.00812587020466735, "id":675, "url":"http://variance.hu/2014/01/09/enervalt-piaci-hangulat-mellett-szaguld-az-oko-szektor"}], 22 | "links":[ 23 | {"source":0,"target":0,"value":3}, 24 | {"source":0,"target":1,"value":3}, 25 | {"source":0,"target":2,"value":3}, 26 | {"source":0,"target":3,"value":3}, 27 | {"source":0,"target":4,"value":3}, 28 | {"source":0,"target":5,"value":3}, 29 | {"source":0,"target":6,"value":3}, 30 | {"source":0,"target":7,"value":3}, 31 | {"source":0,"target":8,"value":3}, 32 | {"source":0,"target":9,"value":3}, 33 | {"source":0,"target":10,"value":3}, 34 | {"source":0,"target":11,"value":3}, 35 | {"source":0,"target":12,"value":3}, 36 | {"source":0,"target":13,"value":3}, 37 | {"source":0,"target":14,"value":3}, 38 | {"source":0,"target":15,"value":3}, 39 | {"source":0,"target":16,"value":3}, 40 | {"source":0,"target":17,"value":3}, 41 | {"source":2,"target":0,"value":3}, 42 | {"source":2,"target":2,"value":3}, 43 | {"source":2,"target":3,"value":3}, 44 | {"source":2,"target":4,"value":3}, 45 | {"source":2,"target":5,"value":3}, 46 | {"source":2,"target":6,"value":3}, 47 | {"source":2,"target":7,"value":3}, 48 | {"source":2,"target":8,"value":3}, 49 | {"source":2,"target":9,"value":3}, 50 | {"source":2,"target":10,"value":3}, 51 | {"source":2,"target":11,"value":3}, 52 | {"source":2,"target":12,"value":3}, 53 | {"source":2,"target":13,"value":3}, 54 | {"source":2,"target":14,"value":3}, 55 | {"source":2,"target":15,"value":3}, 56 | {"source":2,"target":16,"value":3}, 57 | {"source":2,"target":17,"value":3}, 58 | {"source":2,"target":1,"value":3}, 59 | {"source":13,"target":0,"value":3}, 60 | {"source":13,"target":13,"value":3}, 61 | {"source":13,"target":2,"value":3}, 62 | {"source":13,"target":3,"value":3}, 63 | {"source":13,"target":4,"value":3}, 64 | {"source":13,"target":5,"value":3}, 65 | {"source":13,"target":6,"value":3}, 66 | {"source":13,"target":7,"value":3}, 67 | {"source":13,"target":8,"value":3}, 68 | {"source":13,"target":9,"value":3}, 69 | {"source":13,"target":10,"value":3}, 70 | {"source":13,"target":11,"value":3}, 71 | {"source":13,"target":12,"value":3}, 72 | {"source":13,"target":14,"value":3}, 73 | {"source":13,"target":15,"value":3}, 74 | {"source":13,"target":16,"value":3}, 75 | {"source":13,"target":17,"value":3}, 76 | {"source":13,"target":1,"value":3}, 77 | {"source":5,"target":0,"value":3}, 78 | {"source":5,"target":5,"value":3}, 79 | {"source":5,"target":2,"value":3}, 80 | {"source":5,"target":3,"value":3}, 81 | {"source":5,"target":4,"value":3}, 82 | {"source":5,"target":6,"value":3}, 83 | {"source":5,"target":7,"value":3}, 84 | {"source":5,"target":8,"value":3}, 85 | {"source":5,"target":9,"value":3}, 86 | {"source":5,"target":10,"value":3}, 87 | {"source":5,"target":11,"value":3}, 88 | {"source":5,"target":12,"value":3}, 89 | {"source":5,"target":13,"value":3}, 90 | {"source":5,"target":14,"value":3}, 91 | {"source":5,"target":15,"value":3}, 92 | {"source":5,"target":16,"value":3}, 93 | {"source":5,"target":17,"value":3}, 94 | {"source":5,"target":1,"value":3}, 95 | {"source":15,"target":0,"value":3}, 96 | {"source":15,"target":15,"value":3}, 97 | {"source":15,"target":2,"value":3}, 98 | {"source":15,"target":3,"value":3}, 99 | {"source":15,"target":4,"value":3}, 100 | {"source":15,"target":5,"value":3}, 101 | {"source":15,"target":6,"value":3}, 102 | {"source":15,"target":7,"value":3}, 103 | {"source":15,"target":8,"value":3}, 104 | {"source":15,"target":9,"value":3}, 105 | {"source":15,"target":10,"value":3}, 106 | {"source":15,"target":11,"value":3}, 107 | {"source":15,"target":12,"value":3}, 108 | {"source":15,"target":13,"value":3}, 109 | {"source":15,"target":14,"value":3}, 110 | {"source":15,"target":16,"value":3}, 111 | {"source":15,"target":17,"value":3}, 112 | {"source":15,"target":1,"value":3}, 113 | {"source":4,"target":0,"value":3}, 114 | {"source":4,"target":4,"value":3}, 115 | {"source":4,"target":2,"value":3}, 116 | {"source":4,"target":3,"value":3}, 117 | {"source":4,"target":5,"value":3}, 118 | {"source":4,"target":6,"value":3}, 119 | {"source":4,"target":7,"value":3}, 120 | {"source":4,"target":8,"value":3}, 121 | {"source":4,"target":9,"value":3}, 122 | {"source":4,"target":10,"value":3}, 123 | {"source":4,"target":11,"value":3}, 124 | {"source":4,"target":12,"value":3}, 125 | {"source":4,"target":13,"value":3}, 126 | {"source":4,"target":14,"value":3}, 127 | {"source":4,"target":15,"value":3}, 128 | {"source":4,"target":16,"value":3}, 129 | {"source":4,"target":17,"value":3}, 130 | {"source":4,"target":1,"value":3}, 131 | {"source":14,"target":0,"value":3}, 132 | {"source":14,"target":14,"value":3}, 133 | {"source":14,"target":2,"value":3}, 134 | {"source":14,"target":3,"value":3}, 135 | {"source":14,"target":4,"value":3}, 136 | {"source":14,"target":5,"value":3}, 137 | {"source":14,"target":6,"value":3}, 138 | {"source":14,"target":7,"value":3}, 139 | {"source":14,"target":8,"value":3}, 140 | {"source":14,"target":9,"value":3}, 141 | {"source":14,"target":10,"value":3}, 142 | {"source":14,"target":11,"value":3}, 143 | {"source":14,"target":12,"value":3}, 144 | {"source":14,"target":13,"value":3}, 145 | {"source":14,"target":15,"value":3}, 146 | {"source":14,"target":16,"value":3}, 147 | {"source":14,"target":17,"value":3}, 148 | {"source":14,"target":1,"value":3}, 149 | {"source":12,"target":0,"value":3}, 150 | {"source":12,"target":12,"value":3}, 151 | {"source":12,"target":18,"value":3}, 152 | {"source":12,"target":2,"value":3}, 153 | {"source":12,"target":3,"value":3}, 154 | {"source":12,"target":4,"value":3}, 155 | {"source":12,"target":5,"value":3}, 156 | {"source":12,"target":6,"value":3}, 157 | {"source":12,"target":7,"value":3}, 158 | {"source":12,"target":8,"value":3}, 159 | {"source":12,"target":9,"value":3}, 160 | {"source":12,"target":10,"value":3}, 161 | {"source":12,"target":11,"value":3}, 162 | {"source":12,"target":13,"value":3}, 163 | {"source":12,"target":14,"value":3}, 164 | {"source":12,"target":15,"value":3}, 165 | {"source":12,"target":16,"value":3}, 166 | {"source":12,"target":17,"value":3}, 167 | {"source":12,"target":1,"value":3}, 168 | {"source":7,"target":0,"value":3}, 169 | {"source":7,"target":7,"value":3}, 170 | {"source":7,"target":8,"value":3}, 171 | {"source":7,"target":2,"value":3}, 172 | {"source":7,"target":3,"value":3}, 173 | {"source":7,"target":4,"value":3}, 174 | {"source":7,"target":5,"value":3}, 175 | {"source":7,"target":6,"value":3}, 176 | {"source":7,"target":9,"value":3}, 177 | {"source":7,"target":10,"value":3}, 178 | {"source":7,"target":11,"value":3}, 179 | {"source":7,"target":12,"value":3}, 180 | {"source":7,"target":13,"value":3}, 181 | {"source":7,"target":14,"value":3}, 182 | {"source":7,"target":15,"value":3}, 183 | {"source":7,"target":16,"value":3}, 184 | {"source":7,"target":17,"value":3}, 185 | {"source":7,"target":1,"value":3}, 186 | {"source":16,"target":0,"value":3}, 187 | {"source":16,"target":16,"value":3}, 188 | {"source":16,"target":2,"value":3}, 189 | {"source":16,"target":3,"value":3}, 190 | {"source":16,"target":4,"value":3}, 191 | {"source":16,"target":5,"value":3}, 192 | {"source":16,"target":6,"value":3}, 193 | {"source":16,"target":7,"value":3}, 194 | {"source":16,"target":8,"value":3}, 195 | {"source":16,"target":9,"value":3}, 196 | {"source":16,"target":10,"value":3}, 197 | {"source":16,"target":11,"value":3}, 198 | {"source":16,"target":12,"value":3}, 199 | {"source":16,"target":13,"value":3}, 200 | {"source":16,"target":14,"value":3}, 201 | {"source":16,"target":15,"value":3}, 202 | {"source":16,"target":17,"value":3}, 203 | {"source":16,"target":1,"value":3}, 204 | {"source":9,"target":0,"value":3}, 205 | {"source":9,"target":9,"value":3}, 206 | {"source":9,"target":19,"value":3}, 207 | {"source":9,"target":2,"value":3}, 208 | {"source":9,"target":3,"value":3}, 209 | {"source":9,"target":4,"value":3}, 210 | {"source":9,"target":5,"value":3}, 211 | {"source":9,"target":6,"value":3}, 212 | {"source":9,"target":7,"value":3}, 213 | {"source":9,"target":8,"value":3}, 214 | {"source":9,"target":10,"value":3}, 215 | {"source":9,"target":11,"value":3}, 216 | {"source":9,"target":12,"value":3}, 217 | {"source":9,"target":13,"value":3}, 218 | {"source":9,"target":14,"value":3}, 219 | {"source":9,"target":15,"value":3}, 220 | {"source":9,"target":16,"value":3}, 221 | {"source":9,"target":17,"value":3}, 222 | {"source":9,"target":1,"value":3}, 223 | {"source":19,"target":0,"value":3}, 224 | {"source":19,"target":19,"value":3}, 225 | {"source":19,"target":9,"value":3}, 226 | {"source":19,"target":2,"value":3}, 227 | {"source":19,"target":3,"value":3}, 228 | {"source":19,"target":4,"value":3}, 229 | {"source":19,"target":5,"value":3}, 230 | {"source":18,"target":0,"value":3}, 231 | {"source":18,"target":18,"value":3}, 232 | {"source":18,"target":12,"value":3}, 233 | {"source":18,"target":2,"value":3}, 234 | {"source":18,"target":3,"value":3}, 235 | {"source":18,"target":4,"value":3}, 236 | {"source":18,"target":5,"value":3}, 237 | {"source":18,"target":6,"value":3}, 238 | {"source":18,"target":7,"value":3}, 239 | {"source":18,"target":8,"value":3}, 240 | {"source":18,"target":9,"value":3}, 241 | {"source":18,"target":10,"value":3}, 242 | {"source":18,"target":11,"value":3}, 243 | {"source":18,"target":13,"value":3}, 244 | {"source":18,"target":14,"value":3}, 245 | {"source":18,"target":15,"value":3}, 246 | {"source":18,"target":16,"value":3}, 247 | {"source":18,"target":17,"value":3}, 248 | {"source":18,"target":1,"value":3}, 249 | {"source":3,"target":0,"value":3}, 250 | {"source":3,"target":3,"value":3}, 251 | {"source":3,"target":2,"value":3}, 252 | {"source":3,"target":4,"value":3}, 253 | {"source":3,"target":5,"value":3}, 254 | {"source":3,"target":6,"value":3}, 255 | {"source":3,"target":7,"value":3}, 256 | {"source":3,"target":8,"value":3}, 257 | {"source":3,"target":9,"value":3}, 258 | {"source":3,"target":10,"value":3}, 259 | {"source":3,"target":11,"value":3}, 260 | {"source":3,"target":12,"value":3}, 261 | {"source":3,"target":13,"value":3}, 262 | {"source":3,"target":14,"value":3}, 263 | {"source":3,"target":15,"value":3}, 264 | {"source":3,"target":16,"value":3}, 265 | {"source":3,"target":17,"value":3}, 266 | {"source":3,"target":1,"value":3}, 267 | {"source":1,"target":0,"value":3}, 268 | {"source":1,"target":1,"value":3}, 269 | {"source":1,"target":2,"value":3}, 270 | {"source":1,"target":3,"value":3}, 271 | {"source":1,"target":4,"value":3}, 272 | {"source":1,"target":5,"value":3}, 273 | {"source":1,"target":6,"value":3}, 274 | {"source":1,"target":7,"value":3}, 275 | {"source":1,"target":8,"value":3}, 276 | {"source":1,"target":9,"value":3}, 277 | {"source":1,"target":10,"value":3}, 278 | {"source":1,"target":11,"value":3}, 279 | {"source":1,"target":12,"value":3}, 280 | {"source":1,"target":13,"value":3}, 281 | {"source":1,"target":14,"value":3}, 282 | {"source":1,"target":15,"value":3}, 283 | {"source":1,"target":16,"value":3}, 284 | {"source":1,"target":17,"value":3}, 285 | {"source":10,"target":0,"value":3}, 286 | {"source":10,"target":10,"value":3}, 287 | {"source":10,"target":2,"value":3}, 288 | {"source":10,"target":3,"value":3}, 289 | {"source":10,"target":4,"value":3}, 290 | {"source":10,"target":5,"value":3}, 291 | {"source":10,"target":6,"value":3}, 292 | {"source":10,"target":7,"value":3}, 293 | {"source":10,"target":8,"value":3}, 294 | {"source":10,"target":9,"value":3}, 295 | {"source":10,"target":11,"value":3}, 296 | {"source":10,"target":12,"value":3}, 297 | {"source":10,"target":13,"value":3}, 298 | {"source":10,"target":14,"value":3}, 299 | {"source":10,"target":15,"value":3}, 300 | {"source":10,"target":16,"value":3}, 301 | {"source":10,"target":17,"value":3}, 302 | {"source":10,"target":1,"value":3}, 303 | {"source":17,"target":0,"value":3}, 304 | {"source":17,"target":17,"value":3}, 305 | {"source":17,"target":2,"value":3}, 306 | {"source":17,"target":3,"value":3}, 307 | {"source":17,"target":4,"value":3}, 308 | {"source":17,"target":5,"value":3}, 309 | {"source":17,"target":6,"value":3}, 310 | {"source":17,"target":7,"value":3}, 311 | {"source":17,"target":8,"value":3}, 312 | {"source":17,"target":9,"value":3}, 313 | {"source":17,"target":10,"value":3}, 314 | {"source":17,"target":11,"value":3}, 315 | {"source":17,"target":12,"value":3}, 316 | {"source":17,"target":13,"value":3}, 317 | {"source":17,"target":14,"value":3}, 318 | {"source":17,"target":15,"value":3}, 319 | {"source":17,"target":16,"value":3}, 320 | {"source":17,"target":1,"value":3}, 321 | {"source":8,"target":0,"value":3}, 322 | {"source":8,"target":8,"value":3}, 323 | {"source":8,"target":7,"value":3}, 324 | {"source":8,"target":2,"value":3}, 325 | {"source":8,"target":3,"value":3}, 326 | {"source":8,"target":4,"value":3}, 327 | {"source":8,"target":5,"value":3}, 328 | {"source":8,"target":6,"value":3}, 329 | {"source":8,"target":9,"value":3}, 330 | {"source":8,"target":10,"value":3}, 331 | {"source":8,"target":11,"value":3}, 332 | {"source":8,"target":12,"value":3}, 333 | {"source":8,"target":13,"value":3}, 334 | {"source":8,"target":14,"value":3}, 335 | {"source":8,"target":15,"value":3}, 336 | {"source":8,"target":16,"value":3}, 337 | {"source":8,"target":17,"value":3}, 338 | {"source":8,"target":1,"value":3}, 339 | {"source":6,"target":0,"value":3}, 340 | {"source":6,"target":6,"value":3}, 341 | {"source":6,"target":2,"value":3}, 342 | {"source":6,"target":3,"value":3}, 343 | {"source":6,"target":4,"value":3}, 344 | {"source":6,"target":5,"value":3}, 345 | {"source":6,"target":7,"value":3}, 346 | {"source":6,"target":8,"value":3}, 347 | {"source":6,"target":9,"value":3}, 348 | {"source":6,"target":10,"value":3}, 349 | {"source":6,"target":11,"value":3}, 350 | {"source":6,"target":12,"value":3}, 351 | {"source":6,"target":13,"value":3}, 352 | {"source":6,"target":14,"value":3}, 353 | {"source":6,"target":15,"value":3}, 354 | {"source":6,"target":16,"value":3}, 355 | {"source":6,"target":17,"value":3}, 356 | {"source":6,"target":1,"value":3}, 357 | {"source":11,"target":0,"value":3}, 358 | {"source":11,"target":11,"value":3}, 359 | {"source":11,"target":2,"value":3}, 360 | {"source":11,"target":3,"value":3}, 361 | {"source":11,"target":4,"value":3}, 362 | {"source":11,"target":5,"value":3}]}; -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spider.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import urllib.error 3 | import ssl 4 | from urllib.parse import urljoin 5 | from urllib.parse import urlparse 6 | from urllib.request import urlopen 7 | from bs4 import BeautifulSoup 8 | 9 | # Ignore SSL certificate errors 10 | ctx = ssl.create_default_context() 11 | ctx.check_hostname = False 12 | ctx.verify_mode = ssl.CERT_NONE 13 | 14 | conn = sqlite3.connect('spider.sqlite') 15 | cur = conn.cursor() 16 | 17 | cur.execute('''CREATE TABLE IF NOT EXISTS Pages 18 | (id INTEGER PRIMARY KEY, url TEXT UNIQUE, html TEXT, 19 | error INTEGER, old_rank REAL, new_rank REAL)''') 20 | 21 | cur.execute('''CREATE TABLE IF NOT EXISTS Links 22 | (from_id INTEGER, to_id INTEGER)''') 23 | 24 | cur.execute('''CREATE TABLE IF NOT EXISTS Webs (url TEXT UNIQUE)''') 25 | 26 | # Check to see if we are already in progress... 27 | cur.execute('SELECT id,url FROM Pages WHERE html is NULL and error is NULL ORDER BY RANDOM() LIMIT 1') 28 | row = cur.fetchone() 29 | if row is not None: 30 | print("Restarting existing crawl. Remove spider.sqlite to start a fresh crawl.") 31 | else : 32 | starturl = input('Enter web url or enter: ') 33 | if ( len(starturl) < 1 ) : starturl = 'http://www.dr-chuck.com/' 34 | if ( starturl.endswith('/') ) : starturl = starturl[:-1] 35 | web = starturl 36 | if ( starturl.endswith('.htm') or starturl.endswith('.html') ) : 37 | pos = starturl.rfind('/') 38 | web = starturl[:pos] 39 | 40 | if ( len(web) > 1 ) : 41 | cur.execute('INSERT OR IGNORE INTO Webs (url) VALUES ( ? )', ( web, ) ) 42 | cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( starturl, ) ) 43 | conn.commit() 44 | 45 | # Get the current webs 46 | cur.execute('''SELECT url FROM Webs''') 47 | webs = list() 48 | for row in cur: 49 | webs.append(str(row[0])) 50 | 51 | print(webs) 52 | 53 | many = 0 54 | while True: 55 | if ( many < 1 ) : 56 | sval = input('How many pages:') 57 | if ( len(sval) < 1 ) : break 58 | many = int(sval) 59 | many = many - 1 60 | 61 | cur.execute('SELECT id,url FROM Pages WHERE html is NULL and error is NULL ORDER BY RANDOM() LIMIT 1') 62 | try: 63 | row = cur.fetchone() 64 | # print row 65 | fromid = row[0] 66 | url = row[1] 67 | except: 68 | print('No unretrieved HTML pages found') 69 | many = 0 70 | break 71 | 72 | print(fromid, url, end=' ') 73 | 74 | # If we are retrieving this page, there should be no links from it 75 | cur.execute('DELETE from Links WHERE from_id=?', (fromid, ) ) 76 | try: 77 | document = urlopen(url, context=ctx) 78 | 79 | html = document.read() 80 | if document.getcode() != 200 : 81 | print("Error on page: ",document.getcode()) 82 | cur.execute('UPDATE Pages SET error=? WHERE url=?', (document.getcode(), url) ) 83 | 84 | if 'text/html' != document.info().get_content_type() : 85 | print("Ignore non text/html page") 86 | cur.execute('DELETE FROM Pages WHERE url=?', ( url, ) ) 87 | cur.execute('UPDATE Pages SET error=0 WHERE url=?', (url, ) ) 88 | conn.commit() 89 | continue 90 | 91 | print('('+str(len(html))+')', end=' ') 92 | 93 | soup = BeautifulSoup(html, "html.parser") 94 | except KeyboardInterrupt: 95 | print('') 96 | print('Program interrupted by user...') 97 | break 98 | except: 99 | print("Unable to retrieve or parse page") 100 | cur.execute('UPDATE Pages SET error=-1 WHERE url=?', (url, ) ) 101 | conn.commit() 102 | continue 103 | 104 | cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( url, ) ) 105 | cur.execute('UPDATE Pages SET html=? WHERE url=?', (memoryview(html), url ) ) 106 | conn.commit() 107 | 108 | # Retrieve all of the anchor tags 109 | tags = soup('a') 110 | count = 0 111 | for tag in tags: 112 | href = tag.get('href', None) 113 | if ( href is None ) : continue 114 | # Resolve relative references like href="/contact" 115 | up = urlparse(href) 116 | if ( len(up.scheme) < 1 ) : 117 | href = urljoin(url, href) 118 | ipos = href.find('#') 119 | if ( ipos > 1 ) : href = href[:ipos] 120 | if ( href.endswith('.png') or href.endswith('.jpg') or href.endswith('.gif') ) : continue 121 | if ( href.endswith('/') ) : href = href[:-1] 122 | # print href 123 | if ( len(href) < 1 ) : continue 124 | 125 | # Check if the URL is in any of the webs 126 | found = False 127 | for web in webs: 128 | if ( href.startswith(web) ) : 129 | found = True 130 | break 131 | if not found : continue 132 | 133 | cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( href, ) ) 134 | count = count + 1 135 | conn.commit() 136 | 137 | cur.execute('SELECT id FROM Pages WHERE url=? LIMIT 1', ( href, )) 138 | try: 139 | row = cur.fetchone() 140 | toid = row[0] 141 | except: 142 | print('Could not retrieve id') 143 | continue 144 | # print fromid, toid 145 | cur.execute('INSERT OR IGNORE INTO Links (from_id, to_id) VALUES ( ?, ? )', ( fromid, toid ) ) 146 | 147 | 148 | print(count) 149 | 150 | cur.close() 151 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spider.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spider.sqlite -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spjson.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | conn = sqlite3.connect('spider.sqlite') 4 | cur = conn.cursor() 5 | 6 | print("Creating JSON output on spider.js...") 7 | howmany = int(input("How many nodes? ")) 8 | 9 | cur.execute('''SELECT COUNT(from_id) AS inbound, old_rank, new_rank, id, url 10 | FROM Pages JOIN Links ON Pages.id = Links.to_id 11 | WHERE html IS NOT NULL AND ERROR IS NULL 12 | GROUP BY id ORDER BY id,inbound''') 13 | 14 | fhand = open('spider.js','w') 15 | nodes = list() 16 | maxrank = None 17 | minrank = None 18 | for row in cur : 19 | nodes.append(row) 20 | rank = row[2] 21 | if maxrank is None or maxrank < rank: maxrank = rank 22 | if minrank is None or minrank > rank : minrank = rank 23 | if len(nodes) > howmany : break 24 | 25 | if maxrank == minrank or maxrank is None or minrank is None: 26 | print("Error - please run sprank.py to compute page rank") 27 | quit() 28 | 29 | fhand.write('spiderJson = {"nodes":[\n') 30 | count = 0 31 | map = dict() 32 | ranks = dict() 33 | for row in nodes : 34 | if count > 0 : fhand.write(',\n') 35 | # print row 36 | rank = row[2] 37 | rank = 19 * ( (rank - minrank) / (maxrank - minrank) ) 38 | fhand.write('{'+'"weight":'+str(row[0])+',"rank":'+str(rank)+',') 39 | fhand.write(' "id":'+str(row[3])+', "url":"'+row[4]+'"}') 40 | map[row[3]] = count 41 | ranks[row[3]] = rank 42 | count = count + 1 43 | fhand.write('],\n') 44 | 45 | cur.execute('''SELECT DISTINCT from_id, to_id FROM Links''') 46 | fhand.write('"links":[\n') 47 | 48 | count = 0 49 | for row in cur : 50 | # print row 51 | if row[0] not in map or row[1] not in map : continue 52 | if count > 0 : fhand.write(',\n') 53 | rank = ranks[row[0]] 54 | srank = 19 * ( (rank - minrank) / (maxrank - minrank) ) 55 | fhand.write('{"source":'+str(map[row[0]])+',"target":'+str(map[row[1]])+',"value":3}') 56 | count = count + 1 57 | fhand.write(']};') 58 | fhand.close() 59 | cur.close() 60 | 61 | print("Open force.html in a browser to view the visualization") 62 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/sprank.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | conn = sqlite3.connect('spider.sqlite') 4 | cur = conn.cursor() 5 | 6 | # Find the ids that send out page rank - we only are interested 7 | # in pages in the SCC that have in and out links 8 | cur.execute('''SELECT DISTINCT from_id FROM Links''') 9 | from_ids = list() 10 | for row in cur: 11 | from_ids.append(row[0]) 12 | 13 | # Find the ids that receive page rank 14 | to_ids = list() 15 | links = list() 16 | cur.execute('''SELECT DISTINCT from_id, to_id FROM Links''') 17 | for row in cur: 18 | from_id = row[0] 19 | to_id = row[1] 20 | if from_id == to_id : continue 21 | if from_id not in from_ids : continue 22 | if to_id not in from_ids : continue 23 | links.append(row) 24 | if to_id not in to_ids : to_ids.append(to_id) 25 | 26 | # Get latest page ranks for strongly connected component 27 | prev_ranks = dict() 28 | for node in from_ids: 29 | cur.execute('''SELECT new_rank FROM Pages WHERE id = ?''', (node, )) 30 | row = cur.fetchone() 31 | prev_ranks[node] = row[0] 32 | 33 | sval = input('How many iterations:') 34 | many = 1 35 | if ( len(sval) > 0 ) : many = int(sval) 36 | 37 | # Sanity check 38 | if len(prev_ranks) < 1 : 39 | print("Nothing to page rank. Check data.") 40 | quit() 41 | 42 | # Lets do Page Rank in memory so it is really fast 43 | for i in range(many): 44 | # print prev_ranks.items()[:5] 45 | next_ranks = dict(); 46 | total = 0.0 47 | for (node, old_rank) in list(prev_ranks.items()): 48 | total = total + old_rank 49 | next_ranks[node] = 0.0 50 | # print total 51 | 52 | # Find the number of outbound links and sent the page rank down each 53 | for (node, old_rank) in list(prev_ranks.items()): 54 | # print node, old_rank 55 | give_ids = list() 56 | for (from_id, to_id) in links: 57 | if from_id != node : continue 58 | # print ' ',from_id,to_id 59 | 60 | if to_id not in to_ids: continue 61 | give_ids.append(to_id) 62 | if ( len(give_ids) < 1 ) : continue 63 | amount = old_rank / len(give_ids) 64 | # print node, old_rank,amount, give_ids 65 | 66 | for id in give_ids: 67 | next_ranks[id] = next_ranks[id] + amount 68 | 69 | newtot = 0 70 | for (node, next_rank) in list(next_ranks.items()): 71 | newtot = newtot + next_rank 72 | evap = (total - newtot) / len(next_ranks) 73 | 74 | # print newtot, evap 75 | for node in next_ranks: 76 | next_ranks[node] = next_ranks[node] + evap 77 | 78 | newtot = 0 79 | for (node, next_rank) in list(next_ranks.items()): 80 | newtot = newtot + next_rank 81 | 82 | # Compute the per-page average change from old rank to new rank 83 | # As indication of convergence of the algorithm 84 | totdiff = 0 85 | for (node, old_rank) in list(prev_ranks.items()): 86 | new_rank = next_ranks[node] 87 | diff = abs(old_rank-new_rank) 88 | totdiff = totdiff + diff 89 | 90 | avediff = totdiff / len(prev_ranks) 91 | print(i+1, avediff) 92 | 93 | # rotate 94 | prev_ranks = next_ranks 95 | 96 | # Put the final ranks back into the database 97 | print(list(next_ranks.items())[:5]) 98 | cur.execute('''UPDATE Pages SET old_rank=new_rank''') 99 | for (id, new_rank) in list(next_ranks.items()) : 100 | cur.execute('''UPDATE Pages SET new_rank=? WHERE id=?''', (new_rank, id)) 101 | conn.commit() 102 | cur.close() 103 | 104 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/spreset.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | conn = sqlite3.connect('spider.sqlite') 4 | cur = conn.cursor() 5 | 6 | cur.execute('''UPDATE Pages SET new_rank=1.0, old_rank=0.0''') 7 | conn.commit() 8 | 9 | cur.close() 10 | 11 | print("All pages set to a rank of 1.0") 12 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/variance-site-dump.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/variance-site-dump.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/variance-top25.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex17/variance-top25.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/README.txt: -------------------------------------------------------------------------------- 1 | Analyzing an EMAIL Archive from gmane and vizualizing the data 2 | using the D3 JavaScript library 3 | 4 | This is a set of tools that allow you to pull down an archive 5 | of a gmane repository using the instructions at: 6 | 7 | http://gmane.org/export.php 8 | 9 | In order not to overwhelm the gmane.org server, I have put up 10 | my own copy of the messages at: 11 | 12 | http://mbox.dr-chuck.net/ 13 | 14 | This server will be faster and take a lot of load off the 15 | gmane.org server. 16 | 17 | You should install the SQLite browser to view and modify the databases from: 18 | 19 | http://sqlitebrowser.org/ 20 | 21 | The first step is to spider the gmane repository. The base URL 22 | is hard-coded in the gmane.py and is hard-coded to the Sakai 23 | developer list. You can spider another repository by changing that 24 | base url. Make sure to delete the content.sqlite file if you 25 | switch the base url. The gmane.py file operates as a spider in 26 | that it runs slowly and retrieves one mail message per second so 27 | as to avoid getting throttled by gmane.org. It stores all of 28 | its data in a database and can be interrupted and re-started 29 | as often as needed. It may take many hours to pull all the data 30 | down. So you may need to restart several times. 31 | 32 | To give you a head-start, I have put up 600MB of pre-spidered Sakai 33 | email here: 34 | 35 | https://online.dr-chuck.com/files/sakai/email/content.sqlite 36 | 37 | If you download this, you can "catch up with the latest" by 38 | running gmane.py. 39 | 40 | Navigate to the folder where you extracted the gmane.zip 41 | 42 | Note: Windows has difficulty in displaying UTF-8 characters 43 | in the console so for each console window you open, you may need 44 | to type the following command before running this code: 45 | 46 | chcp 65001 47 | 48 | http://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how 49 | 50 | Here is a run of gmane.py getting the last five messages of the 51 | sakai developer list: 52 | 53 | Mac: python3 gmane.py 54 | Win: gmane.py 55 | 56 | How many messages:10 57 | http://mbox.dr-chuck.net/sakai.devel/1/2 2662 58 | ggolden@umich.edu 2005-12-08T23:34:30-06:00 call for participation: developers documentation 59 | http://mbox.dr-chuck.net/sakai.devel/2/3 2434 60 | csev@umich.edu 2005-12-09T00:58:01-05:00 report from the austin conference: sakai developers break into song 61 | http://mbox.dr-chuck.net/sakai.devel/3/4 3055 62 | kevin.carpenter@rsmart.com 2005-12-09T09:01:49-07:00 cas and sakai 1.5 63 | http://mbox.dr-chuck.net/sakai.devel/4/5 11721 64 | michael.feldstein@suny.edu 2005-12-09T09:43:12-05:00 re: lms/vle rants/comments 65 | http://mbox.dr-chuck.net/sakai.devel/5/6 9443 66 | john@caret.cam.ac.uk 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments 67 | Does not start with From 68 | 69 | The program scans content.sqlite from 1 up to the first message number not 70 | already spidered and starts spidering at that message. It continues spidering 71 | until it has spidered the desired number of messages or it reaches a page 72 | that does not appear to be a properly formatted message. 73 | 74 | Sometimes gmane.org is missing a message. Perhaps administrators can delete messages 75 | or perhaps they get lost - I don't know. If your spider stops, and it seems it has hit 76 | a missing message, go into the SQLite Manager and add a row with the missing id - leave 77 | all the other fields blank - and then restart gmane.py. This will unstick the 78 | spidering process and allow it to continue. These empty messages will be ignored in the next 79 | phase of the process. 80 | 81 | One nice thing is that once you have spidered all of the messages and have them in 82 | content.sqlite, you can run gmane.py again to get new messages as they get sent to the 83 | list. gmane.py will quickly scan to the end of the already-spidered pages and check 84 | if there are new messages and then quickly retrieve those messages and add them 85 | to content.sqlite. 86 | 87 | The content.sqlite data is pretty raw, with an innefficient data model, and not compressed. 88 | This is intentional as it allows you to look at content.sqlite to debug the process. 89 | It would be a bad idea to run any queries against this database as they would be 90 | slow. 91 | 92 | The second process is running the program gmodel.py. gmodel.py reads the rough/raw 93 | data from content.sqlite and produces a cleaned-up and well-modeled version of the 94 | data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X 95 | smaller) than content.sqlite because it also compresses the header and body text. 96 | 97 | Each time gmodel.py runs - it completely wipes out and re-builds index.sqlite, allowing 98 | you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the 99 | data cleaning process. 100 | 101 | Running gmodel.py works as follows: 102 | 103 | Mac: python3 gmodel.py 104 | Win: gmodel.py 105 | 106 | Loaded allsenders 1588 and mapping 28 dns mapping 1 107 | 1 2005-12-08T23:34:30-06:00 ggolden22@mac.com 108 | 251 2005-12-22T10:03:20-08:00 tpamsler@ucdavis.edu 109 | 501 2006-01-12T11:17:34-05:00 lance@indiana.edu 110 | 751 2006-01-24T11:13:28-08:00 vrajgopalan@ucmerced.edu 111 | ... 112 | 113 | The gmodel.py program does a number of data cleaing steps 114 | 115 | Domain names are truncated to two levels for .com, .org, .edu, and .net 116 | other domain names are truncated to three levels. So si.umich.edu becomes 117 | umich.edu and caret.cam.ac.uk becomes cam.ac.uk. Also mail addresses are 118 | forced to lower case and some of the @gmane.org address like the following 119 | 120 | arwhyte-63aXycvo3TyHXe+LvDLADg@public.gmane.org 121 | 122 | are converted to the real address whenever there is a matching real email 123 | address elsewhere in the message corpus. 124 | 125 | If you look in the content.sqlite database there are two tables that allow 126 | you to map both domain names and individual email addresses that change over 127 | the lifetime of the email list. For example, Steve Githens used the following 128 | email addresses over the life of the Sakai developer list: 129 | 130 | s-githens@northwestern.edu 131 | sgithens@cam.ac.uk 132 | swgithen@mtu.edu 133 | 134 | We can add two entries to the Mapping table 135 | 136 | s-githens@northwestern.edu -> swgithen@mtu.edu 137 | sgithens@cam.ac.uk -> swgithen@mtu.edu 138 | 139 | And so all the mail messages will be collected under one sender even if 140 | they used several email addresses over the lifetime of the mailing list. 141 | 142 | You can also make similar entries in the DNSMapping table if there are multiple 143 | DNS names you want mapped to a single DNS. In the Sakai data I add the following 144 | mapping: 145 | 146 | iupui.edu -> indiana.edu 147 | 148 | So all the folks from the various Indiana University campuses are tracked together 149 | 150 | You can re-run the gmodel.py over and over as you look at the data, and add mappings 151 | to make the data cleaner and cleaner. When you are done, you will have a nicely 152 | indexed version of the email in index.sqlite. This is the file to use to do data 153 | analysis. With this file, data analysis will be really quick. 154 | 155 | The first, simplest data analysis is to do a "who does the most" and "which 156 | organzation does the most"? This is done using gbasic.py: 157 | 158 | Mac: python3 gbasic.py 159 | Win: gbasic.py 160 | 161 | How many to dump? 5 162 | Loaded messages= 51330 subjects= 25033 senders= 1584 163 | 164 | Top 5 Email list participants 165 | steve.swinsburg@gmail.com 2657 166 | azeckoski@unicon.net 1742 167 | ieb@tfd.co.uk 1591 168 | csev@umich.edu 1304 169 | david.horwitz@uct.ac.za 1184 170 | 171 | Top 5 Email list organizations 172 | gmail.com 7339 173 | umich.edu 6243 174 | uct.ac.za 2451 175 | indiana.edu 2258 176 | unicon.net 2055 177 | 178 | You can look at the data in index.sqlite and if you find a problem, you 179 | can update the Mapping table and DNSMapping table in content.sqlite and 180 | re-run gmodel.py. 181 | 182 | There is a simple vizualization of the word frequence in the subject lines 183 | in the file gword.py: 184 | 185 | Mac: python3 gword.py 186 | Win: gword.py 187 | 188 | Range of counts: 33229 129 189 | Output written to gword.js 190 | 191 | This produces the file gword.js which you can visualize using the file 192 | gword.htm. 193 | 194 | A second visualization is in gline.py. It visualizes email participation by 195 | organizations over time. 196 | 197 | Mac: python3 gline.py 198 | Win: gline.py 199 | 200 | Loaded messages= 51330 subjects= 25033 senders= 1584 201 | Top 10 Oranizations 202 | ['gmail.com', 'umich.edu', 'uct.ac.za', 'indiana.edu', 'unicon.net', 'tfd.co.uk', 'berkeley.edu', 'longsight.com', 'stanford.edu', 'ox.ac.uk'] 203 | Output written to gline.js 204 | 205 | Its output is written to gline.js which is visualized using gline.htm. 206 | 207 | Some URLs for visualization ideas: 208 | 209 | https://developers.google.com/chart/ 210 | 211 | https://developers.google.com/chart/interactive/docs/gallery/motionchart 212 | 213 | https://code.google.com/apis/ajax/playground/?type=visualization#motion_chart_time_formats 214 | 215 | https://developers.google.com/chart/interactive/docs/gallery/annotatedtimeline 216 | 217 | http://bost.ocks.org/mike/uberdata/ 218 | 219 | http://mbostock.github.io/d3/talk/20111018/calendar.html 220 | 221 | http://nltk.org/install.html 222 | 223 | As always - comments welcome. 224 | 225 | -- Dr. Chuck 226 | Sun Sep 29 00:11:01 EDT 2013 227 | 228 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/content.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/content.sqlite -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/content.sqlite-journal.temp: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/content.sqlite-journal.temp -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/content.sqlite.first.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/content.sqlite.first.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/d3.layout.cloud.js: -------------------------------------------------------------------------------- 1 | // Word cloud layout by Jason Davies, http://www.jasondavies.com/word-cloud/ 2 | // Algorithm due to Jonathan Feinberg, http://static.mrfeinberg.com/bv_ch03.pdf 3 | (function(exports) { 4 | function cloud() { 5 | var size = [256, 256], 6 | text = cloudText, 7 | font = cloudFont, 8 | fontSize = cloudFontSize, 9 | fontStyle = cloudFontNormal, 10 | fontWeight = cloudFontNormal, 11 | rotate = cloudRotate, 12 | padding = cloudPadding, 13 | spiral = archimedeanSpiral, 14 | words = [], 15 | timeInterval = Infinity, 16 | event = d3.dispatch("word", "end"), 17 | timer = null, 18 | cloud = {}; 19 | 20 | cloud.start = function() { 21 | var board = zeroArray((size[0] >> 5) * size[1]), 22 | bounds = null, 23 | n = words.length, 24 | i = -1, 25 | tags = [], 26 | data = words.map(function(d, i) { 27 | d.text = text.call(this, d, i); 28 | d.font = font.call(this, d, i); 29 | d.style = fontStyle.call(this, d, i); 30 | d.weight = fontWeight.call(this, d, i); 31 | d.rotate = rotate.call(this, d, i); 32 | d.size = ~~fontSize.call(this, d, i); 33 | d.padding = cloudPadding.call(this, d, i); 34 | return d; 35 | }).sort(function(a, b) { return b.size - a.size; }); 36 | 37 | if (timer) clearInterval(timer); 38 | timer = setInterval(step, 0); 39 | step(); 40 | 41 | return cloud; 42 | 43 | function step() { 44 | var start = +new Date, 45 | d; 46 | while (+new Date - start < timeInterval && ++i < n && timer) { 47 | d = data[i]; 48 | d.x = (size[0] * (Math.random() + .5)) >> 1; 49 | d.y = (size[1] * (Math.random() + .5)) >> 1; 50 | cloudSprite(d, data, i); 51 | if (place(board, d, bounds)) { 52 | tags.push(d); 53 | event.word(d); 54 | if (bounds) cloudBounds(bounds, d); 55 | else bounds = [{x: d.x + d.x0, y: d.y + d.y0}, {x: d.x + d.x1, y: d.y + d.y1}]; 56 | // Temporary hack 57 | d.x -= size[0] >> 1; 58 | d.y -= size[1] >> 1; 59 | } 60 | } 61 | if (i >= n) { 62 | cloud.stop(); 63 | event.end(tags, bounds); 64 | } 65 | } 66 | } 67 | 68 | cloud.stop = function() { 69 | if (timer) { 70 | clearInterval(timer); 71 | timer = null; 72 | } 73 | return cloud; 74 | }; 75 | 76 | cloud.timeInterval = function(x) { 77 | if (!arguments.length) return timeInterval; 78 | timeInterval = x == null ? Infinity : x; 79 | return cloud; 80 | }; 81 | 82 | function place(board, tag, bounds) { 83 | var perimeter = [{x: 0, y: 0}, {x: size[0], y: size[1]}], 84 | startX = tag.x, 85 | startY = tag.y, 86 | maxDelta = Math.sqrt(size[0] * size[0] + size[1] * size[1]), 87 | s = spiral(size), 88 | dt = Math.random() < .5 ? 1 : -1, 89 | t = -dt, 90 | dxdy, 91 | dx, 92 | dy; 93 | 94 | while (dxdy = s(t += dt)) { 95 | dx = ~~dxdy[0]; 96 | dy = ~~dxdy[1]; 97 | 98 | if (Math.min(dx, dy) > maxDelta) break; 99 | 100 | tag.x = startX + dx; 101 | tag.y = startY + dy; 102 | 103 | if (tag.x + tag.x0 < 0 || tag.y + tag.y0 < 0 || 104 | tag.x + tag.x1 > size[0] || tag.y + tag.y1 > size[1]) continue; 105 | // TODO only check for collisions within current bounds. 106 | if (!bounds || !cloudCollide(tag, board, size[0])) { 107 | if (!bounds || collideRects(tag, bounds)) { 108 | var sprite = tag.sprite, 109 | w = tag.width >> 5, 110 | sw = size[0] >> 5, 111 | lx = tag.x - (w << 4), 112 | sx = lx & 0x7f, 113 | msx = 32 - sx, 114 | h = tag.y1 - tag.y0, 115 | x = (tag.y + tag.y0) * sw + (lx >> 5), 116 | last; 117 | for (var j = 0; j < h; j++) { 118 | last = 0; 119 | for (var i = 0; i <= w; i++) { 120 | board[x + i] |= (last << msx) | (i < w ? (last = sprite[j * w + i]) >>> sx : 0); 121 | } 122 | x += sw; 123 | } 124 | delete tag.sprite; 125 | return true; 126 | } 127 | } 128 | } 129 | return false; 130 | } 131 | 132 | cloud.words = function(x) { 133 | if (!arguments.length) return words; 134 | words = x; 135 | return cloud; 136 | }; 137 | 138 | cloud.size = function(x) { 139 | if (!arguments.length) return size; 140 | size = [+x[0], +x[1]]; 141 | return cloud; 142 | }; 143 | 144 | cloud.font = function(x) { 145 | if (!arguments.length) return font; 146 | font = d3.functor(x); 147 | return cloud; 148 | }; 149 | 150 | cloud.fontStyle = function(x) { 151 | if (!arguments.length) return fontStyle; 152 | fontStyle = d3.functor(x); 153 | return cloud; 154 | }; 155 | 156 | cloud.fontWeight = function(x) { 157 | if (!arguments.length) return fontWeight; 158 | fontWeight = d3.functor(x); 159 | return cloud; 160 | }; 161 | 162 | cloud.rotate = function(x) { 163 | if (!arguments.length) return rotate; 164 | rotate = d3.functor(x); 165 | return cloud; 166 | }; 167 | 168 | cloud.text = function(x) { 169 | if (!arguments.length) return text; 170 | text = d3.functor(x); 171 | return cloud; 172 | }; 173 | 174 | cloud.spiral = function(x) { 175 | if (!arguments.length) return spiral; 176 | spiral = spirals[x + ""] || x; 177 | return cloud; 178 | }; 179 | 180 | cloud.fontSize = function(x) { 181 | if (!arguments.length) return fontSize; 182 | fontSize = d3.functor(x); 183 | return cloud; 184 | }; 185 | 186 | cloud.padding = function(x) { 187 | if (!arguments.length) return padding; 188 | padding = d3.functor(x); 189 | return cloud; 190 | }; 191 | 192 | return d3.rebind(cloud, event, "on"); 193 | } 194 | 195 | function cloudText(d) { 196 | return d.text; 197 | } 198 | 199 | function cloudFont() { 200 | return "serif"; 201 | } 202 | 203 | function cloudFontNormal() { 204 | return "normal"; 205 | } 206 | 207 | function cloudFontSize(d) { 208 | return Math.sqrt(d.value); 209 | } 210 | 211 | function cloudRotate() { 212 | return (~~(Math.random() * 6) - 3) * 30; 213 | } 214 | 215 | function cloudPadding() { 216 | return 1; 217 | } 218 | 219 | // Fetches a monochrome sprite bitmap for the specified text. 220 | // Load in batches for speed. 221 | function cloudSprite(d, data, di) { 222 | if (d.sprite) return; 223 | c.clearRect(0, 0, (cw << 5) / ratio, ch / ratio); 224 | var x = 0, 225 | y = 0, 226 | maxh = 0, 227 | n = data.length; 228 | di--; 229 | while (++di < n) { 230 | d = data[di]; 231 | c.save(); 232 | c.font = d.style + " " + d.weight + " " + ~~((d.size + 1) / ratio) + "px " + d.font; 233 | var w = c.measureText(d.text + "m").width * ratio, 234 | h = d.size << 1; 235 | if (d.rotate) { 236 | var sr = Math.sin(d.rotate * cloudRadians), 237 | cr = Math.cos(d.rotate * cloudRadians), 238 | wcr = w * cr, 239 | wsr = w * sr, 240 | hcr = h * cr, 241 | hsr = h * sr; 242 | w = (Math.max(Math.abs(wcr + hsr), Math.abs(wcr - hsr)) + 0x1f) >> 5 << 5; 243 | h = ~~Math.max(Math.abs(wsr + hcr), Math.abs(wsr - hcr)); 244 | } else { 245 | w = (w + 0x1f) >> 5 << 5; 246 | } 247 | if (h > maxh) maxh = h; 248 | if (x + w >= (cw << 5)) { 249 | x = 0; 250 | y += maxh; 251 | maxh = 0; 252 | } 253 | if (y + h >= ch) break; 254 | c.translate((x + (w >> 1)) / ratio, (y + (h >> 1)) / ratio); 255 | if (d.rotate) c.rotate(d.rotate * cloudRadians); 256 | c.fillText(d.text, 0, 0); 257 | c.restore(); 258 | d.width = w; 259 | d.height = h; 260 | d.xoff = x; 261 | d.yoff = y; 262 | d.x1 = w >> 1; 263 | d.y1 = h >> 1; 264 | d.x0 = -d.x1; 265 | d.y0 = -d.y1; 266 | x += w; 267 | } 268 | var pixels = c.getImageData(0, 0, (cw << 5) / ratio, ch / ratio).data, 269 | sprite = []; 270 | while (--di >= 0) { 271 | d = data[di]; 272 | var w = d.width, 273 | w32 = w >> 5, 274 | h = d.y1 - d.y0, 275 | p = d.padding; 276 | // Zero the buffer 277 | for (var i = 0; i < h * w32; i++) sprite[i] = 0; 278 | x = d.xoff; 279 | if (x == null) return; 280 | y = d.yoff; 281 | var seen = 0, 282 | seenRow = -1; 283 | for (var j = 0; j < h; j++) { 284 | for (var i = 0; i < w; i++) { 285 | var k = w32 * j + (i >> 5), 286 | m = pixels[((y + j) * (cw << 5) + (x + i)) << 2] ? 1 << (31 - (i % 32)) : 0; 287 | if (p) { 288 | if (j) sprite[k - w32] |= m; 289 | if (j < w - 1) sprite[k + w32] |= m; 290 | m |= (m << 1) | (m >> 1); 291 | } 292 | sprite[k] |= m; 293 | seen |= m; 294 | } 295 | if (seen) seenRow = j; 296 | else { 297 | d.y0++; 298 | h--; 299 | j--; 300 | y++; 301 | } 302 | } 303 | d.y1 = d.y0 + seenRow; 304 | d.sprite = sprite.slice(0, (d.y1 - d.y0) * w32); 305 | } 306 | } 307 | 308 | // Use mask-based collision detection. 309 | function cloudCollide(tag, board, sw) { 310 | sw >>= 5; 311 | var sprite = tag.sprite, 312 | w = tag.width >> 5, 313 | lx = tag.x - (w << 4), 314 | sx = lx & 0x7f, 315 | msx = 32 - sx, 316 | h = tag.y1 - tag.y0, 317 | x = (tag.y + tag.y0) * sw + (lx >> 5), 318 | last; 319 | for (var j = 0; j < h; j++) { 320 | last = 0; 321 | for (var i = 0; i <= w; i++) { 322 | if (((last << msx) | (i < w ? (last = sprite[j * w + i]) >>> sx : 0)) 323 | & board[x + i]) return true; 324 | } 325 | x += sw; 326 | } 327 | return false; 328 | } 329 | 330 | function cloudBounds(bounds, d) { 331 | var b0 = bounds[0], 332 | b1 = bounds[1]; 333 | if (d.x + d.x0 < b0.x) b0.x = d.x + d.x0; 334 | if (d.y + d.y0 < b0.y) b0.y = d.y + d.y0; 335 | if (d.x + d.x1 > b1.x) b1.x = d.x + d.x1; 336 | if (d.y + d.y1 > b1.y) b1.y = d.y + d.y1; 337 | } 338 | 339 | function collideRects(a, b) { 340 | return a.x + a.x1 > b[0].x && a.x + a.x0 < b[1].x && a.y + a.y1 > b[0].y && a.y + a.y0 < b[1].y; 341 | } 342 | 343 | function archimedeanSpiral(size) { 344 | var e = size[0] / size[1]; 345 | return function(t) { 346 | return [e * (t *= .1) * Math.cos(t), t * Math.sin(t)]; 347 | }; 348 | } 349 | 350 | function rectangularSpiral(size) { 351 | var dy = 4, 352 | dx = dy * size[0] / size[1], 353 | x = 0, 354 | y = 0; 355 | return function(t) { 356 | var sign = t < 0 ? -1 : 1; 357 | // See triangular numbers: T_n = n * (n + 1) / 2. 358 | switch ((Math.sqrt(1 + 4 * sign * t) - sign) & 3) { 359 | case 0: x += dx; break; 360 | case 1: y += dy; break; 361 | case 2: x -= dx; break; 362 | default: y -= dy; break; 363 | } 364 | return [x, y]; 365 | }; 366 | } 367 | 368 | // TODO reuse arrays? 369 | function zeroArray(n) { 370 | var a = [], 371 | i = -1; 372 | while (++i < n) a[i] = 0; 373 | return a; 374 | } 375 | 376 | var cloudRadians = Math.PI / 180, 377 | cw = 1 << 11 >> 5, 378 | ch = 1 << 11, 379 | canvas, 380 | ratio = 1; 381 | 382 | if (typeof document !== "undefined") { 383 | canvas = document.createElement("canvas"); 384 | canvas.width = 1; 385 | canvas.height = 1; 386 | ratio = Math.sqrt(canvas.getContext("2d").getImageData(0, 0, 1, 1).data.length >> 2); 387 | canvas.width = (cw << 5) / ratio; 388 | canvas.height = ch / ratio; 389 | } else { 390 | // node-canvas support 391 | var Canvas = require("canvas"); 392 | canvas = new Canvas(cw << 5, ch); 393 | } 394 | 395 | var c = canvas.getContext("2d"), 396 | spirals = { 397 | archimedean: archimedeanSpiral, 398 | rectangular: rectangularSpiral 399 | }; 400 | c.fillStyle = "red"; 401 | c.textAlign = "center"; 402 | 403 | exports.cloud = cloud; 404 | })(typeof exports === "undefined" ? d3.layout || (d3.layout = {}) : exports); 405 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gbasic.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import zlib 4 | 5 | howmany = int(input("How many to dump? ")) 6 | 7 | conn = sqlite3.connect('index.sqlite') 8 | cur = conn.cursor() 9 | 10 | cur.execute('SELECT id, sender FROM Senders') 11 | senders = dict() 12 | for message_row in cur : 13 | senders[message_row[0]] = message_row[1] 14 | 15 | cur.execute('SELECT id, subject FROM Subjects') 16 | subjects = dict() 17 | for message_row in cur : 18 | subjects[message_row[0]] = message_row[1] 19 | 20 | # cur.execute('SELECT id, guid,sender_id,subject_id,headers,body FROM Messages') 21 | cur.execute('SELECT id, guid,sender_id,subject_id,sent_at FROM Messages') 22 | messages = dict() 23 | for message_row in cur : 24 | messages[message_row[0]] = (message_row[1],message_row[2],message_row[3],message_row[4]) 25 | 26 | print("Loaded messages=",len(messages),"subjects=",len(subjects),"senders=",len(senders)) 27 | 28 | sendcounts = dict() 29 | sendorgs = dict() 30 | for (message_id, message) in list(messages.items()): 31 | sender = message[1] 32 | sendcounts[sender] = sendcounts.get(sender,0) + 1 33 | pieces = senders[sender].split("@") 34 | if len(pieces) != 2 : continue 35 | dns = pieces[1] 36 | sendorgs[dns] = sendorgs.get(dns,0) + 1 37 | 38 | print('') 39 | print('Top',howmany,'Email list participants') 40 | 41 | x = sorted(sendcounts, key=sendcounts.get, reverse=True) 42 | for k in x[:howmany]: 43 | print(senders[k], sendcounts[k]) 44 | if sendcounts[k] < 10 : break 45 | 46 | print('') 47 | print('Top',howmany,'Email list organizations') 48 | 49 | x = sorted(sendorgs, key=sendorgs.get, reverse=True) 50 | for k in x[:howmany]: 51 | print(k, sendorgs[k]) 52 | if sendorgs[k] < 10 : break 53 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gbasic.py.running.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gbasic.py.running.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gbasic.py.running2.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gbasic.py.running2.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gline.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 19 | 20 | 21 | 22 | 23 | 24 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gline.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gline.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gline.js: -------------------------------------------------------------------------------- 1 | gline = [ ['Year','umich.edu','indiana.edu','ucdavis.edu','ufp.pt','uct.ac.za','berkeley.edu','columbia.edu','etudes.org','gmail.com','mac.com'], 2 | ['2005-12',57,12,11,10,14,12,13,5,10,12], 3 | ['2006-01',93,29,28,29,25,25,22,26,16,12] 4 | ]; 5 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gline.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import zlib 4 | 5 | conn = sqlite3.connect('index.sqlite') 6 | cur = conn.cursor() 7 | 8 | cur.execute('SELECT id, sender FROM Senders') 9 | senders = dict() 10 | for message_row in cur : 11 | senders[message_row[0]] = message_row[1] 12 | 13 | cur.execute('SELECT id, guid,sender_id,subject_id,sent_at FROM Messages') 14 | messages = dict() 15 | for message_row in cur : 16 | messages[message_row[0]] = (message_row[1],message_row[2],message_row[3],message_row[4]) 17 | 18 | print("Loaded messages=",len(messages),"senders=",len(senders)) 19 | 20 | sendorgs = dict() 21 | for (message_id, message) in list(messages.items()): 22 | sender = message[1] 23 | pieces = senders[sender].split("@") 24 | if len(pieces) != 2 : continue 25 | dns = pieces[1] 26 | sendorgs[dns] = sendorgs.get(dns,0) + 1 27 | 28 | # pick the top schools 29 | orgs = sorted(sendorgs, key=sendorgs.get, reverse=True) 30 | orgs = orgs[:10] 31 | print("Top 10 Oranizations") 32 | print(orgs) 33 | 34 | counts = dict() 35 | months = list() 36 | # cur.execute('SELECT id, guid,sender_id,subject_id,sent_at FROM Messages') 37 | for (message_id, message) in list(messages.items()): 38 | sender = message[1] 39 | pieces = senders[sender].split("@") 40 | if len(pieces) != 2 : continue 41 | dns = pieces[1] 42 | if dns not in orgs : continue 43 | month = message[3][:7] 44 | if month not in months : months.append(month) 45 | key = (month, dns) 46 | counts[key] = counts.get(key,0) + 1 47 | 48 | months.sort() 49 | # print counts 50 | # print months 51 | 52 | fhand = open('gline.js','w') 53 | fhand.write("gline = [ ['Year'") 54 | for org in orgs: 55 | fhand.write(",'"+org+"'") 56 | fhand.write("]") 57 | 58 | for month in months: 59 | fhand.write(",\n['"+month+"'") 60 | for org in orgs: 61 | key = (month, org) 62 | val = counts.get(key,0) 63 | fhand.write(","+str(val)) 64 | fhand.write("]"); 65 | 66 | fhand.write("\n];\n") 67 | fhand.close() 68 | 69 | print("Output written to gline.js") 70 | print("Open gline.htm to visualize the data") 71 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gmane.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import ssl 4 | import urllib.request, urllib.parse, urllib.error 5 | from urllib.parse import urljoin 6 | from urllib.parse import urlparse 7 | import re 8 | from datetime import datetime, timedelta 9 | 10 | # Not all systems have this so conditionally define parser 11 | try: 12 | import dateutil.parser as parser 13 | except: 14 | pass 15 | 16 | def parsemaildate(md) : 17 | # See if we have dateutil 18 | try: 19 | pdate = parser.parse(tdate) 20 | test_at = pdate.isoformat() 21 | return test_at 22 | except: 23 | pass 24 | 25 | # Non-dateutil version - we try our best 26 | 27 | pieces = md.split() 28 | notz = " ".join(pieces[:4]).strip() 29 | 30 | # Try a bunch of format variations - strptime() is *lame* 31 | dnotz = None 32 | for form in [ '%d %b %Y %H:%M:%S', '%d %b %Y %H:%M:%S', 33 | '%d %b %Y %H:%M', '%d %b %Y %H:%M', '%d %b %y %H:%M:%S', 34 | '%d %b %y %H:%M:%S', '%d %b %y %H:%M', '%d %b %y %H:%M' ] : 35 | try: 36 | dnotz = datetime.strptime(notz, form) 37 | break 38 | except: 39 | continue 40 | 41 | if dnotz is None : 42 | # print 'Bad Date:',md 43 | return None 44 | 45 | iso = dnotz.isoformat() 46 | 47 | tz = "+0000" 48 | try: 49 | tz = pieces[4] 50 | ival = int(tz) # Only want numeric timezone values 51 | if tz == '-0000' : tz = '+0000' 52 | tzh = tz[:3] 53 | tzm = tz[3:] 54 | tz = tzh+":"+tzm 55 | except: 56 | pass 57 | 58 | return iso+tz 59 | 60 | # Ignore SSL certificate errors 61 | ctx = ssl.create_default_context() 62 | ctx.check_hostname = False 63 | ctx.verify_mode = ssl.CERT_NONE 64 | 65 | conn = sqlite3.connect('content.sqlite') 66 | cur = conn.cursor() 67 | 68 | baseurl = "http://mbox.dr-chuck.net/sakai.devel/" 69 | 70 | cur.execute('''CREATE TABLE IF NOT EXISTS Messages 71 | (id INTEGER UNIQUE, email TEXT, sent_at TEXT, 72 | subject TEXT, headers TEXT, body TEXT)''') 73 | 74 | # Pick up where we left off 75 | start = None 76 | cur.execute('SELECT max(id) FROM Messages' ) 77 | try: 78 | row = cur.fetchone() 79 | if row is None : 80 | start = 0 81 | else: 82 | start = row[0] 83 | except: 84 | start = 0 85 | 86 | if start is None : start = 0 87 | 88 | many = 0 89 | count = 0 90 | fail = 0 91 | while True: 92 | if ( many < 1 ) : 93 | conn.commit() 94 | sval = input('How many messages:') 95 | if ( len(sval) < 1 ) : break 96 | many = int(sval) 97 | 98 | start = start + 1 99 | cur.execute('SELECT id FROM Messages WHERE id=?', (start,) ) 100 | try: 101 | row = cur.fetchone() 102 | if row is not None : continue 103 | except: 104 | row = None 105 | 106 | many = many - 1 107 | url = baseurl + str(start) + '/' + str(start + 1) 108 | 109 | text = "None" 110 | try: 111 | # Open with a timeout of 30 seconds 112 | document = urllib.request.urlopen(url, None, 30, context=ctx) 113 | text = document.read().decode() 114 | if document.getcode() != 200 : 115 | print("Error code=",document.getcode(), url) 116 | break 117 | except KeyboardInterrupt: 118 | print('') 119 | print('Program interrupted by user...') 120 | break 121 | except Exception as e: 122 | print("Unable to retrieve or parse page",url) 123 | print("Error",e) 124 | fail = fail + 1 125 | if fail > 5 : break 126 | continue 127 | 128 | print(url,len(text)) 129 | count = count + 1 130 | 131 | if not text.startswith("From "): 132 | print(text) 133 | print("Did not find From ") 134 | fail = fail + 1 135 | if fail > 5 : break 136 | continue 137 | 138 | pos = text.find("\n\n") 139 | if pos > 0 : 140 | hdr = text[:pos] 141 | body = text[pos+2:] 142 | else: 143 | print(text) 144 | print("Could not find break between headers and body") 145 | fail = fail + 1 146 | if fail > 5 : break 147 | continue 148 | 149 | email = None 150 | x = re.findall('\nFrom: .* <(\S+@\S+)>\n', hdr) 151 | if len(x) == 1 : 152 | email = x[0]; 153 | email = email.strip().lower() 154 | email = email.replace("<","") 155 | else: 156 | x = re.findall('\nFrom: (\S+@\S+)\n', hdr) 157 | if len(x) == 1 : 158 | email = x[0]; 159 | email = email.strip().lower() 160 | email = email.replace("<","") 161 | 162 | date = None 163 | y = re.findall('\Date: .*, (.*)\n', hdr) 164 | if len(y) == 1 : 165 | tdate = y[0] 166 | tdate = tdate[:26] 167 | try: 168 | sent_at = parsemaildate(tdate) 169 | except: 170 | print(text) 171 | print("Parse fail",tdate) 172 | fail = fail + 1 173 | if fail > 5 : break 174 | continue 175 | 176 | subject = None 177 | z = re.findall('\Subject: (.*)\n', hdr) 178 | if len(z) == 1 : subject = z[0].strip().lower(); 179 | 180 | # Reset the fail counter 181 | fail = 0 182 | print(" ",email,sent_at,subject) 183 | cur.execute('''INSERT OR IGNORE INTO Messages (id, email, sent_at, subject, headers, body) 184 | VALUES ( ?, ?, ?, ?, ?, ? )''', ( start, email, sent_at, subject, hdr, body)) 185 | if count % 50 == 0 : conn.commit() 186 | if count % 100 == 0 : time.sleep(1) 187 | 188 | conn.commit() 189 | cur.close() 190 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gmodel.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import re 4 | import zlib 5 | from datetime import datetime, timedelta 6 | 7 | # Not all systems have this 8 | try: 9 | import dateutil.parser as parser 10 | except: 11 | pass 12 | 13 | dnsmapping = dict() 14 | mapping = dict() 15 | 16 | def fixsender(sender,allsenders=None) : 17 | global dnsmapping 18 | global mapping 19 | if sender is None : return None 20 | sender = sender.strip().lower() 21 | sender = sender.replace('<','').replace('>','') 22 | 23 | # Check if we have a hacked gmane.org from address 24 | if allsenders is not None and sender.endswith('gmane.org') : 25 | pieces = sender.split('-') 26 | realsender = None 27 | for s in allsenders: 28 | if s.startswith(pieces[0]) : 29 | realsender = sender 30 | sender = s 31 | # print(realsender, sender) 32 | break 33 | if realsender is None : 34 | for s in mapping: 35 | if s.startswith(pieces[0]) : 36 | realsender = sender 37 | sender = mapping[s] 38 | # print(realsender, sender) 39 | break 40 | if realsender is None : sender = pieces[0] 41 | 42 | mpieces = sender.split("@") 43 | if len(mpieces) != 2 : return sender 44 | dns = mpieces[1] 45 | x = dns 46 | pieces = dns.split(".") 47 | if dns.endswith(".edu") or dns.endswith(".com") or dns.endswith(".org") or dns.endswith(".net") : 48 | dns = ".".join(pieces[-2:]) 49 | else: 50 | dns = ".".join(pieces[-3:]) 51 | # if dns != x : print(x,dns) 52 | # if dns != dnsmapping.get(dns,dns) : print(dns,dnsmapping.get(dns,dns)) 53 | dns = dnsmapping.get(dns,dns) 54 | return mpieces[0] + '@' + dns 55 | 56 | def parsemaildate(md) : 57 | # See if we have dateutil 58 | try: 59 | pdate = parser.parse(tdate) 60 | test_at = pdate.isoformat() 61 | return test_at 62 | except: 63 | pass 64 | 65 | # Non-dateutil version - we try our best 66 | 67 | pieces = md.split() 68 | notz = " ".join(pieces[:4]).strip() 69 | 70 | # Try a bunch of format variations - strptime() is *lame* 71 | dnotz = None 72 | for form in [ '%d %b %Y %H:%M:%S', '%d %b %Y %H:%M:%S', 73 | '%d %b %Y %H:%M', '%d %b %Y %H:%M', '%d %b %y %H:%M:%S', 74 | '%d %b %y %H:%M:%S', '%d %b %y %H:%M', '%d %b %y %H:%M' ] : 75 | try: 76 | dnotz = datetime.strptime(notz, form) 77 | break 78 | except: 79 | continue 80 | 81 | if dnotz is None : 82 | # print('Bad Date:',md) 83 | return None 84 | 85 | iso = dnotz.isoformat() 86 | 87 | tz = "+0000" 88 | try: 89 | tz = pieces[4] 90 | ival = int(tz) # Only want numeric timezone values 91 | if tz == '-0000' : tz = '+0000' 92 | tzh = tz[:3] 93 | tzm = tz[3:] 94 | tz = tzh+":"+tzm 95 | except: 96 | pass 97 | 98 | return iso+tz 99 | 100 | # Parse out the info... 101 | def parseheader(hdr, allsenders=None): 102 | if hdr is None or len(hdr) < 1 : return None 103 | sender = None 104 | x = re.findall('\nFrom: .* <(\S+@\S+)>\n', hdr) 105 | if len(x) >= 1 : 106 | sender = x[0] 107 | else: 108 | x = re.findall('\nFrom: (\S+@\S+)\n', hdr) 109 | if len(x) >= 1 : 110 | sender = x[0] 111 | 112 | # normalize the domain name of Email addresses 113 | sender = fixsender(sender, allsenders) 114 | 115 | date = None 116 | y = re.findall('\nDate: .*, (.*)\n', hdr) 117 | sent_at = None 118 | if len(y) >= 1 : 119 | tdate = y[0] 120 | tdate = tdate[:26] 121 | try: 122 | sent_at = parsemaildate(tdate) 123 | except Exception as e: 124 | # print('Date ignored ',tdate, e) 125 | return None 126 | 127 | subject = None 128 | z = re.findall('\nSubject: (.*)\n', hdr) 129 | if len(z) >= 1 : subject = z[0].strip().lower() 130 | 131 | guid = None 132 | z = re.findall('\nMessage-ID: (.*)\n', hdr) 133 | if len(z) >= 1 : guid = z[0].strip().lower() 134 | 135 | if sender is None or sent_at is None or subject is None or guid is None : 136 | return None 137 | return (guid, sender, subject, sent_at) 138 | 139 | conn = sqlite3.connect('index.sqlite') 140 | cur = conn.cursor() 141 | 142 | cur.execute('''DROP TABLE IF EXISTS Messages ''') 143 | cur.execute('''DROP TABLE IF EXISTS Senders ''') 144 | cur.execute('''DROP TABLE IF EXISTS Subjects ''') 145 | cur.execute('''DROP TABLE IF EXISTS Replies ''') 146 | 147 | cur.execute('''CREATE TABLE IF NOT EXISTS Messages 148 | (id INTEGER PRIMARY KEY, guid TEXT UNIQUE, sent_at INTEGER, 149 | sender_id INTEGER, subject_id INTEGER, 150 | headers BLOB, body BLOB)''') 151 | cur.execute('''CREATE TABLE IF NOT EXISTS Senders 152 | (id INTEGER PRIMARY KEY, sender TEXT UNIQUE)''') 153 | cur.execute('''CREATE TABLE IF NOT EXISTS Subjects 154 | (id INTEGER PRIMARY KEY, subject TEXT UNIQUE)''') 155 | cur.execute('''CREATE TABLE IF NOT EXISTS Replies 156 | (from_id INTEGER, to_id INTEGER)''') 157 | 158 | conn_1 = sqlite3.connect('mapping.sqlite') 159 | cur_1 = conn_1.cursor() 160 | 161 | cur_1.execute('''SELECT old,new FROM DNSMapping''') 162 | for message_row in cur_1 : 163 | dnsmapping[message_row[0].strip().lower()] = message_row[1].strip().lower() 164 | 165 | mapping = dict() 166 | cur_1.execute('''SELECT old,new FROM Mapping''') 167 | for message_row in cur_1 : 168 | old = fixsender(message_row[0]) 169 | new = fixsender(message_row[1]) 170 | mapping[old] = fixsender(new) 171 | 172 | # Done with mapping.sqlite 173 | conn_1.close() 174 | 175 | # Open the main content (Read only) 176 | conn_1 = sqlite3.connect('file:content.sqlite?mode=ro', uri=True) 177 | cur_1 = conn_1.cursor() 178 | 179 | allsenders = list() 180 | cur_1.execute('''SELECT email FROM Messages''') 181 | for message_row in cur_1 : 182 | sender = fixsender(message_row[0]) 183 | if sender is None : continue 184 | if 'gmane.org' in sender : continue 185 | if sender in allsenders: continue 186 | allsenders.append(sender) 187 | 188 | print("Loaded allsenders",len(allsenders),"and mapping",len(mapping),"dns mapping",len(dnsmapping)) 189 | 190 | cur_1.execute('''SELECT headers, body, sent_at 191 | FROM Messages ORDER BY sent_at''') 192 | 193 | senders = dict() 194 | subjects = dict() 195 | guids = dict() 196 | 197 | count = 0 198 | 199 | for message_row in cur_1 : 200 | hdr = message_row[0] 201 | parsed = parseheader(hdr, allsenders) 202 | if parsed is None: continue 203 | (guid, sender, subject, sent_at) = parsed 204 | 205 | # Apply the sender mapping 206 | sender = mapping.get(sender,sender) 207 | 208 | count = count + 1 209 | if count % 250 == 1 : print(count,sent_at, sender) 210 | # print(guid, sender, subject, sent_at) 211 | 212 | if 'gmane.org' in sender: 213 | print("Error in sender ===", sender) 214 | 215 | sender_id = senders.get(sender,None) 216 | subject_id = subjects.get(subject,None) 217 | guid_id = guids.get(guid,None) 218 | 219 | if sender_id is None : 220 | cur.execute('INSERT OR IGNORE INTO Senders (sender) VALUES ( ? )', ( sender, ) ) 221 | conn.commit() 222 | cur.execute('SELECT id FROM Senders WHERE sender=? LIMIT 1', ( sender, )) 223 | try: 224 | row = cur.fetchone() 225 | sender_id = row[0] 226 | senders[sender] = sender_id 227 | except: 228 | print('Could not retrieve sender id',sender) 229 | break 230 | if subject_id is None : 231 | cur.execute('INSERT OR IGNORE INTO Subjects (subject) VALUES ( ? )', ( subject, ) ) 232 | conn.commit() 233 | cur.execute('SELECT id FROM Subjects WHERE subject=? LIMIT 1', ( subject, )) 234 | try: 235 | row = cur.fetchone() 236 | subject_id = row[0] 237 | subjects[subject] = subject_id 238 | except: 239 | print('Could not retrieve subject id',subject) 240 | break 241 | # print(sender_id, subject_id) 242 | cur.execute('INSERT OR IGNORE INTO Messages (guid,sender_id,subject_id,sent_at,headers,body) VALUES ( ?,?,?,datetime(?),?,? )', 243 | ( guid, sender_id, subject_id, sent_at, 244 | zlib.compress(message_row[0].encode()), zlib.compress(message_row[1].encode())) ) 245 | conn.commit() 246 | cur.execute('SELECT id FROM Messages WHERE guid=? LIMIT 1', ( guid, )) 247 | try: 248 | row = cur.fetchone() 249 | message_id = row[0] 250 | guids[guid] = message_id 251 | except: 252 | print('Could not retrieve guid id',guid) 253 | break 254 | 255 | cur.close() 256 | cur_1.close() 257 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gmodel.py.running.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gmodel.py.running.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gword.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 37 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gword.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gword.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gword.js: -------------------------------------------------------------------------------- 1 | gword = [{text: 'sakai', size: 100}, 2 | {text: 'with', size: 38}, 3 | {text: 'tool', size: 36}, 4 | {text: 'error', size: 35}, 5 | {text: 'webdav', size: 34}, 6 | {text: 'resources', size: 32}, 7 | {text: 'mysql', size: 30}, 8 | {text: 'problems', size: 29}, 9 | {text: 'changes', size: 28}, 10 | {text: 'problem', size: 26}, 11 | {text: 'working', size: 25}, 12 | {text: 'message', size: 25}, 13 | {text: 'into', size: 24}, 14 | {text: 'content', size: 24}, 15 | {text: 'site', size: 24}, 16 | {text: 'workspace', size: 24}, 17 | {text: 'melete', size: 24}, 18 | {text: 'course', size: 23}, 19 | {text: 'broken', size: 23}, 20 | {text: 'from', size: 23}, 21 | {text: 'password', size: 23}, 22 | {text: 'forgotten', size: 23}, 23 | {text: 'feature', size: 23}, 24 | {text: 'profile', size: 23}, 25 | {text: 'rutgers', size: 23}, 26 | {text: 'accessservlet', size: 23}, 27 | {text: 'aliases', size: 23}, 28 | {text: 'unexpectedly', size: 23}, 29 | {text: 'taken', size: 23}, 30 | {text: 'portalxlogin', size: 23}, 31 | {text: 'samigo', size: 23}, 32 | {text: 'oracle', size: 23}, 33 | {text: 'eclipse', size: 23}, 34 | {text: 'view', size: 23}, 35 | {text: 'tools', size: 23}, 36 | {text: 'update', size: 23}, 37 | {text: 'version', size: 23}, 38 | {text: 'maven', size: 22}, 39 | {text: 'email', size: 22}, 40 | {text: 'center', size: 22}, 41 | {text: 'jforum', size: 22}, 42 | {text: 'files', size: 22}, 43 | {text: 'syllabus', size: 22}, 44 | {text: 'desktop', size: 21}, 45 | {text: 'connection', size: 21}, 46 | {text: 'file', size: 21}, 47 | {text: 'worksite', size: 21}, 48 | {text: 'portal', size: 21}, 49 | {text: 'visual', size: 21}, 50 | {text: 'basic', size: 21}, 51 | {text: 'different', size: 21}, 52 | {text: 'missing', size: 21}, 53 | {text: 'upload', size: 21}, 54 | {text: 'importing', size: 21}, 55 | {text: 'option', size: 21}, 56 | {text: 'information', size: 21}, 57 | {text: 'creating', size: 21}, 58 | {text: 'staleobjectstateexception', size: 21}, 59 | {text: 'updating', size: 21}, 60 | {text: 'sakaiiframemyworkspace', size: 21}, 61 | {text: 'memory', size: 20}, 62 | {text: 'collab', size: 20}, 63 | {text: 'code', size: 20}, 64 | {text: 'section', size: 20}, 65 | {text: 'question', size: 20}, 66 | {text: 'status', size: 20}, 67 | {text: 'production', size: 20}, 68 | {text: 'extending', size: 20}, 69 | {text: 'javaxsqlbasedatasource', size: 20}, 70 | {text: 'apis', size: 20}, 71 | {text: 'wiki', size: 20}, 72 | {text: 'using', size: 20}, 73 | {text: 'tests', size: 20}, 74 | {text: 'branch', size: 20}, 75 | {text: 'permissions', size: 20}, 76 | {text: 'support', size: 20}, 77 | {text: 'size', size: 20}, 78 | {text: 'page', size: 20}, 79 | {text: 'users', size: 20}, 80 | {text: 'sakaiperson', size: 20}, 81 | {text: 'database', size: 20}, 82 | {text: 'casfilter', size: 20}, 83 | {text: 'html', size: 20}, 84 | {text: 'editors', size: 20}, 85 | {text: 'reordering', size: 20}, 86 | {text: 'suppressing', size: 20}, 87 | {text: 'annoying', size: 20}, 88 | {text: 'macos', size: 20}, 89 | {text: 'limit', size: 20}, 90 | {text: 'exceeded', size: 20}, 91 | {text: 'without', size: 20}, 92 | {text: 'uploading', size: 20}, 93 | {text: 'documentation', size: 20}, 94 | {text: 'provider', size: 20}, 95 | {text: 'cannot', size: 20}, 96 | {text: 'development', size: 20}, 97 | {text: 'sakaiscript', size: 20}, 98 | {text: 'again', size: 20}, 99 | {text: 'assigning', size: 20}, 100 | {text: 'quota', size: 20} 101 | ]; 102 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gword.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import zlib 4 | import string 5 | 6 | conn = sqlite3.connect('index.sqlite') 7 | cur = conn.cursor() 8 | 9 | cur.execute('SELECT id, subject FROM Subjects') 10 | subjects = dict() 11 | for message_row in cur : 12 | subjects[message_row[0]] = message_row[1] 13 | 14 | # cur.execute('SELECT id, guid,sender_id,subject_id,headers,body FROM Messages') 15 | cur.execute('SELECT subject_id FROM Messages') 16 | counts = dict() 17 | for message_row in cur : 18 | text = subjects[message_row[0]] 19 | text = text.translate(str.maketrans('','',string.punctuation)) 20 | text = text.translate(str.maketrans('','','1234567890')) 21 | text = text.strip() 22 | text = text.lower() 23 | words = text.split() 24 | for word in words: 25 | if len(word) < 4 : continue 26 | counts[word] = counts.get(word,0) + 1 27 | 28 | x = sorted(counts, key=counts.get, reverse=True) 29 | highest = None 30 | lowest = None 31 | for k in x[:100]: 32 | if highest is None or highest < counts[k] : 33 | highest = counts[k] 34 | if lowest is None or lowest > counts[k] : 35 | lowest = counts[k] 36 | print('Range of counts:',highest,lowest) 37 | 38 | # Spread the font sizes across 20-100 based on the count 39 | bigsize = 80 40 | smallsize = 20 41 | 42 | fhand = open('gword.js','w') 43 | fhand.write("gword = [") 44 | first = True 45 | for k in x[:100]: 46 | if not first : fhand.write( ",\n") 47 | first = False 48 | size = counts[k] 49 | size = (size - lowest) / float(highest - lowest) 50 | size = int((size * bigsize) + smallsize) 51 | fhand.write("{text: '"+k+"', size: "+str(size)+"}") 52 | fhand.write( "\n];\n") 53 | fhand.close() 54 | 55 | print("Output written to gword.js") 56 | print("Open gword.htm in a browser to see the vizualization") 57 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/gyear.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import urllib.request, urllib.parse, urllib.error 4 | import zlib 5 | 6 | conn = sqlite3.connect('index.sqlite') 7 | cur = conn.cursor() 8 | 9 | cur.execute('SELECT id, sender FROM Senders') 10 | senders = dict() 11 | for message_row in cur : 12 | senders[message_row[0]] = message_row[1] 13 | 14 | cur.execute('SELECT id, guid,sender_id,subject_id,sent_at FROM Messages') 15 | messages = dict() 16 | for message_row in cur : 17 | messages[message_row[0]] = (message_row[1],message_row[2],message_row[3],message_row[4]) 18 | 19 | print("Loaded messages=",len(messages),"senders=",len(senders)) 20 | 21 | sendorgs = dict() 22 | for (message_id, message) in list(messages.items()): 23 | sender = message[1] 24 | pieces = senders[sender].split("@") 25 | if len(pieces) != 2 : continue 26 | dns = pieces[1] 27 | sendorgs[dns] = sendorgs.get(dns,0) + 1 28 | 29 | # pick the top schools 30 | orgs = sorted(sendorgs, key=sendorgs.get, reverse=True) 31 | orgs = orgs[:10] 32 | print("Top 10 Oranizations") 33 | print(orgs) 34 | # orgs = ['total'] + orgs 35 | 36 | counts = dict() 37 | months = list() 38 | # cur.execute('SELECT id, guid,sender_id,subject_id,sent_at FROM Messages') 39 | for (message_id, message) in list(messages.items()): 40 | sender = message[1] 41 | pieces = senders[sender].split("@") 42 | if len(pieces) != 2 : continue 43 | dns = pieces[1] 44 | if dns not in orgs : continue 45 | month = message[3][:4] 46 | if month not in months : months.append(month) 47 | key = (month, dns) 48 | counts[key] = counts.get(key,0) + 1 49 | tkey = (month, 'total') 50 | counts[tkey] = counts.get(tkey,0) + 1 51 | 52 | months.sort() 53 | # print counts 54 | # print months 55 | 56 | fhand = open('gline.js','w') 57 | fhand.write("gline = [ ['Year'") 58 | for org in orgs: 59 | fhand.write(",'"+org+"'") 60 | fhand.write("]") 61 | 62 | for month in months[1:-1]: 63 | fhand.write(",\n['"+month+"'") 64 | for org in orgs: 65 | key = (month, org) 66 | val = counts.get(key,0) 67 | fhand.write(","+str(val)) 68 | fhand.write("]"); 69 | 70 | fhand.write("\n];\n") 71 | fhand.close() 72 | 73 | print("Output written to gline.js") 74 | print("Open gline.htm to visualize the data") 75 | 76 | -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/index.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/index.sqlite -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/index.sqlite.second.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/index.sqlite.second.jpg -------------------------------------------------------------------------------- /Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/mapping.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Course 5 - Capstone - Retrieving, processing and visualising data with Python/ex18/gmane/mapping.sqlite -------------------------------------------------------------------------------- /Python for Everybody.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/thmstm/py4e/1ac22d81d43c30f09afff4ad0b53296f3bcf45f1/Python for Everybody.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # py4e 2 | Coursera - Python for Everybody codes 3 | https://www.coursera.org/specializations/python 4 | --------------------------------------------------------------------------------