├── .DS_Store ├── 1-Programming-for-Everybody-Getting-Started-with-Python ├── .DS_Store ├── Assignment │ ├── Assignment_2.1.txt │ ├── Assignment_2.2.txt │ ├── Assignment_2.3.txt │ ├── Assignment_3.1.txt │ ├── Assignment_3.3.txt │ ├── Assignment_4.6.txt │ └── Assignment_5.2.txt └── Quiz │ ├── .DS_Store │ ├── Week 3 Chapter 1.txt │ ├── Week 4 Chapter 2.txt │ ├── Week 5 Chapter 3.txt │ ├── Week 6 Chapter 4.txt │ └── Week 7 Chapter 5.txt ├── 2-Python-Data-Structure ├── .DS_Store ├── Assignment │ ├── Assignment 10.2.txt │ ├── Assignment 6.5.txt │ ├── Assignment 7.1.txt │ ├── Assignment 7.2.txt │ ├── Assignment 8.4.txt │ ├── Assignment 8.5.txt │ └── Assignment 9.4.txt └── Quiz │ ├── .DS_Store │ ├── Week 1 Chapter 6.txt │ ├── Week 3 Chapter 7.txt │ ├── Week 4 Chapter 8.txt │ ├── Week 5 Chapter 9.txt │ └── Week 6 Chapter 10.txt ├── 3-Using-Python-To-Access_Web-Data ├── .DS_Store ├── Assignment │ ├── .DS_Store │ ├── Assignment 2 Extracting Data With Regular Expressions.txt │ ├── Assignment 3 Understanding the Request : Response Cycle.txt │ ├── Assignment 4.1 Scraping HTML Data with BeautifulSoup.txt │ ├── Assignment 4.2 Following Links in HTML Using BeautifulSoup.txt │ ├── Assignment 5 Extracting Data from XML.txt │ ├── Assignment 6.1 Extracting Data from JSON.txt │ └── Assignment 6.2 Using the GeoJSON API.txt └── Quiz │ ├── .DS_Store │ ├── Week 2 Regular Expressions.txt │ ├── Week 3 Networks and Sockets.txt │ ├── Week 4 Reading Web Data From Python.txt │ ├── Week 5 eXtensible Markup Language.txt │ └── Week 6 Rest, Json, and APIs.txt ├── 4-Using-Database-With_Python ├── .DS_Store ├── Assignment │ ├── .DS_Store │ ├── Week 2 Assignment 2.1 Our First Database.py │ ├── Week 2 Assignment 2.2 (Counting Email In database) │ │ ├── .mbox.txt.icloud │ │ ├── Week 2 Assignment 2.2 (Counting Email In database).py │ │ └── emaildb.sqlite │ ├── Week 3 Assignment Multi-Table Database - Tracks │ │ ├── Week 3 Multi-Table Database - Tracks.py │ │ ├── tracks.sqlite │ │ └── tracks │ │ │ ├── Library.xml │ │ │ ├── README.txt │ │ │ └── tracks.py │ ├── Week 4 Assignment Many Students in Many Courses │ │ ├── .DS_Store │ │ ├── Week 4 Assignment Many Students in Many Courses.py │ │ └── roster_data.json │ └── Week 5 Assignment Databases and Visualization (peer-graded) │ │ ├── .DS_Store │ │ ├── A.1.1. - Geoload running.PNG │ │ ├── A.1.2. - Geodump running.PNG │ │ ├── A.1.3. - My location.PNG │ │ └── geodata │ │ ├── README.txt │ │ ├── geodata.sqlite │ │ ├── geodump.py │ │ ├── geoload.py │ │ ├── where.data │ │ ├── where.html │ │ └── where.js └── Quiz │ ├── .DS_Store │ ├── Week 1.1 Using Encoded Data in Python 3.txt │ ├── Week 1.2 Object Oriented Programming-72.txt │ ├── Week 2 Single-Table SQL.txt │ ├── Week 3 Multi-Table Relational SQL.txt │ └── Week 4 Many-to-Many Relationships and Python.txt └── 5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python ├── .DS_Store ├── Assignment ├── .DS_Store ├── Assignment 1 pagerank │ ├── .DS_Store │ ├── .idea │ │ ├── inspectionProfiles │ │ │ └── Project_Default.xml │ │ ├── misc.xml │ │ ├── modules.xml │ │ ├── pagerank.iml │ │ └── workspace.xml │ ├── BeautifulSoup.py │ ├── BeautifulSoup.pyc │ ├── LICENSE │ ├── README.txt │ ├── d3.v2.js │ ├── force.css │ ├── force.html │ ├── force.js │ ├── outputImages │ │ ├── .DS_Store │ │ ├── force.png │ │ ├── force_oth.png │ │ ├── gmainC.png │ │ └── spdump.png │ ├── pageRank.txt │ ├── spdump.py │ ├── spider.js │ ├── spider.py │ ├── spider.sqlite │ ├── spjson.py │ ├── sprank.py │ └── spreset.py ├── Assignment 2 Spidering and Modeling Email Data │ ├── .DS_Store │ ├── OutputImages │ │ ├── gbasic.png │ │ ├── gline.png │ │ ├── gmain.png │ │ ├── gmodel.png │ │ └── gword.png │ ├── README.txt │ ├── content.sqlite │ ├── d3.layout.cloud.js │ ├── d3.v2.js │ ├── email .txt │ ├── gbasic.py │ ├── gline.htm │ ├── gline.js │ ├── gline.py │ ├── gline2.htm │ ├── gmane.py │ ├── gmodel.py │ ├── gword.htm │ ├── gword.js │ ├── gword.py │ ├── gyear.py │ ├── index.sqlite │ └── mapping.sqlite └── Assignment 3 │ ├── .DS_Store │ └── outputImages │ ├── .DS_Store │ ├── blob_serve (1).png │ ├── blob_serve (2).png │ ├── blob_serve (4).png │ └── blob_serve.png └── Quiz └── Week 1 Using Encoded Data in Python 3.txt /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/.DS_Store -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/1-Programming-for-Everybody-Getting-Started-with-Python/.DS_Store -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Assignment/Assignment_2.1.txt: -------------------------------------------------------------------------------- 1 | """You can write any code you like in the window below. There are three files loaded and ready for you to open if you want to do file processing: "mbox-short.txt", "romeo.txt", and "words.txt".""" 2 | 3 | fh = open("words.txt", "r") 4 | 5 | count = 0 6 | for line in fh: 7 | print(line.strip()) 8 | count = count + 1 9 | 10 | print(count,"Lines") 11 | 12 | gh = open("romeo.txt", "r") 13 | 14 | count = 0 15 | for line in gh: 16 | print(line.strip()) 17 | count = count + 1 18 | 19 | print(count,"Lines") 20 | 21 | kh = open("mbox-short.txt", "r") 22 | 23 | count = 0 24 | for line in kh: 25 | print(line.strip()) 26 | count = count + 1 27 | 28 | print(count,"Lines") -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Assignment/Assignment_2.2.txt: -------------------------------------------------------------------------------- 1 | #"""2.2 Write a program that uses input to prompt a user for their name and then welcomes them. Note that input will pop up a dialog box. Enter Sarah in the pop-up box when you are prompted so your output will match the desired output.""" 2 | 3 | name = input("Enter Your Name: ") 4 | print("Hello "+name) -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Assignment/Assignment_2.3.txt: -------------------------------------------------------------------------------- 1 | #"""2.3 Write a program to prompt the user for hours and rate per hour using input to compute gross pay. Use 35 hours and a rate of 2.75 per hour to test the program (the pay should be 96.25). You should use input to read a string and float() to convert the string to a number. Do not worry about error checking or bad user data.""" 2 | 3 | 4 | def computepay(h,r): 5 | if h < 0 or r < 0: 6 | return None 7 | elif h > 40: 8 | return (40*r+(h-40)*1.5*r) 9 | else: 10 | return (h*r) 11 | 12 | try: 13 | hrs = raw_input("Enter Hours:") 14 | hour = float(hrs) 15 | r = raw_input("please input your rate:") 16 | rate = float(r) 17 | p = computepay(hour,rate) 18 | print (p) 19 | except: 20 | print ("Please,input your numberic") -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Assignment/Assignment_3.1.txt: -------------------------------------------------------------------------------- 1 | #""" 3.1 Write a program to prompt the user for hours and rate per hour using input to compute gross pay. Pay the hourly rate for the hours up to 40 and 1.5 times the hourly rate for all hours worked above 40 hours. Use 45 hours and a rate of 10.50 per hour to test the program (the pay should be 498.75). You should use input to read a string and float() to convert the string to a number. Do not worry about error checking the user input - assume the user types numbers properly. 2 | Grade updated on server. """ 3 | 4 | 5 | def computepay(h,r): 6 | if h < 0 or r < 0: 7 | return None 8 | elif h > 40: 9 | return (40*r+(h-40)*1.5*r) 10 | else: 11 | return (h*r) 12 | 13 | try: 14 | hrs = raw_input("Enter Hours:") 15 | hour = float(hrs) 16 | r = raw_input("please input your rate:") 17 | rate = float(r) 18 | p = computepay(hour,rate) 19 | print (p) 20 | except: 21 | print ("Please,input your numberic") -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Assignment/Assignment_3.3.txt: -------------------------------------------------------------------------------- 1 | #"""3.3 Write a program to prompt for a score between 0.0 and 1.0. If the score is out of range, print an error. If the score is between 0.0 and 1.0, print a grade using the following table: 2 | Score Grade 3 | >= 0.9 A 4 | >= 0.8 B 5 | >= 0.7 C 6 | >= 0.6 D 7 | < 0.6 F 8 | If the user enters a value out of range, print a suitable error message and exit. For the test, enter a score of 0.85.""" 9 | 10 | 11 | try: 12 | s = raw_input("please input your score:") 13 | score = float(s) 14 | if score > 1.0: 15 | print ("value out of range") 16 | elif 1.0 >= score>=.9: 17 | print ("A") 18 | elif .9 > score>=.8: 19 | print ("B") 20 | elif .8 >score>=.7: 21 | print ("D") 22 | elif .7 >score>=.6: 23 | print ("D") 24 | except: 25 | print ("Error , please input is numeric") -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Assignment/Assignment_4.6.txt: -------------------------------------------------------------------------------- 1 | #"""4.6 Write a program to prompt the user for hours and rate per hour using input to compute gross pay. Award time-and-a-half for the hourly rate for all hours worked above 40 hours. Put the logic to do the computation of time-and-a-half in a function called computepay() and use the function to do the computation. The function should return a value. Use 45 hours and a rate of 10.50 per hour to test the program (the pay should be 498.75). You should use input to read a string and float() to convert the string to a number. Do not worry about error checking the user input unless you want to - you can assume the user types numbers properly. Do not name your variable sum or use the sum() function.""" 2 | 3 | def computepay(h,r): 4 | if h < 0 or r < 0: 5 | return None 6 | elif h > 40: 7 | return (40*r+(h-40)*1.5*r) 8 | else: 9 | return (h*r) 10 | 11 | try: 12 | hrs = input("Enter Hours:") 13 | hour = float(hrs) 14 | r = input("please input your rate:") 15 | rate = float(r) 16 | p = computepay(hour,rate) 17 | print(p) 18 | except: 19 | print("Please,input your numberic") -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Assignment/Assignment_5.2.txt: -------------------------------------------------------------------------------- 1 | #"""5.2 Write a program that repeatedly prompts a user for integer numbers until the user enters 'done'. Once 'done' is entered, print out the largest and smallest of the numbers. If the user enters anything other than a valid number catch it with a try/except and put out an appropriate message and ignore the number. Enter 7, 2, bob, 10, and 4 and match the output below.""" 2 | 3 | largest = None 4 | smallest = None 5 | 6 | while True: 7 | inp = input("Enter a number: ") 8 | if inp == "done" : break 9 | try: 10 | num = float(inp) 11 | except: 12 | print("Invalid input") 13 | continue 14 | if smallest is None: 15 | smallest = num 16 | if num > largest : 17 | largest = num 18 | elif num < smallest : 19 | smallest = num 20 | 21 | def done(largest,smallest): 22 | print("Maximum is", int(largest)) 23 | print("Minimum is", int(smallest)) 24 | 25 | done(largest,smallest) -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Quiz/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/1-Programming-for-Everybody-Getting-Started-with-Python/Quiz/.DS_Store -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Quiz/Week 3 Chapter 1.txt: -------------------------------------------------------------------------------- 1 | 1.When Python is running in the interactive mode and displaying the chevron prompt (>>>) - what question is Python asking you? 2 | ==> What Python statement would you like me to run? 3 | 4 | 2.What will the following program print out: 5 | >>> x = 15 6 | >>> x = x + 5 7 | >>> print(x) 8 | ==> 20 9 | 10 | 3.Python scripts (files) have names that end with: 11 | ==>.py 12 | 13 | 4.Which of these words are reserved words in Python ? 14 | ==> 15 | — break 16 | — if 17 | 18 | 5.What is the proper way to say “good-bye” to Python? 19 | ==>quit() 20 | 21 | 6.Which of the parts of a computer actually executes the program instructions? 22 | ==> Central Processing Unit 23 | 24 | 7.What is "code" in the context of this course? 25 | ==> A sequence of instructions in a programming language 26 | 27 | 8.A USB memory stick is an example of which of the following components of computer architecture? 28 | ==> Secondary Memory 29 | 30 | 9.What is the best way to think about a "Syntax Error" while programming? 31 | ==> The computer did not understand the statement that you entered 32 | 33 | 10.Which of the following is not one of the programming patterns covered in Chapter 1? 34 | ==> Random steps -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Quiz/Week 4 Chapter 2.txt: -------------------------------------------------------------------------------- 1 | 1.Which of the following is a comment in Python? 2 | ==> # This is a test 3 | 4 | 2.What does the following code print out? 5 | print("123" + "abc") 6 | ==> 123abc 7 | 8 | 3.Which of the following variables is the "most mnemonic"? 9 | ==> x 10 | 11 | 4.Which of the following is not a Python reserved word? 12 | ==> spam 13 | 14 | 5.Assume the variable x has been initialized to an integer value (e.g., x = 3). What does the following statement do? 15 | x = x + 2 16 | ==> Retrieve the current value for x, add two to it, and put the sum back into x 17 | 18 | 6.Which of the following elements of a mathematical expression in Python is evaluated first? 19 | ==> Parentheses ( ) 20 | 21 | 7.What is the value of the following expression 22 | 42 % 10 23 | ==> 2 24 | 25 | 8.What will be the value of x after the following statement executes: 26 | x = 1 + 2 * 3 - 8 / 4 27 | ==> 5.0 28 | 29 | 9.What will be the value of x when the following statement is executed: 30 | x = int(98.6) 31 | ==> 98 32 | 33 | 10.What does the Python input() function do? 34 | ==> Pause the program and read data from the user 35 | 36 | 11.In the following code, print(98.6) What is “98.6”? 37 | ==> A constant 38 | 39 | 12.Which of the following is a bad Python variable name? 40 | ==> spam.23 -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Quiz/Week 5 Chapter 3.txt: -------------------------------------------------------------------------------- 1 | 1.What do we do to a Python statement that is immediately after an if statement to indicate that the statement is to be executed only when the if statement is true? 2 | ==> Indent the line below the if statement 3 | 4 | 2.Which of these operators is not a comparison / logical operator? 5 | ==> = 6 | 7 | 3.What is true about the following code segment: 8 | if x == 5 : 9 | print('Is 5') 10 | print('Is Still 5') 11 | print('Third 5') 12 | ==> Depending on the value of x, either all three of the print statements will execute or none of the statements will execute 13 | 14 | 4.When you have multiple lines in an if block, how do you indicate the end of the if block? 15 | ==> You de-indent the next line past the if block to the same level of indent as the original if statement 16 | 17 | 5.You look at the following text: 18 | if x == 6 : 19 | print('Is 6') 20 | print('Is Still 6') 21 | print('Third 6') 22 | It looks perfect but Python is giving you an 'Indentation Error' on the second print statement. What is the most likely reason? 23 | ==> You have mixed tabs and spaces in the file 24 | 25 | 6.What is the Python reserved word that we use in two-way if tests to indicate the block of code that is to be executed if the logical test is false? 26 | ==>else 27 | 28 | 7.What will the following code print out? 29 | x = 0 30 | if x < 2 : 31 | print('Small') 32 | elif x < 10 : 33 | print('Medium') 34 | else : 35 | print('LARGE') 36 | print('All done') 37 | ==> Small 38 | All done 39 | 40 | 8.For the following code, 41 | if x < 2 : 42 | print('Below 2') 43 | elif x >= 2 : 44 | print('Two or more') 45 | else : 46 | print('Something else') 47 | What value of 'x' will cause 'Something else' to print out? 48 | ==>This code will never print 'Something else' regardless of the value for 'x' 49 | 50 | 9.In the following code (numbers added) - which will be the last line to execute successfully? 51 | (1) astr = 'Hello Bob' 52 | (2) istr = int(astr) 53 | (3) print('First', istr) 54 | (4) astr = '123' 55 | (5) istr = int(astr) 56 | (6) print('Second', istr) 57 | ==> 1 58 | 59 | 10.For the following code: 60 | astr = 'Hello Bob' 61 | istr = 0 62 | try: 63 | istr = int(astr) 64 | except: 65 | istr = -1 66 | What will the value be for istr after this code executes? 67 | ==>-1 -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Quiz/Week 6 Chapter 4.txt: -------------------------------------------------------------------------------- 1 | 1.Which Python keyword indicates the start of a function definition? 2 | ==> def 3 | 4 | 2.In Python, how do you indicate the end of the block of code that makes up the function? 5 | ==> You de-indent a line of code to the same indent level as the def keyword 6 | 7 | 3.In Python what is the input() feature best described as? 8 | ==> A built-in function 9 | 10 | 4.What does the following code print out? 11 | def thing(): 12 | print('Hello') 13 | 14 | print('There') 15 | ==> There 16 | 17 | 5.In the following Python code, which of the following is an "argument" to a function? 18 | x = 'banana' 19 | y = max(x) 20 | print(y) 21 | ==> x 22 | 23 | 6.What will the following Python code print out? 24 | def func(x) : 25 | print(x) 26 | 27 | func(10) 28 | func(20) 29 | ==> 10 30 | 20 31 | 32 | 7.Which line of the following Python program will never execute? 33 | def stuff(): 34 | print('Hello') 35 | return 36 | print('World') 37 | 38 | stuff() 39 | ==> print ('World') 40 | 41 | 8.What will the following Python program print out? 42 | def greet(lang): 43 | if lang == 'es': 44 | return 'Hola' 45 | elif lang == 'fr': 46 | return 'Bonjour' 47 | else: 48 | return 'Hello' 49 | 50 | print(greet('fr'),'Michael') 51 | ==>Bonjour Michael 52 | 53 | 9.What does the following Python code print out? (Note that this is a bit of a trick question and the code has what many would consider to be a flaw/bug - so read carefully). 54 | def addtwo(a, b): 55 | added = a + b 56 | return a 57 | 58 | x = addtwo(2, 7) 59 | print(x) 60 | ==>2 61 | 62 | 10.What is the most important benefit of writing your own functions? 63 | ==>Avoiding writing the same non-trivial code more than once in your program -------------------------------------------------------------------------------- /1-Programming-for-Everybody-Getting-Started-with-Python/Quiz/Week 7 Chapter 5.txt: -------------------------------------------------------------------------------- 1 | 1.What is wrong with this Python loop: 2 | n = 5 3 | while n > 0 : 4 | print(n) 5 | print('All done') 6 | ==> This loop will run forever 7 | 8 | 2.What does the break statement do? 9 | ==> Exits the currently executing loop 10 | 11 | 3.What does the continue statement do? 12 | ==> Jumps to the "top" of the loop and starts the next iteration 13 | 14 | 4.What does the following Python program print out? 15 | tot = 0 16 | for i in [5, 4, 3, 2, 1] : 17 | tot = tot + 1 18 | print(tot) 19 | ==> 5 20 | 21 | 5.What is the iteration variable in the following Python code: 22 | friends = ['Joseph', 'Glenn', 'Sally'] 23 | for friend in friends : 24 | print('Happy New Year:', friend) 25 | print('Done!') 26 | ==> friend 27 | 28 | 6.What is a good description of the following bit of Python code? 29 | zork = 0 30 | for thing in [9, 41, 12, 3, 74, 15] : 31 | zork = zork + thing 32 | print('After', zork) 33 | ==> Sum all the elements of a list 34 | 35 | 7.What will the following code print out? 36 | smallest_so_far = -1 37 | for the_num in [9, 41, 12, 3, 74, 15] : 38 | if the_num < smallest_so_far : 39 | smallest_so_far = the_num 40 | print(smallest_so_far) 41 | ==> -1 42 | 43 | 8.What is a good statement to describe the is operator as used in the following if statement: 44 | if smallest is None : 45 | smallest = value 46 | ==> matches both type and value 47 | 48 | 9.Which reserved word indicates the start of an "indefinite" loop in Python? 49 | ==> while 50 | 51 | 10.How many times will the body of the following loop be executed? 52 | n = 0 53 | while n > 0 : 54 | print('Lather') 55 | print('Rinse') 56 | print('Dry off!') 57 | ==> 0 -------------------------------------------------------------------------------- /2-Python-Data-Structure/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/2-Python-Data-Structure/.DS_Store -------------------------------------------------------------------------------- /2-Python-Data-Structure/Assignment/Assignment 10.2.txt: -------------------------------------------------------------------------------- 1 | #"""10.2 Write a program to read through the mbox-short.txt and figure out the distribution by hour of the day for each of the messages. You can pull the hour out from the 'From ' line by finding the time and then splitting the string a second time using a colon. 2 | From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 3 | Once you have accumulated the counts for each hour, print out the counts, sorted by hour as shown below.""" 4 | 5 | 6 | #Use mbox-short.txt File name 7 | 8 | name = input("Enter file:") 9 | f = open(name) 10 | dic = {} 11 | for i in f: 12 | if i.startswith("From") and len(i.split()) > 2: 13 | line = i.split() 14 | if not dic.has_key(line[5][:2]): 15 | dic[line[5][:2]] = 1 16 | else: 17 | dic[line[5][:2]] += 1 18 | 19 | key = sorted(dic) 20 | for i in key: 21 | print (i, dic[i]) -------------------------------------------------------------------------------- /2-Python-Data-Structure/Assignment/Assignment 6.5.txt: -------------------------------------------------------------------------------- 1 | #"""6.5 Write code using find() and string slicing (see section 6.10) to extract the number at the end of the line below. Convert the extracted value to a floating point number and print it out.""" 2 | 3 | 4 | 5 | text = "X-DSPAM-Confidence: 0.8475"; 6 | 7 | spacePos = text.find(" ") 8 | number = text[spacePos::1] 9 | #not really necessary but since we are just learning and playing 10 | strippedNumber = number.lstrip(); 11 | result = float(strippedNumber) 12 | 13 | def reprint(printed): 14 | print(printed) 15 | 16 | reprint(result) 17 | 18 | -------------------------------------------------------------------------------- /2-Python-Data-Structure/Assignment/Assignment 7.1.txt: -------------------------------------------------------------------------------- 1 | #"""7.1 Write a program that prompts for a file name, then opens that file and reads through the file, and print the contents of the file in upper case. Use the file words.txt to produce the output below. 2 | You can download the sample data at http://www.py4e.com/code3/words.txt""" 3 | 4 | 5 | # Use words.txt as the file name 6 | fname = input("Enter file name: ") 7 | fh = open(fname) 8 | for line in fname: 9 | line = line.rstrip() 10 | inp = fh.read() 11 | print(inp.upper()) -------------------------------------------------------------------------------- /2-Python-Data-Structure/Assignment/Assignment 7.2.txt: -------------------------------------------------------------------------------- 1 | #"""7.2 Write a program that prompts for a file name, then opens that file and reads through the file, looking for lines of the form: 2 | X-DSPAM-Confidence: 0.8475 3 | Count these lines and extract the floating point values from each of the lines and compute the average of those values and produce an output as shown below. Do not use the sum() function or a variable named sum in your solution. 4 | You can download the sample data at http://www.py4e.com/code3/mbox-short.txt when you are testing below enter mbox-short.txt as the file name.""" 5 | 6 | 7 | 8 | # Use the file name mbox-short.txt as the file name 9 | fname = input("Enter file name: ") 10 | fh = open(fname) 11 | count = 0 12 | s = 0 13 | for line in fh: 14 | if not line.startswith("X-DSPAM-Confidence:") : 15 | continue 16 | pos = line.find('0') 17 | s += float(line[pos:pos+6]) 18 | count += 1 19 | average = s / count 20 | print("Average spam confidence:", average) 21 | 22 | -------------------------------------------------------------------------------- /2-Python-Data-Structure/Assignment/Assignment 8.4.txt: -------------------------------------------------------------------------------- 1 | #"""8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order. 2 | You can download the sample data at http://www.py4e.com/code3/romeo.txt""" 3 | 4 | 5 | # File name is "romeo.txt" 6 | fajl = raw_input("unesite ime fajla: ") 7 | fajlOpen = open(fajl) 8 | listica = [] 9 | linije = [line.split() for line in fajlOpen] 10 | for i in linije: 11 | for j in i: 12 | if j not in listica: 13 | listica.append(j) 14 | listica.sort() 15 | print(listica) -------------------------------------------------------------------------------- /2-Python-Data-Structure/Assignment/Assignment 8.5.txt: -------------------------------------------------------------------------------- 1 | #"""8.5 Open the file mbox-short.txt and read it line by line. When you find a line that starts with 'From ' like the following line: 2 | From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 3 | You will parse the From line using split() and print out the second word in the line (i.e. the entire address of the person who sent the message). Then print out a count at the end. 4 | Hint: make sure not to include the lines that start with 'From:'. 5 | 6 | You can download the sample data at http://www.py4e.com/code3/mbox-short.txt""" 7 | 8 | 9 | 10 | #Use mbox-short.txt as File Name 11 | 12 | fname = input("Enter file name: ") 13 | 14 | tekst = open(fname) 15 | count = 0 16 | for linija in tekst: 17 | if linija.startswith("From "): 18 | rijeci = linija.rstrip().split() 19 | email = rijeci[1] 20 | print(email) 21 | count +=1 22 | else: 23 | continue 24 | 25 | print("There were", count, "lines in the file with From as the first word") -------------------------------------------------------------------------------- /2-Python-Data-Structure/Assignment/Assignment 9.4.txt: -------------------------------------------------------------------------------- 1 | #"""9.4 Write a program to read through the mbox-short.txt and figure out who has the sent the greatest number of mail messages. The program looks for 'From ' lines and takes the second word of those lines as the person who sent the mail. The program creates a Python dictionary that maps the sender's mail address to a count of the number of times they appear in the file. After the dictionary is produced, the program reads through the dictionary using a maximum loop to find the most prolific committer.""" 2 | 3 | 4 | 5 | #Use mbox-short.txt as File name 6 | name = input("Enter file:") 7 | tekst = open(name) 8 | dic = {} 9 | 10 | for lines in tekst: 11 | if lines.startswith("From "): 12 | words = lines.split() 13 | email = words[1] 14 | dic[email] = dic.get(email, 0)+1 15 | 16 | i = None 17 | j = None 18 | 19 | for k, v in dic.items(): 20 | if j is None or j < v: 21 | j = v 22 | i = k 23 | print(i, j) -------------------------------------------------------------------------------- /2-Python-Data-Structure/Quiz/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/2-Python-Data-Structure/Quiz/.DS_Store -------------------------------------------------------------------------------- /2-Python-Data-Structure/Quiz/Week 1 Chapter 6.txt: -------------------------------------------------------------------------------- 1 | 1.What does the following Python Program print out? 2 | str1 = "Hello" 3 | str2 = 'there' 4 | bob = str1 + str2 5 | print(bob) 6 | ==>Hellothere 7 | 8 | 2.What does the following Python program print out? 9 | x = '40' 10 | y = int(x) + 2 11 | print(y) 12 | ==>42 13 | 14 | 3.How would you use the index operator [] to print out the letter q from the following string? 15 | x = 'From marquard@uct.ac.za' 16 | ==>print(x[8]) 17 | 18 | 4.How would you use string slicing [:] to print out 'uct' from the following string? 19 | x = 'From marquard@uct.ac.za' 20 | ==>print(x[14:17]) 21 | 22 | 5.What is the iteration variable in the following Python code? 23 | for letter in 'banana' : 24 | print(letter) 25 | ==>letter 26 | 27 | 6.What does the following Python code print out? 28 | print(len('banana')*7) 29 | ==>42 30 | 31 | 7.How would you print out the following variable in all upper case in Python? 32 | greet = 'Hello Bob' 33 | ==>print(greet.upper()) 34 | 35 | 8.Which of the following is not a valid string method in Python? 36 | ==>boldface() 37 | 38 | 9.What will the following Python code print out? 39 | data = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008' 40 | pos = data.find('.') 41 | print(data[pos:pos+3]) 42 | ==>.ma 43 | 44 | 10.Question 10 45 | Which of the following string methods removes whitespace from both the beginning and end of a string? 46 | ==>strip() -------------------------------------------------------------------------------- /2-Python-Data-Structure/Quiz/Week 3 Chapter 7.txt: -------------------------------------------------------------------------------- 1 | 1.Given the architecture and terminology we introduced in Chapter 1, where are files stored? 2 | ==>Secondary memory 3 | 4 | 2.What is stored in a "file handle" that is returned from a successful open() call? 5 | ==> The handle is a connection to the file's data 6 | 7 | 3.What do we use the second parameter of the open() call to indicate? 8 | ==> Whether we want to read data from the file or write data to the file 9 | 10 | 4.What Python function would you use if you wanted to prompt the user for a file name to open? 11 | ==>input() 12 | 13 | 5.What is the purpose of the newline character in text files? 14 | ==>It indicates the end of one line of text and the beginning of another line of text 15 | 16 | 6.If we open a file as follows: xfile = open('mbox.txt'). What statement would we use to read the file one line at a time? 17 | ==>for line in xfile: 18 | 19 | 7.What is the purpose of the following Python code? fhand = open('mbox.txt'); x = 0; for line in fhand: x = x + 1; print x 20 | ==> Count the lines in the file 'mbox.txt' 21 | 22 | 8.If you write a Python program to read a text file and you see extra blank lines in the output that are not present in the file input as shown below, what Python string function will likely solve the problem?. 23 | From: stephen.marquard@uct.ac.za; 24 | From: louis@media.berkeley.edu; 25 | From: zqian@umich.edu; 26 | From: rjlowe@iupui.edu ... 27 | ==> rstrip() 28 | 29 | 9.The following code sequence fails with a traceback when the user enters a file that does not exist. How would you avoid the traceback and make it so you could print out your own error message when a bad file name was entered? 30 | fname = raw_input('Enter the file name: '); 31 | fhand = open(fname) 32 | ==> try / except 33 | 34 | 10.What does the following Python code do? 35 | fhand = open('mbox-short.txt'); 36 | inp = fhand.read() 37 | ==>Reads the entire file into the variable inp as a string 38 | -------------------------------------------------------------------------------- /2-Python-Data-Structure/Quiz/Week 4 Chapter 8.txt: -------------------------------------------------------------------------------- 1 | 1.How are "collection" variables different from normal variables? 2 | ==> Collection variables can store multiple values in a single variable 3 | 4 | 2.What are the Python keywords used to construct a loop to iterate through a list? 5 | ==> for/in 6 | 3.For the following list, how would you print out 'Sally'? 7 | friends = [ 'Joseph', 'Glenn', 'Sally'] 8 | ==> print(friends[2]) 9 | 4. fruit = 'Banana' 10 | fruit[0] = 'b'; 11 | print fruit 12 | ==> Nothing would print the program fails with a traceback 13 | 14 | 5.Which of the following Python statements would print out the length of a list stored in the variable data? 15 | ==> print(len(data)) 16 | 17 | 6.What type of data is produced when you call the range() function? 18 | x = range(5) 19 | ==> A list of integers 20 | 21 | 7.What does the following Python code print out? 22 | a = [1, 2, 3]; 23 | b = [4, 5, 6]; 24 | c = a + b; 25 | print(len(c)) 26 | ==> 6 27 | 28 | 8.Which of the following slicing operations will produce the list [12, 3]? 29 | t = [9, 41, 12, 3, 74, 15] 30 | ==> t[2:4] 31 | 32 | 9.What list method adds a new item to the end of an existing list? 33 | ==> append() 34 | 35 | 10.What will the following Python code print out? 36 | friends = [ 'Joseph', 'Glenn', 'Sally' ]; 37 | friends.sort(); 38 | print(friends[0]) 39 | ==> Glenn -------------------------------------------------------------------------------- /2-Python-Data-Structure/Quiz/Week 5 Chapter 9.txt: -------------------------------------------------------------------------------- 1 | 1.How are Python dictionaries different from Python lists? 2 | ==> Python lists are indexed using integers and dictionaries can use strings as indexes 3 | 4 | 2.What is a term commonly used to describe the Python dictionary feature in other programming languages? 5 | ==> Associative arrays 6 | 7 | 3.What would the following Python code print out? 8 | stuff = dict(); 9 | print(stuff['candy']) 10 | ==> The program would fail with a traceback 11 | 12 | 4.What would the following Python code print out? stuff = dict(); print stuff.get('candy',-1) 13 | ==> -1 14 | 15 | 5.(T/F)When you add items to a dictionary they remain in the order in which you added them. 16 | ==> False 17 | 18 | 6.What is a common use of Python dictionaries in a program? 19 | ==> Building a histogram counting the occurrences of various strings in a file 20 | 21 | 7.Which of the following lines of Python is equivalent to the following sequence of statements assuming that counts is a dictionary? 22 | if key in counts: 23 | counts[key] = counts[key] + 1 24 | else: 25 | counts[key] = 1 26 | ==> counts[key] = counts.get(key,0) + 1 27 | 28 | 8.In the following Python, what does the for loop iterate through? 29 | x = dict() ... 30 | for y in x : ... 31 | ==> It loops through the keys in the dictionary 32 | 33 | 9.Which method in a dictionary object gives you a list of the values in the dictionary? 34 | ==> values() 35 | 36 | 10.What is the purpose of the second parameter of the get() method for Python dictionaries? 37 | ==> To provide a default value if the key is not found -------------------------------------------------------------------------------- /2-Python-Data-Structure/Quiz/Week 6 Chapter 10.txt: -------------------------------------------------------------------------------- 1 | 1.What is the difference between a Python tuple and Python list? 2 | ==> Lists are mutable and tuples are not mutable 3 | 4 | 2.Which of the following methods work both in Python lists and Python tuples? 5 | ==> index() 6 | 7 | 3.What will end up in the variable y after this code is executed? 8 | x , y = 3, 4 9 | ==> 4 10 | 11 | 4.In the following Python code, what will end up in the variable y? 12 | x = { 'chuck' : 1 , 'fred' : 42, 'jan': 100}; 13 | y = x.items() 14 | ==> A list of tuples 15 | 16 | 5.Which of the following tuples is greater than x in the following Python sequence? 17 | x = (5, 1, 3); 18 | if ??? > x : 19 | ... 20 | ==> (6, 0, 0) 21 | 22 | 6.What does the following Python code accomplish, assuming the c is a non-empty dictionary? 23 | tmp = list(); 24 | for k, v in c.items(): 25 | tmp.append( (v, k)) 26 | ==> It creates a list of tuples where each tuple is a value, key pair 27 | 28 | 7.If the variable data is a Python list, how do we sort it in reverse order? 29 | ==> data.sort(reverse=True) 30 | 31 | 8.Using the following tuple, how would you print 'Wed'? 32 | days = ('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun') 33 | ==> print(days[2]) 34 | 35 | 9.In the following Python loop, why are there two iteration variables (k and v)? 36 | c = {'a':10, 'b':1, 'c':22}; 37 | for k, v in c.items() : 38 | ... 39 | ==> Because the items() method in dictionaries returns a list of tuples 40 | 41 | 10.Given that Python lists and Python tuples are quite similar - when might you prefer to use a tuple over a list? 42 | ==> For a temporary variable that you will use and discard without modifying -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/3-Using-Python-To-Access_Web-Data/.DS_Store -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Assignment/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/3-Using-Python-To-Access_Web-Data/Assignment/.DS_Store -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Assignment/Assignment 2 Extracting Data With Regular Expressions.txt: -------------------------------------------------------------------------------- 1 | #"""Finding Numbers in a Haystack 2 | 3 | In this assignment you will read through and parse a file with text and numbers. You will extract all the numbers in the file and compute the sum of the numbers. 4 | 5 | Data Files 6 | We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment. 7 | 8 | Sample data: http://py4e-data.dr-chuck.net/regex_sum_42.txt (There are 90 values with a sum=445833) 9 | Actual data: http://py4e-data.dr-chuck.net/regex_sum_97406.txt (There are 67 values and the sum ends with 785) 10 | These links open in a new window. Make sure to save the file into the same folder as you will be writing your Python program. Note: Each student will have a distinct data file for the assignment - so only use your own data file for analysis.""" 11 | 12 | 13 | 14 | #Answer of this Question is 15 | #305785 16 | #copy all content from "http://py4e-data.dr-chuck.net/regex_sum_97406.txt" file and create text file then run following code 17 | 18 | 19 | 20 | import re 21 | 22 | sum = 0 23 | 24 | file = open('regex_sum_97406', 'r') 25 | for line in file: 26 | numbers = re.findall('[0-9]+', line) 27 | if not numbers: 28 | continue 29 | else: 30 | for number in numbers: 31 | sum += int(number) 32 | 33 | print(sum) 34 | 35 | 36 | -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Assignment/Assignment 3 Understanding the Request : Response Cycle.txt: -------------------------------------------------------------------------------- 1 | #"""Exploring the HyperText Transport Protocol 2 | 3 | You are to retrieve the following document using the HTTP protocol in a way that you can examine the HTTP Response headers. 4 | 5 | http://data.pr4e.org/intro-short.txt 6 | There are three ways that you might retrieve this web page and look at the response headers: 7 | 8 | Preferred: Modify the socket1.py program to retrieve the above URL and print out the headers and data. Make sure to change the code to retrieve the above URL - the values are different for each URL. 9 | Open the URL in a web browser with a developer console or FireBug and manually examine the headers that are returned. 10 | Use the telnet program as shown in lecture to retrieve the headers and content. 11 | Enter the header values in each of the fields below and press "Submit".""" 12 | 13 | 14 | #Server: Apache/2.4.18 (Ubuntu) 15 | #Last-Modified: Sat, 13 May 2017 11:22:22 GMT 16 | #ETag: "1d3-54f6609240717" 17 | #Content-Length: 467 18 | #Cache-Control: max-age=0, no-cache, no-store, must-revalidate 19 | #Content-Type: text/plain 20 | 21 | 22 | import socket 23 | 24 | mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 25 | mysock.connect(('data.pr4e.org', 80)) 26 | # cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n'.encode() 27 | 28 | cmd = 'GET http://data.pr4e.org/intro-short.txt HTTP/1.0\r\n\r\n'.encode() 29 | 30 | mysock.send(cmd) 31 | 32 | while True: 33 | data = mysock.recv(512) 34 | if (len(data) < 1): 35 | break 36 | print(data.decode(),end='') 37 | 38 | mysock.close() 39 | -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Assignment/Assignment 4.1 Scraping HTML Data with BeautifulSoup.txt: -------------------------------------------------------------------------------- 1 | #"""Scraping Numbers from HTML using BeautifulSoup In this assignment you will write a Python program similar to http://www.py4e.com/code3/urllink2.py. The program will use urllib to read the HTML from the data files below, and parse the data, extracting numbers and compute the sum of the numbers in the file. 2 | 3 | We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment. 4 | 5 | Sample data: http://py4e-data.dr-chuck.net/comments_42.html (Sum=2553) 6 | Actual data: http://py4e-data.dr-chuck.net/comments_97408.html (Sum ends with 93) 7 | You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.""" 8 | 9 | 10 | #Enter the url to scrape - http://py4e-data.dr-chuck.net/comments_97408.html 11 | #Count 50 12 | #Sum 2893 13 | 14 | 15 | import urllib.request as ur 16 | from bs4 import * 17 | 18 | url = input('Enter the url to scrape - ') 19 | 20 | html = ur.urlopen(url).read() 21 | soup = BeautifulSoup(html, 'html.parser') 22 | 23 | count_of_spans = 0 24 | sum = 0 25 | 26 | spans = soup('span') 27 | for span in spans: 28 | sum += int(span.contents[0]) 29 | count_of_spans += 1 30 | 31 | print('Count ', count_of_spans) 32 | print('Sum ', sum) 33 | -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Assignment/Assignment 4.2 Following Links in HTML Using BeautifulSoup.txt: -------------------------------------------------------------------------------- 1 | #"""Following Links in Python 2 | 3 | In this assignment you will write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find. 4 | 5 | We provide two files for this assignment. One is a sample file where we give you the name for your testing and the other is the actual data you need to process for the assignment 6 | 7 | Sample problem: Start at http://py4e-data.dr-chuck.net/known_by_Fikret.html 8 | Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve. 9 | Sequence of names: Fikret Montgomery Mhairade Butchi Anayah 10 | Last name in sequence: Anayah 11 | Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Annick.html 12 | Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve. 13 | Hint: The first character of the name of the last page that you will load is: M""" 14 | 15 | 16 | # Enter URL: http://py4e-data.dr-chuck.net/known_by_Annick.html 17 | # Enter count: 7 18 | # Enter position: 18 19 | # Retrieving: http://py4e-data.dr-chuck.net/known_by_Annick.html 20 | # Retrieving: http://py4e-data.dr-chuck.net/known_by_Nicki.html 21 | # Retrieving: http://py4e-data.dr-chuck.net/known_by_Peebles.html 22 | # Retrieving: http://py4e-data.dr-chuck.net/known_by_Chantelle.html 23 | # Retrieving: http://py4e-data.dr-chuck.net/known_by_Kamila.html 24 | # Retrieving: http://py4e-data.dr-chuck.net/known_by_Domenico.html 25 | # Retrieving: http://py4e-data.dr-chuck.net/known_by_Nassir.html 26 | # Last Url: http://py4e-data.dr-chuck.net/known_by_Mhea.html 27 | 28 | ###########Final Answer is: 29 | #Name: Mhea 30 | 31 | import urllib.request as ur 32 | from bs4 import * 33 | 34 | current_repeat_count = 0 35 | url = input('Enter URL: ') 36 | repeat_count = int(input('Enter count: ')) 37 | position = int(input('Enter position: ')) 38 | 39 | 40 | def parse_html(url): 41 | html = ur.urlopen(url).read() 42 | soup = BeautifulSoup(html, 'html.parser') 43 | tags = soup('a') 44 | return tags 45 | 46 | while current_repeat_count < repeat_count: 47 | print('Retrieving: ', url) 48 | tags = parse_html(url) 49 | for index, item in enumerate(tags): 50 | if index == position - 1: 51 | url = item.get('href', None) 52 | name = item.contents[0] 53 | break 54 | else: 55 | continue 56 | current_repeat_count += 1 57 | print('Last Url: ', url) -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Assignment/Assignment 5 Extracting Data from XML.txt: -------------------------------------------------------------------------------- 1 | #"""Extracting Data from XML 2 | 3 | In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/geoxml.py. The program will prompt for a URL, read the XML data from that URL using urllib and then parse and extract the comment counts from the XML data, compute the sum of the numbers in the file. 4 | 5 | We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment. 6 | 7 | Sample data: http://py4e-data.dr-chuck.net/comments_42.xml (Sum=2553) 8 | Actual data: http://py4e-data.dr-chuck.net/comments_97410.xml (Sum ends with 59) 9 | You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.""" 10 | 11 | 12 | 13 | 14 | #Enter location: http://py4e-data.dr-chuck.net/comments_97410.xml 15 | #Retrieving http://py4e-data.dr-chuck.net/comments_97410.xml 16 | #Retrieved 4220 characters 17 | #Count: 50 18 | #Sum: 2259 19 | 20 | 21 | import urllib.request as ur 22 | import xml.etree.ElementTree as et 23 | 24 | url = input('Enter location: ') 25 | # 'http://python-data.dr-chuck.net/comments_42.xml' 26 | 27 | total_number = 0 28 | sum = 0 29 | 30 | print('Retrieving', url) 31 | xml = ur.urlopen(url).read() 32 | print('Retrieved', len(xml), 'characters') 33 | 34 | tree = et.fromstring(xml) 35 | counts = tree.findall('.//count') 36 | for count in counts: 37 | sum += int(count.text) 38 | total_number += 1 39 | 40 | print('Count:', total_number) 41 | print('Sum:', sum) -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Assignment/Assignment 6.1 Extracting Data from JSON.txt: -------------------------------------------------------------------------------- 1 | #"""Extracting Data from JSON 2 | 3 | In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/json2.py. The program will prompt for a URL, read the JSON data from that URL using urllib and then parse and extract the comment counts from the JSON data, compute the sum of the numbers in the file and enter the sum below: 4 | We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment. 5 | 6 | Sample data: http://py4e-data.dr-chuck.net/comments_42.json (Sum=2553) 7 | Actual data: http://py4e-data.dr-chuck.net/comments_97411.json (Sum ends with 65) 8 | You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis. 9 | """ 10 | 11 | 12 | #Enter location: http://py4e-data.dr-chuck.net/comments_97411.json 13 | #Retrieving http://py4e-data.dr-chuck.net/comments_97411.json 14 | #Retrieved 2711 characters 15 | #Count: 50 16 | #Sum: 2365 17 | 18 | 19 | 20 | import urllib.request as ur 21 | import json 22 | 23 | # json_url = 'http://python-data.dr-chuck.net/comments_42.json' 24 | 25 | json_url = input("Enter location: ") 26 | print("Retrieving ", json_url) 27 | data = ur.urlopen(json_url).read().decode('utf-8') 28 | print('Retrieved', len(data), 'characters') 29 | json_obj = json.loads(data) 30 | 31 | sum = 0 32 | total_number = 0 33 | 34 | for comment in json_obj["comments"]: 35 | sum += int(comment["count"]) 36 | total_number += 1 37 | 38 | print('Count:', total_number) 39 | print('Sum:', sum) -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Assignment/Assignment 6.2 Using the GeoJSON API.txt: -------------------------------------------------------------------------------- 1 | #"""Calling a JSON API 2 | 3 | In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/geojson.py. The program will prompt for a location, contact a web service and retrieve JSON for the web service and parse that data, and retrieve the first place_id from the JSON. A place ID is a textual identifier that uniquely identifies a place as within Google Maps. 4 | API End Points 5 | 6 | To complete this assignment, you should use this API endpoint that has a static subset of the Google Data: 7 | 8 | http://py4e-data.dr-chuck.net/geojson? 9 | This API uses the same parameter (address) as the Google API. This API also has no rate limit so you can test as often as you like. If you visit the URL with no parameters, you get a list of all of the address values which can be used with this API. 10 | To call the API, you need to provide address that you are requesting as the address= parameter that is properly URL encoded using the urllib.urlencode() fuction as shown in http://www.py4e.com/code3/geojson.py""" 11 | 12 | 13 | 14 | #Enter location: University of Twente 15 | #Retrieving http://python-data.dr-chuck.net/geojson?#sensor=false&address=University+of+Twente 16 | #Retrieved 2124 characters 17 | #Place id ChIJPZ9qp0tvv4cRb5oLVI9wra8 18 | 19 | 20 | 21 | import urllib.request as ur 22 | import urllib.parse as up 23 | import json 24 | 25 | serviceurl = "http://python-data.dr-chuck.net/geojson?" 26 | 27 | address_input = input("Enter location: ") 28 | params = {"sensor": "false", "address": address_input} 29 | url = serviceurl + up.urlencode(params) 30 | print("Retrieving ", url) 31 | data = ur.urlopen(url).read().decode('utf-8') 32 | print('Retrieved', len(data), 'characters') 33 | json_obj = json.loads(data) 34 | 35 | place_id = json_obj["results"][0]["place_id"] 36 | print("Place id", place_id) -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Quiz/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/3-Using-Python-To-Access_Web-Data/Quiz/.DS_Store -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Quiz/Week 2 Regular Expressions.txt: -------------------------------------------------------------------------------- 1 | 1. Which of the following regular expressions would extract 'uct.ac.za' from this string using re.findall? 2 | From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 3 | ==> @(\S+) 4 | 5 | 2.Which of the following is the way we match the "start of a line" in a regular expression? 6 | ==> ^ 7 | 8 | 3.What would the following mean in a regular expression? [a-z0-9] 9 | ==> Match a lowercase letter or a digit 10 | 11 | 4.What is the type of the return value of the re.findall() method? 12 | ==> A list of strings 13 | 14 | 5.What is the "wild card" character in a regular expression (i.e., the character that matches any character)? 15 | ==> . 16 | 17 | 6.What is the difference between the "+" and "*" character in regular expressions? 18 | ==> The "+" matches at least one character and the "*" matches zero or more characters 19 | 20 | 7.What does the "[0-9]+" match in a regular expression? 21 | ==> One or more digits 22 | 23 | 8.What does the following Python sequence print out? 24 | x = 'From: Using the : character' 25 | y = re.findall('^F.+:', x) 26 | print(y) 27 | ==> [‘From: Using the :'] 28 | 29 | 9.What character do you add to the "+" or "*" to indicate that the match is to be done in a non-greedy manner? 30 | ==> ? 31 | 32 | 10.Given the following line of text: 33 | From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 34 | What would the regular expression '\S+?@\S+' match? 35 | ==> stephen.marquard@uct.ac.za 36 | 37 | 11.Which of the following best describes "Regular Expressions"? 38 | ==> A small programming language unto itself 39 | 40 | 12.What will the '\$' regular expression match? 41 | ==> A new line at the end of a line(wrong) 42 | ==> The end of a line(wrong) -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Quiz/Week 3 Networks and Sockets.txt: -------------------------------------------------------------------------------- 1 | 1.What do we call it when a browser uses the HTTP protocol to load a file or page from a server and display it in the browser? 2 | ==> The Request/Response Cycle 3 | 4 | 2.Which of the following is most similar to a TCP port number? 5 | ==> A telephone extension 6 | 7 | 3.What must you do in Python before opening a socket? 8 | ==> import socket 9 | 10 | 4.Which of the following TCP sockets is most commonly used for the web protocol (HTTP)? 11 | ==> 80 12 | 13 | 5.Which of the following is most like an open socket in an application? 14 | ==> An "in-progress" phone conversation 15 | 16 | 6.What does the "H" of HTTP stand for? 17 | ==> HyperText 18 | 19 | 7.What is an important aspect of an Application Layer protocol like HTTP? 20 | ==> Which application talks first? The client or server? 21 | 22 | 8.What are the three parts of this URL (Uniform Resource Locator)? 23 | http://www.dr-chuck.com/page1.htm 24 | ==> Protocol, host, and document 25 | 26 | 9.When you click on an anchor tag in a web page like below, what HTTP request is sent to the server? 27 |
Please click here.
28 | ==> GET 29 | 30 | 10.Which organization publishes Internet Protocol Standards? 31 | ==> IETF 32 | 33 | 11. In a client-server application on the web using sockets, which must come up first? 34 | ==> server 35 | 36 | 12.What do we call it when a browser uses the HTTP protocol to load a file or page from a server and display it in the browser? 37 | ==>The Request/Response Cycle -------------------------------------------------------------------------------- /3-Using-Python-To-Access_Web-Data/Quiz/Week 4 Reading Web Data From Python.txt: -------------------------------------------------------------------------------- 1 | 1.Which of the following Python data structures is most similar to the value returned in this line of Python: 2 | x = urllib.request.urlopen('http://data.pr4e.org/romeo.txt') 3 | ==>file handle 4 | 5 | 2.In this Python code, which line actually reads the data? 6 | import socket 7 | 8 | mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 9 | mysock.connect(('data.pr4e.org', 80)) 10 | cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode() 11 | mysock.send(cmd) 12 | 13 | while True: 14 | data = mysock.recv(512) 15 | if (len(data) < 1): 16 | break 17 | print(data.decode()) 18 | mysock.close() 19 | 20 | ==>mysock.recv() 21 | 22 | 3.Which of the following regular expressions would extract the URL from this line of HTML: 23 |Please click here
24 | ==> href="(.+)" 25 | 26 | 4.In this Python code, which line is most like the open() call to read a file: 27 | import socket 28 | 29 | mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 30 | mysock.connect(('data.pr4e.org', 80)) 31 | cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode() 32 | mysock.send(cmd) 33 | 34 | while True: 35 | data = mysock.recv(512) 36 | if (len(data) < 1): 37 | break 38 | print(data.decode()) 39 | mysock.close() 40 | 41 | ==>mysock.connect() 42 | 43 | 5.Which HTTP header tells the browser the kind of document that is being returned? 44 | ==>Content-Type: 45 | 46 | 6.What should you check before scraping a web site? 47 | ==> That the web site allows scraping 48 | 49 | 7.What is the purpose of the BeautifulSoup Python library? 50 | ==>It repairs and parses HTML to make it easier for a program to understand 51 | 52 | 8.What ends up in the "x" variable in the following code: 53 | html = urllib.request.urlopen(url).read() 54 | soup = BeautifulSoup(html, 'html.parser') 55 | x = soup('a') 56 | ==> A list of all the anchor tags (About this Map
47 |48 | This is a cool map from 49 | www.pythonlearn.com. 50 |
51 | 52 | 53 | -------------------------------------------------------------------------------- /4-Using-Database-With_Python/Assignment/Week 5 Assignment Databases and Visualization (peer-graded)/geodata/where.js: -------------------------------------------------------------------------------- 1 | myData = [ 2 | [42.340082,-71.0894884, 'Northeastern, Boston, MA 02115, USA'], 3 | [40.7399972,-74.1775311, 'Bradley Hall, 110 Warren St, Newark, NJ 07102, USA'], 4 | [32.778949,35.019648, 'Technion/ Sports Building, Haifa'], 5 | [42.4036848,-71.120482, 'South Hall Tufts University, 30 Lower Campus Rd, Somerville, MA 02144, USA'], 6 | [-38.1518106,145.1345412, 'Monash University, Frankston VIC 3199, Australia'], 7 | [53.2948229,69.4047872, 'Kokshetau 020000, Kazakhstan'], 8 | [40.7127837,-74.0059413, 'New York, NY, USA'], 9 | [52.2869741,104.3050183, 'Irkutsk, Irkutsk Oblast, Russia'], 10 | [8.481302,4.611479, 'University Rd, Ilorin, Nigeria'], 11 | [-25.7688448,28.199104, 'Unisa Observatory Building, Preller St, Pretoria, 0027, South Africa'], 12 | [47.80949,13.05501, 'Salzburg, Austria'], 13 | [61.4977524,23.7609535, 'Tampere, Finland'], 14 | [27.7518284,-82.6267345, 'St. Petersburg, FL, USA'], 15 | [54.7903112,32.0503663, 'Smolensk, Smolensk Oblast, Russia'], 16 | [24.8614622,67.0099388, 'Karachi, Pakistan'], 17 | [40.506934,-3.3458886, 'Ctra. Universidad Complutense, 28805 Alcalá de Henares, Madrid, Spain'], 18 | [51.5266171,-0.1260773, 'University Of London, 1-11 Cartwright Gardens, Kings Cross, London WC1H 9EB, UK'], 19 | [39.5069974,-84.745231, 'Oxford, OH 45056, USA'], 20 | [58.3733281,26.7265098, 'Tartu Ülikooli Füüsikahoone, 50103 Tartu, Estonia'], 21 | [33.6778327,-117.8151285, 'Padua, Irvine, CA 92614, USA'], 22 | [18.5544976,73.8257325, 'Pune University, Ganeshkhind, Pune, Maharashtra, India'], 23 | [37.8805941,-122.2447958, 'Space Sciences Laboratory at University of California, 7 Gauss Way, Berkeley, CA 94720, USA'], 24 | [43.0765915,-89.4052247, 'William H. Sewell Social Sciences Building, 1180 Observatory Dr, Madison, WI 53706, USA'], 25 | [39.9622267,116.3659223, 'Bei Jing Shi Fan Da Xue, BeiTaiPingZhuang, Haidian Qu, Beijing Shi, China, 100875'], 26 | [33.9519347,-83.357567, 'Athens, GA, USA'], 27 | [10.7295115,79.0196067, 'Sastra University Road, Tirumalaisamudram, Tamil Nadu 613401, India'], 28 | [41.9197689,-91.649501, 'Duke St SW, Cedar Rapids, IA 52404, USA'], 29 | [-23.5505199,-46.6333094, 'São Paulo, State of São Paulo, Brazil'], 30 | [30.2850284,-97.7335226, 'University of Texas at Austin, Austin, TX, USA'], 31 | [61.6887271,27.2721457, 'Mikkeli, Finland'], 32 | [32.4204729,-85.0323718, 'H. Curtis Pitts Hall, 3413 S Seale Rd, Phenix City, AL 36869, USA'], 33 | [41.557583,-8.397568, 'Universidade do Minho, 4710 Braga, Portugal'], 34 | [51.892316,-8.4951998, 'National Food Biotechnology Centre, Food Science and Technology Building, University College Cork, College Rd, University College, Cork, Ireland'], 35 | [-33.0444219,-71.6066334, 'Pontificia Universidad Catolica De Valparaiso - Gimpert, Valparaíso, Región de Valparaíso, Chile'], 36 | [40.6331249,-89.3985283, 'Illinois, USA'], 37 | [30.0180285,31.5032758, 'AUC Sports Center, Cairo Governorate, Egypt'], 38 | [55.1170375,36.5970818, 'Obninsk, Kaluga Oblast, Russia'], 39 | [31.767879,-106.440736, 'Washington, El Paso, TX 79905, USA'], 40 | [49.9935,36.230383, 'Kharkiv, Kharkiv Oblast, Ukraine'], 41 | [43.8562586,18.4130763, 'Sarajevo, Bosnia and Herzegovina'], 42 | [3.4321247,-76.5461709, 'Parqueadero Universidad Del Valle, Cali, Valle del Cauca, Colombia'], 43 | [40.0082221,-105.2591119, 'Colorado Ave & University Heights, Boulder, CO 80302, USA'], 44 | [53.4129429,59.0016233, 'Magnitogorsk, Chelyabinsk Oblast, Russia'], 45 | [27.5695246,-99.4350626, 'Senator Judith Zaffirini Student Success Center, Laredo, TX 78041, USA'], 46 | [52.124815,-106.589195, 'Simon Fraser Crescent, Saskatoon, SK S7H, Canada'], 47 | [40.807722,-73.96411, '116 St - Columbia University, New York, NY 10027, USA'], 48 | [34.1036186,-117.2914463, 'American Heritage University of Southern California, 255 N D St, San Bernardino, CA 92401, USA'], 49 | [43.1827984,-77.5993071, 'Warsaw St, Rochester, NY 14621, USA'], 50 | [52.2296756,21.0122287, 'Warsaw, Poland'], 51 | [-40.900557,174.885971, 'New Zealand'], 52 | [-40.3850866,175.6140639, 'Massey University, Palmerston North, New Zealand'], 53 | [35.1924456,-97.4432884, 'University of Oklahoma, Norman, OK 73072, USA'], 54 | [45.1847248,9.1582069, '27100 Pavia PV, Italy'], 55 | [38.6598662,-90.3123536, 'Columbia Ave, University City, MO 63130, USA'], 56 | [50.0755381,14.4378005, 'Prague, Czech Republic'], 57 | [41.8313852,-87.6272216, 'Iit Tower, 10 W 35th St, Chicago, IL 60616, USA'], 58 | [40.7933949,-77.8600012, 'State College, PA, USA'], 59 | [40.7609264,-111.8270486, 'University, Salt Lake City, UT, USA'], 60 | [39.4813156,-0.3505, 'Universitat Politècnica, 46022 Valencia, Spain'], 61 | [33.6140008,-117.8440006, 'Vienna, Newport Beach, CA 92660, USA'], 62 | [44.4267674,26.1025384, 'Bucharest, Romania'], 63 | [33.7063317,-117.7733121, 'New Haven, Irvine, CA 92620, USA'], 64 | [47.761605,-122.19303, 'UW Bothell & Cascadia College, Bothell, WA 98011, USA'], 65 | [38.6679152,-90.3322259, 'Drexel Dr, University City, MO 63130, USA'], 66 | [42.320138,-83.230993, 'University of Michigan, Dearborn, MI 48128, USA'], 67 | [40.4432289,-79.9441368, 'Carnegie Mellon University, Pausch Bridge, Pittsburgh, PA 15213, USA'], 68 | [55.8304307,49.0660806, 'Kazan, Tatarstan, Russia'], 69 | [12.0263438,79.8492812, 'Pondicherry University, Kalapet, Puducherry 605014, India'], 70 | [30.7897514,120.7760636, 'Jia Xing Nan Yang Zhi Ye Ji Shu Xue Yuan, Xiuzhou Qu, Jiaxing Shi, Zhejiang Sheng, China, 314000'], 71 | [35.712815,135.9711705, 'Nyu, Mihama, Mikata District, Fukui Prefecture 919-1201, Japan'], 72 | [-23.5431786,-46.6291845, 'State of São Paulo, Brazil'], 73 | [47.5584793,21.620443, 'Debrecen, Debrecen University-Botanical Garden, 4032 Hungary'], 74 | [34.0705324,-117.2957813, 'San Bernardino Fwy, San Bernardino, CA 92408, USA'], 75 | [50.4501,30.5234, 'Kiev, Ukraine, 02000'], 76 | [46.4618977,-80.9664534, 'University Laurentian, Copper Cliff, ON P0M 1N0, Canada'], 77 | [55.755826,37.6173, 'Moscow, Russia'], 78 | [52.2016671,0.1177882, 'University Of Cambridge, Cambridge CB2, UK'], 79 | [35.246756,33.0307541, 'ODTÜ Misafirhane, Kalkanlı'], 80 | [46.5189865,6.5676007, 'EPFL, 1015 Lausanne, Switzerland'], 81 | [45.2671352,19.8335496, 'Novi Sad, Serbia'], 82 | [57.6954209,11.9853213, 'Göteborgs universitetsbibliotek, Renströmsgatan 4, 412 55 Göteborg, Sweden'], 83 | [22.4828735,88.394867, 'Jadavpur University Lake, Sahid Smirity Colony, Pancha Sayar, Kolkata, West Bengal 700094'], 84 | [26.1529683,91.6639235, 'Gauhati University, Jalukbari, Guwahati, Assam, India'], 85 | [-34.5101473,-58.6864035, 'Universidad de Buenos Aires, Villa de Mayo, Buenos Aires, Argentina'], 86 | [44.4046049,8.9311653, 'Centro servizi bibliotecari di architettura Nino Carboneri dellUniversità degli studi di Genova, Stradone di SantAgostino, 37, 16123 Genova, Italy'], 87 | [4.8602595,-74.0333032, 'Universidad De La Sabana, Chía, Cundinamarca, Colombia'], 88 | [43.4553461,-76.5104973, 'Oswego, NY, USA'], 89 | [16.9785466,82.2406733, 'Jawaharlal Nehru Technological University, Kakinada, Andhra Pradesh 533003, India'], 90 | [50.503887,4.469936, 'Belgium'], 91 | [51.4925846,-0.1852592, 'Boston University, 43 Harrington Gardens, Kensington, London SW7 4JU, UK'], 92 | [64.9078809,-147.7117155, 'Manchester Loop, Fairbanks, AK 99712, USA'], 93 | [51.1877226,6.7938734, 'Fachhochschule Düsseldorf, 40225 Düsseldorf, Germany'], 94 | [39.18625,-86.5345967, 'Indiana 45 46 Bypass & N College Ave, Bloomington, IN 47408, USA'], 95 | [18.9331831,72.8341894, 'KP Shethi Building, Janmabhoomi Marg, Kala Ghoda, Fort, Mumbai, Maharashtra 400001, India'], 96 | [45.4248599,-75.6828, 'University of Ottawa Press, 542 King Edward Ave, Ottawa, ON K1N 6N5, Canada'], 97 | [28.3580163,75.5887989, 'BITS, Pilani, Rajasthan 333031, India'], 98 | [38.0517783,-84.4923513, 'Lucille C. Little Theater, Lexington, KY 40508, USA'], 99 | [25.25968,82.989115, 'IIT Gymkhana, RR 11, Banaras Hindu University Campus, Varanasi, Uttar Pradesh 221001, India'], 100 | [50.862282,-2.4998561, 'E M Mitchell & Sons, Hermitage, Dorchester DT2 7BB, UK'], 101 | [10.1464162,-64.6955802, 'Universidad Central de Venezuela EUS Educación Barcelona, Av Centurión, Barcelona, Anzoátegui, Venezuela'], 102 | [-9.9541653,-67.8384015, 'Tv. Paraíba - Geraldo Fleming, Rio Branco - AC, Brazil'], 103 | [47.497912,19.040235, 'Budapest, Hungary'], 104 | [55.755826,37.6173, 'Moscow, Russia'], 105 | [27.7518284,-82.6267345, 'St. Petersburg, FL, USA'], 106 | [41.7508391,-88.1535352, 'Naperville, IL, USA'], 107 | [37.424106,-122.1660756, 'Stanford, CA, USA'], 108 | [29.1891714,-81.0469168, 'Lehman Engineering & Technology Center, 600 S Clyde Morris Blvd, Daytona Beach, FL 32114, USA'], 109 | [-35.417,149.1, 'Monash ACT 2904, Australia'], 110 | [19.3188895,-99.1843676, 'National Autonomous University of Mexico, Mexico City, Mexico'], 111 | [35.7058075,51.4020909, 'Tehran University, Tehran, Iran'], 112 | [36.8838957,-76.3040214, 'Old Dominion University, 5115 Hampton Blvd, Norfolk, VA 23508, USA'], 113 | [50.4501,30.5234, 'Kiev, Ukraine, 02000'], 114 | [40.0997009,-88.2209362, 'Babcock Hall, 906 W College Ct, Urbana, IL 61801, USA'], 115 | [40.0024922,-83.0524629, 'Essex Rd, Columbus, OH 43221, USA'], 116 | [49.9935,36.230383, 'Kharkiv, Kharkiv Oblast, Ukraine'], 117 | [27.6027172,-99.4687146, 'Buenos Aires Dr, Laredo, TX 78045, USA'], 118 | [42.5030209,-89.0295642, 'College St, Beloit, WI 53511, USA'], 119 | [40.5382913,-78.3528584, 'Ucla Ln, Altoona, PA 16602, USA'], 120 | [41.7857416,-87.5903039, 'The University of Chicago Press, 1427 E 60th St, Chicago, IL 60637, USA'], 121 | [30.5848529,31.4843221, 'Rd inside Zagazig University, Shaibet an Nakareyah, Markaz El-Zakazik, Ash Sharqia Governorate, Egypt'], 122 | [53.4943212,-113.5490268, 'University of Alberta Farm, Edmonton, AB T6H, Canada'], 123 | [28.0735403,-82.4373589, 'University, FL, USA'], 124 | [8.5053554,76.9484624, 'University of Kerala Senate House Campus, Palayam, Thiruvananthapuram, Kerala, India'], 125 | [45.4723514,9.1964401, 'Via del Vecchio Politecnico, 20121 Milano, Italy'], 126 | [54.6871555,25.2796514, 'Vilnius, Lithuania'], 127 | [20.593684,78.96288, 'India'], 128 | [-33.8812733,18.6264694, 'Stellenbosch University, Cape Town, 7530, South Africa'], 129 | [28.6777345,77.4504666, 'IMT Rd, Block 14, Sector 10, Raj Nagar, Ghaziabad, Uttar Pradesh 201002, India'], 130 | [41.2033216,-77.1945247, 'Pennsylvania, USA'], 131 | [31.3260152,75.5761829, 'Jalandhar, Punjab 144001, India'], 132 | [36.8743583,-76.1745441, 'Virginia Tech Trail, Virginia Beach, VA 23455, USA'], 133 | [33.4205343,-111.9339825, 'Old Main at Arizona State University, 400 E Tyler Mall, Tempe, AZ 85281, USA'], 134 | [22.2567635,-97.8345654, 'Guatemala, Cd Madero, Tamps., Mexico'], 135 | [54.6871555,25.2796514, 'Vilnius, Lithuania'], 136 | [1.2246216,19.7878159, 'Basankusu Airport (BSU), N22, Basankusu, Democratic Republic of the Congo'], 137 | [51.165691,10.451526, 'Germany'], 138 | [27.7518284,-82.6267345, 'St. Petersburg, FL, USA'], 139 | [33.952602,-84.5499327, 'Marietta, GA, USA'], 140 | [42.9097484,-85.7630885, 'Grandville, MI, USA'], 141 | [34.3020001,48.8145943, 'Malayer, Hamadan, Iran'], 142 | [39.4813156,-0.3505, 'Universitat Politècnica, 46022 Valencia, Spain'] 143 | ]; 144 | -------------------------------------------------------------------------------- /4-Using-Database-With_Python/Quiz/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/4-Using-Database-With_Python/Quiz/.DS_Store -------------------------------------------------------------------------------- /4-Using-Database-With_Python/Quiz/Week 1.1 Using Encoded Data in Python 3.txt: -------------------------------------------------------------------------------- 1 | 1.What is the most common Unicode encoding when moving data between systems? 2 | ==> UTF-8 3 | 4 | 2.What is the decimal (Base-10) numeric value for the upper case letter "G" in the ASCII character set? 5 | ==> 71 6 | 7 | 3.What word does the following sequence of numbers represent in ASCII: 8 | 108, 105, 115, 116 9 | ==> list 10 | 11 | 4.How are strings stored internally in Python 3? 12 | ==> Unicode 13 | 14 | 5.When reading data across the network (i.e. from a URL) in Python 3, what method must be used to convert it to the internal format used by strings? 15 | ==> decode() -------------------------------------------------------------------------------- /4-Using-Database-With_Python/Quiz/Week 1.2 Object Oriented Programming-72.txt: -------------------------------------------------------------------------------- 1 | 1.Which came first, the instance or the class? 2 | ==> class 3 | 4 | 2.In Object Oriented Programming, what is another name for the "attributes" of an object? 5 | ==>fields 6 | 7 | 3.At the moment of creation of a new object, Python looks at the _________ definition to define the structure and capabilities of the newly created object. 8 | ==>class 9 | 10 | 4.Which of the following is NOT a good synonym for "class" in Python? 11 | ==>direction 12 | 13 | 5.What does this Python statement do if PartyAnimal is a class? 14 | zap = PartyAnimal() 15 | ==>Use the PartyAnimal template to make a new object and assign it to zap 16 | 17 | 6.What is the syntax to look up the fullname attribute in an object stored in the variable colleen? 18 | ==> colleen.fullname 19 | 20 | 7.Which of these statements is used to indicate that class A will inherit all the features of class B? 21 | ==>class A(B) : 22 | 23 | 8.What keyword is used to indicate the start of a method in a Python class? 24 | ==>def 25 | 26 | 9.What is "self" typically used for in a Python method within a class? 27 | ==>To refer to the instance in which the method is being called 28 | 29 | 10.What does the Python dir() function show when we pass an object into it as a parameter? 30 | ==> It shows the methods and attributes of the object 31 | 32 | 11.Which of the following is rarely used in Object Oriented Programming? 33 | ==>Destructor -------------------------------------------------------------------------------- /4-Using-Database-With_Python/Quiz/Week 2 Single-Table SQL.txt: -------------------------------------------------------------------------------- 1 | 1.Structured Query Language (SQL) is used to (check all that apply) 2 | ==> 3 | - Create a table 4 | - Delete data 5 | - Insert data 6 | 7 | 2.Which of these is the right syntax to make a new table? 8 | ==>CREATE TABLE people; 9 | 10 | 3.Which SQL command is used to insert a new row into a table? 11 | ==> INSERT INTO 12 | 13 | 4.Which command is used to retrieve all records from a table? 14 | ==>SELECT * FROM Users 15 | 16 | 5.Which keyword will cause the results of the query to be displayed in sorted order? 17 | ==>ORDER BY 18 | 19 | 6.In database terminology, another word for table is 20 | ==>relation 21 | 22 | 7.In a typical online production environment, who has direct access to the production database? 23 | ==>Database Administrator 24 | 25 | 8.Which of the following is the database software used in this class? 26 | ==>SQLite 27 | 28 | 9.What happens if a DELETE command is run on a table without a WHERE clause? 29 | ==>All the rows in the table are deleted 30 | 31 | 10.Which of the following commands would update a column named "name" in a table named "Users"? 32 | ==>UPDATE Users SET name='new name' WHERE ... 33 | 34 | 11.What does this SQL command do? 35 | SELECT COUNT(*) FROM Users 36 | Hint: This is not from the lecture 37 | ==>It counts the rows in the table Users -------------------------------------------------------------------------------- /4-Using-Database-With_Python/Quiz/Week 3 Multi-Table Relational SQL.txt: -------------------------------------------------------------------------------- 1 | 1.What is the primary added value of relational databases over flat files? 2 | ==>Ability to scan large amounts of data quickly 3 | 4 | 2.What is the purpose of a primary key? 5 | ==>To look up a particular row in a table very quickly 6 | 7 | 3.Which of the following is NOT a good rule to follow when developing a database model? 8 | ==>Use a person's email address as their primary key 9 | 10 | 4.If our user interface (i.e., like iTunes) has repeated strings on one column of the user interface, how should we model this properly in a database? 11 | ==>Make a table that maps the strings in the column to numbers and then use those numbers in the column 12 | 13 | 5.Which of the following is the label we give a column that the "outside world" uses to look up a particular row? 14 | ==>Logical key 15 | 16 | 6.What is the label we give to a column that is an integer and used to point to a row in a different table? 17 | ==>Foreign key 18 | 19 | 7.What SQLite keyword is added to primary keys in a CREATE TABLE statement to indicate that the database is to provide a value for the column when records are inserted? 20 | ==>AUTOINCREMENT 21 | 22 | 8.What is the SQL keyword that reconnects rows that have foreign keys with the corresponding data in the table that the foreign key points to? 23 | ==>JOIN 24 | 25 | 9.What happens when you JOIN two tables together without an ON clause? 26 | ==>The number of rows you get is the number of rows in the first table times the number of rows in the second table 27 | 28 | 10.When you are doing a SELECT with a JOIN across multiple tables with identical column names, how do you distinguish the column names? 29 | ==>tablename.columnname -------------------------------------------------------------------------------- /4-Using-Database-With_Python/Quiz/Week 4 Many-to-Many Relationships and Python.txt: -------------------------------------------------------------------------------- 1 | 1.How do we model a many-to-many relationship between two database tables? 2 | ==>We add a table with two foreign keys 3 | 4 | 2.In Python, what is a database "cursor" most like? 5 | ==>A file handle 6 | 7 | 3.What method do you call in an SQLIte cursor object in Python to run an SQL command? 8 | ==>execute() 9 | 10 | 4.In the following SQL, 11 | cur.execute('SELECT count FROM Counts WHERE org = ? ', (org, )) 12 | what is the purpose of the "?"? 13 | ==>It is a placeholder for the contents of the "org" variable 14 | 15 | 5.In the following Python code sequence (assuming cur is a SQLite cursor object), 16 | cur.execute('SELECT count FROM Counts WHERE org = ? ', (org, )) 17 | row = cur.fetchone() 18 | what is the value in row if no rows match the WHERE clause? 19 | ==>None 20 | 21 | 6.What does the LIMIT clause in the following SQL accomplish? 22 | SELECT org, count FROM Counts 23 | ORDER BY count DESC LIMIT 10 24 | ==>It only retrieves the first 10 rows from the table 25 | 26 | 7.What does the executescript() method in the Python SQLite cursor object do that the normal execute() method does not do? 27 | ==>It allows multiple SQL statements separated by semicolons 28 | 29 | 8.What is the purpose of "OR IGNORE" in the following SQL: 30 | INSERT OR IGNORE INTO Course (title) VALUES ( ? ) 31 | ==>It makes sure that if a particular title is already in the table, there are no duplicate rows inserted 32 | 33 | 9.For the following Python code to work, what must be added to the title column in the CREATE TABLE statement for the Course table: 34 | cur.execute('''INSERT OR IGNORE INTO Course (title) 35 | VALUES ( ? )''', ( title, ) ) 36 | cur.execute('SELECT id FROM Course WHERE title = ? ', 37 | (title, )) 38 | course_id = cur.fetchone()[0] 39 | ==>A UNIQUE constraint -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/.DS_Store -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/.DS_Store -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/.DS_Store -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/.idea/inspectionProfiles/Project_Default.xml: -------------------------------------------------------------------------------- 1 |If you don't see a chart above, check the JavaScript console. You may 16 | need to use a different browser.
17 | 18 | 19 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/force.js: -------------------------------------------------------------------------------- 1 | var width = 600, 2 | height = 600; 3 | 4 | var color = d3.scale.category20(); 5 | 6 | var dist = (width + height) / 4; 7 | 8 | var force = d3.layout.force() 9 | .charge(-120) 10 | .linkDistance(dist) 11 | .size([width, height]); 12 | 13 | function getrank(rval) { 14 | return (rval/2.0) + 3; 15 | } 16 | 17 | function getcolor(rval) { 18 | return color(rval); 19 | } 20 | 21 | var svg = d3.select("#chart").append("svg") 22 | .attr("width", width) 23 | .attr("height", height); 24 | 25 | function loadData(json) { 26 | force 27 | .nodes(json.nodes) 28 | .links(json.links); 29 | 30 | var k = Math.sqrt(json.nodes.length / (width * height)); 31 | 32 | force 33 | .charge(-10 / k) 34 | .gravity(100 * k) 35 | .start(); 36 | 37 | var link = svg.selectAll("line.link") 38 | .data(json.links) 39 | .enter().append("line") 40 | .attr("class", "link") 41 | .style("stroke-width", function(d) { return Math.sqrt(d.value); }); 42 | 43 | var node = svg.selectAll("circle.node") 44 | .data(json.nodes) 45 | .enter().append("circle") 46 | .attr("class", "node") 47 | .attr("r", function(d) { return getrank(d.rank); } ) 48 | .style("fill", function(d) { return getcolor(d.rank); }) 49 | .on("dblclick",function(d) { 50 | if ( confirm('Do you want to open '+d.url) ) 51 | window.open(d.url,'_new',''); 52 | d3.event.stopPropagation(); 53 | }) 54 | .call(force.drag); 55 | 56 | node.append("title") 57 | .text(function(d) { return d.url; }); 58 | 59 | force.on("tick", function() { 60 | link.attr("x1", function(d) { return d.source.x; }) 61 | .attr("y1", function(d) { return d.source.y; }) 62 | .attr("x2", function(d) { return d.target.x; }) 63 | .attr("y2", function(d) { return d.target.y; }); 64 | 65 | node.attr("cx", function(d) { return d.x; }) 66 | .attr("cy", function(d) { return d.y; }); 67 | }); 68 | 69 | } 70 | loadData(spiderJson); 71 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/.DS_Store -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/force.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/force.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/force_oth.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/force_oth.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/gmainC.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/gmainC.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/spdump.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/outputImages/spdump.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/pageRank.txt: -------------------------------------------------------------------------------- 1 | Simple Python Search Spider, Page Ranker, and Visualizer 2 | 3 | This is a set of programs that emulate some of the functions of a 4 | search engine. They store their data in a SQLITE3 database named 5 | 'spider.sqlite'. This file can be removed at any time to restart the 6 | process. 7 | 8 | You should install the SQLite browser to view and modify 9 | the databases from: 10 | 11 | http://sqlitebrowser.org/ 12 | 13 | This program crawls a web site and pulls a series of pages into the 14 | database, recording the links between pages. 15 | 16 | Mac: rm spider.sqlite 17 | Mac: python spider.py 18 | 19 | Win: del spider.sqlite 20 | Win: spider.py 21 | 22 | Enter web url or enter: http://www.dr-chuck.com/ 23 | ['http://www.dr-chuck.com'] 24 | How many pages:2 25 | 1 http://www.dr-chuck.com/ 12 26 | 2 http://www.dr-chuck.com/csev-blog/ 57 27 | How many pages: 28 | 29 | In this sample run, we told it to crawl a website and retrieve two 30 | pages. If you restart the program again and tell it to crawl more 31 | pages, it will not re-crawl any pages already in the database. Upon 32 | restart it goes to a random non-crawled page and starts there. So 33 | each successive run of spider.py is additive. 34 | 35 | Mac: python spider.py 36 | Win: spider.py 37 | 38 | Enter web url or enter: http://www.dr-chuck.com/ 39 | ['http://www.dr-chuck.com'] 40 | How many pages:3 41 | 3 http://www.dr-chuck.com/csev-blog 57 42 | 4 http://www.dr-chuck.com/dr-chuck/resume/speaking.htm 1 43 | 5 http://www.dr-chuck.com/dr-chuck/resume/index.htm 13 44 | How many pages: 45 | 46 | You can have multiple starting points in the same database - 47 | within the program these are called "webs". The spider 48 | chooses randomly amongst all non-visited links across all 49 | the webs. 50 | 51 | If your code fails complainin about certificate probems, 52 | there is some code (SSL) that can be un-commented to work 53 | around certificate problems. 54 | 55 | If you want to dump the contents of the spider.sqlite file, you can 56 | run spdump.py as follows: 57 | 58 | Mac: python spdump.py 59 | Win: spdump.py 60 | 61 | (5, None, 1.0, 3, u'http://www.dr-chuck.com/csev-blog') 62 | (3, None, 1.0, 4, u'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm') 63 | (1, None, 1.0, 2, u'http://www.dr-chuck.com/csev-blog/') 64 | (1, None, 1.0, 5, u'http://www.dr-chuck.com/dr-chuck/resume/index.htm') 65 | 4 rows. 66 | 67 | This shows the number of incoming links, the old page rank, the new page 68 | rank, the id of the page, and the url of the page. The spdump.py program 69 | only shows pages that have at least one incoming link to them. 70 | 71 | Once you have a few pages in the database, you can run Page Rank on the 72 | pages using the sprank.py program. You simply tell it how many Page 73 | Rank iterations to run. 74 | 75 | Mac: python sprank.py 76 | Win: sprank.py 77 | 78 | How many iterations:2 79 | 1 0.546848992536 80 | 2 0.226714939664 81 | [(1, 0.559), (2, 0.659), (3, 0.985), (4, 2.135), (5, 0.659)] 82 | 83 | You can dump the database again to see that page rank has been updated: 84 | 85 | Mac: python spdump.py 86 | Win: spdump.py 87 | 88 | (5, 1.0, 0.985, 3, u'http://www.dr-chuck.com/csev-blog') 89 | (3, 1.0, 2.135, 4, u'http://www.dr-chuck.com/dr-chuck/resume/speaking.htm') 90 | (1, 1.0, 0.659, 2, u'http://www.dr-chuck.com/csev-blog/') 91 | (1, 1.0, 0.659, 5, u'http://www.dr-chuck.com/dr-chuck/resume/index.htm') 92 | 4 rows. 93 | 94 | You can run sprank.py as many times as you like and it will simply refine 95 | the page rank the more times you run it. You can even run sprank.py a few times 96 | and then go spider a few more pages sith spider.py and then run sprank.py 97 | to converge the page ranks. 98 | 99 | If you want to restart the Page Rank calculations without re-spidering the 100 | web pages, you can use spreset.py 101 | 102 | Mac: python spreset.py 103 | Win: spreset.py 104 | 105 | All pages set to a rank of 1.0 106 | 107 | Mac: python sprank.py 108 | Win: sprank.py 109 | 110 | How many iterations:50 111 | 1 0.546848992536 112 | 2 0.226714939664 113 | 3 0.0659516187242 114 | 4 0.0244199333 115 | 5 0.0102096489546 116 | 6 0.00610244329379 117 | ... 118 | 42 0.000109076928206 119 | 43 9.91987599002e-05 120 | 44 9.02151706798e-05 121 | 45 8.20451504471e-05 122 | 46 7.46150183837e-05 123 | 47 6.7857770908e-05 124 | 48 6.17124694224e-05 125 | 49 5.61236959327e-05 126 | 50 5.10410499467e-05 127 | [(512, 0.02963718031139026), (1, 12.790786721866658), (2, 28.939418898678284), (3, 6.808468390725946), (4, 13.469889092397006)] 128 | 129 | For each iteration of the page rank algorithm it prints the average 130 | change per page of the page rank. The network initially is quite 131 | unbalanced and so the individual page ranks are changeing wildly. 132 | But in a few short iterations, the page rank converges. You 133 | should run prank.py long enough that the page ranks converge. 134 | 135 | If you want to visualize the current top pages in terms of page rank, 136 | run spjson.py to write the pages out in JSON format to be viewed in a 137 | web browser. 138 | 139 | Mac: python spjson.py 140 | Win: spjson.py 141 | 142 | Creating JSON output on spider.js... 143 | How many nodes? 30 144 | Open force.html in a browser to view the visualization 145 | 146 | You can view this data by opening the file force.html in your web browser. 147 | This shows an automatic layout of the nodes and links. You can click and 148 | drag any node and you can also double click on a node to find the URL 149 | that is represented by the node. 150 | 151 | This visualization is provided using the force layout from: 152 | 153 | http://mbostock.github.com/d3/ 154 | 155 | If you rerun the other utilities and then re-run spjson.py - you merely 156 | have to press refresh in the browser to get the new data from spider.js. -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/spdump.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | conn = sqlite3.connect('spider.sqlite') 4 | cur = conn.cursor() 5 | 6 | cur.execute('''SELECT COUNT(from_id) AS inbound, old_rank, new_rank, id, url 7 | FROM Pages JOIN Links ON Pages.id = Links.to_id 8 | WHERE html IS NOT NULL 9 | GROUP BY id ORDER BY inbound DESC''') 10 | 11 | count = 0 12 | for row in cur : 13 | if count < 50 : print row 14 | count = count + 1 15 | print count, 'rows.' 16 | cur.close() 17 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/spider.js: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/spider.js -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/spider.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import urllib 3 | import ssl 4 | from urlparse import urljoin 5 | from urlparse import urlparse 6 | from BeautifulSoup import * 7 | 8 | # Deal with SSL certificate anomalies Python > 2.7 9 | # scontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1) 10 | scontext = None 11 | 12 | conn = sqlite3.connect('spider.sqlite') 13 | cur = conn.cursor() 14 | 15 | cur.execute('''CREATE TABLE IF NOT EXISTS Pages 16 | (id INTEGER PRIMARY KEY, url TEXT UNIQUE, html TEXT, 17 | error INTEGER, old_rank REAL, new_rank REAL)''') 18 | 19 | cur.execute('''CREATE TABLE IF NOT EXISTS Links 20 | (from_id INTEGER, to_id INTEGER)''') 21 | 22 | cur.execute('''CREATE TABLE IF NOT EXISTS Webs (url TEXT UNIQUE)''') 23 | 24 | # Check to see if we are already in progress... 25 | cur.execute('SELECT id,url FROM Pages WHERE html is NULL and error is NULL ORDER BY RANDOM() LIMIT 1') 26 | row = cur.fetchone() 27 | if row is not None: 28 | print "Restarting existing crawl. Remove spider.sqlite to start a fresh crawl." 29 | else : 30 | starturl = raw_input('Enter web url or enter: ') 31 | if ( len(starturl) < 1 ) : starturl = 'http://python-data.dr-chuck.net/' 32 | if ( starturl.endswith('/') ) : starturl = starturl[:-1] 33 | web = starturl 34 | if ( starturl.endswith('.htm') or starturl.endswith('.html') ) : 35 | pos = starturl.rfind('/') 36 | web = starturl[:pos] 37 | 38 | if ( len(web) > 1 ) : 39 | cur.execute('INSERT OR IGNORE INTO Webs (url) VALUES ( ? )', ( web, ) ) 40 | cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( starturl, ) ) 41 | conn.commit() 42 | # http://www.dr-chuck.com/ 43 | # Get the current webs 44 | cur.execute('''SELECT url FROM Webs''') 45 | webs = list() 46 | for row in cur: 47 | webs.append(str(row[0])) 48 | 49 | print webs 50 | 51 | many = 0 52 | while True: 53 | if ( many < 1 ) : 54 | sval = raw_input('How many pages:') 55 | if ( len(sval) < 1 ) : break 56 | many = int(sval) 57 | many = many - 1 58 | 59 | cur.execute('SELECT id,url FROM Pages WHERE html is NULL and error is NULL ORDER BY RANDOM() LIMIT 1') 60 | try: 61 | row = cur.fetchone() 62 | # print row 63 | fromid = row[0] 64 | url = row[1] 65 | except: 66 | print 'No unretrieved HTML pages found' 67 | many = 0 68 | break 69 | 70 | print fromid, url, 71 | 72 | # If we are retrieving this page, there should be no links from it 73 | cur.execute('DELETE from Links WHERE from_id=?', (fromid, ) ) 74 | try: 75 | # Deal with SSL certificate anomalies Python > 2.7 76 | # scontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1) 77 | # document = urllib.urlopen(url, context=scontext) 78 | 79 | # Normal Unless you encounter certificate problems 80 | document = urllib.urlopen(url) 81 | 82 | html = document.read() 83 | if document.getcode() != 200 : 84 | print "Error on page: ",document.getcode() 85 | cur.execute('UPDATE Pages SET error=? WHERE url=?', (document.getcode(), url) ) 86 | 87 | if 'text/html' != document.info().gettype() : 88 | print "Ignore non text/html page" 89 | cur.execute('UPDATE Pages SET error=-1 WHERE url=?', (url, ) ) 90 | conn.commit() 91 | continue 92 | 93 | print '('+str(len(html))+')', 94 | 95 | soup = BeautifulSoup(html) 96 | except KeyboardInterrupt: 97 | print '' 98 | print 'Program interrupted by user...' 99 | break 100 | except: 101 | print "Unable to retrieve or parse page" 102 | cur.execute('UPDATE Pages SET error=-1 WHERE url=?', (url, ) ) 103 | conn.commit() 104 | continue 105 | 106 | cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( url, ) ) 107 | cur.execute('UPDATE Pages SET html=? WHERE url=?', (buffer(html), url ) ) 108 | conn.commit() 109 | 110 | # Retrieve all of the anchor tags 111 | tags = soup('a') 112 | count = 0 113 | for tag in tags: 114 | href = tag.get('href', None) 115 | if ( href is None ) : continue 116 | # Resolve relative references like href="/contact" 117 | up = urlparse(href) 118 | if ( len(up.scheme) < 1 ) : 119 | href = urljoin(url, href) 120 | ipos = href.find('#') 121 | if ( ipos > 1 ) : href = href[:ipos] 122 | if ( href.endswith('.png') or href.endswith('.jpg') or href.endswith('.gif') ) : continue 123 | if ( href.endswith('/') ) : href = href[:-1] 124 | # print href 125 | if ( len(href) < 1 ) : continue 126 | 127 | # Check if the URL is in any of the webs 128 | found = False 129 | for web in webs: 130 | if ( href.startswith(web) ) : 131 | found = True 132 | break 133 | if not found : continue 134 | 135 | cur.execute('INSERT OR IGNORE INTO Pages (url, html, new_rank) VALUES ( ?, NULL, 1.0 )', ( href, ) ) 136 | count = count + 1 137 | conn.commit() 138 | 139 | cur.execute('SELECT id FROM Pages WHERE url=? LIMIT 1', ( href, )) 140 | try: 141 | row = cur.fetchone() 142 | toid = row[0] 143 | except: 144 | print 'Could not retrieve id' 145 | continue 146 | # print fromid, toid 147 | cur.execute('INSERT OR IGNORE INTO Links (from_id, to_id) VALUES ( ?, ? )', ( fromid, toid ) ) 148 | 149 | 150 | print count 151 | 152 | cur.close() 153 | 154 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/spider.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/spider.sqlite -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/spjson.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | conn = sqlite3.connect('spider.sqlite') 4 | cur = conn.cursor() 5 | 6 | print "Creating JSON output on spider.js..." 7 | howmany = int(raw_input("How many nodes? ")) 8 | 9 | cur.execute('''SELECT COUNT(from_id) AS inbound, old_rank, new_rank, id, url 10 | FROM Pages JOIN Links ON Pages.id = Links.to_id 11 | WHERE html IS NOT NULL AND ERROR IS NULL 12 | GROUP BY id ORDER BY id,inbound''') 13 | 14 | fhand = open('spider.js','w') 15 | nodes = list() 16 | maxrank = None 17 | minrank = None 18 | for row in cur : 19 | nodes.append(row) 20 | rank = row[2] 21 | if maxrank < rank or maxrank is None : maxrank = rank 22 | if minrank > rank or minrank is None : minrank = rank 23 | if len(nodes) > howmany : break 24 | 25 | if maxrank == minrank or maxrank is None or minrank is None: 26 | print "Error - please run sprank.py to compute page rank" 27 | quit() 28 | 29 | fhand.write('spiderJson = {"nodes":[\n') 30 | count = 0 31 | map = dict() 32 | ranks = dict() 33 | for row in nodes : 34 | if count > 0 : fhand.write(',\n') 35 | # print row 36 | rank = row[2] 37 | rank = 19 * ( (rank - minrank) / (maxrank - minrank) ) 38 | fhand.write('{'+'"weight":'+str(row[0])+',"rank":'+str(rank)+',') 39 | fhand.write(' "id":'+str(row[3])+', "url":"'+row[4]+'"}') 40 | map[row[3]] = count 41 | ranks[row[3]] = rank 42 | count = count + 1 43 | fhand.write('],\n') 44 | 45 | cur.execute('''SELECT DISTINCT from_id, to_id FROM Links''') 46 | fhand.write('"links":[\n') 47 | 48 | count = 0 49 | for row in cur : 50 | # print row 51 | if row[0] not in map or row[1] not in map : continue 52 | if count > 0 : fhand.write(',\n') 53 | rank = ranks[row[0]] 54 | srank = 19 * ( (rank - minrank) / (maxrank - minrank) ) 55 | fhand.write('{"source":'+str(map[row[0]])+',"target":'+str(map[row[1]])+',"value":3}') 56 | count = count + 1 57 | fhand.write(']};') 58 | fhand.close() 59 | cur.close() 60 | 61 | print "Open force.html in a browser to view the visualization" 62 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/sprank.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | conn = sqlite3.connect('spider.sqlite') 4 | cur = conn.cursor() 5 | 6 | # Find the ids that send out page rank - we only are interested 7 | # in pages in the SCC that have in and out links 8 | cur.execute('''SELECT DISTINCT from_id FROM Links''') 9 | from_ids = list() 10 | for row in cur: 11 | from_ids.append(row[0]) 12 | 13 | # Find the ids that receive page rank 14 | to_ids = list() 15 | links = list() 16 | cur.execute('''SELECT DISTINCT from_id, to_id FROM Links''') 17 | for row in cur: 18 | from_id = row[0] 19 | to_id = row[1] 20 | if from_id == to_id : continue 21 | if from_id not in from_ids : continue 22 | if to_id not in from_ids : continue 23 | links.append(row) 24 | if to_id not in to_ids : to_ids.append(to_id) 25 | 26 | # Get latest page ranks for strongly connected component 27 | prev_ranks = dict() 28 | for node in from_ids: 29 | cur.execute('''SELECT new_rank FROM Pages WHERE id = ?''', (node, )) 30 | row = cur.fetchone() 31 | prev_ranks[node] = row[0] 32 | 33 | sval = raw_input('How many iterations:') 34 | many = 1 35 | if ( len(sval) > 0 ) : many = int(sval) 36 | 37 | # Sanity check 38 | if len(prev_ranks) < 1 : 39 | print "Nothing to page rank. Check data." 40 | quit() 41 | 42 | # Lets do Page Rank in memory so it is really fast 43 | for i in range(many): 44 | # print prev_ranks.items()[:5] 45 | next_ranks = dict(); 46 | total = 0.0 47 | for (node, old_rank) in prev_ranks.items(): 48 | total = total + old_rank 49 | next_ranks[node] = 0.0 50 | # print total 51 | 52 | # Find the number of outbound links and sent the page rank down each 53 | for (node, old_rank) in prev_ranks.items(): 54 | # print node, old_rank 55 | give_ids = list() 56 | for (from_id, to_id) in links: 57 | if from_id != node : continue 58 | # print ' ',from_id,to_id 59 | 60 | if to_id not in to_ids: continue 61 | give_ids.append(to_id) 62 | if ( len(give_ids) < 1 ) : continue 63 | amount = old_rank / len(give_ids) 64 | # print node, old_rank,amount, give_ids 65 | 66 | for id in give_ids: 67 | next_ranks[id] = next_ranks[id] + amount 68 | 69 | newtot = 0 70 | for (node, next_rank) in next_ranks.items(): 71 | newtot = newtot + next_rank 72 | evap = (total - newtot) / len(next_ranks) 73 | 74 | # print newtot, evap 75 | for node in next_ranks: 76 | next_ranks[node] = next_ranks[node] + evap 77 | 78 | newtot = 0 79 | for (node, next_rank) in next_ranks.items(): 80 | newtot = newtot + next_rank 81 | 82 | # Compute the per-page average change from old rank to new rank 83 | # As indication of convergence of the algorithm 84 | totdiff = 0 85 | for (node, old_rank) in prev_ranks.items(): 86 | new_rank = next_ranks[node] 87 | diff = abs(old_rank-new_rank) 88 | totdiff = totdiff + diff 89 | 90 | avediff = totdiff / len(prev_ranks) 91 | print i+1, avediff 92 | 93 | # rotate 94 | prev_ranks = next_ranks 95 | 96 | # Put the final ranks back into the database 97 | print next_ranks.items()[:5] 98 | cur.execute('''UPDATE Pages SET old_rank=new_rank''') 99 | for (id, new_rank) in next_ranks.items() : 100 | cur.execute('''UPDATE Pages SET new_rank=? WHERE id=?''', (new_rank, id)) 101 | conn.commit() 102 | cur.close() 103 | 104 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 1 pagerank/spreset.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | 3 | conn = sqlite3.connect('spider.sqlite') 4 | cur = conn.cursor() 5 | 6 | cur.execute('''UPDATE Pages SET new_rank=1.0, old_rank=0.0''') 7 | conn.commit() 8 | 9 | cur.close() 10 | 11 | print "All pages set to a rank of 1.0" 12 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/.DS_Store -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gbasic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gbasic.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gline.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gline.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gmain.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gmain.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gmodel.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gmodel.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gword.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/OutputImages/gword.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/README.txt: -------------------------------------------------------------------------------- 1 | Analyzing an EMAIL Archive vizualizing the data using the 2 | D3 JavaScript library 3 | 4 | Here is a copy of the Sakai Developer Mailing list from 2006-2014. 5 | 6 | http://mbox.dr-chuck.net/ 7 | 8 | You should install the SQLite browser to view and modify the databases from: 9 | 10 | http://sqlitebrowser.org/ 11 | 12 | The base URL is hard-coded in the gmane.py. Make sure to delete the 13 | content.sqlite file if you switch the base url. The gmane.py file 14 | operates as a spider in that it runs slowly and retrieves one mail 15 | message per second so as to avoid getting throttled. It stores all of 16 | its data in a database and can be interrupted and re-started 17 | as often as needed. It may take many hours to pull all the data 18 | down. So you may need to restart several times. 19 | 20 | To give you a head-start, I have put up 600MB of pre-spidered Sakai 21 | email here: 22 | 23 | https://online.dr-chuck.com/files/sakai/email/content.sqlite.zip 24 | 25 | If you download and unzip this, you can "catch up with the 26 | latest" by running gmane.py. 27 | 28 | Navigate to the folder where you extracted the gmane.zip 29 | 30 | Here is a run of gmane.py getting the last five messages of the 31 | sakai developer list: 32 | 33 | Mac: python gmane.py 34 | Win: gmane.py 35 | 36 | How many messages:10 37 | http://mbox.dr-chuck.net/sakai.devel/5/6 9443 38 | john@caret.cam.ac.uk 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments 39 | http://mbox.dr-chuck.net/sakai.devel/6/7 3586 40 | s-githens@northwestern.edu 2005-12-09T13:32:31-06:00 re: sakaiportallogin and presense 41 | http://mbox.dr-chuck.net/sakai.devel/7/8 10600 42 | john@caret.cam.ac.uk 2005-12-09T13:42:24+00:00 re: lms/vle rants/comments 43 | 44 | The program scans content.sqlite from 1 up to the first message number not 45 | already spidered and starts spidering at that message. It continues spidering 46 | until it has spidered the desired number of messages or it reaches a page 47 | that does not appear to be a properly formatted message. 48 | 49 | Sometimes there is missing a message. Perhaps administrators can delete messages 50 | or perhaps they get lost - I don't know. If your spider stops, and it seems it has hit 51 | a missing message, go into the SQLite Manager and add a row with the missing id - leave 52 | all the other fields blank - and then restart gmane.py. This will unstick the 53 | spidering process and allow it to continue. These empty messages will be ignored in the next 54 | phase of the process. 55 | 56 | One nice thing is that once you have spidered all of the messages and have them in 57 | content.sqlite, you can run gmane.py again to get new messages as they get sent to the 58 | list. gmane.py will quickly scan to the end of the already-spidered pages and check 59 | if there are new messages and then quickly retrieve those messages and add them 60 | to content.sqlite. 61 | 62 | The content.sqlite data is pretty raw, with an innefficient data model, and not compressed. 63 | This is intentional as it allows you to look at content.sqlite to debug the process. 64 | It would be a bad idea to run any queries against this database as they would be 65 | slow. 66 | 67 | The second process is running the program gmodel.py. gmodel.py reads the rough/raw 68 | data from content.sqlite and produces a cleaned-up and well-modeled version of the 69 | data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X 70 | smaller) than content.sqlite because it also compresses the header and body text. 71 | 72 | Each time gmodel.py runs - it completely wipes out and re-builds index.sqlite, allowing 73 | you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the 74 | data cleaning process. 75 | 76 | Running gmodel.py works as follows: 77 | 78 | Mac: python gmodel.py 79 | Win: gmodel.py 80 | 81 | Loaded allsenders 1588 and mapping 28 dns mapping 1 82 | 1 2005-12-08T23:34:30-06:00 ggolden22@mac.com 83 | 251 2005-12-22T10:03:20-08:00 tpamsler@ucdavis.edu 84 | 501 2006-01-12T11:17:34-05:00 lance@indiana.edu 85 | 751 2006-01-24T11:13:28-08:00 vrajgopalan@ucmerced.edu 86 | ... 87 | 88 | The gmodel.py program does a number of data cleaing steps 89 | 90 | Domain names are truncated to two levels for .com, .org, .edu, and .net 91 | other domain names are truncated to three levels. So si.umich.edu becomes 92 | umich.edu and caret.cam.ac.uk becomes cam.ac.uk. Also mail addresses are 93 | forced to lower case and some of the @gmane.org address like the following 94 | 95 | arwhyte-63aXycvo3TyHXe+LvDLADg@public.gmane.org 96 | 97 | are converted to the real address whenever there is a matching real email 98 | address elsewhere in the message corpus. 99 | 100 | If you look in the content.sqlite database there are two tables that allow 101 | you to map both domain names and individual email addresses that change over 102 | the lifetime of the email list. For example, Steve Githens used the following 103 | email addresses over the life of the Sakai developer list: 104 | 105 | s-githens@northwestern.edu 106 | sgithens@cam.ac.uk 107 | swgithen@mtu.edu 108 | 109 | We can add two entries to the Mapping table 110 | 111 | s-githens@northwestern.edu -> swgithen@mtu.edu 112 | sgithens@cam.ac.uk -> swgithen@mtu.edu 113 | 114 | And so all the mail messages will be collected under one sender even if 115 | they used several email addresses over the lifetime of the mailing list. 116 | 117 | You can also make similar entries in the DNSMapping table if there are multiple 118 | DNS names you want mapped to a single DNS. In the Sakai data I add the following 119 | mapping: 120 | 121 | iupui.edu -> indiana.edu 122 | 123 | So all the folks from the various Indiana University campuses are tracked together 124 | 125 | You can re-run the gmodel.py over and over as you look at the data, and add mappings 126 | to make the data cleaner and cleaner. When you are done, you will have a nicely 127 | indexed version of the email in index.sqlite. This is the file to use to do data 128 | analysis. With this file, data analysis will be really quick. 129 | 130 | The first, simplest data analysis is to do a "who does the most" and "which 131 | organzation does the most"? This is done using gbasic.py: 132 | 133 | Mac: python gbasic.py 134 | Win: gbasic.py 135 | 136 | How many to dump? 5 137 | Loaded messages= 51330 subjects= 25033 senders= 1584 138 | 139 | Top 5 Email list participants 140 | steve.swinsburg@gmail.com 2657 141 | azeckoski@unicon.net 1742 142 | ieb@tfd.co.uk 1591 143 | csev@umich.edu 1304 144 | david.horwitz@uct.ac.za 1184 145 | 146 | Top 5 Email list organizations 147 | gmail.com 7339 148 | umich.edu 6243 149 | uct.ac.za 2451 150 | indiana.edu 2258 151 | unicon.net 2055 152 | 153 | You can look at the data in index.sqlite and if you find a problem, you 154 | can update the Mapping table and DNSMapping table in content.sqlite and 155 | re-run gmodel.py. 156 | 157 | There is a simple vizualization of the word frequence in the subject lines 158 | in the file gword.py: 159 | 160 | Mac: python gword.py 161 | Win: gword.py 162 | 163 | Range of counts: 33229 129 164 | Output written to gword.js 165 | 166 | This produces the file gword.js which you can visualize using the file 167 | gword.htm. 168 | 169 | A second visualization is in gline.py. It visualizes email participation by 170 | organizations over time. 171 | 172 | Mac: python gline.py 173 | Win: gline.py 174 | 175 | Loaded messages= 51330 subjects= 25033 senders= 1584 176 | Top 10 Oranizations 177 | ['gmail.com', 'umich.edu', 'uct.ac.za', 'indiana.edu', 'unicon.net', 'tfd.co.uk', 'berkeley.edu', 'longsight.com', 'stanford.edu', 'ox.ac.uk'] 178 | Output written to gline.js 179 | 180 | Its output is written to gline.js which is visualized using gline.htm. 181 | 182 | Some URLs for visualization ideas: 183 | 184 | https://developers.google.com/chart/ 185 | 186 | https://developers.google.com/chart/interactive/docs/gallery/motionchart 187 | 188 | https://code.google.com/apis/ajax/playground/?type=visualization#motion_chart_time_formats 189 | 190 | https://developers.google.com/chart/interactive/docs/gallery/annotatedtimeline 191 | 192 | http://bost.ocks.org/mike/uberdata/ 193 | 194 | http://mbostock.github.io/d3/talk/20111018/calendar.html 195 | 196 | http://nltk.org/install.html 197 | 198 | As always - comments welcome. 199 | 200 | -- Dr. Chuck 201 | Sun Sep 29 00:11:01 EDT 2013 202 | 203 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/content.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/content.sqlite -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/d3.layout.cloud.js: -------------------------------------------------------------------------------- 1 | // Word cloud layout by Jason Davies, http://www.jasondavies.com/word-cloud/ 2 | // Algorithm due to Jonathan Feinberg, http://static.mrfeinberg.com/bv_ch03.pdf 3 | (function(exports) { 4 | function cloud() { 5 | var size = [256, 256], 6 | text = cloudText, 7 | font = cloudFont, 8 | fontSize = cloudFontSize, 9 | fontStyle = cloudFontNormal, 10 | fontWeight = cloudFontNormal, 11 | rotate = cloudRotate, 12 | padding = cloudPadding, 13 | spiral = archimedeanSpiral, 14 | words = [], 15 | timeInterval = Infinity, 16 | event = d3.dispatch("word", "end"), 17 | timer = null, 18 | cloud = {}; 19 | 20 | cloud.start = function() { 21 | var board = zeroArray((size[0] >> 5) * size[1]), 22 | bounds = null, 23 | n = words.length, 24 | i = -1, 25 | tags = [], 26 | data = words.map(function(d, i) { 27 | d.text = text.call(this, d, i); 28 | d.font = font.call(this, d, i); 29 | d.style = fontStyle.call(this, d, i); 30 | d.weight = fontWeight.call(this, d, i); 31 | d.rotate = rotate.call(this, d, i); 32 | d.size = ~~fontSize.call(this, d, i); 33 | d.padding = cloudPadding.call(this, d, i); 34 | return d; 35 | }).sort(function(a, b) { return b.size - a.size; }); 36 | 37 | if (timer) clearInterval(timer); 38 | timer = setInterval(step, 0); 39 | step(); 40 | 41 | return cloud; 42 | 43 | function step() { 44 | var start = +new Date, 45 | d; 46 | while (+new Date - start < timeInterval && ++i < n && timer) { 47 | d = data[i]; 48 | d.x = (size[0] * (Math.random() + .5)) >> 1; 49 | d.y = (size[1] * (Math.random() + .5)) >> 1; 50 | cloudSprite(d, data, i); 51 | if (place(board, d, bounds)) { 52 | tags.push(d); 53 | event.word(d); 54 | if (bounds) cloudBounds(bounds, d); 55 | else bounds = [{x: d.x + d.x0, y: d.y + d.y0}, {x: d.x + d.x1, y: d.y + d.y1}]; 56 | // Temporary hack 57 | d.x -= size[0] >> 1; 58 | d.y -= size[1] >> 1; 59 | } 60 | } 61 | if (i >= n) { 62 | cloud.stop(); 63 | event.end(tags, bounds); 64 | } 65 | } 66 | } 67 | 68 | cloud.stop = function() { 69 | if (timer) { 70 | clearInterval(timer); 71 | timer = null; 72 | } 73 | return cloud; 74 | }; 75 | 76 | cloud.timeInterval = function(x) { 77 | if (!arguments.length) return timeInterval; 78 | timeInterval = x == null ? Infinity : x; 79 | return cloud; 80 | }; 81 | 82 | function place(board, tag, bounds) { 83 | var perimeter = [{x: 0, y: 0}, {x: size[0], y: size[1]}], 84 | startX = tag.x, 85 | startY = tag.y, 86 | maxDelta = Math.sqrt(size[0] * size[0] + size[1] * size[1]), 87 | s = spiral(size), 88 | dt = Math.random() < .5 ? 1 : -1, 89 | t = -dt, 90 | dxdy, 91 | dx, 92 | dy; 93 | 94 | while (dxdy = s(t += dt)) { 95 | dx = ~~dxdy[0]; 96 | dy = ~~dxdy[1]; 97 | 98 | if (Math.min(dx, dy) > maxDelta) break; 99 | 100 | tag.x = startX + dx; 101 | tag.y = startY + dy; 102 | 103 | if (tag.x + tag.x0 < 0 || tag.y + tag.y0 < 0 || 104 | tag.x + tag.x1 > size[0] || tag.y + tag.y1 > size[1]) continue; 105 | // TODO only check for collisions within current bounds. 106 | if (!bounds || !cloudCollide(tag, board, size[0])) { 107 | if (!bounds || collideRects(tag, bounds)) { 108 | var sprite = tag.sprite, 109 | w = tag.width >> 5, 110 | sw = size[0] >> 5, 111 | lx = tag.x - (w << 4), 112 | sx = lx & 0x7f, 113 | msx = 32 - sx, 114 | h = tag.y1 - tag.y0, 115 | x = (tag.y + tag.y0) * sw + (lx >> 5), 116 | last; 117 | for (var j = 0; j < h; j++) { 118 | last = 0; 119 | for (var i = 0; i <= w; i++) { 120 | board[x + i] |= (last << msx) | (i < w ? (last = sprite[j * w + i]) >>> sx : 0); 121 | } 122 | x += sw; 123 | } 124 | delete tag.sprite; 125 | return true; 126 | } 127 | } 128 | } 129 | return false; 130 | } 131 | 132 | cloud.words = function(x) { 133 | if (!arguments.length) return words; 134 | words = x; 135 | return cloud; 136 | }; 137 | 138 | cloud.size = function(x) { 139 | if (!arguments.length) return size; 140 | size = [+x[0], +x[1]]; 141 | return cloud; 142 | }; 143 | 144 | cloud.font = function(x) { 145 | if (!arguments.length) return font; 146 | font = d3.functor(x); 147 | return cloud; 148 | }; 149 | 150 | cloud.fontStyle = function(x) { 151 | if (!arguments.length) return fontStyle; 152 | fontStyle = d3.functor(x); 153 | return cloud; 154 | }; 155 | 156 | cloud.fontWeight = function(x) { 157 | if (!arguments.length) return fontWeight; 158 | fontWeight = d3.functor(x); 159 | return cloud; 160 | }; 161 | 162 | cloud.rotate = function(x) { 163 | if (!arguments.length) return rotate; 164 | rotate = d3.functor(x); 165 | return cloud; 166 | }; 167 | 168 | cloud.text = function(x) { 169 | if (!arguments.length) return text; 170 | text = d3.functor(x); 171 | return cloud; 172 | }; 173 | 174 | cloud.spiral = function(x) { 175 | if (!arguments.length) return spiral; 176 | spiral = spirals[x + ""] || x; 177 | return cloud; 178 | }; 179 | 180 | cloud.fontSize = function(x) { 181 | if (!arguments.length) return fontSize; 182 | fontSize = d3.functor(x); 183 | return cloud; 184 | }; 185 | 186 | cloud.padding = function(x) { 187 | if (!arguments.length) return padding; 188 | padding = d3.functor(x); 189 | return cloud; 190 | }; 191 | 192 | return d3.rebind(cloud, event, "on"); 193 | } 194 | 195 | function cloudText(d) { 196 | return d.text; 197 | } 198 | 199 | function cloudFont() { 200 | return "serif"; 201 | } 202 | 203 | function cloudFontNormal() { 204 | return "normal"; 205 | } 206 | 207 | function cloudFontSize(d) { 208 | return Math.sqrt(d.value); 209 | } 210 | 211 | function cloudRotate() { 212 | return (~~(Math.random() * 6) - 3) * 30; 213 | } 214 | 215 | function cloudPadding() { 216 | return 1; 217 | } 218 | 219 | // Fetches a monochrome sprite bitmap for the specified text. 220 | // Load in batches for speed. 221 | function cloudSprite(d, data, di) { 222 | if (d.sprite) return; 223 | c.clearRect(0, 0, (cw << 5) / ratio, ch / ratio); 224 | var x = 0, 225 | y = 0, 226 | maxh = 0, 227 | n = data.length; 228 | di--; 229 | while (++di < n) { 230 | d = data[di]; 231 | c.save(); 232 | c.font = d.style + " " + d.weight + " " + ~~((d.size + 1) / ratio) + "px " + d.font; 233 | var w = c.measureText(d.text + "m").width * ratio, 234 | h = d.size << 1; 235 | if (d.rotate) { 236 | var sr = Math.sin(d.rotate * cloudRadians), 237 | cr = Math.cos(d.rotate * cloudRadians), 238 | wcr = w * cr, 239 | wsr = w * sr, 240 | hcr = h * cr, 241 | hsr = h * sr; 242 | w = (Math.max(Math.abs(wcr + hsr), Math.abs(wcr - hsr)) + 0x1f) >> 5 << 5; 243 | h = ~~Math.max(Math.abs(wsr + hcr), Math.abs(wsr - hcr)); 244 | } else { 245 | w = (w + 0x1f) >> 5 << 5; 246 | } 247 | if (h > maxh) maxh = h; 248 | if (x + w >= (cw << 5)) { 249 | x = 0; 250 | y += maxh; 251 | maxh = 0; 252 | } 253 | if (y + h >= ch) break; 254 | c.translate((x + (w >> 1)) / ratio, (y + (h >> 1)) / ratio); 255 | if (d.rotate) c.rotate(d.rotate * cloudRadians); 256 | c.fillText(d.text, 0, 0); 257 | c.restore(); 258 | d.width = w; 259 | d.height = h; 260 | d.xoff = x; 261 | d.yoff = y; 262 | d.x1 = w >> 1; 263 | d.y1 = h >> 1; 264 | d.x0 = -d.x1; 265 | d.y0 = -d.y1; 266 | x += w; 267 | } 268 | var pixels = c.getImageData(0, 0, (cw << 5) / ratio, ch / ratio).data, 269 | sprite = []; 270 | while (--di >= 0) { 271 | d = data[di]; 272 | var w = d.width, 273 | w32 = w >> 5, 274 | h = d.y1 - d.y0, 275 | p = d.padding; 276 | // Zero the buffer 277 | for (var i = 0; i < h * w32; i++) sprite[i] = 0; 278 | x = d.xoff; 279 | if (x == null) return; 280 | y = d.yoff; 281 | var seen = 0, 282 | seenRow = -1; 283 | for (var j = 0; j < h; j++) { 284 | for (var i = 0; i < w; i++) { 285 | var k = w32 * j + (i >> 5), 286 | m = pixels[((y + j) * (cw << 5) + (x + i)) << 2] ? 1 << (31 - (i % 32)) : 0; 287 | if (p) { 288 | if (j) sprite[k - w32] |= m; 289 | if (j < w - 1) sprite[k + w32] |= m; 290 | m |= (m << 1) | (m >> 1); 291 | } 292 | sprite[k] |= m; 293 | seen |= m; 294 | } 295 | if (seen) seenRow = j; 296 | else { 297 | d.y0++; 298 | h--; 299 | j--; 300 | y++; 301 | } 302 | } 303 | d.y1 = d.y0 + seenRow; 304 | d.sprite = sprite.slice(0, (d.y1 - d.y0) * w32); 305 | } 306 | } 307 | 308 | // Use mask-based collision detection. 309 | function cloudCollide(tag, board, sw) { 310 | sw >>= 5; 311 | var sprite = tag.sprite, 312 | w = tag.width >> 5, 313 | lx = tag.x - (w << 4), 314 | sx = lx & 0x7f, 315 | msx = 32 - sx, 316 | h = tag.y1 - tag.y0, 317 | x = (tag.y + tag.y0) * sw + (lx >> 5), 318 | last; 319 | for (var j = 0; j < h; j++) { 320 | last = 0; 321 | for (var i = 0; i <= w; i++) { 322 | if (((last << msx) | (i < w ? (last = sprite[j * w + i]) >>> sx : 0)) 323 | & board[x + i]) return true; 324 | } 325 | x += sw; 326 | } 327 | return false; 328 | } 329 | 330 | function cloudBounds(bounds, d) { 331 | var b0 = bounds[0], 332 | b1 = bounds[1]; 333 | if (d.x + d.x0 < b0.x) b0.x = d.x + d.x0; 334 | if (d.y + d.y0 < b0.y) b0.y = d.y + d.y0; 335 | if (d.x + d.x1 > b1.x) b1.x = d.x + d.x1; 336 | if (d.y + d.y1 > b1.y) b1.y = d.y + d.y1; 337 | } 338 | 339 | function collideRects(a, b) { 340 | return a.x + a.x1 > b[0].x && a.x + a.x0 < b[1].x && a.y + a.y1 > b[0].y && a.y + a.y0 < b[1].y; 341 | } 342 | 343 | function archimedeanSpiral(size) { 344 | var e = size[0] / size[1]; 345 | return function(t) { 346 | return [e * (t *= .1) * Math.cos(t), t * Math.sin(t)]; 347 | }; 348 | } 349 | 350 | function rectangularSpiral(size) { 351 | var dy = 4, 352 | dx = dy * size[0] / size[1], 353 | x = 0, 354 | y = 0; 355 | return function(t) { 356 | var sign = t < 0 ? -1 : 1; 357 | // See triangular numbers: T_n = n * (n + 1) / 2. 358 | switch ((Math.sqrt(1 + 4 * sign * t) - sign) & 3) { 359 | case 0: x += dx; break; 360 | case 1: y += dy; break; 361 | case 2: x -= dx; break; 362 | default: y -= dy; break; 363 | } 364 | return [x, y]; 365 | }; 366 | } 367 | 368 | // TODO reuse arrays? 369 | function zeroArray(n) { 370 | var a = [], 371 | i = -1; 372 | while (++i < n) a[i] = 0; 373 | return a; 374 | } 375 | 376 | var cloudRadians = Math.PI / 180, 377 | cw = 1 << 11 >> 5, 378 | ch = 1 << 11, 379 | canvas, 380 | ratio = 1; 381 | 382 | if (typeof document !== "undefined") { 383 | canvas = document.createElement("canvas"); 384 | canvas.width = 1; 385 | canvas.height = 1; 386 | ratio = Math.sqrt(canvas.getContext("2d").getImageData(0, 0, 1, 1).data.length >> 2); 387 | canvas.width = (cw << 5) / ratio; 388 | canvas.height = ch / ratio; 389 | } else { 390 | // node-canvas support 391 | var Canvas = require("canvas"); 392 | canvas = new Canvas(cw << 5, ch); 393 | } 394 | 395 | var c = canvas.getContext("2d"), 396 | spirals = { 397 | archimedean: archimedeanSpiral, 398 | rectangular: rectangularSpiral 399 | }; 400 | c.fillStyle = "red"; 401 | c.textAlign = "center"; 402 | 403 | exports.cloud = cloud; 404 | })(typeof exports === "undefined" ? d3.layout || (d3.layout = {}) : exports); 405 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/email .txt: -------------------------------------------------------------------------------- 1 | Analyzing an EMAIL Archive vizualizing the data using the 2 | D3 JavaScript library 3 | 4 | Here is a copy of the Sakai Developer Mailing list from 2006-2014. 5 | 6 | http://mbox.dr-chuck.net/ 7 | 8 | You should install the SQLite browser to view and modify the databases from: 9 | 10 | http://sqlitebrowser.org/ 11 | 12 | The base URL is hard-coded in the gmane.py. Make sure to delete the 13 | content.sqlite file if you switch the base url. The gmane.py file 14 | operates as a spider in that it runs slowly and retrieves one mail 15 | message per second so as to avoid getting throttled. It stores all of 16 | its data in a database and can be interrupted and re-started 17 | as often as needed. It may take many hours to pull all the data 18 | down. So you may need to restart several times. 19 | 20 | To give you a head-start, I have put up 600MB of pre-spidered Sakai 21 | email here: 22 | 23 | https://online.dr-chuck.com/files/sakai/email/content.sqlite.zip 24 | 25 | If you download and unzip this, you can "catch up with the 26 | latest" by running gmane.py. 27 | 28 | Navigate to the folder where you extracted the gmane.zip 29 | 30 | Here is a run of gmane.py getting the last five messages of the 31 | sakai developer list: 32 | 33 | Mac: python gmane.py 34 | Win: gmane.py 35 | 36 | How many messages:10 37 | http://mbox.dr-chuck.net/sakai.devel/5/6 9443 38 | john@caret.cam.ac.uk 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments 39 | http://mbox.dr-chuck.net/sakai.devel/6/7 3586 40 | s-githens@northwestern.edu 2005-12-09T13:32:31-06:00 re: sakaiportallogin and presense 41 | http://mbox.dr-chuck.net/sakai.devel/7/8 10600 42 | john@caret.cam.ac.uk 2005-12-09T13:42:24+00:00 re: lms/vle rants/comments 43 | 44 | The program scans content.sqlite from 1 up to the first message number not 45 | already spidered and starts spidering at that message. It continues spidering 46 | until it has spidered the desired number of messages or it reaches a page 47 | that does not appear to be a properly formatted message. 48 | 49 | Sometimes there is missing a message. Perhaps administrators can delete messages 50 | or perhaps they get lost - I don't know. If your spider stops, and it seems it has hit 51 | a missing message, go into the SQLite Manager and add a row with the missing id - leave 52 | all the other fields blank - and then restart gmane.py. This will unstick the 53 | spidering process and allow it to continue. These empty messages will be ignored in the next 54 | phase of the process. 55 | 56 | One nice thing is that once you have spidered all of the messages and have them in 57 | content.sqlite, you can run gmane.py again to get new messages as they get sent to the 58 | list. gmane.py will quickly scan to the end of the already-spidered pages and check 59 | if there are new messages and then quickly retrieve those messages and add them 60 | to content.sqlite. 61 | 62 | The content.sqlite data is pretty raw, with an innefficient data model, and not compressed. 63 | This is intentional as it allows you to look at content.sqlite to debug the process. 64 | It would be a bad idea to run any queries against this database as they would be 65 | slow. 66 | 67 | The second process is running the program gmodel.py. gmodel.py reads the rough/raw 68 | data from content.sqlite and produces a cleaned-up and well-modeled version of the 69 | data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X 70 | smaller) than content.sqlite because it also compresses the header and body text. 71 | 72 | Each time gmodel.py runs - it completely wipes out and re-builds index.sqlite, allowing 73 | you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the 74 | data cleaning process. 75 | 76 | Running gmodel.py works as follows: 77 | 78 | Mac: python gmodel.py 79 | Win: gmodel.py 80 | 81 | Loaded allsenders 1588 and mapping 28 dns mapping 1 82 | 1 2005-12-08T23:34:30-06:00 ggolden22@mac.com 83 | 251 2005-12-22T10:03:20-08:00 tpamsler@ucdavis.edu 84 | 501 2006-01-12T11:17:34-05:00 lance@indiana.edu 85 | 751 2006-01-24T11:13:28-08:00 vrajgopalan@ucmerced.edu 86 | ... 87 | 88 | The gmodel.py program does a number of data cleaing steps 89 | 90 | Domain names are truncated to two levels for .com, .org, .edu, and .net 91 | other domain names are truncated to three levels. So si.umich.edu becomes 92 | umich.edu and caret.cam.ac.uk becomes cam.ac.uk. Also mail addresses are 93 | forced to lower case and some of the @gmane.org address like the following 94 | 95 | arwhyte-63aXycvo3TyHXe+LvDLADg@public.gmane.org 96 | 97 | are converted to the real address whenever there is a matching real email 98 | address elsewhere in the message corpus. 99 | 100 | If you look in the content.sqlite database there are two tables that allow 101 | you to map both domain names and individual email addresses that change over 102 | the lifetime of the email list. For example, Steve Githens used the following 103 | email addresses over the life of the Sakai developer list: 104 | 105 | s-githens@northwestern.edu 106 | sgithens@cam.ac.uk 107 | swgithen@mtu.edu 108 | 109 | We can add two entries to the Mapping table 110 | 111 | s-githens@northwestern.edu -> swgithen@mtu.edu 112 | sgithens@cam.ac.uk -> swgithen@mtu.edu 113 | 114 | And so all the mail messages will be collected under one sender even if 115 | they used several email addresses over the lifetime of the mailing list. 116 | 117 | You can also make similar entries in the DNSMapping table if there are multiple 118 | DNS names you want mapped to a single DNS. In the Sakai data I add the following 119 | mapping: 120 | 121 | iupui.edu -> indiana.edu 122 | 123 | So all the folks from the various Indiana University campuses are tracked together 124 | 125 | You can re-run the gmodel.py over and over as you look at the data, and add mappings 126 | to make the data cleaner and cleaner. When you are done, you will have a nicely 127 | indexed version of the email in index.sqlite. This is the file to use to do data 128 | analysis. With this file, data analysis will be really quick. 129 | 130 | The first, simplest data analysis is to do a "who does the most" and "which 131 | organzation does the most"? This is done using gbasic.py: 132 | 133 | Mac: python gbasic.py 134 | Win: gbasic.py 135 | 136 | How many to dump? 5 137 | Loaded messages= 51330 subjects= 25033 senders= 1584 138 | 139 | Top 5 Email list participants 140 | steve.swinsburg@gmail.com 2657 141 | azeckoski@unicon.net 1742 142 | ieb@tfd.co.uk 1591 143 | csev@umich.edu 1304 144 | david.horwitz@uct.ac.za 1184 145 | 146 | Top 5 Email list organizations 147 | gmail.com 7339 148 | umich.edu 6243 149 | uct.ac.za 2451 150 | indiana.edu 2258 151 | unicon.net 2055 152 | 153 | You can look at the data in index.sqlite and if you find a problem, you 154 | can update the Mapping table and DNSMapping table in content.sqlite and 155 | re-run gmodel.py. 156 | 157 | There is a simple vizualization of the word frequence in the subject lines 158 | in the file gword.py: 159 | 160 | Mac: python gword.py 161 | Win: gword.py 162 | 163 | Range of counts: 33229 129 164 | Output written to gword.js 165 | 166 | This produces the file gword.js which you can visualize using the file 167 | gword.htm. 168 | 169 | A second visualization is in gline.py. It visualizes email participation by 170 | organizations over time. 171 | 172 | Mac: python gline.py 173 | Win: gline.py 174 | 175 | Loaded messages= 51330 subjects= 25033 senders= 1584 176 | Top 10 Oranizations 177 | ['gmail.com', 'umich.edu', 'uct.ac.za', 'indiana.edu', 'unicon.net', 'tfd.co.uk', 'berkeley.edu', 'longsight.com', 'stanford.edu', 'ox.ac.uk'] 178 | Output written to gline.js 179 | 180 | Its output is written to gline.js which is visualized using gline.htm. 181 | 182 | Some URLs for visualization ideas: 183 | 184 | https://developers.google.com/chart/ 185 | 186 | https://developers.google.com/chart/interactive/docs/gallery/motionchart 187 | 188 | https://code.google.com/apis/ajax/playground/?type=visualization#motion_chart_time_formats 189 | 190 | https://developers.google.com/chart/interactive/docs/gallery/annotatedtimeline 191 | 192 | http://bost.ocks.org/mike/uberdata/ 193 | 194 | http://mbostock.github.io/d3/talk/20111018/calendar.html 195 | 196 | http://nltk.org/install.html 197 | 198 | As always - comments welcome. 199 | 200 | -- Dr. Chuck 201 | Sun Sep 29 00:11:01 EDT 2013 -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gbasic.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import urllib 4 | import zlib 5 | 6 | howmany = int(raw_input("How many to dump? ")) 7 | 8 | conn = sqlite3.connect('index.sqlite') 9 | conn.text_factory = str 10 | cur = conn.cursor() 11 | 12 | cur.execute('''SELECT Messages.id, sender FROM Messages 13 | JOIN Senders ON Messages.sender_id = Senders.id''') 14 | 15 | sendcounts = dict() 16 | sendorgs = dict() 17 | for message in cur : 18 | sender = message[1] 19 | sendcounts[sender] = sendcounts.get(sender,0) + 1 20 | pieces = sender.split("@") 21 | if len(pieces) != 2 : continue 22 | dns = pieces[1] 23 | sendorgs[dns] = sendorgs.get(dns,0) + 1 24 | 25 | print '' 26 | print 'Top',howmany,'Email list participants' 27 | 28 | x = sorted(sendcounts, key=sendcounts.get, reverse=True) 29 | for k in x[:howmany]: 30 | print k, sendcounts[k] 31 | if sendcounts[k] < 10 : break 32 | 33 | print '' 34 | print 'Top',howmany,'Email list organizations' 35 | 36 | x = sorted(sendorgs, key=sendorgs.get, reverse=True) 37 | for k in x[:howmany]: 38 | print k, sendorgs[k] 39 | if sendorgs[k] < 10 : break 40 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gline.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 19 | 20 | 21 | 22 | 23 | 24 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gline.js: -------------------------------------------------------------------------------- 1 | gline = [ ['Month','umich.edu','unl.edu','mac.com','columbia.edu','berkeley.edu','unicon.net','virginia.edu','hull.ac.uk','cam.ac.uk','weber.edu'], 2 | ['2005-12',25,12,10,7,6,6,6,6,5,5] 3 | ]; 4 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gline.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import urllib 4 | import zlib 5 | 6 | conn = sqlite3.connect('index.sqlite') 7 | conn.text_factory = str 8 | cur = conn.cursor() 9 | 10 | # Determine the top ten organizations 11 | cur.execute('''SELECT Messages.id, sender FROM Messages 12 | JOIN Senders ON Messages.sender_id = Senders.id''') 13 | 14 | sendorgs = dict() 15 | for message_row in cur : 16 | sender = message_row[1] 17 | pieces = sender.split("@") 18 | if len(pieces) != 2 : continue 19 | dns = pieces[1] 20 | sendorgs[dns] = sendorgs.get(dns,0) + 1 21 | 22 | # pick the top schools 23 | orgs = sorted(sendorgs, key=sendorgs.get, reverse=True) 24 | orgs = orgs[:10] 25 | print "Top 10 Organizations" 26 | print orgs 27 | # orgs = ['total'] + orgs 28 | 29 | # Read through the messages 30 | counts = dict() 31 | months = list() 32 | 33 | cur.execute('''SELECT Messages.id, sender, sent_at FROM Messages 34 | JOIN Senders ON Messages.sender_id = Senders.id''') 35 | 36 | for message_row in cur : 37 | sender = message_row[1] 38 | pieces = sender.split("@") 39 | if len(pieces) != 2 : continue 40 | dns = pieces[1] 41 | if dns not in orgs : continue 42 | month = message_row[2][:7] 43 | if month not in months : months.append(month) 44 | key = (month, dns) 45 | counts[key] = counts.get(key,0) + 1 46 | tkey = (month, 'total') 47 | counts[tkey] = counts.get(tkey,0) + 1 48 | 49 | months.sort() 50 | print counts 51 | print months 52 | 53 | fhand = open('gline.js','w') 54 | fhand.write("gline = [ ['Month'") 55 | for org in orgs: 56 | fhand.write(",'"+org+"'") 57 | fhand.write("]") 58 | 59 | # for month in months[1:-1]: 60 | for month in months: 61 | fhand.write(",\n['"+month+"'") 62 | for org in orgs: 63 | key = (month, org) 64 | val = counts.get(key,0) 65 | fhand.write(","+str(val)) 66 | fhand.write("]"); 67 | 68 | fhand.write("\n];\n") 69 | 70 | print "Data written to gline.js" 71 | print "Open gline.htm in a browser to view" 72 | 73 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gline2.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 6 | 21 | 22 | 23 | 24 | 25 | 26 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gmane.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import sqlite3 3 | import time 4 | import ssl 5 | import urllib 6 | from urlparse import urljoin 7 | from urlparse import urlparse 8 | import re 9 | from datetime import datetime, timedelta 10 | 11 | # Not all systems have this so conditionally define parser 12 | try: 13 | import dateutil.parser as parser 14 | except: 15 | pass 16 | 17 | def parsemaildate(md) : 18 | # See if we have dateutil 19 | try: 20 | pdate = parser.parse(tdate) 21 | test_at = pdate.isoformat() 22 | return test_at 23 | except: 24 | pass 25 | 26 | # Non-dateutil version - we try our best 27 | 28 | pieces = md.split() 29 | notz = " ".join(pieces[:4]).strip() 30 | 31 | # Try a bunch of format variations - strptime() is *lame* 32 | dnotz = None 33 | for form in [ '%d %b %Y %H:%M:%S', '%d %b %Y %H:%M:%S', 34 | '%d %b %Y %H:%M', '%d %b %Y %H:%M', '%d %b %y %H:%M:%S', 35 | '%d %b %y %H:%M:%S', '%d %b %y %H:%M', '%d %b %y %H:%M' ] : 36 | try: 37 | dnotz = datetime.strptime(notz, form) 38 | break 39 | except: 40 | continue 41 | 42 | if dnotz is None : 43 | # print 'Bad Date:',md 44 | return None 45 | 46 | iso = dnotz.isoformat() 47 | 48 | tz = "+0000" 49 | try: 50 | tz = pieces[4] 51 | ival = int(tz) # Only want numeric timezone values 52 | if tz == '-0000' : tz = '+0000' 53 | tzh = tz[:3] 54 | tzm = tz[3:] 55 | tz = tzh+":"+tzm 56 | except: 57 | pass 58 | 59 | return iso+tz 60 | 61 | conn = sqlite3.connect('content.sqlite') 62 | cur = conn.cursor() 63 | conn.text_factory = str 64 | 65 | baseurl = "http://mbox.dr-chuck.net/sakai.devel/" 66 | 67 | cur.execute('''CREATE TABLE IF NOT EXISTS Messages 68 | (id INTEGER UNIQUE, email TEXT, sent_at TEXT, 69 | subject TEXT, headers TEXT, body TEXT)''') 70 | 71 | start = 0 72 | cur.execute('SELECT max(id) FROM Messages') 73 | try: 74 | row = cur.fetchone() 75 | if row[0] is not None: 76 | start = row[0] 77 | except: 78 | start = 0 79 | row = None 80 | 81 | print start 82 | 83 | many = 0 84 | 85 | # Skip up to five messages 86 | skip = 5 87 | while True: 88 | if ( many < 1 ) : 89 | sval = raw_input('How many messages:') 90 | if ( len(sval) < 1 ) : break 91 | many = int(sval) 92 | 93 | start = start + 1 94 | cur.execute('SELECT id FROM Messages WHERE id=?', (start,) ) 95 | try: 96 | row = cur.fetchone() 97 | if row is not None : continue 98 | except: 99 | row = None 100 | 101 | many = many - 1 102 | url = baseurl + str(start) + '/' + str(start + 1) 103 | 104 | try: 105 | # Deal with SSL certificate anomalies Python > 2.7 106 | # scontext = ssl.SSLContext(ssl.PROTOCOL_TLSv1) 107 | # document = urllib.urlopen(url, context=scontext) 108 | 109 | document = urllib.urlopen(url) 110 | 111 | text = document.read() 112 | if document.getcode() != 200 : 113 | print "Error code=",document.getcode(), url 114 | break 115 | except KeyboardInterrupt: 116 | print '' 117 | print 'Program interrupted by user...' 118 | break 119 | except: 120 | print "Unable to retrieve or parse page",url 121 | print sys.exc_info()[0] 122 | break 123 | 124 | print url,len(text) 125 | 126 | if not text.startswith("From "): 127 | if skip < 1 : 128 | print text 129 | print "End of mail stream reached..." 130 | quit () 131 | print " Skipping badly formed message" 132 | skip = skip-1 133 | continue 134 | 135 | pos = text.find("\n\n") 136 | if pos > 0 : 137 | hdr = text[:pos] 138 | body = text[pos+2:] 139 | else: 140 | print text 141 | print "Could not find break between headers and body" 142 | break 143 | 144 | skip = 5 # reset skip count 145 | 146 | email = None 147 | x = re.findall('\nFrom: .* <(\S+@\S+)>\n', hdr) 148 | if len(x) == 1 : 149 | email = x[0]; 150 | email = email.strip().lower() 151 | email = email.replace("<","") 152 | else: 153 | x = re.findall('\nFrom: (\S+@\S+)\n', hdr) 154 | if len(x) == 1 : 155 | email = x[0]; 156 | email = email.strip().lower() 157 | email = email.replace("<","") 158 | 159 | date = None 160 | y = re.findall('\Date: .*, (.*)\n', hdr) 161 | if len(y) == 1 : 162 | tdate = y[0] 163 | tdate = tdate[:26] 164 | try: 165 | sent_at = parsemaildate(tdate) 166 | except: 167 | print text 168 | print "Parse fail",tdate 169 | break 170 | 171 | subject = None 172 | z = re.findall('\Subject: (.*)\n', hdr) 173 | if len(z) == 1 : subject = z[0].strip().lower(); 174 | 175 | print " ",email,sent_at,subject 176 | cur.execute('''INSERT OR IGNORE INTO Messages (id, email, sent_at, subject, headers, body) 177 | VALUES ( ?, ?, ?, ?, ?, ? )''', ( start, email, sent_at, subject, hdr, body)) 178 | 179 | # Only commit every 50th record 180 | # if (many % 50) == 0 : conn.commit() 181 | time.sleep(1) 182 | 183 | conn.commit() 184 | cur.close() 185 | 186 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gmodel.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import urllib 4 | import re 5 | import zlib 6 | from datetime import datetime, timedelta 7 | # Not all systems have this 8 | try: 9 | import dateutil.parser as parser 10 | except: 11 | pass 12 | 13 | dnsmapping = dict() 14 | mapping = dict() 15 | 16 | def fixsender(sender,allsenders=None) : 17 | global dnsmapping 18 | global mapping 19 | if sender is None : return None 20 | sender = sender.strip().lower() 21 | sender = sender.replace('<','').replace('>','') 22 | 23 | # Check if we have a hacked gmane.org from address 24 | if allsenders is not None and sender.endswith('gmane.org') : 25 | pieces = sender.split('-') 26 | realsender = None 27 | for s in allsenders: 28 | if s.startswith(pieces[0]) : 29 | realsender = sender 30 | sender = s 31 | # print realsender, sender 32 | break 33 | if realsender is None : 34 | for s in mapping: 35 | if s.startswith(pieces[0]) : 36 | realsender = sender 37 | sender = mapping[s] 38 | # print realsender, sender 39 | break 40 | if realsender is None : sender = pieces[0] 41 | 42 | mpieces = sender.split("@") 43 | if len(mpieces) != 2 : return sender 44 | dns = mpieces[1] 45 | x = dns 46 | pieces = dns.split(".") 47 | if dns.endswith(".edu") or dns.endswith(".com") or dns.endswith(".org") or dns.endswith(".net") : 48 | dns = ".".join(pieces[-2:]) 49 | else: 50 | dns = ".".join(pieces[-3:]) 51 | # if dns != x : print x,dns 52 | # if dns != dnsmapping.get(dns,dns) : print dns,dnsmapping.get(dns,dns) 53 | dns = dnsmapping.get(dns,dns) 54 | return mpieces[0] + '@' + dns 55 | 56 | def parsemaildate(md) : 57 | # See if we have dateutil 58 | try: 59 | pdate = parser.parse(tdate) 60 | test_at = pdate.isoformat() 61 | return test_at 62 | except: 63 | pass 64 | 65 | # Non-dateutil version - we try our best 66 | 67 | pieces = md.split() 68 | notz = " ".join(pieces[:4]).strip() 69 | 70 | # Try a bunch of format variations - strptime() is *lame* 71 | dnotz = None 72 | for form in [ '%d %b %Y %H:%M:%S', '%d %b %Y %H:%M:%S', 73 | '%d %b %Y %H:%M', '%d %b %Y %H:%M', '%d %b %y %H:%M:%S', 74 | '%d %b %y %H:%M:%S', '%d %b %y %H:%M', '%d %b %y %H:%M' ] : 75 | try: 76 | dnotz = datetime.strptime(notz, form) 77 | break 78 | except: 79 | continue 80 | 81 | if dnotz is None : 82 | # print 'Bad Date:',md 83 | return None 84 | 85 | iso = dnotz.isoformat() 86 | 87 | tz = "+0000" 88 | try: 89 | tz = pieces[4] 90 | ival = int(tz) # Only want numeric timezone values 91 | if tz == '-0000' : tz = '+0000' 92 | tzh = tz[:3] 93 | tzm = tz[3:] 94 | tz = tzh+":"+tzm 95 | except: 96 | pass 97 | 98 | return iso+tz 99 | 100 | # Parse out the info... 101 | def parseheader(hdr, allsenders=None): 102 | if hdr is None or len(hdr) < 1 : return None 103 | sender = None 104 | x = re.findall('\nFrom: .* <(\S+@\S+)>\n', hdr) 105 | if len(x) >= 1 : 106 | sender = x[0] 107 | else: 108 | x = re.findall('\nFrom: (\S+@\S+)\n', hdr) 109 | if len(x) >= 1 : 110 | sender = x[0] 111 | 112 | # normalize the domain name of Email addresses 113 | sender = fixsender(sender, allsenders) 114 | 115 | date = None 116 | y = re.findall('\nDate: .*, (.*)\n', hdr) 117 | sent_at = None 118 | if len(y) >= 1 : 119 | tdate = y[0] 120 | tdate = tdate[:26] 121 | try: 122 | sent_at = parsemaildate(tdate) 123 | except Exception, e: 124 | # print 'Date ignored ',tdate, e 125 | return None 126 | 127 | subject = None 128 | z = re.findall('\nSubject: (.*)\n', hdr) 129 | if len(z) >= 1 : subject = z[0].strip().lower() 130 | 131 | guid = None 132 | z = re.findall('\nMessage-ID: (.*)\n', hdr) 133 | if len(z) >= 1 : guid = z[0].strip().lower() 134 | 135 | if sender is None or sent_at is None or subject is None or guid is None : 136 | return None 137 | return (guid, sender, subject, sent_at) 138 | 139 | # Open the output database and create empty tables 140 | conn = sqlite3.connect('index.sqlite') 141 | conn.text_factory = str 142 | cur = conn.cursor() 143 | 144 | cur.execute('''DROP TABLE IF EXISTS Messages ''') 145 | cur.execute('''DROP TABLE IF EXISTS Senders ''') 146 | cur.execute('''DROP TABLE IF EXISTS Subjects ''') 147 | cur.execute('''DROP TABLE IF EXISTS Replies ''') 148 | 149 | cur.execute('''CREATE TABLE IF NOT EXISTS Messages 150 | (id INTEGER PRIMARY KEY, guid TEXT UNIQUE, sent_at INTEGER, 151 | sender_id INTEGER, subject_id INTEGER, 152 | headers BLOB, body BLOB)''') 153 | cur.execute('''CREATE TABLE IF NOT EXISTS Senders 154 | (id INTEGER PRIMARY KEY, sender TEXT UNIQUE)''') 155 | cur.execute('''CREATE TABLE IF NOT EXISTS Subjects 156 | (id INTEGER PRIMARY KEY, subject TEXT UNIQUE)''') 157 | cur.execute('''CREATE TABLE IF NOT EXISTS Replies 158 | (from_id INTEGER, to_id INTEGER)''') 159 | 160 | # Open the mapping information 161 | conn_1 = sqlite3.connect('mapping.sqlite') 162 | conn_1.text_factory = str 163 | cur_1 = conn_1.cursor() 164 | 165 | # Load up the mapping information into memory structures 166 | cur_1.execute('''SELECT old,new FROM DNSMapping''') 167 | for message_row in cur_1 : 168 | dnsmapping[message_row[0].strip().lower()] = message_row[1].strip().lower() 169 | 170 | mapping = dict() 171 | cur_1.execute('''SELECT old,new FROM Mapping''') 172 | for message_row in cur_1 : 173 | old = fixsender(message_row[0]) 174 | new = fixsender(message_row[1]) 175 | mapping[old] = fixsender(new) 176 | 177 | cur_1.close() 178 | 179 | # Open the raw data retrieved from the network 180 | conn_2 = sqlite3.connect('content.sqlite') 181 | conn_2.text_factory = str 182 | cur_2 = conn_2.cursor() 183 | 184 | allsenders = list() 185 | cur_2.execute('''SELECT email FROM Messages''') 186 | for message_row in cur_2 : 187 | sender = fixsender(message_row[0]) 188 | if sender is None : continue 189 | if 'gmane.org' in sender : continue 190 | if sender in allsenders: continue 191 | allsenders.append(sender) 192 | 193 | print "Loaded allsenders",len(allsenders),"and mapping",len(mapping),"dns mapping",len(dnsmapping) 194 | 195 | cur_2.execute('''SELECT headers, body, sent_at 196 | FROM Messages ORDER BY sent_at''') 197 | 198 | senders = dict() 199 | subjects = dict() 200 | guids = dict() 201 | 202 | count = 0 203 | 204 | for message_row in cur_2 : 205 | hdr = message_row[0] 206 | parsed = parseheader(hdr, allsenders) 207 | if parsed is None: continue 208 | (guid, sender, subject, sent_at) = parsed 209 | 210 | # Apply the sender mapping 211 | sender = mapping.get(sender,sender) 212 | 213 | count = count + 1 214 | if count % 250 == 1 : print count,sent_at, sender 215 | # print guid, sender, subject, sent_at 216 | 217 | if 'gmane.org' in sender: 218 | print "Error in sender ===", sender 219 | 220 | sender_id = senders.get(sender,None) 221 | subject_id = subjects.get(subject,None) 222 | guid_id = guids.get(guid,None) 223 | 224 | if sender_id is None : 225 | cur.execute('INSERT OR IGNORE INTO Senders (sender) VALUES ( ? )', ( sender, ) ) 226 | conn.commit() 227 | cur.execute('SELECT id FROM Senders WHERE sender=? LIMIT 1', ( sender, )) 228 | try: 229 | row = cur.fetchone() 230 | sender_id = row[0] 231 | senders[sender] = sender_id 232 | except: 233 | print 'Could not retrieve sender id',sender 234 | break 235 | if subject_id is None : 236 | cur.execute('INSERT OR IGNORE INTO Subjects (subject) VALUES ( ? )', ( subject, ) ) 237 | conn.commit() 238 | cur.execute('SELECT id FROM Subjects WHERE subject=? LIMIT 1', ( subject, )) 239 | try: 240 | row = cur.fetchone() 241 | subject_id = row[0] 242 | subjects[subject] = subject_id 243 | except: 244 | print 'Could not retrieve subject id',subject 245 | break 246 | # print sender_id, subject_id 247 | cur.execute('INSERT OR IGNORE INTO Messages (guid,sender_id,subject_id,sent_at,headers,body) VALUES ( ?,?,?,datetime(?),?,? )', 248 | ( guid, sender_id, subject_id, sent_at, zlib.compress(message_row[0]), zlib.compress(message_row[1])) ) 249 | conn.commit() 250 | cur.execute('SELECT id FROM Messages WHERE guid=? LIMIT 1', ( guid, )) 251 | try: 252 | row = cur.fetchone() 253 | message_id = row[0] 254 | guids[guid] = message_id 255 | except: 256 | print 'Could not retrieve guid id',guid 257 | break 258 | 259 | # Close the connections 260 | cur.close() 261 | cur_2.close() 262 | 263 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gword.htm: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 37 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gword.js: -------------------------------------------------------------------------------- 1 | gword = [{text: 'sakai', size: 100}, 2 | {text: 'with', size: 57}, 3 | {text: 'error', size: 51}, 4 | {text: 'password', size: 45}, 5 | {text: 'forgotten', size: 45}, 6 | {text: 'feature', size: 45}, 7 | {text: 'mysql', size: 38}, 8 | {text: 'apis', size: 31}, 9 | {text: 'section', size: 30}, 10 | {text: 'problem', size: 30}, 11 | {text: 'site', size: 30}, 12 | {text: 'collab', size: 30}, 13 | {text: 'webdav', size: 28}, 14 | {text: 'memory', size: 28}, 15 | {text: 'taxonomy', size: 27}, 16 | {text: 'worksite', size: 27}, 17 | {text: 'nosuchbeandefinitionexception', size: 25}, 18 | {text: 'resources', size: 25}, 19 | {text: 'sectionmanager', size: 25}, 20 | {text: 'creating', size: 25}, 21 | {text: 'maven', size: 25}, 22 | {text: 'austin', size: 25}, 23 | {text: 'tool', size: 25}, 24 | {text: 'manager', size: 24}, 25 | {text: 'provider', size: 24}, 26 | {text: 'regarding', size: 24}, 27 | {text: 'level', size: 24}, 28 | {text: 'schedule', size: 24}, 29 | {text: 'question', size: 24}, 30 | {text: 'sakaiportallogin', size: 24}, 31 | {text: 'related', size: 24}, 32 | {text: 'high', size: 24}, 33 | {text: 'other', size: 24}, 34 | {text: 'presense', size: 24}, 35 | {text: 'displayed', size: 22}, 36 | {text: 'cannot', size: 22}, 37 | {text: 'document', size: 22}, 38 | {text: 'page', size: 22}, 39 | {text: 'examples', size: 22}, 40 | {text: 'tools', size: 22}, 41 | {text: 'internet', size: 22}, 42 | {text: 'email', size: 22}, 43 | {text: 'accessing', size: 22}, 44 | {text: 'lmsvle', size: 22}, 45 | {text: 'recordings', size: 22}, 46 | {text: 'address', size: 22}, 47 | {text: 'configuration', size: 22}, 48 | {text: 'presentations', size: 22}, 49 | {text: 'samigo', size: 22}, 50 | {text: 'rantscomments', size: 22}, 51 | {text: 'username', size: 22}, 52 | {text: 'http', size: 22}, 53 | {text: 'problems', size: 22}, 54 | {text: 'oracle', size: 22}, 55 | {text: 'audio', size: 22}, 56 | {text: 'planning', size: 21}, 57 | {text: 'converting', size: 21}, 58 | {text: 'tables', size: 21}, 59 | {text: 'breakage', size: 21}, 60 | {text: 'stovepipe', size: 21}, 61 | {text: 'picker', size: 21}, 62 | {text: 'denied', size: 21}, 63 | {text: 'nonlegacy', size: 21}, 64 | {text: 'update', size: 21}, 65 | {text: 'news', size: 21}, 66 | {text: 'urls', size: 21}, 67 | {text: 'wiki', size: 21}, 68 | {text: 'firefox', size: 21}, 69 | {text: 'conference', size: 21}, 70 | {text: 'from', size: 21}, 71 | {text: 'anyone', size: 21}, 72 | {text: 'translation', size: 21}, 73 | {text: 'future', size: 21}, 74 | {text: 'file', size: 21}, 75 | {text: 'conversion', size: 21}, 76 | {text: 'permission', size: 21}, 77 | {text: 'developers', size: 21}, 78 | {text: 'explorer', size: 21}, 79 | {text: 'myfaces', size: 21}, 80 | {text: 'jira', size: 20}, 81 | {text: 'code', size: 20}, 82 | {text: 'courserosteruser', size: 20}, 83 | {text: 'entity', size: 20}, 84 | {text: 'group', size: 20}, 85 | {text: 'clarification', size: 20}, 86 | {text: 'ldap', size: 20}, 87 | {text: 'song', size: 20}, 88 | {text: 'dynamic', size: 20}, 89 | {text: 'break', size: 20}, 90 | {text: 'report', size: 20}, 91 | {text: 'renamed', size: 20}, 92 | {text: 'release', size: 20}, 93 | {text: 'simplified', size: 20}, 94 | {text: 'direct', size: 20}, 95 | {text: 'library', size: 20}, 96 | {text: 'zero', size: 20}, 97 | {text: 'export', size: 20}, 98 | {text: 'logo', size: 20}, 99 | {text: 'preferences', size: 20}, 100 | {text: 'import', size: 20} 101 | ]; 102 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gword.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import urllib 4 | import zlib 5 | import string 6 | 7 | conn = sqlite3.connect('index.sqlite') 8 | conn.text_factory = str 9 | cur = conn.cursor() 10 | 11 | cur.execute('''SELECT subject_id,subject FROM Messages 12 | JOIN Subjects ON Messages.subject_id = Subjects.id''') 13 | 14 | counts = dict() 15 | for message_row in cur : 16 | text = message_row[1] 17 | text = text.translate(None, string.punctuation) 18 | text = text.translate(None, '1234567890') 19 | text = text.strip() 20 | text = text.lower() 21 | words = text.split() 22 | for word in words: 23 | if len(word) < 4 : continue 24 | counts[word] = counts.get(word,0) + 1 25 | 26 | # Find the top 100 words 27 | words = sorted(counts, key=counts.get, reverse=True) 28 | highest = None 29 | lowest = None 30 | for w in words[:100]: 31 | if highest is None or highest < counts[w] : 32 | highest = counts[w] 33 | if lowest is None or lowest > counts[w] : 34 | lowest = counts[w] 35 | print 'Range of counts:',highest,lowest 36 | 37 | # Spread the font sizes across 20-100 based on the count 38 | bigsize = 80 39 | smallsize = 20 40 | 41 | fhand = open('gword.js','w') 42 | fhand.write("gword = [") 43 | first = True 44 | for k in words[:100]: 45 | if not first : fhand.write( ",\n") 46 | first = False 47 | size = counts[k] 48 | size = (size - lowest) / float(highest - lowest) 49 | size = int((size * bigsize) + smallsize) 50 | fhand.write("{text: '"+k+"', size: "+str(size)+"}") 51 | fhand.write( "\n];\n") 52 | 53 | print "Output written to gword.js" 54 | print "Open gword.htm in a browser to view" 55 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/gyear.py: -------------------------------------------------------------------------------- 1 | import sqlite3 2 | import time 3 | import urllib 4 | import zlib 5 | 6 | conn = sqlite3.connect('index.sqlite') 7 | conn.text_factory = str 8 | cur = conn.cursor() 9 | 10 | # Determine the top ten organizations 11 | cur.execute('''SELECT Messages.id, sender FROM Messages 12 | JOIN Senders ON Messages.sender_id = Senders.id''') 13 | 14 | sendorgs = dict() 15 | for message_row in cur : 16 | sender = message_row[1] 17 | pieces = sender.split("@") 18 | if len(pieces) != 2 : continue 19 | dns = pieces[1] 20 | sendorgs[dns] = sendorgs.get(dns,0) + 1 21 | 22 | # pick the top schools 23 | orgs = sorted(sendorgs, key=sendorgs.get, reverse=True) 24 | orgs = orgs[:10] 25 | print "Top 10 Organizations" 26 | print orgs 27 | # orgs = ['total'] + orgs 28 | 29 | # Read through the messages 30 | counts = dict() 31 | years = list() 32 | 33 | cur.execute('''SELECT Messages.id, sender, sent_at FROM Messages 34 | JOIN Senders ON Messages.sender_id = Senders.id''') 35 | 36 | for message_row in cur : 37 | sender = message_row[1] 38 | pieces = sender.split("@") 39 | if len(pieces) != 2 : continue 40 | dns = pieces[1] 41 | if dns not in orgs : continue 42 | year = message_row[2][:4] 43 | if year not in years : years.append(year) 44 | key = (year, dns) 45 | counts[key] = counts.get(key,0) + 1 46 | tkey = (year, 'total') 47 | counts[tkey] = counts.get(tkey,0) + 1 48 | 49 | years.sort() 50 | print counts 51 | print years 52 | 53 | fhand = open('gline.js','w') 54 | fhand.write("gline = [ ['Year'") 55 | for org in orgs: 56 | fhand.write(",'"+org+"'") 57 | fhand.write("]") 58 | 59 | # for year in years[1:-1]: 60 | for year in years: 61 | fhand.write(",\n['"+year+"'") 62 | for org in orgs: 63 | key = (year, org) 64 | val = counts.get(key,0) 65 | fhand.write(","+str(val)) 66 | fhand.write("]"); 67 | 68 | fhand.write("\n];\n") 69 | 70 | print "Data written to gline.js" 71 | print "Open gline.htm in a browser to view" 72 | 73 | -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/index.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/index.sqlite -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/mapping.sqlite: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 2 Spidering and Modeling Email Data/mapping.sqlite -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/.DS_Store -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/.DS_Store -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/blob_serve (1).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/blob_serve (1).png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/blob_serve (2).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/blob_serve (2).png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/blob_serve (4).png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/blob_serve (4).png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/blob_serve.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Kalpesh14m/Python-For-Everybody-Answers/4cd08bcbca30fe3d54c7a6e957243d2e47ab76d3/5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Assignment/Assignment 3/outputImages/blob_serve.png -------------------------------------------------------------------------------- /5-Capstone-Retrieving-Processing-And-Visualizing-Data-with-Python/Quiz/Week 1 Using Encoded Data in Python 3.txt: -------------------------------------------------------------------------------- 1 | 1.What is the most common Unicode encoding when moving data between systems? 2 | ==> UTF-8 3 | 4 | 2.What is the ASCII character that is associated with the decimal value 42? 5 | ==> * 6 | 7 | 3.What word does the following sequence of numbers represent in ASCII: 8 | 108, 105, 115, 116 9 | ==> list 10 | 11 | 4.How are strings stored internally in Python 3? 12 | ==> Unicode 13 | 14 | 5.When reading data across the network (i.e. from a URL) in Python 3, what string method must be used to convert it to the internal format used by strings? 15 | ==> decode() 16 | --------------------------------------------------------------------------------