├── Chapter 11 - Regular Expressions ├── chapter11.py ├── regex_sum_201873.txt └── regex_sum_42.txt ├── Chapter 12 - Networks and Sockets ├── chapter12.py └── examples.py ├── Chapter 12 - Programs that Surf the Web ├── BeautifulSoup.py ├── chapter12_2.py └── chapter12_3.py ├── Chapter 13 - JSON and the REST Architecture ├── geo_assignment.py └── json_assignment.py ├── Chapter 13 - Web Services and XML └── xml_assignment.py └── README.md /Chapter 11 - Regular Expressions/chapter11.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Chapter 11 Assignment: Extracting Data With Regular Expressions 4 | 5 | # Finding Numbers in a Haystack 6 | 7 | # In this assignment you will read through and parse a file with text and 8 | # numbers. You will extract all the numbers in the file and compute the sum of 9 | # the numbers. 10 | 11 | # Data Files 12 | # We provide two files for this assignment. One is a sample file where we give 13 | # you the sum for your testing and the other is the actual data you need to 14 | # process for the assignment. 15 | 16 | # Sample data: http://python-data.dr-chuck.net/regex_sum_42.txt 17 | # (There are 87 values with a sum=445822) 18 | # Actual data: http://python-data.dr-chuck.net/regex_sum_201873.txt 19 | # (There are 96 values and the sum ends with 156) 20 | 21 | # These links open in a new window. Make sure to save the file into the same 22 | # folder as you will be writing your Python program. Note: Each student will 23 | # have a distinct data file for the assignment - so only use your own data file 24 | # for analysis. 25 | 26 | # Data Format 27 | # The file contains much of the text from the introduction of the textbook 28 | # except that random numbers are inserted throughout the text. Here is a sample 29 | # of the output you might see: 30 | 31 | ''' 32 | Why should you learn to write programs? 7746 33 | 12 1929 8827 34 | Writing programs (or programming) is a very creative 35 | 7 and rewarding activity. You can write programs for 36 | many reasons, ranging from making your living to solving 37 | 8837 a difficult data analysis problem to having fun to helping 128 38 | someone else solve a problem. This book assumes that 39 | everyone needs to know how to program ... 40 | ''' 41 | 42 | # The sum for the sample text above is 27486. The numbers can appear anywhere 43 | # in the line. There can be any number of numbers in each line (including none). 44 | 45 | # Handling The Data 46 | # The basic outline of this problem is to read the file, look for integers using 47 | # the re.findall(), looking for a regular expression of '[0-9]+' and then 48 | # converting the extracted strings to integers and summing up the integers. 49 | 50 | import re 51 | 52 | file = open('regex_sum_201873.txt', 'r') 53 | 54 | sum = 0 55 | 56 | for line in file: 57 | numbers = re.findall('[0-9]+', line) 58 | for number in numbers: 59 | sum = sum + int(number) 60 | 61 | print sum 62 | -------------------------------------------------------------------------------- /Chapter 11 - Regular Expressions/regex_sum_201873.txt: -------------------------------------------------------------------------------- 1 | This file contains the actual data for your assignment - good luck! 2 | 3 | 4 | Why should you learn to write programs? 5 | 6 | Writing programs (or programming) is a very creative 7 | and rewarding activity. You can write programs for 8 | many reasons, ranging from making your living to solving 9 | a difficult data analysis problem to having fun to helping 10 | someone else solve a problem. This book assumes that 11 | everyone needs to know how to program, and that once 12 | you know how to program you will figure out what you want 13 | to 6812 do 8128 with 5282 your newfound skills. 14 | 7266 15 | 4340 We are surrounded in our daily lives with computers ranging 16 | from laptops to cell phones. We can think of these computers 17 | as our personal assistants who can take care of many things 18 | on our behalf. The hardware in our current-day computers 19 | is essentially built to continuously ask us the question, 20 | What would you like me to do next? 21 | 22 | 1013 Programmers add an operating system and a set of applications 23 | to the hardware and we end up with a Personal Digital 24 | Assistant that is quite helpful and capable of helping 25 | 6818 us do many different things. 905 26 | 27 | Our computers are fast and have vast amounts of memory and 28 | could be very helpful to us if we only knew the language to 29 | speak to explain to the computer what we would like it to 30 | do next. If we knew this language, we could tell the 31 | computer to do tasks on our behalf that were repetitive. 32 | 9504 Interestingly, the kinds of things computers can do best 33 | are often the kinds of things that we humans find boring 34 | and mind-numbing. 35 | 36 | For example, look at the first three paragraphs of this 37 | chapter and tell me the most commonly used word and how 38 | many 4425 times 3448 the 79 word is used. While you were able to read 39 | and understand the words in a few seconds, counting them 40 | is almost painful because it is not the kind of problem 41 | that human minds are designed to solve. For a computer 42 | the opposite is true, reading and understanding text 43 | 112 from a piece of paper is hard for a computer to do 44 | but counting the words and telling you how many times 45 | the most used word was used is very easy for the 46 | computer: 47 | 48 | Our personal information analysis assistant quickly 49 | told 8980 us 1151 that 6621 the word to was used sixteen times in the 50 | first three paragraphs of this chapter. 51 | 52 | This 6885 very 9707 fact 4618 that computers are good at things 53 | that humans are not is why you need to become 54 | skilled at talking computer language. Once you learn 55 | this new language, you can delegate mundane tasks 56 | to your partner (the computer), leaving more time 57 | for you to do the 58 | things that you are uniquely suited for. You bring 59 | 4130 creativity, intuition, and inventiveness to this 60 | partnership. 61 | 62 | Creativity and motivation 63 | 64 | While this book is not intended for professional programmers, professional 65 | programming can be a very rewarding job both financially and personally. 66 | Building useful, elegant, and clever programs for others to use is a very 67 | creative 1483 activity. 4097 2097 Your computer or Personal Digital Assistant (PDA) 68 | usually contains many different programs from many different groups of 69 | programmers, each competing for your attention and interest. They try 70 | their best to meet your needs and give you a great user experience in the 71 | process. In some situations, when you choose a piece of software, the 72 | programmers 6518 are 6625 directly 2923 compensated because of your choice. 73 | 74 | If we think of programs as the creative output of groups of programmers, 75 | perhaps the following figure is a more sensible version of our PDA: 76 | 77 | For now, our primary motivation is not to make money or please end users, but 78 | instead for us to be more productive in handling the data and 79 | information 9001 that 9633 we 3868 will encounter in our lives. 80 | When you first start, you will be both the programmer and the end user of 81 | your programs. As you gain skill as a programmer and 82 | programming feels more creative to you, your thoughts may turn 83 | toward developing programs for others. 84 | 85 | Computer hardware architecture 86 | 87 | 3818 Before we start learning the language we 88 | speak to give instructions to computers to 89 | develop software, we need to learn a small amount about 90 | 4005 how computers are built. 91 | 92 | Central Processing Unit (or CPU) is 93 | the part of the computer that is built to be obsessed 94 | with what is next? If your computer is rated 95 | at three Gigahertz, it means that the CPU will ask What next? 96 | three billion times per second. You are going to have to 97 | learn how to talk fast to keep up with the CPU. 98 | 99 | 2138 Main Memory is used to store information 1235 100 | that the CPU needs in a hurry. The main memory is nearly as 101 | fast as the CPU. But the information stored in the main 102 | memory vanishes when the computer is turned off. 103 | 104 | 356 Secondary Memory is also used to store 105 | information, but it is much slower than the main memory. 106 | The 4849 advantage 4991 of 4945 the secondary memory is that it can 107 | store information even when there is no power to the 108 | computer. Examples of secondary memory are disk drives 109 | or flash memory (typically found in USB sticks and portable 110 | music players). 111 | 112 | Input and Output Devices are simply our 113 | screen, keyboard, mouse, microphone, speaker, touchpad, etc. 114 | They are all of the ways we interact with the computer. 115 | 116 | These days, most computers also have a 117 | Network Connection to retrieve information over a network. 118 | 3453 We can think of the network as a very slow place to store and 119 | retrieve data that might not always be up. So in a sense, 120 | the 216 network 6976 is 306 a slower and at times unreliable form of 121 | 2537 Secondary Memory. 2460 122 | 123 | While most of the detail of how these components work is best left 124 | to computer builders, it helps to have some terminology 125 | so we can talk about these different parts as we write our programs. 126 | 127 | 7470 As a programmer, your job is to use and orchestrate 2286 128 | each of these resources to solve the problem that you need to solve 129 | and analyze the data you get from the solution. As a programmer you will 130 | mostly be talking to the CPU and telling it what to 131 | do next. Sometimes you will tell the CPU to use the main memory, 132 | secondary memory, network, or the input/output devices. 133 | 134 | You need to be the person who answers the CPU's What next? 135 | question. But it would be very uncomfortable to shrink you 136 | 9055 down to five mm tall and insert you into the computer just so you 137 | could issue a command three billion times per second. So instead, 138 | you must write down your instructions in advance. 139 | We call these stored instructions a program and the act 140 | of writing these instructions down and getting the instructions to 141 | be correct programming. 142 | 143 | Understanding programming 144 | 145 | In the rest of this book, we will try to turn you into a person 146 | who is skilled in the art of programming. In the end you will be a 147 | programmer --- perhaps not a professional programmer, but 148 | at least you will have the skills to look at a data/information 149 | analysis problem and develop a program to solve the problem. 150 | 151 | 7578 problem solving 2736 152 | 153 | In a sense, you need two skills to be a programmer: 154 | 155 | First, you need to know the programming language (Python) - 156 | you need to know the vocabulary and the grammar. You need to be able 157 | 1984 to spell the words in this new language properly and know how to construct 391 158 | well-formed 7950 sentences 7775 in 646 this new language. 159 | 160 | Second, you need to tell a story. In writing a story, 161 | you combine words and sentences to convey an idea to the reader. 162 | There is a skill and art in constructing the story, and skill in 163 | story writing is improved by doing some writing and getting some 164 | feedback. In programming, our program is the story and the 165 | problem you are trying to solve is the idea. 166 | 167 | itemize 168 | 169 | Once you learn one programming language such as Python, you will 170 | find it much easier to learn a second programming language such 171 | as JavaScript or C++. The new programming language has very different 172 | vocabulary and grammar but the problem-solving skills 173 | will be the same across all programming languages. 174 | 175 | You will learn the vocabulary and sentences of Python pretty quickly. 176 | It will take longer for you to be able to write a coherent program 177 | to solve a brand-new problem. We teach programming much like we teach 178 | 233 writing. We start reading and explaining programs, then we write 2072 179 | simple programs, and then we write increasingly complex programs over time. 180 | At some point you get your muse and see the patterns on your own 181 | and can see more naturally how to take a problem and 182 | write a program that solves that problem. And once you get 183 | 7356 to that point, programming becomes a very pleasant and creative process. 5137 184 | 185 | 9004 We start with the vocabulary and structure of Python programs. Be patient 1422 186 | as the simple examples remind you of when you started reading for the first 187 | time. 188 | 189 | Words and sentences 190 | 191 | Unlike human languages, the Python vocabulary is actually pretty small. 192 | We call this vocabulary the reserved words. These are words that 193 | have very special meaning to Python. When Python sees these words in 194 | a Python program, they have one and only one meaning to Python. Later 195 | as you write programs you will make up your own words that have meaning to 196 | you called variables. You will have great latitude in choosing 197 | your names for your variables, but you cannot use any of Python's 198 | reserved words as a name for a variable. 199 | 200 | When we train a dog, we use special words like 201 | sit, stay, and fetch. When you talk to a dog and 202 | don't use any of the reserved words, they just look at you with a 203 | quizzical look on their face until you say a reserved word. 204 | For example, if you say, 205 | I wish more people would walk to improve their overall health, 206 | what most dogs likely hear is, 207 | blah 8934 blah 8389 blah 9878 walk blah blah blah blah. 208 | That is because walk is a reserved word in dog language. 209 | 210 | The reserved words in the language where humans talk to 211 | Python 690 include 6621 the 8577 following: 212 | 213 | and del from not while 214 | as 3753 5414 955 elif global or with 215 | assert else if pass yield 216 | break except import print 217 | class 6535 6010 7650 exec in raise 218 | continue finally is return 219 | def for lambda try 220 | 221 | That is it, and unlike a dog, Python is already completely trained. 222 | When you say try, Python will try every time you say it without 223 | fail. 224 | 225 | We will learn these reserved words and how they are used in good time, 226 | but for now we will focus on the Python equivalent of speak (in 227 | human-to-dog language). The nice thing about telling Python to speak 228 | is that we can even tell it what to say by giving it a message in quotes: 229 | 230 | And we have even written our first syntactically correct Python sentence. 231 | Our sentence starts with the reserved word print followed 232 | by a string of text of our choosing enclosed in single quotes. 233 | 234 | Conversing with Python 235 | 236 | Now that we have a word and a simple sentence that we know in Python, 237 | we need to know how to start a conversation with Python to test 238 | our new language skills. 239 | 240 | Before you can converse with Python, you must first install the Python 241 | software on your computer and learn how to start Python on your 242 | computer. That is too much detail for this chapter so I suggest 243 | that you consult www.pythonlearn.com where I have detailed 244 | instructions and screencasts of setting up and starting Python 245 | on Macintosh and Windows systems. At some point, you will be in 246 | a terminal or command window and you will type python and 247 | the Python interpreter will start executing in interactive mode 248 | and appear somewhat as follows: 249 | interactive mode 250 | 251 | 3910 The >>> prompt is the Python interpreter's way of asking you, What 252 | do you want me to do next? Python is ready to have a conversation with 253 | you. All you have to know is how to speak the Python language. 254 | 255 | Let's say for example that you did not know even the simplest Python language 256 | words or sentences. You might want to use the standard line that astronauts 257 | use when they land on a faraway planet and try to speak with the inhabitants 258 | of the planet: 259 | 260 | This is not going so well. Unless you think of something quickly, 261 | the inhabitants of the planet are likely to stab you with their spears, 262 | put you on a spit, roast you over a fire, and eat you for dinner. 263 | 264 | At this point, you should also realize that while Python 265 | is amazingly complex and powerful and very picky about 266 | the syntax you use to communicate with it, Python is 267 | not intelligent. You are really just having a conversation 268 | with yourself, but using proper syntax. 269 | 270 | In a sense, when you use a program written by someone else 271 | the conversation is between you and those other 272 | programmers with Python acting as an intermediary. Python 273 | is a way for the creators of programs to express how the 274 | conversation is supposed to proceed. And 275 | in just a few more chapters, you will be one of those 276 | programmers using Python to talk to the users of your program. 277 | 278 | Before we leave our first conversation with the Python 279 | interpreter, you should probably know the proper way 280 | to say good-bye when interacting with the inhabitants 281 | of Planet Python: 282 | 283 | You will notice that the error is different for the first two 284 | incorrect attempts. The second error is different because 285 | if is a reserved word and Python saw the reserved word 286 | and thought we were trying to say something but got the syntax 287 | of the sentence wrong. 288 | 289 | Terminology: interpreter and compiler 290 | 5292 1540 291 | Python is a high-level language intended to be relatively 292 | straightforward for humans to read and write and for computers 293 | to read and process. Other high-level languages include Java, C++, 294 | 1565 PHP, Ruby, Basic, Perl, JavaScript, and many more. The actual hardware 295 | inside the Central Processing Unit (CPU) does not understand any 296 | of these high-level languages. 297 | 298 | The CPU understands a language we call machine language. Machine 299 | language is very simple and frankly very tiresome to write because it 300 | is represented all in zeros and ones. 301 | 302 | Machine language seems quite simple on the surface, given that there 303 | are 1668 only 9393 zeros 3135 and ones, but its syntax is even more complex 304 | and far more intricate than Python. So very few programmers ever write 305 | machine language. Instead we build various translators to allow 306 | programmers to write in high-level languages like Python or JavaScript 307 | and these translators convert the programs to machine language for actual 308 | execution by the CPU. 309 | 310 | Since machine language is tied to the computer hardware, machine language 311 | is not portable across different types of hardware. Programs written in 312 | high-level languages can be moved between different computers by using a 313 | different 2884 interpreter 7212 on 8076 the new machine or recompiling the code to create 314 | a machine language version of the program for the new machine. 315 | 1399 316 | These programming language translators fall into two general categories: 317 | (one) interpreters and (two) compilers. 318 | 319 | An interpreter reads the source code of the program as written by the 320 | programmer, parses the source code, and interprets the instructions on the fly. 321 | Python is an interpreter and when we are running Python interactively, 322 | we can type a line of Python (a sentence) and Python processes it immediately 323 | and is ready for us to type another line of Python. 324 | 325 | Some of the lines of Python tell Python that you want it to remember some 326 | value for later. We need to pick a name for that value to be remembered and 327 | we can use that symbolic name to retrieve the value later. We use the 328 | term variable to refer to the labels we use to refer to this stored data. 329 | 330 | In this example, we ask Python to remember the value six and use the label x 331 | so we can retrieve the value later. We verify that Python has actually remembered 332 | the value using x and multiply 333 | it by seven and put the newly computed value in y. Then we ask Python to print out 334 | 4401 the value currently in y. 9505 335 | 336 | Even though we are typing these commands into Python one line at a time, Python 337 | is treating them as an ordered sequence of statements with later statements able 338 | to retrieve data created in earlier statements. We are writing our first 339 | simple paragraph with four sentences in a logical and meaningful order. 340 | 341 | It is the nature of an interpreter to be able to have an interactive conversation 342 | as shown above. A compiler needs to be handed the entire program in a file, and then 343 | it runs a process to translate the high-level source code into machine language 344 | and then the compiler puts the resulting machine language into a file for later 345 | execution. 346 | 119 347 | If you have a Windows system, often these executable machine language programs have a 348 | suffix of .exe or .dll which stand for executable and dynamic link 349 | library respectively. In Linux and Macintosh, there is no suffix that uniquely marks 350 | a file as executable. 351 | 352 | 1505 If you were to open an executable file in a text editor, it would look 353 | completely crazy and be unreadable: 354 | 355 | It is not easy to read or write machine language, so it is nice that we have 356 | compilers that allow us to write in high-level 357 | languages like Python or C. 358 | 359 | 2832 Now at this point in our discussion of compilers and interpreters, you should 360 | be wondering a bit about the Python interpreter itself. What language is 361 | it written in? Is it written in a compiled language? When we type 362 | 9758 python, what exactly is happening? 5097 363 | 364 | The Python interpreter is written in a high-level language called C. 365 | You can look at the actual source code for the Python interpreter by 366 | going to www.python.org and working your way to their source code. 367 | So Python is a program itself and it is compiled into machine code. 368 | When you installed Python on your computer (or the vendor installed it), 369 | you copied a machine-code copy of the translated Python program onto your 370 | system. In Windows, the executable machine code for Python itself is likely 371 | in a file. 372 | 373 | That is more than you really need to know to be a Python programmer, but 374 | 1052 sometimes it pays to answer those little nagging questions right at 375 | the beginning. 376 | 377 | Writing a program 378 | 379 | Typing commands into the Python interpreter is a great way to experiment 380 | with Python's features, but it is not recommended for solving more complex problems. 381 | 382 | When we want to write a program, 383 | we 7972 use 7973 a 3439 text editor to write the Python instructions into a file, 384 | 7361 which is called a script. By 6771 385 | convention, Python scripts have names that end with .py. 386 | 387 | script 388 | 389 | To execute the script, you have to tell the Python interpreter 390 | the name of the file. In a Unix or Windows command window, 391 | you would type python hello.py as follows: 392 | 393 | We call the Python interpreter and tell it to read its source code from 394 | the file hello.py instead of prompting us for lines of Python code 395 | interactively. 396 | 397 | You will notice that there was no need to have quit() at the end of 398 | the Python program in the file. When Python is reading your source code 399 | from a file, it knows to stop when it reaches the end of the file. 400 | 401 | What is a program? 402 | 403 | The definition of a program at its most basic is a sequence 404 | of Python statements that have been crafted to do something. 405 | Even our simple hello.py script is a program. It is a one-line 406 | program and is not particularly useful, but in the strictest definition, 407 | it is a Python program. 408 | 409 | It might be easiest to understand what a program is by thinking about a problem 410 | that a program might be built to solve, and then looking at a program 411 | that would solve that problem. 412 | 413 | Lets say you are doing Social Computing research on Facebook posts and 414 | you are interested in the most frequently used word in a series of posts. 415 | You could print out the stream of Facebook posts and pore over the text 416 | looking for the most common word, but that would take a long time and be very 417 | mistake prone. You would be smart to write a Python program to handle the 418 | task quickly and accurately so you can spend the weekend doing something 419 | fun. 420 | 421 | For example, look at the following text about a clown and a car. Look at the 422 | text and figure out the most common word and how many times it occurs. 423 | 424 | Then imagine that you are doing this task looking at millions of lines of 425 | text. Frankly it would be quicker for you to learn Python and write a 426 | Python program to count the words than it would be to manually 427 | scan the words. 428 | 429 | The even better news is that I already came up with a simple program to 430 | find the most common word in a text file. I wrote it, 431 | tested it, and now I am giving it to you to use so you can save some time. 432 | 433 | You don't even need to know Python to use this program. You will need to get through 434 | Chapter ten of this book to fully understand the awesome Python techniques that were 435 | used to make the program. You are the end user, you simply use the program and marvel 436 | at its cleverness and how it saved you so much manual effort. 437 | You simply type the code 438 | into a file called words.py and run it or you download the source 439 | code from http://www.pythonlearn.com/code/ and run it. 440 | 441 | This is a good example of how Python and the Python language are acting as an intermediary 442 | between you (the end user) and me (the programmer). Python is a way for us to exchange useful 443 | instruction sequences (i.e., programs) in a common language that can be used by anyone who 444 | installs Python on their computer. So neither of us are talking to Python, 445 | instead we are communicating with each other through Python. 446 | 447 | The building blocks of programs 448 | 449 | In the next few chapters, we will learn more about the vocabulary, sentence structure, 450 | paragraph structure, and story structure of Python. We will learn about the powerful 451 | capabilities of Python and how to compose those capabilities together to create useful 452 | programs. 453 | 454 | There are some low-level conceptual patterns that we use to construct programs. These 455 | constructs are not just for Python programs, they are part of every programming language 456 | from machine language up to the high-level languages. 457 | 458 | description 459 | 460 | Get data from the outside world. This might be 461 | reading data from a file, or even some kind of sensor like 462 | a microphone or GPS. In our initial programs, our input will come from the user 463 | typing data on the keyboard. 464 | 465 | Display the results of the program on a screen 466 | or store them in a file or perhaps write them to a device like a 467 | speaker to play music or speak text. 468 | 469 | Perform statements one after 470 | another in the order they are encountered in the script. 471 | 472 | Check for certain conditions and 473 | then execute or skip a sequence of statements. 474 | 475 | Perform some set of statements 476 | repeatedly, usually with 477 | some variation. 478 | 479 | Write a set of instructions once and give them a name 480 | and then reuse those instructions as needed throughout your program. 481 | 482 | description 483 | 484 | It sounds almost too simple to be true, and of course it is never 485 | so simple. It is like saying that walking is simply 486 | putting one foot in front of the other. The art 487 | of writing a program is composing and weaving these 488 | basic elements together many times over to produce something 489 | that is useful to its users. 490 | 491 | The word counting program above directly uses all of 492 | these patterns except for one. 493 | 494 | What could possibly go wrong? 495 | 496 | As we saw in our earliest conversations with Python, we must 497 | communicate very precisely when we write Python code. The smallest 498 | deviation or mistake will cause Python to give up looking at your 499 | program. 500 | 501 | Beginning programmers often take the fact that Python leaves no 502 | room for errors as evidence that Python is mean, hateful, and cruel. 503 | While Python seems to like everyone else, Python knows them 504 | personally and holds a grudge against them. Because of this grudge, 505 | Python takes our perfectly written programs and rejects them as 506 | unfit just to torment us. 507 | 508 | There is little to be gained by arguing with Python. It is just a tool. 509 | It has no emotions and it is happy and ready to serve you whenever you 510 | need it. Its error messages sound harsh, but they are just Python's 511 | call for help. It has looked at what you typed, and it simply cannot 512 | understand what you have entered. 513 | 514 | Python is much more like a dog, loving you unconditionally, having a few 515 | key words that it understands, looking you with a sweet look on its 516 | face (>>>), and waiting for you to say something it understands. 517 | When Python says SyntaxError: invalid syntax, it is simply wagging 518 | its tail and saying, You seemed to say something but I just don't 519 | understand what you meant, but please keep talking to me (>>>). 520 | 521 | As your programs become increasingly sophisticated, you will encounter three 522 | general types of errors: 523 | 524 | description 525 | 526 | These are the first errors you will make and the easiest 527 | to fix. A syntax error means that you have violated the grammar rules of Python. 528 | Python does its best to point right at the line and character where 529 | it noticed it was confused. The only tricky bit of syntax errors is that sometimes 530 | the mistake that needs fixing is actually earlier in the program than where Python 531 | noticed it was confused. So the line and character that Python indicates in 532 | a syntax error may just be a starting point for your investigation. 533 | 534 | A logic error is when your program has good syntax but there is a mistake 535 | in the order of the statements or perhaps a mistake in how the statements relate to one another. 536 | A good example of a logic error might be, take a drink from your water bottle, put it 537 | in your backpack, walk to the library, and then put the top back on the bottle. 538 | 539 | A semantic error is when your description of the steps to take 540 | is syntactically perfect and in the right order, but there is simply a mistake in 541 | the program. The program is perfectly correct but it does not do what 542 | you intended for it to do. A simple example would 543 | be if you were giving a person directions to a restaurant and said, ...when you reach 544 | the intersection with the gas station, turn left and go one mile and the restaurant 545 | is a red building on your left. Your friend is very late and calls you to tell you that 546 | they are on a farm and walking around behind a barn, with no sign of a restaurant. 547 | Then you say did you turn left or right at the gas station? and 548 | they say, I followed your directions perfectly, I have 549 | them written down, it says turn left and go one mile at the gas station. Then you say, 550 | I am very sorry, because while my instructions were syntactically correct, they 551 | sadly contained a small but undetected semantic error.. 552 | 553 | description 554 | 555 | Again in all three types of errors, Python is merely trying its hardest to 556 | do exactly what you have asked. 557 | 558 | The learning journey 559 | 560 | As you progress through the rest of the book, don't be afraid if the concepts 561 | don't seem to fit together well the first time. When you were learning to speak, 562 | it was not a problem for your first few years that you just made cute gurgling noises. 563 | And it was OK if it took six months for you to move from simple vocabulary to 564 | simple sentences and took five or six more years to move from sentences to paragraphs, and a 565 | few more years to be able to write an interesting complete short story on your own. 566 | 567 | We want you to learn Python much more rapidly, so we teach it all at the same time 568 | over the next few chapters. 569 | But it is like learning a new language that takes time to absorb and understand 570 | before it feels natural. 571 | That leads to some confusion as we visit and revisit 572 | topics to try to get you to see the big picture while we are defining the tiny 573 | fragments that make up that big picture. While the book is written linearly, and 574 | if you are taking a course it will progress in a linear fashion, don't hesitate 575 | to be very nonlinear in how you approach the material. Look forwards and backwards 576 | and read with a light touch. By skimming more advanced material without 577 | fully understanding the details, you can get a better understanding of the why? 578 | of programming. By reviewing previous material and even redoing earlier 579 | exercises, you will realize that you actually learned a lot of material even 580 | if the material you are currently staring at seems a bit impenetrable. 581 | 582 | Usually when you are learning your first programming language, there are a few 583 | wonderful Ah Hah! moments where you can look up from pounding away at some rock 584 | with a hammer and chisel and step away and see that you are indeed building 585 | a beautiful sculpture. 586 | 587 | If something seems particularly hard, there is usually no value in staying up all 588 | night and staring at it. Take a break, take a nap, have a snack, explain what you 589 | are having a problem with to someone (or perhaps your dog), and then come back to it with 590 | fresh eyes. I assure you that once you learn the programming concepts in the book 591 | you will look back and see that it was all really easy and elegant and it simply 592 | took you a bit of time to absorb it. 593 | 42 594 | The end 595 | -------------------------------------------------------------------------------- /Chapter 11 - Regular Expressions/regex_sum_42.txt: -------------------------------------------------------------------------------- 1 | This file contains the sample data 2 | 3 | 4 | Why should you learn to write programs? 5 | 6 | Writing programs (or programming) is a very creative 7 | and rewarding activity. You can write programs for 8 | 3036 many reasons, ranging from making your living to solving 7209 9 | a difficult data analysis problem to having fun to helping 10 | someone else solve a problem. This book assumes that 11 | everyone needs to know how to program, and that once 12 | you know how to program you will figure out what you want 13 | to do with your newfound skills. 14 | 15 | We are surrounded in our daily lives with computers ranging 16 | from laptops to cell phones. We can think of these computers 17 | as 4497 our 6702 personal 8454 assistants who can take care of many things 18 | 7449 on our behalf. The hardware in our current-day computers 19 | is essentially built to continuously ask us the question, 20 | What would you like me to do next? 21 | 22 | Programmers add an operating system and a set of applications 23 | to the hardware and we end up with a Personal Digital 24 | Assistant that is quite helpful and capable of helping 25 | us do many different things. 26 | 27 | Our computers are fast and have vast amounts of memory and 28 | could be very helpful to us if we only knew the language to 29 | speak to explain to the computer what we would like it to 30 | do 3665 next. 7936 9772 If we knew this language, we could tell the 31 | computer to do tasks on our behalf that were repetitive. 32 | Interestingly, the kinds of things computers can do best 33 | are often the kinds of things that we humans find boring 34 | and mind-numbing. 35 | 36 | For example, look at the first three paragraphs of this 37 | chapter and tell me the most commonly used word and how 38 | many times the word is used. While you were able to read 39 | and understand the words in a few seconds, counting them 40 | is almost painful because it is not the kind of problem 41 | that human minds are designed to solve. For a computer 42 | the opposite is true, reading and understanding text 43 | from a piece of paper is hard for a computer to do 44 | but counting the words and telling you how many times 45 | the most used word was used is very easy for the 46 | computer: 47 | 7114 48 | Our personal information analysis assistant quickly 49 | told us that the word to was used sixteen times in the 50 | first three paragraphs of this chapter. 51 | 52 | This very fact that computers are good at things 53 | that humans are not is why you need to become 54 | skilled at talking computer language. Once you learn 55 | 956 this new language, you can delegate mundane tasks 2564 56 | to 8003 your 1704 partner 3816 (the computer), leaving more time 57 | for you to do the 58 | things that you are uniquely suited for. You bring 59 | creativity, intuition, and inventiveness to this 60 | partnership. 61 | 62 | Creativity and motivation 63 | 64 | While this book is not intended for professional programmers, professional 65 | programming can be a very rewarding job both financially and personally. 66 | Building useful, elegant, and clever programs for others to use is a very 67 | creative 6662 activity. 5858 7777 Your computer or Personal Digital Assistant (PDA) 68 | usually contains many different programs from many different groups of 69 | programmers, each competing for your attention and interest. They try 70 | their best to meet your needs and give you a great user experience in the 71 | process. In some situations, when you choose a piece of software, the 72 | programmers are directly compensated because of your choice. 73 | 74 | If we think of programs as the creative output of groups of programmers, 75 | perhaps the following figure is a more sensible version of our PDA: 76 | 77 | For now, our primary motivation is not to make money or please end users, but 78 | instead for us to be more productive in handling the data and 79 | information that we will encounter in our lives. 80 | When you first start, you will be both the programmer and the end user of 81 | your programs. As you gain skill as a programmer and 82 | programming feels more creative to you, your thoughts may turn 83 | toward developing programs for others. 84 | 85 | Computer hardware architecture 86 | 87 | Before we start learning the language we 88 | speak to give instructions to computers to 89 | develop software, we need to learn a small amount about 90 | how computers are built. 91 | 92 | Central Processing Unit (or CPU) is 93 | the part of the computer that is built to be obsessed 94 | with what is next? If your computer is rated 95 | at three Gigahertz, it means that the CPU will ask What next? 96 | three billion times per second. You are going to have to 97 | learn how to talk fast to keep up with the CPU. 98 | 99 | Main Memory is used to store information 100 | that the CPU needs in a hurry. The main memory is nearly as 101 | fast as the CPU. But the information stored in the main 102 | memory vanishes when the computer is turned off. 103 | 104 | Secondary Memory is also used to store 105 | 6482 information, but it is much slower than the main memory. 106 | The advantage of the secondary memory is that it can 107 | store information even when there is no power to the 108 | computer. Examples of secondary memory are disk drives 109 | or flash memory (typically found in USB sticks and portable 110 | music players). 111 | 9634 112 | Input and Output Devices are simply our 113 | screen, keyboard, mouse, microphone, speaker, touchpad, etc. 114 | They are all of the ways we interact with the computer. 115 | 116 | These days, most computers also have a 117 | Network Connection to retrieve information over a network. 118 | We can think of the network as a very slow place to store and 119 | 8805 retrieve data that might not always be up. So in a sense, 7123 120 | the network is a slower and at times unreliable form of 121 | 9703 4676 6373 122 | 123 | While most of the detail of how these components work is best left 124 | to computer builders, it helps to have some terminology 125 | so we can talk about these different parts as we write our programs. 126 | 127 | As a programmer, your job is to use and orchestrate 128 | each of these resources to solve the problem that you need to solve 129 | and analyze the data you get from the solution. As a programmer you will 130 | mostly be talking to the CPU and telling it what to 131 | do next. Sometimes you will tell the CPU to use the main memory, 132 | secondary memory, network, or the input/output devices. 133 | 134 | You need to be the person who answers the CPU's What next? 135 | question. But it would be very uncomfortable to shrink you 136 | down to five mm tall and insert you into the computer just so you 137 | could issue a command three billion times per second. So instead, 138 | you must write down your instructions in advance. 139 | We call these stored instructions a program and the act 140 | of writing these instructions down and getting the instructions to 141 | be correct programming. 142 | 143 | Understanding programming 144 | 145 | In the rest of this book, we will try to turn you into a person 146 | who is skilled in the art of programming. In the end you will be a 147 | programmer --- perhaps not a professional programmer, but 148 | at least you will have the skills to look at a data/information 149 | analysis problem and develop a program to solve the problem. 150 | 2834 151 | 7221 problem solving 152 | 153 | 2981 In a sense, you need two skills to be a programmer: 154 | 155 | First, you need to know the programming language (Python) - 156 | 5415 you need to know the vocabulary and the grammar. You need to be able 157 | to spell the words in this new language properly and know how to construct 158 | well-formed sentences in this new language. 159 | 160 | Second, you need to tell a story. In writing a story, 161 | you combine words and sentences to convey an idea to the reader. 162 | There is a skill and art in constructing the story, and skill in 163 | story writing is improved by doing some writing and getting some 164 | feedback. In programming, our program is the story and the 165 | problem you are trying to solve is the idea. 166 | 167 | itemize 168 | 169 | Once you learn one programming language such as Python, you will 170 | find it much easier to learn a second programming language such 171 | as JavaScript or C++. The new programming language has very different 172 | vocabulary and grammar but the problem-solving skills 173 | will be the same across all programming languages. 174 | 175 | You will learn the vocabulary and sentences of Python pretty quickly. 176 | It will take longer for you to be able to write a coherent program 177 | to solve a brand-new problem. We teach programming much like we teach 178 | writing. We start reading and explaining programs, then we write 179 | simple programs, and then we write increasingly complex programs over time. 180 | At some point you get your muse and see the patterns on your own 181 | and can see more naturally how to take a problem and 182 | write a program that solves that problem. And once you get 183 | 6872 to that point, programming becomes a very pleasant and creative process. 184 | 185 | We start with the vocabulary and structure of Python programs. Be patient 186 | as the simple examples remind you of when you started reading for the first 187 | time. 188 | 189 | Words and sentences 190 | 4806 191 | Unlike human languages, the Python vocabulary is actually pretty small. 192 | We call this vocabulary the reserved words. These are words that 193 | 5460 have very special meaning to Python. When Python sees these words in 8533 194 | 3538 a Python program, they have one and only one meaning to Python. Later 195 | as you write programs you will make up your own words that have meaning to 196 | you called variables. You will have great latitude in choosing 197 | your 9663 names 8001 for 9795 your variables, but you cannot use any of Python's 198 | reserved 8752 words 1117 as 5349 a name for a variable. 199 | 200 | When we train a dog, we use special words like 201 | sit, stay, and fetch. When you talk to a dog and 202 | 4509 don't use any of the reserved words, they just look at you with a 203 | quizzical look on their face until you say a reserved word. 204 | For example, if you say, 205 | I wish more people would walk to improve their overall health, 206 | what most dogs likely hear is, 207 | blah blah blah walk blah blah blah blah. 208 | That is because walk is a reserved word in dog language. 209 | 210 | The reserved words in the language where humans talk to 211 | Python include the following: 212 | 213 | and del from not while 214 | as elif global or with 215 | assert else if pass yield 216 | break except import print 217 | class exec in raise 218 | continue finally is return 219 | def for lambda try 220 | 221 | That is it, and unlike a dog, Python is already completely trained. 222 | When 1004 you 9258 say 4183 try, Python will try every time you say it without 223 | fail. 224 | 225 | 4034 We will learn these reserved words and how they are used in good time, 3342 226 | but for now we will focus on the Python equivalent of speak (in 227 | human-to-dog language). The nice thing about telling Python to speak 228 | 3482 is that we can even tell it what to say by giving it a message in quotes: 8567 229 | 230 | And we have even written our first syntactically correct Python sentence. 231 | Our sentence starts with the reserved word print followed 232 | by a string of text of our choosing enclosed in single quotes. 233 | 234 | Conversing with Python 235 | 236 | 1052 Now that we have a word and a simple sentence that we know in Python, 8135 237 | we need to know how to start a conversation with Python to test 238 | our new language skills. 239 | 240 | Before 5561 you 517 can 1218 converse with Python, you must first install the Python 241 | software on your computer and learn how to start Python on your 242 | computer. That is too much detail for this chapter so I suggest 243 | that you consult www.pythonlearn.com where I have detailed 244 | instructions and screencasts of setting up and starting Python 245 | on Macintosh and Windows systems. At some point, you will be in 246 | a terminal or command window and you will type python and 247 | 8877 the Python interpreter will start executing in interactive mode 248 | and appear somewhat as follows: 249 | interactive mode 250 | 251 | The >>> prompt is the Python interpreter's way of asking you, What 252 | do you want me to do next? Python is ready to have a conversation with 253 | you. All you have to know is how to speak the Python language. 254 | 255 | Let's say for example that you did not know even the simplest Python language 256 | words or sentences. You might want to use the standard line that astronauts 257 | use when they land on a faraway planet and try to speak with the inhabitants 258 | of the planet: 259 | 260 | This is not going so well. Unless you think of something quickly, 261 | the inhabitants of the planet are likely to stab you with their spears, 262 | put you on a spit, roast you over a fire, and eat you for dinner. 263 | 264 | At this point, you should also realize that while Python 265 | is amazingly complex and powerful and very picky about 266 | the syntax you use to communicate with it, Python is 267 | not intelligent. You are really just having a conversation 268 | with yourself, but using proper syntax. 269 | 8062 1720 270 | In a sense, when you use a program written by someone else 271 | the conversation is between you and those other 272 | programmers with Python acting as an intermediary. Python 273 | is a way for the creators of programs to express how the 274 | conversation is supposed to proceed. And 275 | in just a few more chapters, you will be one of those 276 | programmers using Python to talk to the users of your program. 277 | 278 | 279 Before we leave our first conversation with the Python 279 | interpreter, you should probably know the proper way 280 | to say good-bye when interacting with the inhabitants 281 | of Planet Python: 282 | 283 | 2054 You will notice that the error is different for the first two 801 284 | incorrect attempts. The second error is different because 285 | if is a reserved word and Python saw the reserved word 286 | and thought we were trying to say something but got the syntax 287 | of the sentence wrong. 288 | 289 | Terminology: interpreter and compiler 290 | 291 | Python is a high-level language intended to be relatively 292 | straightforward for humans to read and write and for computers 293 | to read and process. Other high-level languages include Java, C++, 294 | 918 PHP, Ruby, Basic, Perl, JavaScript, and many more. The actual hardware 295 | inside the Central Processing Unit (CPU) does not understand any 296 | of these high-level languages. 297 | 298 | The CPU understands a language we call machine language. Machine 299 | language is very simple and frankly very tiresome to write because it 300 | is represented all in zeros and ones. 301 | 302 | Machine language seems quite simple on the surface, given that there 303 | are only zeros and ones, but its syntax is even more complex 304 | 8687 and far more intricate than Python. So very few programmers ever write 305 | machine language. Instead we build various translators to allow 306 | programmers to write in high-level languages like Python or JavaScript 307 | and these translators convert the programs to machine language for actual 308 | execution by the CPU. 309 | 310 | Since machine language is tied to the computer hardware, machine language 311 | is not portable across different types of hardware. Programs written in 312 | high-level languages can be moved between different computers by using a 313 | different interpreter on the new machine or recompiling the code to create 314 | a machine language version of the program for the new machine. 315 | 316 | These programming language translators fall into two general categories: 317 | (one) interpreters and (two) compilers. 318 | 7073 1865 7084 319 | An interpreter reads the source code of the program as written by the 320 | programmer, parses the source code, and interprets the instructions on the fly. 321 | Python is an interpreter and when we are running Python interactively, 322 | we can type a line of Python (a sentence) and Python processes it immediately 323 | and is ready for us to type another line of Python. 324 | 2923 63 325 | Some of the lines of Python tell Python that you want it to remember some 326 | value for later. We need to pick a name for that value to be remembered and 327 | we can use that symbolic name to retrieve the value later. We use the 328 | term variable to refer to the labels we use to refer to this stored data. 329 | 330 | In this example, we ask Python to remember the value six and use the label x 331 | so we can retrieve the value later. We verify that Python has actually remembered 332 | the value using x and multiply 333 | it by seven and put the newly computed value in y. Then we ask Python to print out 334 | 8824 the value currently in y. 335 | 1079 5801 5047 336 | Even though we are typing these commands into Python one line at a time, Python 337 | is treating them as an ordered sequence of statements with later statements able 338 | to retrieve data created in earlier statements. We are writing our first 339 | simple paragraph with four sentences in a logical and meaningful order. 340 | 5 341 | It is the nature of an interpreter to be able to have an interactive conversation 342 | as shown above. A compiler needs to be handed the entire program in a file, and then 343 | it runs a process to translate the high-level source code into machine language 344 | 2572 and then the compiler puts the resulting machine language into a file for later 345 | execution. 346 | 347 | If you have a Windows system, often these executable machine language programs have a 348 | suffix of .exe or .dll which stand for executable and dynamic link 349 | library respectively. In Linux and Macintosh, there is no suffix that uniquely marks 350 | a file as executable. 351 | 352 | If you were to open an executable file in a text editor, it would look 353 | completely crazy and be unreadable: 354 | 355 | It is not easy to read or write machine language, so it is nice that we have 356 | compilers that allow us to write in high-level 357 | languages like Python or C. 358 | 359 | Now at this point in our discussion of compilers and interpreters, you should 360 | be 5616 wondering 171 a 3062 bit about the Python interpreter itself. What language is 361 | 9552 it written in? Is it written in a compiled language? When we type 7655 362 | python, 829 what 6096 exactly 2312 is happening? 363 | 364 | The Python interpreter is written in a high-level language called C. 365 | You can look at the actual source code for the Python interpreter by 366 | going to www.python.org and working your way to their source code. 367 | So Python is a program itself and it is compiled into machine code. 368 | When you installed Python on your computer (or the vendor installed it), 369 | 6015 you copied a machine-code copy of the translated Python program onto your 7100 370 | system. In Windows, the executable machine code for Python itself is likely 371 | in a file. 372 | 373 | That is more than you really need to know to be a Python programmer, but 374 | sometimes it pays to answer those little nagging questions right at 375 | the beginning. 376 | 377 | Writing a program 378 | 379 | Typing commands into the Python interpreter is a great way to experiment 380 | with Python's features, but it is not recommended for solving more complex problems. 381 | 382 | When we want to write a program, 383 | we use a text editor to write the Python instructions into a file, 384 | which 9548 is 2727 called 1792 a script. By 385 | convention, Python scripts have names that end with .py. 386 | 387 | script 388 | 389 | To execute the script, you have to tell the Python interpreter 390 | the name of the file. In a Unix or Windows command window, 391 | you would type python hello.py as follows: 392 | 393 | We call the Python interpreter and tell it to read its source code from 394 | the file hello.py instead of prompting us for lines of Python code 395 | interactively. 396 | 397 | You will notice that there was no need to have quit() at the end of 398 | the Python program in the file. When Python is reading your source code 399 | from a file, it knows to stop when it reaches the end of the file. 400 | 401 | 8402 What is a program? 402 | 403 | The definition of a program at its most basic is a sequence 404 | of Python statements that have been crafted to do something. 405 | Even our simple hello.py script is a program. It is a one-line 406 | program and is not particularly useful, but in the strictest definition, 407 | it is a Python program. 408 | 409 | It might be easiest to understand what a program is by thinking about a problem 410 | that a program might be built to solve, and then looking at a program 411 | that would solve that problem. 412 | 413 | Lets say you are doing Social Computing research on Facebook posts and 414 | you are interested in the most frequently used word in a series of posts. 415 | You could print out the stream of Facebook posts and pore over the text 416 | looking for the most common word, but that would take a long time and be very 417 | mistake prone. You would be smart to write a Python program to handle the 418 | task quickly and accurately so you can spend the weekend doing something 419 | fun. 420 | 421 | For example, look at the following text about a clown and a car. Look at the 422 | text and figure out the most common word and how many times it occurs. 423 | 424 | Then imagine that you are doing this task looking at millions of lines of 425 | text. Frankly it would be quicker for you to learn Python and write a 426 | Python program to count the words than it would be to manually 427 | scan the words. 428 | 429 | The even better news is that I already came up with a simple program to 430 | find the most common word in a text file. I wrote it, 431 | tested it, and now I am giving it to you to use so you can save some time. 432 | 433 | You don't even need to know Python to use this program. You will need to get through 434 | Chapter ten of this book to fully understand the awesome Python techniques that were 435 | used to make the program. You are the end user, you simply use the program and marvel 436 | at its cleverness and how it saved you so much manual effort. 437 | You simply type the code 438 | into a file called words.py and run it or you download the source 439 | code from http://www.pythonlearn.com/code/ and run it. 440 | 441 | This is a good example of how Python and the Python language are acting as an intermediary 442 | between you (the end user) and me (the programmer). Python is a way for us to exchange useful 443 | instruction sequences (i.e., programs) in a common language that can be used by anyone who 444 | installs Python on their computer. So neither of us are talking to Python, 445 | instead we are communicating with each other through Python. 446 | 447 | The building blocks of programs 448 | 449 | In the next few chapters, we will learn more about the vocabulary, sentence structure, 450 | paragraph structure, and story structure of Python. We will learn about the powerful 451 | capabilities of Python and how to compose those capabilities together to create useful 452 | programs. 453 | 454 | There are some low-level conceptual patterns that we use to construct programs. These 455 | constructs are not just for Python programs, they are part of every programming language 456 | from machine language up to the high-level languages. 457 | 458 | description 459 | 460 | Get data from the outside world. This might be 461 | reading data from a file, or even some kind of sensor like 462 | a microphone or GPS. In our initial programs, our input will come from the user 463 | typing data on the keyboard. 464 | 465 | Display the results of the program on a screen 466 | or store them in a file or perhaps write them to a device like a 467 | speaker to play music or speak text. 468 | 469 | Perform statements one after 470 | another in the order they are encountered in the script. 471 | 472 | Check for certain conditions and 473 | then execute or skip a sequence of statements. 474 | 475 | Perform some set of statements 476 | repeatedly, usually with 477 | some variation. 478 | 479 | Write a set of instructions once and give them a name 480 | and then reuse those instructions as needed throughout your program. 481 | 482 | description 483 | 484 | It sounds almost too simple to be true, and of course it is never 485 | so simple. It is like saying that walking is simply 486 | putting one foot in front of the other. The art 487 | of writing a program is composing and weaving these 488 | basic elements together many times over to produce something 489 | that is useful to its users. 490 | 491 | The word counting program above directly uses all of 492 | these patterns except for one. 493 | 494 | What could possibly go wrong? 495 | 496 | As we saw in our earliest conversations with Python, we must 497 | communicate very precisely when we write Python code. The smallest 498 | deviation or mistake will cause Python to give up looking at your 499 | program. 500 | 501 | Beginning programmers often take the fact that Python leaves no 502 | room for errors as evidence that Python is mean, hateful, and cruel. 503 | While Python seems to like everyone else, Python knows them 504 | personally and holds a grudge against them. Because of this grudge, 505 | Python takes our perfectly written programs and rejects them as 506 | unfit just to torment us. 507 | 508 | There is little to be gained by arguing with Python. It is just a tool. 509 | It has no emotions and it is happy and ready to serve you whenever you 510 | need it. Its error messages sound harsh, but they are just Python's 511 | call for help. It has looked at what you typed, and it simply cannot 512 | understand what you have entered. 513 | 514 | Python is much more like a dog, loving you unconditionally, having a few 515 | key words that it understands, looking you with a sweet look on its 516 | face (>>>), and waiting for you to say something it understands. 517 | When Python says SyntaxError: invalid syntax, it is simply wagging 518 | its tail and saying, You seemed to say something but I just don't 519 | understand what you meant, but please keep talking to me (>>>). 520 | 521 | As your programs become increasingly sophisticated, you will encounter three 522 | general types of errors: 523 | 524 | description 525 | 526 | These are the first errors you will make and the easiest 527 | to fix. A syntax error means that you have violated the grammar rules of Python. 528 | Python does its best to point right at the line and character where 529 | it noticed it was confused. The only tricky bit of syntax errors is that sometimes 530 | the mistake that needs fixing is actually earlier in the program than where Python 531 | noticed it was confused. So the line and character that Python indicates in 532 | a syntax error may just be a starting point for your investigation. 533 | 534 | A logic error is when your program has good syntax but there is a mistake 535 | in the order of the statements or perhaps a mistake in how the statements relate to one another. 536 | A good example of a logic error might be, take a drink from your water bottle, put it 537 | in your backpack, walk to the library, and then put the top back on the bottle. 538 | 539 | A semantic error is when your description of the steps to take 540 | is syntactically perfect and in the right order, but there is simply a mistake in 541 | the program. The program is perfectly correct but it does not do what 542 | you intended for it to do. A simple example would 543 | be if you were giving a person directions to a restaurant and said, ...when you reach 544 | the intersection with the gas station, turn left and go one mile and the restaurant 545 | is a red building on your left. Your friend is very late and calls you to tell you that 546 | they are on a farm and walking around behind a barn, with no sign of a restaurant. 547 | Then you say did you turn left or right at the gas station? and 548 | they say, I followed your directions perfectly, I have 549 | them written down, it says turn left and go one mile at the gas station. Then you say, 550 | I am very sorry, because while my instructions were syntactically correct, they 551 | sadly contained a small but undetected semantic error.. 552 | 553 | description 554 | 555 | Again in all three types of errors, Python is merely trying its hardest to 556 | do exactly what you have asked. 557 | 558 | The learning journey 559 | 560 | As you progress through the rest of the book, don't be afraid if the concepts 561 | don't seem to fit together well the first time. When you were learning to speak, 562 | it was not a problem for your first few years that you just made cute gurgling noises. 563 | And it was OK if it took six months for you to move from simple vocabulary to 564 | simple sentences and took five or six more years to move from sentences to paragraphs, and a 565 | few more years to be able to write an interesting complete short story on your own. 566 | 567 | We want you to learn Python much more rapidly, so we teach it all at the same time 568 | over the next few chapters. 569 | But it is like learning a new language that takes time to absorb and understand 570 | before it feels natural. 571 | That leads to some confusion as we visit and revisit 572 | topics to try to get you to see the big picture while we are defining the tiny 573 | fragments that make up that big picture. While the book is written linearly, and 574 | if you are taking a course it will progress in a linear fashion, don't hesitate 575 | to be very nonlinear in how you approach the material. Look forwards and backwards 576 | and read with a light touch. By skimming more advanced material without 577 | fully understanding the details, you can get a better understanding of the why? 578 | of programming. By reviewing previous material and even redoing earlier 579 | exercises, you will realize that you actually learned a lot of material even 580 | if the material you are currently staring at seems a bit impenetrable. 581 | 582 | Usually when you are learning your first programming language, there are a few 583 | wonderful Ah Hah! moments where you can look up from pounding away at some rock 584 | with a hammer and chisel and step away and see that you are indeed building 585 | a beautiful sculpture. 586 | 587 | If something seems particularly hard, there is usually no value in staying up all 588 | night and staring at it. Take a break, take a nap, have a snack, explain what you 589 | are having a problem with to someone (or perhaps your dog), and then come back to it with 590 | fresh eyes. I assure you that once you learn the programming concepts in the book 591 | you will look back and see that it was all really easy and elegant and it simply 592 | took you a bit of time to absorb it. 593 | 42 594 | The end 595 | -------------------------------------------------------------------------------- /Chapter 12 - Networks and Sockets/chapter12.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | # Chapter 12 Assignment: Understanding the Request / Response Cycle 4 | 5 | # Exploring the HyperText Transport Protocol 6 | 7 | # You are to retrieve the following document using the HTTP protocol in a way 8 | # that you can examine the HTTP Response headers. 9 | 10 | # http://data.pr4e.org/intro-short.txt 11 | # There are three ways that you might retrieve this web page and look at the 12 | # response headers: 13 | 14 | # Preferred: Modify the socket1.py program to retrieve the above URL and print 15 | # out the headers and data. Make sure to change the code to retrieve the above 16 | # URL - the values are different for each URL. 17 | 18 | # Open the URL in a web browser with a developer console or FireBug and manually 19 | # examine the headers that are returned. 20 | 21 | # Use the telnet program as shown in lecture to retrieve the headers and content. 22 | 23 | import socket 24 | 25 | mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 26 | mysocket.connect(('www.pythonlearn.com', 80)) 27 | 28 | mysocket.send('GET http://www.pythonlearn.com/code/intro-short.txt HTTP/1.1\r\nHost: www.pythonlearn.com\r\n\r\n') 29 | 30 | while True: 31 | data = mysocket.recv(512) 32 | if (len(data) < 1): 33 | break 34 | print data 35 | 36 | mysocket.close() 37 | -------------------------------------------------------------------------------- /Chapter 12 - Networks and Sockets/examples.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | 3 | # Chapter 3 examples using either the socket library or the urllib library to read a URL. 4 | 5 | import socket 6 | import urllib 7 | 8 | # socket library example: 9 | 10 | print 'Output from the socket library example:\n' 11 | 12 | mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 13 | mysocket.connect(('www.py4inf.com', 80)) 14 | 15 | mysocket.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n') 16 | 17 | while True: 18 | data = mysocket.recv(512) 19 | if (len(data) < 1): 20 | break 21 | print data 22 | 23 | mysocket.close() 24 | 25 | # urllib library example: 26 | 27 | print '\n\nOutput from the urllib library example:\n' 28 | 29 | myurl = urllib.urlopen('http://www.py4inf.com/code/romeo.txt') 30 | 31 | for line in myurl: 32 | print line.rstrip() 33 | -------------------------------------------------------------------------------- /Chapter 12 - Programs that Surf the Web/BeautifulSoup.py: -------------------------------------------------------------------------------- 1 | """Beautiful Soup 2 | Elixir and Tonic 3 | "The Screen-Scraper's Friend" 4 | http://www.crummy.com/software/BeautifulSoup/ 5 | 6 | Beautiful Soup parses a (possibly invalid) XML or HTML document into a 7 | tree representation. It provides methods and Pythonic idioms that make 8 | it easy to navigate, search, and modify the tree. 9 | 10 | A well-formed XML/HTML document yields a well-formed data 11 | structure. An ill-formed XML/HTML document yields a correspondingly 12 | ill-formed data structure. If your document is only locally 13 | well-formed, you can use this library to find and process the 14 | well-formed part of it. 15 | 16 | Beautiful Soup works with Python 2.2 and up. It has no external 17 | dependencies, but you'll have more success at converting data to UTF-8 18 | if you also install these three packages: 19 | 20 | * chardet, for auto-detecting character encodings 21 | http://chardet.feedparser.org/ 22 | * cjkcodecs and iconv_codec, which add more encodings to the ones supported 23 | by stock Python. 24 | http://cjkpython.i18n.org/ 25 | 26 | Beautiful Soup defines classes for two main parsing strategies: 27 | 28 | * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific 29 | language that kind of looks like XML. 30 | 31 | * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid 32 | or invalid. This class has web browser-like heuristics for 33 | obtaining a sensible parse tree in the face of common HTML errors. 34 | 35 | Beautiful Soup also defines a class (UnicodeDammit) for autodetecting 36 | the encoding of an HTML or XML document, and converting it to 37 | Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser. 38 | 39 | For more than you ever wanted to know about Beautiful Soup, see the 40 | documentation: 41 | http://www.crummy.com/software/BeautifulSoup/documentation.html 42 | 43 | Here, have some legalese: 44 | 45 | Copyright (c) 2004-2010, Leonard Richardson 46 | 47 | All rights reserved. 48 | 49 | Redistribution and use in source and binary forms, with or without 50 | modification, are permitted provided that the following conditions are 51 | met: 52 | 53 | * Redistributions of source code must retain the above copyright 54 | notice, this list of conditions and the following disclaimer. 55 | 56 | * Redistributions in binary form must reproduce the above 57 | copyright notice, this list of conditions and the following 58 | disclaimer in the documentation and/or other materials provided 59 | with the distribution. 60 | 61 | * Neither the name of the the Beautiful Soup Consortium and All 62 | Night Kosher Bakery nor the names of its contributors may be 63 | used to endorse or promote products derived from this software 64 | without specific prior written permission. 65 | 66 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 67 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 68 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 69 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR 70 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 71 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 72 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 73 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 74 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 75 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 76 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT. 77 | 78 | """ 79 | from __future__ import generators 80 | 81 | __author__ = "Leonard Richardson (leonardr@segfault.org)" 82 | __version__ = "3.0.8.1" 83 | __copyright__ = "Copyright (c) 2004-2010 Leonard Richardson" 84 | __license__ = "New-style BSD" 85 | 86 | from sgmllib import SGMLParser, SGMLParseError 87 | import codecs 88 | import markupbase 89 | import types 90 | import re 91 | import sgmllib 92 | try: 93 | from htmlentitydefs import name2codepoint 94 | except ImportError: 95 | name2codepoint = {} 96 | try: 97 | set 98 | except NameError: 99 | from sets import Set as set 100 | 101 | #These hacks make Beautiful Soup able to parse XML with namespaces 102 | sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*') 103 | markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match 104 | 105 | DEFAULT_OUTPUT_ENCODING = "utf-8" 106 | 107 | def _match_css_class(str): 108 | """Build a RE to match the given CSS class.""" 109 | return re.compile(r"(^|.*\s)%s($|\s)" % str) 110 | 111 | # First, the classes that represent markup elements. 112 | 113 | class PageElement(object): 114 | """Contains the navigational information for some part of the page 115 | (either a tag or a piece of text)""" 116 | 117 | def setup(self, parent=None, previous=None): 118 | """Sets up the initial relations between this element and 119 | other elements.""" 120 | self.parent = parent 121 | self.previous = previous 122 | self.next = None 123 | self.previousSibling = None 124 | self.nextSibling = None 125 | if self.parent and self.parent.contents: 126 | self.previousSibling = self.parent.contents[-1] 127 | self.previousSibling.nextSibling = self 128 | 129 | def replaceWith(self, replaceWith): 130 | oldParent = self.parent 131 | myIndex = self.parent.index(self) 132 | if hasattr(replaceWith, "parent")\ 133 | and replaceWith.parent is self.parent: 134 | # We're replacing this element with one of its siblings. 135 | index = replaceWith.parent.index(replaceWith) 136 | if index and index < myIndex: 137 | # Furthermore, it comes before this element. That 138 | # means that when we extract it, the index of this 139 | # element will change. 140 | myIndex = myIndex - 1 141 | self.extract() 142 | oldParent.insert(myIndex, replaceWith) 143 | 144 | def replaceWithChildren(self): 145 | myParent = self.parent 146 | myIndex = self.parent.index(self) 147 | self.extract() 148 | reversedChildren = list(self.contents) 149 | reversedChildren.reverse() 150 | for child in reversedChildren: 151 | myParent.insert(myIndex, child) 152 | 153 | def extract(self): 154 | """Destructively rips this element out of the tree.""" 155 | if self.parent: 156 | try: 157 | del self.parent.contents[self.parent.index(self)] 158 | except ValueError: 159 | pass 160 | 161 | #Find the two elements that would be next to each other if 162 | #this element (and any children) hadn't been parsed. Connect 163 | #the two. 164 | lastChild = self._lastRecursiveChild() 165 | nextElement = lastChild.next 166 | 167 | if self.previous: 168 | self.previous.next = nextElement 169 | if nextElement: 170 | nextElement.previous = self.previous 171 | self.previous = None 172 | lastChild.next = None 173 | 174 | self.parent = None 175 | if self.previousSibling: 176 | self.previousSibling.nextSibling = self.nextSibling 177 | if self.nextSibling: 178 | self.nextSibling.previousSibling = self.previousSibling 179 | self.previousSibling = self.nextSibling = None 180 | return self 181 | 182 | def _lastRecursiveChild(self): 183 | "Finds the last element beneath this object to be parsed." 184 | lastChild = self 185 | while hasattr(lastChild, 'contents') and lastChild.contents: 186 | lastChild = lastChild.contents[-1] 187 | return lastChild 188 | 189 | def insert(self, position, newChild): 190 | if isinstance(newChild, basestring) \ 191 | and not isinstance(newChild, NavigableString): 192 | newChild = NavigableString(newChild) 193 | 194 | position = min(position, len(self.contents)) 195 | if hasattr(newChild, 'parent') and newChild.parent is not None: 196 | # We're 'inserting' an element that's already one 197 | # of this object's children. 198 | if newChild.parent is self: 199 | index = self.index(newChild) 200 | if index > position: 201 | # Furthermore we're moving it further down the 202 | # list of this object's children. That means that 203 | # when we extract this element, our target index 204 | # will jump down one. 205 | position = position - 1 206 | newChild.extract() 207 | 208 | newChild.parent = self 209 | previousChild = None 210 | if position == 0: 211 | newChild.previousSibling = None 212 | newChild.previous = self 213 | else: 214 | previousChild = self.contents[position-1] 215 | newChild.previousSibling = previousChild 216 | newChild.previousSibling.nextSibling = newChild 217 | newChild.previous = previousChild._lastRecursiveChild() 218 | if newChild.previous: 219 | newChild.previous.next = newChild 220 | 221 | newChildsLastElement = newChild._lastRecursiveChild() 222 | 223 | if position >= len(self.contents): 224 | newChild.nextSibling = None 225 | 226 | parent = self 227 | parentsNextSibling = None 228 | while not parentsNextSibling: 229 | parentsNextSibling = parent.nextSibling 230 | parent = parent.parent 231 | if not parent: # This is the last element in the document. 232 | break 233 | if parentsNextSibling: 234 | newChildsLastElement.next = parentsNextSibling 235 | else: 236 | newChildsLastElement.next = None 237 | else: 238 | nextChild = self.contents[position] 239 | newChild.nextSibling = nextChild 240 | if newChild.nextSibling: 241 | newChild.nextSibling.previousSibling = newChild 242 | newChildsLastElement.next = nextChild 243 | 244 | if newChildsLastElement.next: 245 | newChildsLastElement.next.previous = newChildsLastElement 246 | self.contents.insert(position, newChild) 247 | 248 | def append(self, tag): 249 | """Appends the given tag to the contents of this tag.""" 250 | self.insert(len(self.contents), tag) 251 | 252 | def findNext(self, name=None, attrs={}, text=None, **kwargs): 253 | """Returns the first item that matches the given criteria and 254 | appears after this Tag in the document.""" 255 | return self._findOne(self.findAllNext, name, attrs, text, **kwargs) 256 | 257 | def findAllNext(self, name=None, attrs={}, text=None, limit=None, 258 | **kwargs): 259 | """Returns all items that match the given criteria and appear 260 | after this Tag in the document.""" 261 | return self._findAll(name, attrs, text, limit, self.nextGenerator, 262 | **kwargs) 263 | 264 | def findNextSibling(self, name=None, attrs={}, text=None, **kwargs): 265 | """Returns the closest sibling to this Tag that matches the 266 | given criteria and appears after this Tag in the document.""" 267 | return self._findOne(self.findNextSiblings, name, attrs, text, 268 | **kwargs) 269 | 270 | def findNextSiblings(self, name=None, attrs={}, text=None, limit=None, 271 | **kwargs): 272 | """Returns the siblings of this Tag that match the given 273 | criteria and appear after this Tag in the document.""" 274 | return self._findAll(name, attrs, text, limit, 275 | self.nextSiblingGenerator, **kwargs) 276 | fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x 277 | 278 | def findPrevious(self, name=None, attrs={}, text=None, **kwargs): 279 | """Returns the first item that matches the given criteria and 280 | appears before this Tag in the document.""" 281 | return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs) 282 | 283 | def findAllPrevious(self, name=None, attrs={}, text=None, limit=None, 284 | **kwargs): 285 | """Returns all items that match the given criteria and appear 286 | before this Tag in the document.""" 287 | return self._findAll(name, attrs, text, limit, self.previousGenerator, 288 | **kwargs) 289 | fetchPrevious = findAllPrevious # Compatibility with pre-3.x 290 | 291 | def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs): 292 | """Returns the closest sibling to this Tag that matches the 293 | given criteria and appears before this Tag in the document.""" 294 | return self._findOne(self.findPreviousSiblings, name, attrs, text, 295 | **kwargs) 296 | 297 | def findPreviousSiblings(self, name=None, attrs={}, text=None, 298 | limit=None, **kwargs): 299 | """Returns the siblings of this Tag that match the given 300 | criteria and appear before this Tag in the document.""" 301 | return self._findAll(name, attrs, text, limit, 302 | self.previousSiblingGenerator, **kwargs) 303 | fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x 304 | 305 | def findParent(self, name=None, attrs={}, **kwargs): 306 | """Returns the closest parent of this Tag that matches the given 307 | criteria.""" 308 | # NOTE: We can't use _findOne because findParents takes a different 309 | # set of arguments. 310 | r = None 311 | l = self.findParents(name, attrs, 1) 312 | if l: 313 | r = l[0] 314 | return r 315 | 316 | def findParents(self, name=None, attrs={}, limit=None, **kwargs): 317 | """Returns the parents of this Tag that match the given 318 | criteria.""" 319 | 320 | return self._findAll(name, attrs, None, limit, self.parentGenerator, 321 | **kwargs) 322 | fetchParents = findParents # Compatibility with pre-3.x 323 | 324 | #These methods do the real heavy lifting. 325 | 326 | def _findOne(self, method, name, attrs, text, **kwargs): 327 | r = None 328 | l = method(name, attrs, text, 1, **kwargs) 329 | if l: 330 | r = l[0] 331 | return r 332 | 333 | def _findAll(self, name, attrs, text, limit, generator, **kwargs): 334 | "Iterates over a generator looking for things that match." 335 | 336 | if isinstance(name, SoupStrainer): 337 | strainer = name 338 | # (Possibly) special case some findAll*(...) searches 339 | elif text is None and not limit and not attrs and not kwargs: 340 | # findAll*(True) 341 | if name is True: 342 | return [element for element in generator() 343 | if isinstance(element, Tag)] 344 | # findAll*('tag-name') 345 | elif isinstance(name, basestring): 346 | return [element for element in generator() 347 | if isinstance(element, Tag) and 348 | element.name == name] 349 | else: 350 | strainer = SoupStrainer(name, attrs, text, **kwargs) 351 | # Build a SoupStrainer 352 | else: 353 | strainer = SoupStrainer(name, attrs, text, **kwargs) 354 | results = ResultSet(strainer) 355 | g = generator() 356 | while True: 357 | try: 358 | i = g.next() 359 | except StopIteration: 360 | break 361 | if i: 362 | found = strainer.search(i) 363 | if found: 364 | results.append(found) 365 | if limit and len(results) >= limit: 366 | break 367 | return results 368 | 369 | #These Generators can be used to navigate starting from both 370 | #NavigableStrings and Tags. 371 | def nextGenerator(self): 372 | i = self 373 | while i is not None: 374 | i = i.next 375 | yield i 376 | 377 | def nextSiblingGenerator(self): 378 | i = self 379 | while i is not None: 380 | i = i.nextSibling 381 | yield i 382 | 383 | def previousGenerator(self): 384 | i = self 385 | while i is not None: 386 | i = i.previous 387 | yield i 388 | 389 | def previousSiblingGenerator(self): 390 | i = self 391 | while i is not None: 392 | i = i.previousSibling 393 | yield i 394 | 395 | def parentGenerator(self): 396 | i = self 397 | while i is not None: 398 | i = i.parent 399 | yield i 400 | 401 | # Utility methods 402 | def substituteEncoding(self, str, encoding=None): 403 | encoding = encoding or "utf-8" 404 | return str.replace("%SOUP-ENCODING%", encoding) 405 | 406 | def toEncoding(self, s, encoding=None): 407 | """Encodes an object to a string in some encoding, or to Unicode. 408 | .""" 409 | if isinstance(s, unicode): 410 | if encoding: 411 | s = s.encode(encoding) 412 | elif isinstance(s, str): 413 | if encoding: 414 | s = s.encode(encoding) 415 | else: 416 | s = unicode(s) 417 | else: 418 | if encoding: 419 | s = self.toEncoding(str(s), encoding) 420 | else: 421 | s = unicode(s) 422 | return s 423 | 424 | class NavigableString(unicode, PageElement): 425 | 426 | def __new__(cls, value): 427 | """Create a new NavigableString. 428 | 429 | When unpickling a NavigableString, this method is called with 430 | the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be 431 | passed in to the superclass's __new__ or the superclass won't know 432 | how to handle non-ASCII characters. 433 | """ 434 | if isinstance(value, unicode): 435 | return unicode.__new__(cls, value) 436 | return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING) 437 | 438 | def __getnewargs__(self): 439 | return (NavigableString.__str__(self),) 440 | 441 | def __getattr__(self, attr): 442 | """text.string gives you text. This is for backwards 443 | compatibility for Navigable*String, but for CData* it lets you 444 | get the string without the CData wrapper.""" 445 | if attr == 'string': 446 | return self 447 | else: 448 | raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr) 449 | 450 | def __unicode__(self): 451 | return str(self).decode(DEFAULT_OUTPUT_ENCODING) 452 | 453 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 454 | if encoding: 455 | return self.encode(encoding) 456 | else: 457 | return self 458 | 459 | class CData(NavigableString): 460 | 461 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 462 | return "" % NavigableString.__str__(self, encoding) 463 | 464 | class ProcessingInstruction(NavigableString): 465 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 466 | output = self 467 | if "%SOUP-ENCODING%" in output: 468 | output = self.substituteEncoding(output, encoding) 469 | return "" % self.toEncoding(output, encoding) 470 | 471 | class Comment(NavigableString): 472 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 473 | return "" % NavigableString.__str__(self, encoding) 474 | 475 | class Declaration(NavigableString): 476 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING): 477 | return "" % NavigableString.__str__(self, encoding) 478 | 479 | class Tag(PageElement): 480 | 481 | """Represents a found HTML tag with its attributes and contents.""" 482 | 483 | def _invert(h): 484 | "Cheap function to invert a hash." 485 | i = {} 486 | for k,v in h.items(): 487 | i[v] = k 488 | return i 489 | 490 | XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'", 491 | "quot" : '"', 492 | "amp" : "&", 493 | "lt" : "<", 494 | "gt" : ">" } 495 | 496 | XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS) 497 | 498 | def _convertEntities(self, match): 499 | """Used in a call to re.sub to replace HTML, XML, and numeric 500 | entities with the appropriate Unicode characters. If HTML 501 | entities are being converted, any unrecognized entities are 502 | escaped.""" 503 | x = match.group(1) 504 | if self.convertHTMLEntities and x in name2codepoint: 505 | return unichr(name2codepoint[x]) 506 | elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS: 507 | if self.convertXMLEntities: 508 | return self.XML_ENTITIES_TO_SPECIAL_CHARS[x] 509 | else: 510 | return u'&%s;' % x 511 | elif len(x) > 0 and x[0] == '#': 512 | # Handle numeric entities 513 | if len(x) > 1 and x[1] == 'x': 514 | return unichr(int(x[2:], 16)) 515 | else: 516 | return unichr(int(x[1:])) 517 | 518 | elif self.escapeUnrecognizedEntities: 519 | return u'&%s;' % x 520 | else: 521 | return u'&%s;' % x 522 | 523 | def __init__(self, parser, name, attrs=None, parent=None, 524 | previous=None): 525 | "Basic constructor." 526 | 527 | # We don't actually store the parser object: that lets extracted 528 | # chunks be garbage-collected 529 | self.parserClass = parser.__class__ 530 | self.isSelfClosing = parser.isSelfClosingTag(name) 531 | self.name = name 532 | if attrs is None: 533 | attrs = [] 534 | self.attrs = attrs 535 | self.contents = [] 536 | self.setup(parent, previous) 537 | self.hidden = False 538 | self.containsSubstitutions = False 539 | self.convertHTMLEntities = parser.convertHTMLEntities 540 | self.convertXMLEntities = parser.convertXMLEntities 541 | self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities 542 | 543 | # Convert any HTML, XML, or numeric entities in the attribute values. 544 | convert = lambda(k, val): (k, 545 | re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);", 546 | self._convertEntities, 547 | val)) 548 | self.attrs = map(convert, self.attrs) 549 | 550 | def getString(self): 551 | if (len(self.contents) == 1 552 | and isinstance(self.contents[0], NavigableString)): 553 | return self.contents[0] 554 | 555 | def setString(self, string): 556 | """Replace the contents of the tag with a string""" 557 | self.clear() 558 | self.append(string) 559 | 560 | string = property(getString, setString) 561 | 562 | def getText(self, separator=u""): 563 | if not len(self.contents): 564 | return u"" 565 | stopNode = self._lastRecursiveChild().next 566 | strings = [] 567 | current = self.contents[0] 568 | while current is not stopNode: 569 | if isinstance(current, NavigableString): 570 | strings.append(current.strip()) 571 | current = current.next 572 | return separator.join(strings) 573 | 574 | text = property(getText) 575 | 576 | def get(self, key, default=None): 577 | """Returns the value of the 'key' attribute for the tag, or 578 | the value given for 'default' if it doesn't have that 579 | attribute.""" 580 | return self._getAttrMap().get(key, default) 581 | 582 | def clear(self): 583 | """Extract all children.""" 584 | for child in self.contents[:]: 585 | child.extract() 586 | 587 | def index(self, element): 588 | for i, child in enumerate(self.contents): 589 | if child is element: 590 | return i 591 | raise ValueError("Tag.index: element not in tag") 592 | 593 | def has_key(self, key): 594 | return self._getAttrMap().has_key(key) 595 | 596 | def __getitem__(self, key): 597 | """tag[key] returns the value of the 'key' attribute for the tag, 598 | and throws an exception if it's not there.""" 599 | return self._getAttrMap()[key] 600 | 601 | def __iter__(self): 602 | "Iterating over a tag iterates over its contents." 603 | return iter(self.contents) 604 | 605 | def __len__(self): 606 | "The length of a tag is the length of its list of contents." 607 | return len(self.contents) 608 | 609 | def __contains__(self, x): 610 | return x in self.contents 611 | 612 | def __nonzero__(self): 613 | "A tag is non-None even if it has no contents." 614 | return True 615 | 616 | def __setitem__(self, key, value): 617 | """Setting tag[key] sets the value of the 'key' attribute for the 618 | tag.""" 619 | self._getAttrMap() 620 | self.attrMap[key] = value 621 | found = False 622 | for i in range(0, len(self.attrs)): 623 | if self.attrs[i][0] == key: 624 | self.attrs[i] = (key, value) 625 | found = True 626 | if not found: 627 | self.attrs.append((key, value)) 628 | self._getAttrMap()[key] = value 629 | 630 | def __delitem__(self, key): 631 | "Deleting tag[key] deletes all 'key' attributes for the tag." 632 | for item in self.attrs: 633 | if item[0] == key: 634 | self.attrs.remove(item) 635 | #We don't break because bad HTML can define the same 636 | #attribute multiple times. 637 | self._getAttrMap() 638 | if self.attrMap.has_key(key): 639 | del self.attrMap[key] 640 | 641 | def __call__(self, *args, **kwargs): 642 | """Calling a tag like a function is the same as calling its 643 | findAll() method. Eg. tag('a') returns a list of all the A tags 644 | found within this tag.""" 645 | return apply(self.findAll, args, kwargs) 646 | 647 | def __getattr__(self, tag): 648 | #print "Getattr %s.%s" % (self.__class__, tag) 649 | if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3: 650 | return self.find(tag[:-3]) 651 | elif tag.find('__') != 0: 652 | return self.find(tag) 653 | raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag) 654 | 655 | def __eq__(self, other): 656 | """Returns true iff this tag has the same name, the same attributes, 657 | and the same contents (recursively) as the given tag. 658 | 659 | NOTE: right now this will return false if two tags have the 660 | same attributes in a different order. Should this be fixed?""" 661 | if other is self: 662 | return True 663 | if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other): 664 | return False 665 | for i in range(0, len(self.contents)): 666 | if self.contents[i] != other.contents[i]: 667 | return False 668 | return True 669 | 670 | def __ne__(self, other): 671 | """Returns true iff this tag is not identical to the other tag, 672 | as defined in __eq__.""" 673 | return not self == other 674 | 675 | def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING): 676 | """Renders this tag as a string.""" 677 | return self.__str__(encoding) 678 | 679 | def __unicode__(self): 680 | return self.__str__(None) 681 | 682 | BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|" 683 | + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)" 684 | + ")") 685 | 686 | def _sub_entity(self, x): 687 | """Used with a regular expression to substitute the 688 | appropriate XML entity for an XML special character.""" 689 | return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";" 690 | 691 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING, 692 | prettyPrint=False, indentLevel=0): 693 | """Returns a string or Unicode representation of this tag and 694 | its contents. To get Unicode, pass None for encoding. 695 | 696 | NOTE: since Python's HTML parser consumes whitespace, this 697 | method is not certain to reproduce the whitespace present in 698 | the original string.""" 699 | 700 | encodedName = self.toEncoding(self.name, encoding) 701 | 702 | attrs = [] 703 | if self.attrs: 704 | for key, val in self.attrs: 705 | fmt = '%s="%s"' 706 | if isinstance(val, basestring): 707 | if self.containsSubstitutions and '%SOUP-ENCODING%' in val: 708 | val = self.substituteEncoding(val, encoding) 709 | 710 | # The attribute value either: 711 | # 712 | # * Contains no embedded double quotes or single quotes. 713 | # No problem: we enclose it in double quotes. 714 | # * Contains embedded single quotes. No problem: 715 | # double quotes work here too. 716 | # * Contains embedded double quotes. No problem: 717 | # we enclose it in single quotes. 718 | # * Embeds both single _and_ double quotes. This 719 | # can't happen naturally, but it can happen if 720 | # you modify an attribute value after parsing 721 | # the document. Now we have a bit of a 722 | # problem. We solve it by enclosing the 723 | # attribute in single quotes, and escaping any 724 | # embedded single quotes to XML entities. 725 | if '"' in val: 726 | fmt = "%s='%s'" 727 | if "'" in val: 728 | # TODO: replace with apos when 729 | # appropriate. 730 | val = val.replace("'", "&squot;") 731 | 732 | # Now we're okay w/r/t quotes. But the attribute 733 | # value might also contain angle brackets, or 734 | # ampersands that aren't part of entities. We need 735 | # to escape those to XML entities too. 736 | val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val) 737 | 738 | attrs.append(fmt % (self.toEncoding(key, encoding), 739 | self.toEncoding(val, encoding))) 740 | close = '' 741 | closeTag = '' 742 | if self.isSelfClosing: 743 | close = ' /' 744 | else: 745 | closeTag = '' % encodedName 746 | 747 | indentTag, indentContents = 0, 0 748 | if prettyPrint: 749 | indentTag = indentLevel 750 | space = (' ' * (indentTag-1)) 751 | indentContents = indentTag + 1 752 | contents = self.renderContents(encoding, prettyPrint, indentContents) 753 | if self.hidden: 754 | s = contents 755 | else: 756 | s = [] 757 | attributeString = '' 758 | if attrs: 759 | attributeString = ' ' + ' '.join(attrs) 760 | if prettyPrint: 761 | s.append(space) 762 | s.append('<%s%s%s>' % (encodedName, attributeString, close)) 763 | if prettyPrint: 764 | s.append("\n") 765 | s.append(contents) 766 | if prettyPrint and contents and contents[-1] != "\n": 767 | s.append("\n") 768 | if prettyPrint and closeTag: 769 | s.append(space) 770 | s.append(closeTag) 771 | if prettyPrint and closeTag and self.nextSibling: 772 | s.append("\n") 773 | s = ''.join(s) 774 | return s 775 | 776 | def decompose(self): 777 | """Recursively destroys the contents of this tree.""" 778 | self.extract() 779 | if len(self.contents) == 0: 780 | return 781 | current = self.contents[0] 782 | while current is not None: 783 | next = current.next 784 | if isinstance(current, Tag): 785 | del current.contents[:] 786 | current.parent = None 787 | current.previous = None 788 | current.previousSibling = None 789 | current.next = None 790 | current.nextSibling = None 791 | current = next 792 | 793 | def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING): 794 | return self.__str__(encoding, True) 795 | 796 | def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING, 797 | prettyPrint=False, indentLevel=0): 798 | """Renders the contents of this tag as a string in the given 799 | encoding. If encoding is None, returns a Unicode string..""" 800 | s=[] 801 | for c in self: 802 | text = None 803 | if isinstance(c, NavigableString): 804 | text = c.__str__(encoding) 805 | elif isinstance(c, Tag): 806 | s.append(c.__str__(encoding, prettyPrint, indentLevel)) 807 | if text and prettyPrint: 808 | text = text.strip() 809 | if text: 810 | if prettyPrint: 811 | s.append(" " * (indentLevel-1)) 812 | s.append(text) 813 | if prettyPrint: 814 | s.append("\n") 815 | return ''.join(s) 816 | 817 | #Soup methods 818 | 819 | def find(self, name=None, attrs={}, recursive=True, text=None, 820 | **kwargs): 821 | """Return only the first child of this Tag matching the given 822 | criteria.""" 823 | r = None 824 | l = self.findAll(name, attrs, recursive, text, 1, **kwargs) 825 | if l: 826 | r = l[0] 827 | return r 828 | findChild = find 829 | 830 | def findAll(self, name=None, attrs={}, recursive=True, text=None, 831 | limit=None, **kwargs): 832 | """Extracts a list of Tag objects that match the given 833 | criteria. You can specify the name of the Tag and any 834 | attributes you want the Tag to have. 835 | 836 | The value of a key-value pair in the 'attrs' map can be a 837 | string, a list of strings, a regular expression object, or a 838 | callable that takes a string and returns whether or not the 839 | string matches for some custom definition of 'matches'. The 840 | same is true of the tag name.""" 841 | generator = self.recursiveChildGenerator 842 | if not recursive: 843 | generator = self.childGenerator 844 | return self._findAll(name, attrs, text, limit, generator, **kwargs) 845 | findChildren = findAll 846 | 847 | # Pre-3.x compatibility methods 848 | first = find 849 | fetch = findAll 850 | 851 | def fetchText(self, text=None, recursive=True, limit=None): 852 | return self.findAll(text=text, recursive=recursive, limit=limit) 853 | 854 | def firstText(self, text=None, recursive=True): 855 | return self.find(text=text, recursive=recursive) 856 | 857 | #Private methods 858 | 859 | def _getAttrMap(self): 860 | """Initializes a map representation of this tag's attributes, 861 | if not already initialized.""" 862 | if not getattr(self, 'attrMap'): 863 | self.attrMap = {} 864 | for (key, value) in self.attrs: 865 | self.attrMap[key] = value 866 | return self.attrMap 867 | 868 | #Generator methods 869 | def childGenerator(self): 870 | # Just use the iterator from the contents 871 | return iter(self.contents) 872 | 873 | def recursiveChildGenerator(self): 874 | if not len(self.contents): 875 | raise StopIteration 876 | stopNode = self._lastRecursiveChild().next 877 | current = self.contents[0] 878 | while current is not stopNode: 879 | yield current 880 | current = current.next 881 | 882 | 883 | # Next, a couple classes to represent queries and their results. 884 | class SoupStrainer: 885 | """Encapsulates a number of ways of matching a markup element (tag or 886 | text).""" 887 | 888 | def __init__(self, name=None, attrs={}, text=None, **kwargs): 889 | self.name = name 890 | if isinstance(attrs, basestring): 891 | kwargs['class'] = _match_css_class(attrs) 892 | attrs = None 893 | if kwargs: 894 | if attrs: 895 | attrs = attrs.copy() 896 | attrs.update(kwargs) 897 | else: 898 | attrs = kwargs 899 | self.attrs = attrs 900 | self.text = text 901 | 902 | def __str__(self): 903 | if self.text: 904 | return self.text 905 | else: 906 | return "%s|%s" % (self.name, self.attrs) 907 | 908 | def searchTag(self, markupName=None, markupAttrs={}): 909 | found = None 910 | markup = None 911 | if isinstance(markupName, Tag): 912 | markup = markupName 913 | markupAttrs = markup 914 | callFunctionWithTagData = callable(self.name) \ 915 | and not isinstance(markupName, Tag) 916 | 917 | if (not self.name) \ 918 | or callFunctionWithTagData \ 919 | or (markup and self._matches(markup, self.name)) \ 920 | or (not markup and self._matches(markupName, self.name)): 921 | if callFunctionWithTagData: 922 | match = self.name(markupName, markupAttrs) 923 | else: 924 | match = True 925 | markupAttrMap = None 926 | for attr, matchAgainst in self.attrs.items(): 927 | if not markupAttrMap: 928 | if hasattr(markupAttrs, 'get'): 929 | markupAttrMap = markupAttrs 930 | else: 931 | markupAttrMap = {} 932 | for k,v in markupAttrs: 933 | markupAttrMap[k] = v 934 | attrValue = markupAttrMap.get(attr) 935 | if not self._matches(attrValue, matchAgainst): 936 | match = False 937 | break 938 | if match: 939 | if markup: 940 | found = markup 941 | else: 942 | found = markupName 943 | return found 944 | 945 | def search(self, markup): 946 | #print 'looking for %s in %s' % (self, markup) 947 | found = None 948 | # If given a list of items, scan it for a text element that 949 | # matches. 950 | if hasattr(markup, "__iter__") \ 951 | and not isinstance(markup, Tag): 952 | for element in markup: 953 | if isinstance(element, NavigableString) \ 954 | and self.search(element): 955 | found = element 956 | break 957 | # If it's a Tag, make sure its name or attributes match. 958 | # Don't bother with Tags if we're searching for text. 959 | elif isinstance(markup, Tag): 960 | if not self.text: 961 | found = self.searchTag(markup) 962 | # If it's text, make sure the text matches. 963 | elif isinstance(markup, NavigableString) or \ 964 | isinstance(markup, basestring): 965 | if self._matches(markup, self.text): 966 | found = markup 967 | else: 968 | raise Exception, "I don't know how to match against a %s" \ 969 | % markup.__class__ 970 | return found 971 | 972 | def _matches(self, markup, matchAgainst): 973 | #print "Matching %s against %s" % (markup, matchAgainst) 974 | result = False 975 | if matchAgainst is True: 976 | result = markup is not None 977 | elif callable(matchAgainst): 978 | result = matchAgainst(markup) 979 | else: 980 | #Custom match methods take the tag as an argument, but all 981 | #other ways of matching match the tag name as a string. 982 | if isinstance(markup, Tag): 983 | markup = markup.name 984 | if markup and not isinstance(markup, basestring): 985 | markup = unicode(markup) 986 | #Now we know that chunk is either a string, or None. 987 | if hasattr(matchAgainst, 'match'): 988 | # It's a regexp object. 989 | result = markup and matchAgainst.search(markup) 990 | elif hasattr(matchAgainst, '__iter__'): # list-like 991 | result = markup in matchAgainst 992 | elif hasattr(matchAgainst, 'items'): 993 | result = markup.has_key(matchAgainst) 994 | elif matchAgainst and isinstance(markup, basestring): 995 | if isinstance(markup, unicode): 996 | matchAgainst = unicode(matchAgainst) 997 | else: 998 | matchAgainst = str(matchAgainst) 999 | 1000 | if not result: 1001 | result = matchAgainst == markup 1002 | return result 1003 | 1004 | class ResultSet(list): 1005 | """A ResultSet is just a list that keeps track of the SoupStrainer 1006 | that created it.""" 1007 | def __init__(self, source): 1008 | list.__init__([]) 1009 | self.source = source 1010 | 1011 | # Now, some helper functions. 1012 | 1013 | def buildTagMap(default, *args): 1014 | """Turns a list of maps, lists, or scalars into a single map. 1015 | Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and 1016 | NESTING_RESET_TAGS maps out of lists and partial maps.""" 1017 | built = {} 1018 | for portion in args: 1019 | if hasattr(portion, 'items'): 1020 | #It's a map. Merge it. 1021 | for k,v in portion.items(): 1022 | built[k] = v 1023 | elif hasattr(portion, '__iter__'): # is a list 1024 | #It's a list. Map each item to the default. 1025 | for k in portion: 1026 | built[k] = default 1027 | else: 1028 | #It's a scalar. Map it to the default. 1029 | built[portion] = default 1030 | return built 1031 | 1032 | # Now, the parser classes. 1033 | 1034 | class BeautifulStoneSoup(Tag, SGMLParser): 1035 | 1036 | """This class contains the basic parser and search code. It defines 1037 | a parser that knows nothing about tag behavior except for the 1038 | following: 1039 | 1040 | You can't close a tag without closing all the tags it encloses. 1041 | That is, "" actually means 1042 | "". 1043 | 1044 | [Another possible explanation is "", but since 1045 | this class defines no SELF_CLOSING_TAGS, it will never use that 1046 | explanation.] 1047 | 1048 | This class is useful for parsing XML or made-up markup languages, 1049 | or when BeautifulSoup makes an assumption counter to what you were 1050 | expecting.""" 1051 | 1052 | SELF_CLOSING_TAGS = {} 1053 | NESTABLE_TAGS = {} 1054 | RESET_NESTING_TAGS = {} 1055 | QUOTE_TAGS = {} 1056 | PRESERVE_WHITESPACE_TAGS = [] 1057 | 1058 | MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'), 1059 | lambda x: x.group(1) + ' />'), 1060 | (re.compile(']*)>'), 1061 | lambda x: '') 1062 | ] 1063 | 1064 | ROOT_TAG_NAME = u'[document]' 1065 | 1066 | HTML_ENTITIES = "html" 1067 | XML_ENTITIES = "xml" 1068 | XHTML_ENTITIES = "xhtml" 1069 | # TODO: This only exists for backwards-compatibility 1070 | ALL_ENTITIES = XHTML_ENTITIES 1071 | 1072 | # Used when determining whether a text node is all whitespace and 1073 | # can be replaced with a single space. A text node that contains 1074 | # fancy Unicode spaces (usually non-breaking) should be left 1075 | # alone. 1076 | STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, } 1077 | 1078 | def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None, 1079 | markupMassage=True, smartQuotesTo=XML_ENTITIES, 1080 | convertEntities=None, selfClosingTags=None, isHTML=False): 1081 | """The Soup object is initialized as the 'root tag', and the 1082 | provided markup (which can be a string or a file-like object) 1083 | is fed into the underlying parser. 1084 | 1085 | sgmllib will process most bad HTML, and the BeautifulSoup 1086 | class has some tricks for dealing with some HTML that kills 1087 | sgmllib, but Beautiful Soup can nonetheless choke or lose data 1088 | if your data uses self-closing tags or declarations 1089 | incorrectly. 1090 | 1091 | By default, Beautiful Soup uses regexes to sanitize input, 1092 | avoiding the vast majority of these problems. If the problems 1093 | don't apply to you, pass in False for markupMassage, and 1094 | you'll get better performance. 1095 | 1096 | The default parser massage techniques fix the two most common 1097 | instances of invalid HTML that choke sgmllib: 1098 | 1099 |
(No space between name of closing tag and tag close) 1100 | (Extraneous whitespace in declaration) 1101 | 1102 | You can pass in a custom list of (RE object, replace method) 1103 | tuples to get Beautiful Soup to scrub your input the way you 1104 | want.""" 1105 | 1106 | self.parseOnlyThese = parseOnlyThese 1107 | self.fromEncoding = fromEncoding 1108 | self.smartQuotesTo = smartQuotesTo 1109 | self.convertEntities = convertEntities 1110 | # Set the rules for how we'll deal with the entities we 1111 | # encounter 1112 | if self.convertEntities: 1113 | # It doesn't make sense to convert encoded characters to 1114 | # entities even while you're converting entities to Unicode. 1115 | # Just convert it all to Unicode. 1116 | self.smartQuotesTo = None 1117 | if convertEntities == self.HTML_ENTITIES: 1118 | self.convertXMLEntities = False 1119 | self.convertHTMLEntities = True 1120 | self.escapeUnrecognizedEntities = True 1121 | elif convertEntities == self.XHTML_ENTITIES: 1122 | self.convertXMLEntities = True 1123 | self.convertHTMLEntities = True 1124 | self.escapeUnrecognizedEntities = False 1125 | elif convertEntities == self.XML_ENTITIES: 1126 | self.convertXMLEntities = True 1127 | self.convertHTMLEntities = False 1128 | self.escapeUnrecognizedEntities = False 1129 | else: 1130 | self.convertXMLEntities = False 1131 | self.convertHTMLEntities = False 1132 | self.escapeUnrecognizedEntities = False 1133 | 1134 | self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags) 1135 | SGMLParser.__init__(self) 1136 | 1137 | if hasattr(markup, 'read'): # It's a file-type object. 1138 | markup = markup.read() 1139 | self.markup = markup 1140 | self.markupMassage = markupMassage 1141 | try: 1142 | self._feed(isHTML=isHTML) 1143 | except StopParsing: 1144 | pass 1145 | self.markup = None # The markup can now be GCed 1146 | 1147 | def convert_charref(self, name): 1148 | """This method fixes a bug in Python's SGMLParser.""" 1149 | try: 1150 | n = int(name) 1151 | except ValueError: 1152 | return 1153 | if not 0 <= n <= 127 : # ASCII ends at 127, not 255 1154 | return 1155 | return self.convert_codepoint(n) 1156 | 1157 | def _feed(self, inDocumentEncoding=None, isHTML=False): 1158 | # Convert the document to Unicode. 1159 | markup = self.markup 1160 | if isinstance(markup, unicode): 1161 | if not hasattr(self, 'originalEncoding'): 1162 | self.originalEncoding = None 1163 | else: 1164 | dammit = UnicodeDammit\ 1165 | (markup, [self.fromEncoding, inDocumentEncoding], 1166 | smartQuotesTo=self.smartQuotesTo, isHTML=isHTML) 1167 | markup = dammit.unicode 1168 | self.originalEncoding = dammit.originalEncoding 1169 | self.declaredHTMLEncoding = dammit.declaredHTMLEncoding 1170 | if markup: 1171 | if self.markupMassage: 1172 | if not hasattr(self.markupMassage, "__iter__"): 1173 | self.markupMassage = self.MARKUP_MASSAGE 1174 | for fix, m in self.markupMassage: 1175 | markup = fix.sub(m, markup) 1176 | # TODO: We get rid of markupMassage so that the 1177 | # soup object can be deepcopied later on. Some 1178 | # Python installations can't copy regexes. If anyone 1179 | # was relying on the existence of markupMassage, this 1180 | # might cause problems. 1181 | del(self.markupMassage) 1182 | self.reset() 1183 | 1184 | SGMLParser.feed(self, markup) 1185 | # Close out any unfinished strings and close all the open tags. 1186 | self.endData() 1187 | while self.currentTag.name != self.ROOT_TAG_NAME: 1188 | self.popTag() 1189 | 1190 | def __getattr__(self, methodName): 1191 | """This method routes method call requests to either the SGMLParser 1192 | superclass or the Tag superclass, depending on the method name.""" 1193 | #print "__getattr__ called on %s.%s" % (self.__class__, methodName) 1194 | 1195 | if methodName.startswith('start_') or methodName.startswith('end_') \ 1196 | or methodName.startswith('do_'): 1197 | return SGMLParser.__getattr__(self, methodName) 1198 | elif not methodName.startswith('__'): 1199 | return Tag.__getattr__(self, methodName) 1200 | else: 1201 | raise AttributeError 1202 | 1203 | def isSelfClosingTag(self, name): 1204 | """Returns true iff the given string is the name of a 1205 | self-closing tag according to this parser.""" 1206 | return self.SELF_CLOSING_TAGS.has_key(name) \ 1207 | or self.instanceSelfClosingTags.has_key(name) 1208 | 1209 | def reset(self): 1210 | Tag.__init__(self, self, self.ROOT_TAG_NAME) 1211 | self.hidden = 1 1212 | SGMLParser.reset(self) 1213 | self.currentData = [] 1214 | self.currentTag = None 1215 | self.tagStack = [] 1216 | self.quoteStack = [] 1217 | self.pushTag(self) 1218 | 1219 | def popTag(self): 1220 | tag = self.tagStack.pop() 1221 | 1222 | #print "Pop", tag.name 1223 | if self.tagStack: 1224 | self.currentTag = self.tagStack[-1] 1225 | return self.currentTag 1226 | 1227 | def pushTag(self, tag): 1228 | #print "Push", tag.name 1229 | if self.currentTag: 1230 | self.currentTag.contents.append(tag) 1231 | self.tagStack.append(tag) 1232 | self.currentTag = self.tagStack[-1] 1233 | 1234 | def endData(self, containerClass=NavigableString): 1235 | if self.currentData: 1236 | currentData = u''.join(self.currentData) 1237 | if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and 1238 | not set([tag.name for tag in self.tagStack]).intersection( 1239 | self.PRESERVE_WHITESPACE_TAGS)): 1240 | if '\n' in currentData: 1241 | currentData = '\n' 1242 | else: 1243 | currentData = ' ' 1244 | self.currentData = [] 1245 | if self.parseOnlyThese and len(self.tagStack) <= 1 and \ 1246 | (not self.parseOnlyThese.text or \ 1247 | not self.parseOnlyThese.search(currentData)): 1248 | return 1249 | o = containerClass(currentData) 1250 | o.setup(self.currentTag, self.previous) 1251 | if self.previous: 1252 | self.previous.next = o 1253 | self.previous = o 1254 | self.currentTag.contents.append(o) 1255 | 1256 | 1257 | def _popToTag(self, name, inclusivePop=True): 1258 | """Pops the tag stack up to and including the most recent 1259 | instance of the given tag. If inclusivePop is false, pops the tag 1260 | stack up to but *not* including the most recent instqance of 1261 | the given tag.""" 1262 | #print "Popping to %s" % name 1263 | if name == self.ROOT_TAG_NAME: 1264 | return 1265 | 1266 | numPops = 0 1267 | mostRecentTag = None 1268 | for i in range(len(self.tagStack)-1, 0, -1): 1269 | if name == self.tagStack[i].name: 1270 | numPops = len(self.tagStack)-i 1271 | break 1272 | if not inclusivePop: 1273 | numPops = numPops - 1 1274 | 1275 | for i in range(0, numPops): 1276 | mostRecentTag = self.popTag() 1277 | return mostRecentTag 1278 | 1279 | def _smartPop(self, name): 1280 | 1281 | """We need to pop up to the previous tag of this type, unless 1282 | one of this tag's nesting reset triggers comes between this 1283 | tag and the previous tag of this type, OR unless this tag is a 1284 | generic nesting trigger and another generic nesting trigger 1285 | comes between this tag and the previous tag of this type. 1286 | 1287 | Examples: 1288 |

FooBar *

* should pop to 'p', not 'b'. 1289 |

FooBar *

* should pop to 'table', not 'p'. 1290 |

Foo

Bar *

* should pop to 'tr', not 'p'. 1291 | 1292 |

    • *
    • * should pop to 'ul', not the first 'li'. 1293 |
  • ** should pop to 'table', not the first 'tr' 1294 | tag should 1496 | implicitly close the previous tag within the same
    ** should pop to 'tr', not the first 'td' 1295 | """ 1296 | 1297 | nestingResetTriggers = self.NESTABLE_TAGS.get(name) 1298 | isNestable = nestingResetTriggers != None 1299 | isResetNesting = self.RESET_NESTING_TAGS.has_key(name) 1300 | popTo = None 1301 | inclusive = True 1302 | for i in range(len(self.tagStack)-1, 0, -1): 1303 | p = self.tagStack[i] 1304 | if (not p or p.name == name) and not isNestable: 1305 | #Non-nestable tags get popped to the top or to their 1306 | #last occurance. 1307 | popTo = name 1308 | break 1309 | if (nestingResetTriggers is not None 1310 | and p.name in nestingResetTriggers) \ 1311 | or (nestingResetTriggers is None and isResetNesting 1312 | and self.RESET_NESTING_TAGS.has_key(p.name)): 1313 | 1314 | #If we encounter one of the nesting reset triggers 1315 | #peculiar to this tag, or we encounter another tag 1316 | #that causes nesting to reset, pop up to but not 1317 | #including that tag. 1318 | popTo = p.name 1319 | inclusive = False 1320 | break 1321 | p = p.parent 1322 | if popTo: 1323 | self._popToTag(popTo, inclusive) 1324 | 1325 | def unknown_starttag(self, name, attrs, selfClosing=0): 1326 | #print "Start tag %s: %s" % (name, attrs) 1327 | if self.quoteStack: 1328 | #This is not a real tag. 1329 | #print "<%s> is not real!" % name 1330 | attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs]) 1331 | self.handle_data('<%s%s>' % (name, attrs)) 1332 | return 1333 | self.endData() 1334 | 1335 | if not self.isSelfClosingTag(name) and not selfClosing: 1336 | self._smartPop(name) 1337 | 1338 | if self.parseOnlyThese and len(self.tagStack) <= 1 \ 1339 | and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)): 1340 | return 1341 | 1342 | tag = Tag(self, name, attrs, self.currentTag, self.previous) 1343 | if self.previous: 1344 | self.previous.next = tag 1345 | self.previous = tag 1346 | self.pushTag(tag) 1347 | if selfClosing or self.isSelfClosingTag(name): 1348 | self.popTag() 1349 | if name in self.QUOTE_TAGS: 1350 | #print "Beginning quote (%s)" % name 1351 | self.quoteStack.append(name) 1352 | self.literal = 1 1353 | return tag 1354 | 1355 | def unknown_endtag(self, name): 1356 | #print "End tag %s" % name 1357 | if self.quoteStack and self.quoteStack[-1] != name: 1358 | #This is not a real end tag. 1359 | #print " is not real!" % name 1360 | self.handle_data('' % name) 1361 | return 1362 | self.endData() 1363 | self._popToTag(name) 1364 | if self.quoteStack and self.quoteStack[-1] == name: 1365 | self.quoteStack.pop() 1366 | self.literal = (len(self.quoteStack) > 0) 1367 | 1368 | def handle_data(self, data): 1369 | self.currentData.append(data) 1370 | 1371 | def _toStringSubclass(self, text, subclass): 1372 | """Adds a certain piece of text to the tree as a NavigableString 1373 | subclass.""" 1374 | self.endData() 1375 | self.handle_data(text) 1376 | self.endData(subclass) 1377 | 1378 | def handle_pi(self, text): 1379 | """Handle a processing instruction as a ProcessingInstruction 1380 | object, possibly one with a %SOUP-ENCODING% slot into which an 1381 | encoding will be plugged later.""" 1382 | if text[:3] == "xml": 1383 | text = u"xml version='1.0' encoding='%SOUP-ENCODING%'" 1384 | self._toStringSubclass(text, ProcessingInstruction) 1385 | 1386 | def handle_comment(self, text): 1387 | "Handle comments as Comment objects." 1388 | self._toStringSubclass(text, Comment) 1389 | 1390 | def handle_charref(self, ref): 1391 | "Handle character references as data." 1392 | if self.convertEntities: 1393 | data = unichr(int(ref)) 1394 | else: 1395 | data = '&#%s;' % ref 1396 | self.handle_data(data) 1397 | 1398 | def handle_entityref(self, ref): 1399 | """Handle entity references as data, possibly converting known 1400 | HTML and/or XML entity references to the corresponding Unicode 1401 | characters.""" 1402 | data = None 1403 | if self.convertHTMLEntities: 1404 | try: 1405 | data = unichr(name2codepoint[ref]) 1406 | except KeyError: 1407 | pass 1408 | 1409 | if not data and self.convertXMLEntities: 1410 | data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref) 1411 | 1412 | if not data and self.convertHTMLEntities and \ 1413 | not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref): 1414 | # TODO: We've got a problem here. We're told this is 1415 | # an entity reference, but it's not an XML entity 1416 | # reference or an HTML entity reference. Nonetheless, 1417 | # the logical thing to do is to pass it through as an 1418 | # unrecognized entity reference. 1419 | # 1420 | # Except: when the input is "&carol;" this function 1421 | # will be called with input "carol". When the input is 1422 | # "AT&T", this function will be called with input 1423 | # "T". We have no way of knowing whether a semicolon 1424 | # was present originally, so we don't know whether 1425 | # this is an unknown entity or just a misplaced 1426 | # ampersand. 1427 | # 1428 | # The more common case is a misplaced ampersand, so I 1429 | # escape the ampersand and omit the trailing semicolon. 1430 | data = "&%s" % ref 1431 | if not data: 1432 | # This case is different from the one above, because we 1433 | # haven't already gone through a supposedly comprehensive 1434 | # mapping of entities to Unicode characters. We might not 1435 | # have gone through any mapping at all. So the chances are 1436 | # very high that this is a real entity, and not a 1437 | # misplaced ampersand. 1438 | data = "&%s;" % ref 1439 | self.handle_data(data) 1440 | 1441 | def handle_decl(self, data): 1442 | "Handle DOCTYPEs and the like as Declaration objects." 1443 | self._toStringSubclass(data, Declaration) 1444 | 1445 | def parse_declaration(self, i): 1446 | """Treat a bogus SGML declaration as raw data. Treat a CDATA 1447 | declaration as a CData object.""" 1448 | j = None 1449 | if self.rawdata[i:i+9] == '', i) 1451 | if k == -1: 1452 | k = len(self.rawdata) 1453 | data = self.rawdata[i+9:k] 1454 | j = k+3 1455 | self._toStringSubclass(data, CData) 1456 | else: 1457 | try: 1458 | j = SGMLParser.parse_declaration(self, i) 1459 | except SGMLParseError: 1460 | toHandle = self.rawdata[i:] 1461 | self.handle_data(toHandle) 1462 | j = i + len(toHandle) 1463 | return j 1464 | 1465 | class BeautifulSoup(BeautifulStoneSoup): 1466 | 1467 | """This parser knows the following facts about HTML: 1468 | 1469 | * Some tags have no closing tag and should be interpreted as being 1470 | closed as soon as they are encountered. 1471 | 1472 | * The text inside some tags (ie. 'script') may contain tags which 1473 | are not really part of the document and which should be parsed 1474 | as text, not tags. If you want to parse the text as tags, you can 1475 | always fetch it and parse it explicitly. 1476 | 1477 | * Tag nesting rules: 1478 | 1479 | Most tags can't be nested at all. For instance, the occurance of 1480 | a

    tag should implicitly close the previous

    tag. 1481 | 1482 |

    Para1

    Para2 1483 | should be transformed into: 1484 |

    Para1

    Para2 1485 | 1486 | Some tags can be nested arbitrarily. For instance, the occurance 1487 | of a

    tag should _not_ implicitly close the previous 1488 |
    tag. 1489 | 1490 | Alice said:
    Bob said:
    Blah 1491 | should NOT be transformed into: 1492 | Alice said:
    Bob said:
    Blah 1493 | 1494 | Some tags can be nested, but the nesting is reset by the 1495 | interposition of other tags. For instance, a
    , 1497 | but not close a tag in another table. 1498 | 1499 |
    BlahBlah 1500 | should be transformed into: 1501 |
    BlahBlah 1502 | but, 1503 | Blah
    Blah 1504 | should NOT be transformed into 1505 | Blah
    Blah 1506 | 1507 | Differing assumptions about tag nesting rules are a major source 1508 | of problems with the BeautifulSoup class. If BeautifulSoup is not 1509 | treating as nestable a tag your page author treats as nestable, 1510 | try ICantBelieveItsBeautifulSoup, MinimalSoup, or 1511 | BeautifulStoneSoup before writing your own subclass.""" 1512 | 1513 | def __init__(self, *args, **kwargs): 1514 | if not kwargs.has_key('smartQuotesTo'): 1515 | kwargs['smartQuotesTo'] = self.HTML_ENTITIES 1516 | kwargs['isHTML'] = True 1517 | BeautifulStoneSoup.__init__(self, *args, **kwargs) 1518 | 1519 | SELF_CLOSING_TAGS = buildTagMap(None, 1520 | ('br' , 'hr', 'input', 'img', 'meta', 1521 | 'spacer', 'link', 'frame', 'base', 'col')) 1522 | 1523 | PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea']) 1524 | 1525 | QUOTE_TAGS = {'script' : None, 'textarea' : None} 1526 | 1527 | #According to the HTML standard, each of these inline tags can 1528 | #contain another tag of the same type. Furthermore, it's common 1529 | #to actually use these tags this way. 1530 | NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup', 1531 | 'center') 1532 | 1533 | #According to the HTML standard, these block tags can contain 1534 | #another tag of the same type. Furthermore, it's common 1535 | #to actually use these tags this way. 1536 | NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del') 1537 | 1538 | #Lists can contain other lists, but there are restrictions. 1539 | NESTABLE_LIST_TAGS = { 'ol' : [], 1540 | 'ul' : [], 1541 | 'li' : ['ul', 'ol'], 1542 | 'dl' : [], 1543 | 'dd' : ['dl'], 1544 | 'dt' : ['dl'] } 1545 | 1546 | #Tables can contain other tables, but there are restrictions. 1547 | NESTABLE_TABLE_TAGS = {'table' : [], 1548 | 'tr' : ['table', 'tbody', 'tfoot', 'thead'], 1549 | 'td' : ['tr'], 1550 | 'th' : ['tr'], 1551 | 'thead' : ['table'], 1552 | 'tbody' : ['table'], 1553 | 'tfoot' : ['table'], 1554 | } 1555 | 1556 | NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre') 1557 | 1558 | #If one of these tags is encountered, all tags up to the next tag of 1559 | #this type are popped. 1560 | RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript', 1561 | NON_NESTABLE_BLOCK_TAGS, 1562 | NESTABLE_LIST_TAGS, 1563 | NESTABLE_TABLE_TAGS) 1564 | 1565 | NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS, 1566 | NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS) 1567 | 1568 | # Used to detect the charset in a META tag; see start_meta 1569 | CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M) 1570 | 1571 | def start_meta(self, attrs): 1572 | """Beautiful Soup can detect a charset included in a META tag, 1573 | try to convert the document to that charset, and re-parse the 1574 | document from the beginning.""" 1575 | httpEquiv = None 1576 | contentType = None 1577 | contentTypeIndex = None 1578 | tagNeedsEncodingSubstitution = False 1579 | 1580 | for i in range(0, len(attrs)): 1581 | key, value = attrs[i] 1582 | key = key.lower() 1583 | if key == 'http-equiv': 1584 | httpEquiv = value 1585 | elif key == 'content': 1586 | contentType = value 1587 | contentTypeIndex = i 1588 | 1589 | if httpEquiv and contentType: # It's an interesting meta tag. 1590 | match = self.CHARSET_RE.search(contentType) 1591 | if match: 1592 | if (self.declaredHTMLEncoding is not None or 1593 | self.originalEncoding == self.fromEncoding): 1594 | # An HTML encoding was sniffed while converting 1595 | # the document to Unicode, or an HTML encoding was 1596 | # sniffed during a previous pass through the 1597 | # document, or an encoding was specified 1598 | # explicitly and it worked. Rewrite the meta tag. 1599 | def rewrite(match): 1600 | return match.group(1) + "%SOUP-ENCODING%" 1601 | newAttr = self.CHARSET_RE.sub(rewrite, contentType) 1602 | attrs[contentTypeIndex] = (attrs[contentTypeIndex][0], 1603 | newAttr) 1604 | tagNeedsEncodingSubstitution = True 1605 | else: 1606 | # This is our first pass through the document. 1607 | # Go through it again with the encoding information. 1608 | newCharset = match.group(3) 1609 | if newCharset and newCharset != self.originalEncoding: 1610 | self.declaredHTMLEncoding = newCharset 1611 | self._feed(self.declaredHTMLEncoding) 1612 | raise StopParsing 1613 | pass 1614 | tag = self.unknown_starttag("meta", attrs) 1615 | if tag and tagNeedsEncodingSubstitution: 1616 | tag.containsSubstitutions = True 1617 | 1618 | class StopParsing(Exception): 1619 | pass 1620 | 1621 | class ICantBelieveItsBeautifulSoup(BeautifulSoup): 1622 | 1623 | """The BeautifulSoup class is oriented towards skipping over 1624 | common HTML errors like unclosed tags. However, sometimes it makes 1625 | errors of its own. For instance, consider this fragment: 1626 | 1627 | FooBar 1628 | 1629 | This is perfectly valid (if bizarre) HTML. However, the 1630 | BeautifulSoup class will implicitly close the first b tag when it 1631 | encounters the second 'b'. It will think the author wrote 1632 | "FooBar", and didn't close the first 'b' tag, because 1633 | there's no real-world reason to bold something that's already 1634 | bold. When it encounters '' it will close two more 'b' 1635 | tags, for a grand total of three tags closed instead of two. This 1636 | can throw off the rest of your document structure. The same is 1637 | true of a number of other tags, listed below. 1638 | 1639 | It's much more common for someone to forget to close a 'b' tag 1640 | than to actually use nested 'b' tags, and the BeautifulSoup class 1641 | handles the common case. This class handles the not-co-common 1642 | case: where you can't believe someone wrote what they did, but 1643 | it's valid HTML and BeautifulSoup screwed up by assuming it 1644 | wouldn't be.""" 1645 | 1646 | I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \ 1647 | ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong', 1648 | 'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b', 1649 | 'big') 1650 | 1651 | I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',) 1652 | 1653 | NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS, 1654 | I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS, 1655 | I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS) 1656 | 1657 | class MinimalSoup(BeautifulSoup): 1658 | """The MinimalSoup class is for parsing HTML that contains 1659 | pathologically bad markup. It makes no assumptions about tag 1660 | nesting, but it does know which tags are self-closing, that 1661 |