├── Chapter 11 - Regular Expressions
├── chapter11.py
├── regex_sum_201873.txt
└── regex_sum_42.txt
├── Chapter 12 - Networks and Sockets
├── chapter12.py
└── examples.py
├── Chapter 12 - Programs that Surf the Web
├── BeautifulSoup.py
├── chapter12_2.py
└── chapter12_3.py
├── Chapter 13 - JSON and the REST Architecture
├── geo_assignment.py
└── json_assignment.py
├── Chapter 13 - Web Services and XML
└── xml_assignment.py
└── README.md
/Chapter 11 - Regular Expressions/chapter11.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | # Chapter 11 Assignment: Extracting Data With Regular Expressions
4 |
5 | # Finding Numbers in a Haystack
6 |
7 | # In this assignment you will read through and parse a file with text and
8 | # numbers. You will extract all the numbers in the file and compute the sum of
9 | # the numbers.
10 |
11 | # Data Files
12 | # We provide two files for this assignment. One is a sample file where we give
13 | # you the sum for your testing and the other is the actual data you need to
14 | # process for the assignment.
15 |
16 | # Sample data: http://python-data.dr-chuck.net/regex_sum_42.txt
17 | # (There are 87 values with a sum=445822)
18 | # Actual data: http://python-data.dr-chuck.net/regex_sum_201873.txt
19 | # (There are 96 values and the sum ends with 156)
20 |
21 | # These links open in a new window. Make sure to save the file into the same
22 | # folder as you will be writing your Python program. Note: Each student will
23 | # have a distinct data file for the assignment - so only use your own data file
24 | # for analysis.
25 |
26 | # Data Format
27 | # The file contains much of the text from the introduction of the textbook
28 | # except that random numbers are inserted throughout the text. Here is a sample
29 | # of the output you might see:
30 |
31 | '''
32 | Why should you learn to write programs? 7746
33 | 12 1929 8827
34 | Writing programs (or programming) is a very creative
35 | 7 and rewarding activity. You can write programs for
36 | many reasons, ranging from making your living to solving
37 | 8837 a difficult data analysis problem to having fun to helping 128
38 | someone else solve a problem. This book assumes that
39 | everyone needs to know how to program ...
40 | '''
41 |
42 | # The sum for the sample text above is 27486. The numbers can appear anywhere
43 | # in the line. There can be any number of numbers in each line (including none).
44 |
45 | # Handling The Data
46 | # The basic outline of this problem is to read the file, look for integers using
47 | # the re.findall(), looking for a regular expression of '[0-9]+' and then
48 | # converting the extracted strings to integers and summing up the integers.
49 |
50 | import re
51 |
52 | file = open('regex_sum_201873.txt', 'r')
53 |
54 | sum = 0
55 |
56 | for line in file:
57 | numbers = re.findall('[0-9]+', line)
58 | for number in numbers:
59 | sum = sum + int(number)
60 |
61 | print sum
62 |
--------------------------------------------------------------------------------
/Chapter 11 - Regular Expressions/regex_sum_201873.txt:
--------------------------------------------------------------------------------
1 | This file contains the actual data for your assignment - good luck!
2 |
3 |
4 | Why should you learn to write programs?
5 |
6 | Writing programs (or programming) is a very creative
7 | and rewarding activity. You can write programs for
8 | many reasons, ranging from making your living to solving
9 | a difficult data analysis problem to having fun to helping
10 | someone else solve a problem. This book assumes that
11 | everyone needs to know how to program, and that once
12 | you know how to program you will figure out what you want
13 | to 6812 do 8128 with 5282 your newfound skills.
14 | 7266
15 | 4340 We are surrounded in our daily lives with computers ranging
16 | from laptops to cell phones. We can think of these computers
17 | as our personal assistants who can take care of many things
18 | on our behalf. The hardware in our current-day computers
19 | is essentially built to continuously ask us the question,
20 | What would you like me to do next?
21 |
22 | 1013 Programmers add an operating system and a set of applications
23 | to the hardware and we end up with a Personal Digital
24 | Assistant that is quite helpful and capable of helping
25 | 6818 us do many different things. 905
26 |
27 | Our computers are fast and have vast amounts of memory and
28 | could be very helpful to us if we only knew the language to
29 | speak to explain to the computer what we would like it to
30 | do next. If we knew this language, we could tell the
31 | computer to do tasks on our behalf that were repetitive.
32 | 9504 Interestingly, the kinds of things computers can do best
33 | are often the kinds of things that we humans find boring
34 | and mind-numbing.
35 |
36 | For example, look at the first three paragraphs of this
37 | chapter and tell me the most commonly used word and how
38 | many 4425 times 3448 the 79 word is used. While you were able to read
39 | and understand the words in a few seconds, counting them
40 | is almost painful because it is not the kind of problem
41 | that human minds are designed to solve. For a computer
42 | the opposite is true, reading and understanding text
43 | 112 from a piece of paper is hard for a computer to do
44 | but counting the words and telling you how many times
45 | the most used word was used is very easy for the
46 | computer:
47 |
48 | Our personal information analysis assistant quickly
49 | told 8980 us 1151 that 6621 the word to was used sixteen times in the
50 | first three paragraphs of this chapter.
51 |
52 | This 6885 very 9707 fact 4618 that computers are good at things
53 | that humans are not is why you need to become
54 | skilled at talking computer language. Once you learn
55 | this new language, you can delegate mundane tasks
56 | to your partner (the computer), leaving more time
57 | for you to do the
58 | things that you are uniquely suited for. You bring
59 | 4130 creativity, intuition, and inventiveness to this
60 | partnership.
61 |
62 | Creativity and motivation
63 |
64 | While this book is not intended for professional programmers, professional
65 | programming can be a very rewarding job both financially and personally.
66 | Building useful, elegant, and clever programs for others to use is a very
67 | creative 1483 activity. 4097 2097 Your computer or Personal Digital Assistant (PDA)
68 | usually contains many different programs from many different groups of
69 | programmers, each competing for your attention and interest. They try
70 | their best to meet your needs and give you a great user experience in the
71 | process. In some situations, when you choose a piece of software, the
72 | programmers 6518 are 6625 directly 2923 compensated because of your choice.
73 |
74 | If we think of programs as the creative output of groups of programmers,
75 | perhaps the following figure is a more sensible version of our PDA:
76 |
77 | For now, our primary motivation is not to make money or please end users, but
78 | instead for us to be more productive in handling the data and
79 | information 9001 that 9633 we 3868 will encounter in our lives.
80 | When you first start, you will be both the programmer and the end user of
81 | your programs. As you gain skill as a programmer and
82 | programming feels more creative to you, your thoughts may turn
83 | toward developing programs for others.
84 |
85 | Computer hardware architecture
86 |
87 | 3818 Before we start learning the language we
88 | speak to give instructions to computers to
89 | develop software, we need to learn a small amount about
90 | 4005 how computers are built.
91 |
92 | Central Processing Unit (or CPU) is
93 | the part of the computer that is built to be obsessed
94 | with what is next? If your computer is rated
95 | at three Gigahertz, it means that the CPU will ask What next?
96 | three billion times per second. You are going to have to
97 | learn how to talk fast to keep up with the CPU.
98 |
99 | 2138 Main Memory is used to store information 1235
100 | that the CPU needs in a hurry. The main memory is nearly as
101 | fast as the CPU. But the information stored in the main
102 | memory vanishes when the computer is turned off.
103 |
104 | 356 Secondary Memory is also used to store
105 | information, but it is much slower than the main memory.
106 | The 4849 advantage 4991 of 4945 the secondary memory is that it can
107 | store information even when there is no power to the
108 | computer. Examples of secondary memory are disk drives
109 | or flash memory (typically found in USB sticks and portable
110 | music players).
111 |
112 | Input and Output Devices are simply our
113 | screen, keyboard, mouse, microphone, speaker, touchpad, etc.
114 | They are all of the ways we interact with the computer.
115 |
116 | These days, most computers also have a
117 | Network Connection to retrieve information over a network.
118 | 3453 We can think of the network as a very slow place to store and
119 | retrieve data that might not always be up. So in a sense,
120 | the 216 network 6976 is 306 a slower and at times unreliable form of
121 | 2537 Secondary Memory. 2460
122 |
123 | While most of the detail of how these components work is best left
124 | to computer builders, it helps to have some terminology
125 | so we can talk about these different parts as we write our programs.
126 |
127 | 7470 As a programmer, your job is to use and orchestrate 2286
128 | each of these resources to solve the problem that you need to solve
129 | and analyze the data you get from the solution. As a programmer you will
130 | mostly be talking to the CPU and telling it what to
131 | do next. Sometimes you will tell the CPU to use the main memory,
132 | secondary memory, network, or the input/output devices.
133 |
134 | You need to be the person who answers the CPU's What next?
135 | question. But it would be very uncomfortable to shrink you
136 | 9055 down to five mm tall and insert you into the computer just so you
137 | could issue a command three billion times per second. So instead,
138 | you must write down your instructions in advance.
139 | We call these stored instructions a program and the act
140 | of writing these instructions down and getting the instructions to
141 | be correct programming.
142 |
143 | Understanding programming
144 |
145 | In the rest of this book, we will try to turn you into a person
146 | who is skilled in the art of programming. In the end you will be a
147 | programmer --- perhaps not a professional programmer, but
148 | at least you will have the skills to look at a data/information
149 | analysis problem and develop a program to solve the problem.
150 |
151 | 7578 problem solving 2736
152 |
153 | In a sense, you need two skills to be a programmer:
154 |
155 | First, you need to know the programming language (Python) -
156 | you need to know the vocabulary and the grammar. You need to be able
157 | 1984 to spell the words in this new language properly and know how to construct 391
158 | well-formed 7950 sentences 7775 in 646 this new language.
159 |
160 | Second, you need to tell a story. In writing a story,
161 | you combine words and sentences to convey an idea to the reader.
162 | There is a skill and art in constructing the story, and skill in
163 | story writing is improved by doing some writing and getting some
164 | feedback. In programming, our program is the story and the
165 | problem you are trying to solve is the idea.
166 |
167 | itemize
168 |
169 | Once you learn one programming language such as Python, you will
170 | find it much easier to learn a second programming language such
171 | as JavaScript or C++. The new programming language has very different
172 | vocabulary and grammar but the problem-solving skills
173 | will be the same across all programming languages.
174 |
175 | You will learn the vocabulary and sentences of Python pretty quickly.
176 | It will take longer for you to be able to write a coherent program
177 | to solve a brand-new problem. We teach programming much like we teach
178 | 233 writing. We start reading and explaining programs, then we write 2072
179 | simple programs, and then we write increasingly complex programs over time.
180 | At some point you get your muse and see the patterns on your own
181 | and can see more naturally how to take a problem and
182 | write a program that solves that problem. And once you get
183 | 7356 to that point, programming becomes a very pleasant and creative process. 5137
184 |
185 | 9004 We start with the vocabulary and structure of Python programs. Be patient 1422
186 | as the simple examples remind you of when you started reading for the first
187 | time.
188 |
189 | Words and sentences
190 |
191 | Unlike human languages, the Python vocabulary is actually pretty small.
192 | We call this vocabulary the reserved words. These are words that
193 | have very special meaning to Python. When Python sees these words in
194 | a Python program, they have one and only one meaning to Python. Later
195 | as you write programs you will make up your own words that have meaning to
196 | you called variables. You will have great latitude in choosing
197 | your names for your variables, but you cannot use any of Python's
198 | reserved words as a name for a variable.
199 |
200 | When we train a dog, we use special words like
201 | sit, stay, and fetch. When you talk to a dog and
202 | don't use any of the reserved words, they just look at you with a
203 | quizzical look on their face until you say a reserved word.
204 | For example, if you say,
205 | I wish more people would walk to improve their overall health,
206 | what most dogs likely hear is,
207 | blah 8934 blah 8389 blah 9878 walk blah blah blah blah.
208 | That is because walk is a reserved word in dog language.
209 |
210 | The reserved words in the language where humans talk to
211 | Python 690 include 6621 the 8577 following:
212 |
213 | and del from not while
214 | as 3753 5414 955 elif global or with
215 | assert else if pass yield
216 | break except import print
217 | class 6535 6010 7650 exec in raise
218 | continue finally is return
219 | def for lambda try
220 |
221 | That is it, and unlike a dog, Python is already completely trained.
222 | When you say try, Python will try every time you say it without
223 | fail.
224 |
225 | We will learn these reserved words and how they are used in good time,
226 | but for now we will focus on the Python equivalent of speak (in
227 | human-to-dog language). The nice thing about telling Python to speak
228 | is that we can even tell it what to say by giving it a message in quotes:
229 |
230 | And we have even written our first syntactically correct Python sentence.
231 | Our sentence starts with the reserved word print followed
232 | by a string of text of our choosing enclosed in single quotes.
233 |
234 | Conversing with Python
235 |
236 | Now that we have a word and a simple sentence that we know in Python,
237 | we need to know how to start a conversation with Python to test
238 | our new language skills.
239 |
240 | Before you can converse with Python, you must first install the Python
241 | software on your computer and learn how to start Python on your
242 | computer. That is too much detail for this chapter so I suggest
243 | that you consult www.pythonlearn.com where I have detailed
244 | instructions and screencasts of setting up and starting Python
245 | on Macintosh and Windows systems. At some point, you will be in
246 | a terminal or command window and you will type python and
247 | the Python interpreter will start executing in interactive mode
248 | and appear somewhat as follows:
249 | interactive mode
250 |
251 | 3910 The >>> prompt is the Python interpreter's way of asking you, What
252 | do you want me to do next? Python is ready to have a conversation with
253 | you. All you have to know is how to speak the Python language.
254 |
255 | Let's say for example that you did not know even the simplest Python language
256 | words or sentences. You might want to use the standard line that astronauts
257 | use when they land on a faraway planet and try to speak with the inhabitants
258 | of the planet:
259 |
260 | This is not going so well. Unless you think of something quickly,
261 | the inhabitants of the planet are likely to stab you with their spears,
262 | put you on a spit, roast you over a fire, and eat you for dinner.
263 |
264 | At this point, you should also realize that while Python
265 | is amazingly complex and powerful and very picky about
266 | the syntax you use to communicate with it, Python is
267 | not intelligent. You are really just having a conversation
268 | with yourself, but using proper syntax.
269 |
270 | In a sense, when you use a program written by someone else
271 | the conversation is between you and those other
272 | programmers with Python acting as an intermediary. Python
273 | is a way for the creators of programs to express how the
274 | conversation is supposed to proceed. And
275 | in just a few more chapters, you will be one of those
276 | programmers using Python to talk to the users of your program.
277 |
278 | Before we leave our first conversation with the Python
279 | interpreter, you should probably know the proper way
280 | to say good-bye when interacting with the inhabitants
281 | of Planet Python:
282 |
283 | You will notice that the error is different for the first two
284 | incorrect attempts. The second error is different because
285 | if is a reserved word and Python saw the reserved word
286 | and thought we were trying to say something but got the syntax
287 | of the sentence wrong.
288 |
289 | Terminology: interpreter and compiler
290 | 5292 1540
291 | Python is a high-level language intended to be relatively
292 | straightforward for humans to read and write and for computers
293 | to read and process. Other high-level languages include Java, C++,
294 | 1565 PHP, Ruby, Basic, Perl, JavaScript, and many more. The actual hardware
295 | inside the Central Processing Unit (CPU) does not understand any
296 | of these high-level languages.
297 |
298 | The CPU understands a language we call machine language. Machine
299 | language is very simple and frankly very tiresome to write because it
300 | is represented all in zeros and ones.
301 |
302 | Machine language seems quite simple on the surface, given that there
303 | are 1668 only 9393 zeros 3135 and ones, but its syntax is even more complex
304 | and far more intricate than Python. So very few programmers ever write
305 | machine language. Instead we build various translators to allow
306 | programmers to write in high-level languages like Python or JavaScript
307 | and these translators convert the programs to machine language for actual
308 | execution by the CPU.
309 |
310 | Since machine language is tied to the computer hardware, machine language
311 | is not portable across different types of hardware. Programs written in
312 | high-level languages can be moved between different computers by using a
313 | different 2884 interpreter 7212 on 8076 the new machine or recompiling the code to create
314 | a machine language version of the program for the new machine.
315 | 1399
316 | These programming language translators fall into two general categories:
317 | (one) interpreters and (two) compilers.
318 |
319 | An interpreter reads the source code of the program as written by the
320 | programmer, parses the source code, and interprets the instructions on the fly.
321 | Python is an interpreter and when we are running Python interactively,
322 | we can type a line of Python (a sentence) and Python processes it immediately
323 | and is ready for us to type another line of Python.
324 |
325 | Some of the lines of Python tell Python that you want it to remember some
326 | value for later. We need to pick a name for that value to be remembered and
327 | we can use that symbolic name to retrieve the value later. We use the
328 | term variable to refer to the labels we use to refer to this stored data.
329 |
330 | In this example, we ask Python to remember the value six and use the label x
331 | so we can retrieve the value later. We verify that Python has actually remembered
332 | the value using x and multiply
333 | it by seven and put the newly computed value in y. Then we ask Python to print out
334 | 4401 the value currently in y. 9505
335 |
336 | Even though we are typing these commands into Python one line at a time, Python
337 | is treating them as an ordered sequence of statements with later statements able
338 | to retrieve data created in earlier statements. We are writing our first
339 | simple paragraph with four sentences in a logical and meaningful order.
340 |
341 | It is the nature of an interpreter to be able to have an interactive conversation
342 | as shown above. A compiler needs to be handed the entire program in a file, and then
343 | it runs a process to translate the high-level source code into machine language
344 | and then the compiler puts the resulting machine language into a file for later
345 | execution.
346 | 119
347 | If you have a Windows system, often these executable machine language programs have a
348 | suffix of .exe or .dll which stand for executable and dynamic link
349 | library respectively. In Linux and Macintosh, there is no suffix that uniquely marks
350 | a file as executable.
351 |
352 | 1505 If you were to open an executable file in a text editor, it would look
353 | completely crazy and be unreadable:
354 |
355 | It is not easy to read or write machine language, so it is nice that we have
356 | compilers that allow us to write in high-level
357 | languages like Python or C.
358 |
359 | 2832 Now at this point in our discussion of compilers and interpreters, you should
360 | be wondering a bit about the Python interpreter itself. What language is
361 | it written in? Is it written in a compiled language? When we type
362 | 9758 python, what exactly is happening? 5097
363 |
364 | The Python interpreter is written in a high-level language called C.
365 | You can look at the actual source code for the Python interpreter by
366 | going to www.python.org and working your way to their source code.
367 | So Python is a program itself and it is compiled into machine code.
368 | When you installed Python on your computer (or the vendor installed it),
369 | you copied a machine-code copy of the translated Python program onto your
370 | system. In Windows, the executable machine code for Python itself is likely
371 | in a file.
372 |
373 | That is more than you really need to know to be a Python programmer, but
374 | 1052 sometimes it pays to answer those little nagging questions right at
375 | the beginning.
376 |
377 | Writing a program
378 |
379 | Typing commands into the Python interpreter is a great way to experiment
380 | with Python's features, but it is not recommended for solving more complex problems.
381 |
382 | When we want to write a program,
383 | we 7972 use 7973 a 3439 text editor to write the Python instructions into a file,
384 | 7361 which is called a script. By 6771
385 | convention, Python scripts have names that end with .py.
386 |
387 | script
388 |
389 | To execute the script, you have to tell the Python interpreter
390 | the name of the file. In a Unix or Windows command window,
391 | you would type python hello.py as follows:
392 |
393 | We call the Python interpreter and tell it to read its source code from
394 | the file hello.py instead of prompting us for lines of Python code
395 | interactively.
396 |
397 | You will notice that there was no need to have quit() at the end of
398 | the Python program in the file. When Python is reading your source code
399 | from a file, it knows to stop when it reaches the end of the file.
400 |
401 | What is a program?
402 |
403 | The definition of a program at its most basic is a sequence
404 | of Python statements that have been crafted to do something.
405 | Even our simple hello.py script is a program. It is a one-line
406 | program and is not particularly useful, but in the strictest definition,
407 | it is a Python program.
408 |
409 | It might be easiest to understand what a program is by thinking about a problem
410 | that a program might be built to solve, and then looking at a program
411 | that would solve that problem.
412 |
413 | Lets say you are doing Social Computing research on Facebook posts and
414 | you are interested in the most frequently used word in a series of posts.
415 | You could print out the stream of Facebook posts and pore over the text
416 | looking for the most common word, but that would take a long time and be very
417 | mistake prone. You would be smart to write a Python program to handle the
418 | task quickly and accurately so you can spend the weekend doing something
419 | fun.
420 |
421 | For example, look at the following text about a clown and a car. Look at the
422 | text and figure out the most common word and how many times it occurs.
423 |
424 | Then imagine that you are doing this task looking at millions of lines of
425 | text. Frankly it would be quicker for you to learn Python and write a
426 | Python program to count the words than it would be to manually
427 | scan the words.
428 |
429 | The even better news is that I already came up with a simple program to
430 | find the most common word in a text file. I wrote it,
431 | tested it, and now I am giving it to you to use so you can save some time.
432 |
433 | You don't even need to know Python to use this program. You will need to get through
434 | Chapter ten of this book to fully understand the awesome Python techniques that were
435 | used to make the program. You are the end user, you simply use the program and marvel
436 | at its cleverness and how it saved you so much manual effort.
437 | You simply type the code
438 | into a file called words.py and run it or you download the source
439 | code from http://www.pythonlearn.com/code/ and run it.
440 |
441 | This is a good example of how Python and the Python language are acting as an intermediary
442 | between you (the end user) and me (the programmer). Python is a way for us to exchange useful
443 | instruction sequences (i.e., programs) in a common language that can be used by anyone who
444 | installs Python on their computer. So neither of us are talking to Python,
445 | instead we are communicating with each other through Python.
446 |
447 | The building blocks of programs
448 |
449 | In the next few chapters, we will learn more about the vocabulary, sentence structure,
450 | paragraph structure, and story structure of Python. We will learn about the powerful
451 | capabilities of Python and how to compose those capabilities together to create useful
452 | programs.
453 |
454 | There are some low-level conceptual patterns that we use to construct programs. These
455 | constructs are not just for Python programs, they are part of every programming language
456 | from machine language up to the high-level languages.
457 |
458 | description
459 |
460 | Get data from the outside world. This might be
461 | reading data from a file, or even some kind of sensor like
462 | a microphone or GPS. In our initial programs, our input will come from the user
463 | typing data on the keyboard.
464 |
465 | Display the results of the program on a screen
466 | or store them in a file or perhaps write them to a device like a
467 | speaker to play music or speak text.
468 |
469 | Perform statements one after
470 | another in the order they are encountered in the script.
471 |
472 | Check for certain conditions and
473 | then execute or skip a sequence of statements.
474 |
475 | Perform some set of statements
476 | repeatedly, usually with
477 | some variation.
478 |
479 | Write a set of instructions once and give them a name
480 | and then reuse those instructions as needed throughout your program.
481 |
482 | description
483 |
484 | It sounds almost too simple to be true, and of course it is never
485 | so simple. It is like saying that walking is simply
486 | putting one foot in front of the other. The art
487 | of writing a program is composing and weaving these
488 | basic elements together many times over to produce something
489 | that is useful to its users.
490 |
491 | The word counting program above directly uses all of
492 | these patterns except for one.
493 |
494 | What could possibly go wrong?
495 |
496 | As we saw in our earliest conversations with Python, we must
497 | communicate very precisely when we write Python code. The smallest
498 | deviation or mistake will cause Python to give up looking at your
499 | program.
500 |
501 | Beginning programmers often take the fact that Python leaves no
502 | room for errors as evidence that Python is mean, hateful, and cruel.
503 | While Python seems to like everyone else, Python knows them
504 | personally and holds a grudge against them. Because of this grudge,
505 | Python takes our perfectly written programs and rejects them as
506 | unfit just to torment us.
507 |
508 | There is little to be gained by arguing with Python. It is just a tool.
509 | It has no emotions and it is happy and ready to serve you whenever you
510 | need it. Its error messages sound harsh, but they are just Python's
511 | call for help. It has looked at what you typed, and it simply cannot
512 | understand what you have entered.
513 |
514 | Python is much more like a dog, loving you unconditionally, having a few
515 | key words that it understands, looking you with a sweet look on its
516 | face (>>>), and waiting for you to say something it understands.
517 | When Python says SyntaxError: invalid syntax, it is simply wagging
518 | its tail and saying, You seemed to say something but I just don't
519 | understand what you meant, but please keep talking to me (>>>).
520 |
521 | As your programs become increasingly sophisticated, you will encounter three
522 | general types of errors:
523 |
524 | description
525 |
526 | These are the first errors you will make and the easiest
527 | to fix. A syntax error means that you have violated the grammar rules of Python.
528 | Python does its best to point right at the line and character where
529 | it noticed it was confused. The only tricky bit of syntax errors is that sometimes
530 | the mistake that needs fixing is actually earlier in the program than where Python
531 | noticed it was confused. So the line and character that Python indicates in
532 | a syntax error may just be a starting point for your investigation.
533 |
534 | A logic error is when your program has good syntax but there is a mistake
535 | in the order of the statements or perhaps a mistake in how the statements relate to one another.
536 | A good example of a logic error might be, take a drink from your water bottle, put it
537 | in your backpack, walk to the library, and then put the top back on the bottle.
538 |
539 | A semantic error is when your description of the steps to take
540 | is syntactically perfect and in the right order, but there is simply a mistake in
541 | the program. The program is perfectly correct but it does not do what
542 | you intended for it to do. A simple example would
543 | be if you were giving a person directions to a restaurant and said, ...when you reach
544 | the intersection with the gas station, turn left and go one mile and the restaurant
545 | is a red building on your left. Your friend is very late and calls you to tell you that
546 | they are on a farm and walking around behind a barn, with no sign of a restaurant.
547 | Then you say did you turn left or right at the gas station? and
548 | they say, I followed your directions perfectly, I have
549 | them written down, it says turn left and go one mile at the gas station. Then you say,
550 | I am very sorry, because while my instructions were syntactically correct, they
551 | sadly contained a small but undetected semantic error..
552 |
553 | description
554 |
555 | Again in all three types of errors, Python is merely trying its hardest to
556 | do exactly what you have asked.
557 |
558 | The learning journey
559 |
560 | As you progress through the rest of the book, don't be afraid if the concepts
561 | don't seem to fit together well the first time. When you were learning to speak,
562 | it was not a problem for your first few years that you just made cute gurgling noises.
563 | And it was OK if it took six months for you to move from simple vocabulary to
564 | simple sentences and took five or six more years to move from sentences to paragraphs, and a
565 | few more years to be able to write an interesting complete short story on your own.
566 |
567 | We want you to learn Python much more rapidly, so we teach it all at the same time
568 | over the next few chapters.
569 | But it is like learning a new language that takes time to absorb and understand
570 | before it feels natural.
571 | That leads to some confusion as we visit and revisit
572 | topics to try to get you to see the big picture while we are defining the tiny
573 | fragments that make up that big picture. While the book is written linearly, and
574 | if you are taking a course it will progress in a linear fashion, don't hesitate
575 | to be very nonlinear in how you approach the material. Look forwards and backwards
576 | and read with a light touch. By skimming more advanced material without
577 | fully understanding the details, you can get a better understanding of the why?
578 | of programming. By reviewing previous material and even redoing earlier
579 | exercises, you will realize that you actually learned a lot of material even
580 | if the material you are currently staring at seems a bit impenetrable.
581 |
582 | Usually when you are learning your first programming language, there are a few
583 | wonderful Ah Hah! moments where you can look up from pounding away at some rock
584 | with a hammer and chisel and step away and see that you are indeed building
585 | a beautiful sculpture.
586 |
587 | If something seems particularly hard, there is usually no value in staying up all
588 | night and staring at it. Take a break, take a nap, have a snack, explain what you
589 | are having a problem with to someone (or perhaps your dog), and then come back to it with
590 | fresh eyes. I assure you that once you learn the programming concepts in the book
591 | you will look back and see that it was all really easy and elegant and it simply
592 | took you a bit of time to absorb it.
593 | 42
594 | The end
595 |
--------------------------------------------------------------------------------
/Chapter 11 - Regular Expressions/regex_sum_42.txt:
--------------------------------------------------------------------------------
1 | This file contains the sample data
2 |
3 |
4 | Why should you learn to write programs?
5 |
6 | Writing programs (or programming) is a very creative
7 | and rewarding activity. You can write programs for
8 | 3036 many reasons, ranging from making your living to solving 7209
9 | a difficult data analysis problem to having fun to helping
10 | someone else solve a problem. This book assumes that
11 | everyone needs to know how to program, and that once
12 | you know how to program you will figure out what you want
13 | to do with your newfound skills.
14 |
15 | We are surrounded in our daily lives with computers ranging
16 | from laptops to cell phones. We can think of these computers
17 | as 4497 our 6702 personal 8454 assistants who can take care of many things
18 | 7449 on our behalf. The hardware in our current-day computers
19 | is essentially built to continuously ask us the question,
20 | What would you like me to do next?
21 |
22 | Programmers add an operating system and a set of applications
23 | to the hardware and we end up with a Personal Digital
24 | Assistant that is quite helpful and capable of helping
25 | us do many different things.
26 |
27 | Our computers are fast and have vast amounts of memory and
28 | could be very helpful to us if we only knew the language to
29 | speak to explain to the computer what we would like it to
30 | do 3665 next. 7936 9772 If we knew this language, we could tell the
31 | computer to do tasks on our behalf that were repetitive.
32 | Interestingly, the kinds of things computers can do best
33 | are often the kinds of things that we humans find boring
34 | and mind-numbing.
35 |
36 | For example, look at the first three paragraphs of this
37 | chapter and tell me the most commonly used word and how
38 | many times the word is used. While you were able to read
39 | and understand the words in a few seconds, counting them
40 | is almost painful because it is not the kind of problem
41 | that human minds are designed to solve. For a computer
42 | the opposite is true, reading and understanding text
43 | from a piece of paper is hard for a computer to do
44 | but counting the words and telling you how many times
45 | the most used word was used is very easy for the
46 | computer:
47 | 7114
48 | Our personal information analysis assistant quickly
49 | told us that the word to was used sixteen times in the
50 | first three paragraphs of this chapter.
51 |
52 | This very fact that computers are good at things
53 | that humans are not is why you need to become
54 | skilled at talking computer language. Once you learn
55 | 956 this new language, you can delegate mundane tasks 2564
56 | to 8003 your 1704 partner 3816 (the computer), leaving more time
57 | for you to do the
58 | things that you are uniquely suited for. You bring
59 | creativity, intuition, and inventiveness to this
60 | partnership.
61 |
62 | Creativity and motivation
63 |
64 | While this book is not intended for professional programmers, professional
65 | programming can be a very rewarding job both financially and personally.
66 | Building useful, elegant, and clever programs for others to use is a very
67 | creative 6662 activity. 5858 7777 Your computer or Personal Digital Assistant (PDA)
68 | usually contains many different programs from many different groups of
69 | programmers, each competing for your attention and interest. They try
70 | their best to meet your needs and give you a great user experience in the
71 | process. In some situations, when you choose a piece of software, the
72 | programmers are directly compensated because of your choice.
73 |
74 | If we think of programs as the creative output of groups of programmers,
75 | perhaps the following figure is a more sensible version of our PDA:
76 |
77 | For now, our primary motivation is not to make money or please end users, but
78 | instead for us to be more productive in handling the data and
79 | information that we will encounter in our lives.
80 | When you first start, you will be both the programmer and the end user of
81 | your programs. As you gain skill as a programmer and
82 | programming feels more creative to you, your thoughts may turn
83 | toward developing programs for others.
84 |
85 | Computer hardware architecture
86 |
87 | Before we start learning the language we
88 | speak to give instructions to computers to
89 | develop software, we need to learn a small amount about
90 | how computers are built.
91 |
92 | Central Processing Unit (or CPU) is
93 | the part of the computer that is built to be obsessed
94 | with what is next? If your computer is rated
95 | at three Gigahertz, it means that the CPU will ask What next?
96 | three billion times per second. You are going to have to
97 | learn how to talk fast to keep up with the CPU.
98 |
99 | Main Memory is used to store information
100 | that the CPU needs in a hurry. The main memory is nearly as
101 | fast as the CPU. But the information stored in the main
102 | memory vanishes when the computer is turned off.
103 |
104 | Secondary Memory is also used to store
105 | 6482 information, but it is much slower than the main memory.
106 | The advantage of the secondary memory is that it can
107 | store information even when there is no power to the
108 | computer. Examples of secondary memory are disk drives
109 | or flash memory (typically found in USB sticks and portable
110 | music players).
111 | 9634
112 | Input and Output Devices are simply our
113 | screen, keyboard, mouse, microphone, speaker, touchpad, etc.
114 | They are all of the ways we interact with the computer.
115 |
116 | These days, most computers also have a
117 | Network Connection to retrieve information over a network.
118 | We can think of the network as a very slow place to store and
119 | 8805 retrieve data that might not always be up. So in a sense, 7123
120 | the network is a slower and at times unreliable form of
121 | 9703 4676 6373
122 |
123 | While most of the detail of how these components work is best left
124 | to computer builders, it helps to have some terminology
125 | so we can talk about these different parts as we write our programs.
126 |
127 | As a programmer, your job is to use and orchestrate
128 | each of these resources to solve the problem that you need to solve
129 | and analyze the data you get from the solution. As a programmer you will
130 | mostly be talking to the CPU and telling it what to
131 | do next. Sometimes you will tell the CPU to use the main memory,
132 | secondary memory, network, or the input/output devices.
133 |
134 | You need to be the person who answers the CPU's What next?
135 | question. But it would be very uncomfortable to shrink you
136 | down to five mm tall and insert you into the computer just so you
137 | could issue a command three billion times per second. So instead,
138 | you must write down your instructions in advance.
139 | We call these stored instructions a program and the act
140 | of writing these instructions down and getting the instructions to
141 | be correct programming.
142 |
143 | Understanding programming
144 |
145 | In the rest of this book, we will try to turn you into a person
146 | who is skilled in the art of programming. In the end you will be a
147 | programmer --- perhaps not a professional programmer, but
148 | at least you will have the skills to look at a data/information
149 | analysis problem and develop a program to solve the problem.
150 | 2834
151 | 7221 problem solving
152 |
153 | 2981 In a sense, you need two skills to be a programmer:
154 |
155 | First, you need to know the programming language (Python) -
156 | 5415 you need to know the vocabulary and the grammar. You need to be able
157 | to spell the words in this new language properly and know how to construct
158 | well-formed sentences in this new language.
159 |
160 | Second, you need to tell a story. In writing a story,
161 | you combine words and sentences to convey an idea to the reader.
162 | There is a skill and art in constructing the story, and skill in
163 | story writing is improved by doing some writing and getting some
164 | feedback. In programming, our program is the story and the
165 | problem you are trying to solve is the idea.
166 |
167 | itemize
168 |
169 | Once you learn one programming language such as Python, you will
170 | find it much easier to learn a second programming language such
171 | as JavaScript or C++. The new programming language has very different
172 | vocabulary and grammar but the problem-solving skills
173 | will be the same across all programming languages.
174 |
175 | You will learn the vocabulary and sentences of Python pretty quickly.
176 | It will take longer for you to be able to write a coherent program
177 | to solve a brand-new problem. We teach programming much like we teach
178 | writing. We start reading and explaining programs, then we write
179 | simple programs, and then we write increasingly complex programs over time.
180 | At some point you get your muse and see the patterns on your own
181 | and can see more naturally how to take a problem and
182 | write a program that solves that problem. And once you get
183 | 6872 to that point, programming becomes a very pleasant and creative process.
184 |
185 | We start with the vocabulary and structure of Python programs. Be patient
186 | as the simple examples remind you of when you started reading for the first
187 | time.
188 |
189 | Words and sentences
190 | 4806
191 | Unlike human languages, the Python vocabulary is actually pretty small.
192 | We call this vocabulary the reserved words. These are words that
193 | 5460 have very special meaning to Python. When Python sees these words in 8533
194 | 3538 a Python program, they have one and only one meaning to Python. Later
195 | as you write programs you will make up your own words that have meaning to
196 | you called variables. You will have great latitude in choosing
197 | your 9663 names 8001 for 9795 your variables, but you cannot use any of Python's
198 | reserved 8752 words 1117 as 5349 a name for a variable.
199 |
200 | When we train a dog, we use special words like
201 | sit, stay, and fetch. When you talk to a dog and
202 | 4509 don't use any of the reserved words, they just look at you with a
203 | quizzical look on their face until you say a reserved word.
204 | For example, if you say,
205 | I wish more people would walk to improve their overall health,
206 | what most dogs likely hear is,
207 | blah blah blah walk blah blah blah blah.
208 | That is because walk is a reserved word in dog language.
209 |
210 | The reserved words in the language where humans talk to
211 | Python include the following:
212 |
213 | and del from not while
214 | as elif global or with
215 | assert else if pass yield
216 | break except import print
217 | class exec in raise
218 | continue finally is return
219 | def for lambda try
220 |
221 | That is it, and unlike a dog, Python is already completely trained.
222 | When 1004 you 9258 say 4183 try, Python will try every time you say it without
223 | fail.
224 |
225 | 4034 We will learn these reserved words and how they are used in good time, 3342
226 | but for now we will focus on the Python equivalent of speak (in
227 | human-to-dog language). The nice thing about telling Python to speak
228 | 3482 is that we can even tell it what to say by giving it a message in quotes: 8567
229 |
230 | And we have even written our first syntactically correct Python sentence.
231 | Our sentence starts with the reserved word print followed
232 | by a string of text of our choosing enclosed in single quotes.
233 |
234 | Conversing with Python
235 |
236 | 1052 Now that we have a word and a simple sentence that we know in Python, 8135
237 | we need to know how to start a conversation with Python to test
238 | our new language skills.
239 |
240 | Before 5561 you 517 can 1218 converse with Python, you must first install the Python
241 | software on your computer and learn how to start Python on your
242 | computer. That is too much detail for this chapter so I suggest
243 | that you consult www.pythonlearn.com where I have detailed
244 | instructions and screencasts of setting up and starting Python
245 | on Macintosh and Windows systems. At some point, you will be in
246 | a terminal or command window and you will type python and
247 | 8877 the Python interpreter will start executing in interactive mode
248 | and appear somewhat as follows:
249 | interactive mode
250 |
251 | The >>> prompt is the Python interpreter's way of asking you, What
252 | do you want me to do next? Python is ready to have a conversation with
253 | you. All you have to know is how to speak the Python language.
254 |
255 | Let's say for example that you did not know even the simplest Python language
256 | words or sentences. You might want to use the standard line that astronauts
257 | use when they land on a faraway planet and try to speak with the inhabitants
258 | of the planet:
259 |
260 | This is not going so well. Unless you think of something quickly,
261 | the inhabitants of the planet are likely to stab you with their spears,
262 | put you on a spit, roast you over a fire, and eat you for dinner.
263 |
264 | At this point, you should also realize that while Python
265 | is amazingly complex and powerful and very picky about
266 | the syntax you use to communicate with it, Python is
267 | not intelligent. You are really just having a conversation
268 | with yourself, but using proper syntax.
269 | 8062 1720
270 | In a sense, when you use a program written by someone else
271 | the conversation is between you and those other
272 | programmers with Python acting as an intermediary. Python
273 | is a way for the creators of programs to express how the
274 | conversation is supposed to proceed. And
275 | in just a few more chapters, you will be one of those
276 | programmers using Python to talk to the users of your program.
277 |
278 | 279 Before we leave our first conversation with the Python
279 | interpreter, you should probably know the proper way
280 | to say good-bye when interacting with the inhabitants
281 | of Planet Python:
282 |
283 | 2054 You will notice that the error is different for the first two 801
284 | incorrect attempts. The second error is different because
285 | if is a reserved word and Python saw the reserved word
286 | and thought we were trying to say something but got the syntax
287 | of the sentence wrong.
288 |
289 | Terminology: interpreter and compiler
290 |
291 | Python is a high-level language intended to be relatively
292 | straightforward for humans to read and write and for computers
293 | to read and process. Other high-level languages include Java, C++,
294 | 918 PHP, Ruby, Basic, Perl, JavaScript, and many more. The actual hardware
295 | inside the Central Processing Unit (CPU) does not understand any
296 | of these high-level languages.
297 |
298 | The CPU understands a language we call machine language. Machine
299 | language is very simple and frankly very tiresome to write because it
300 | is represented all in zeros and ones.
301 |
302 | Machine language seems quite simple on the surface, given that there
303 | are only zeros and ones, but its syntax is even more complex
304 | 8687 and far more intricate than Python. So very few programmers ever write
305 | machine language. Instead we build various translators to allow
306 | programmers to write in high-level languages like Python or JavaScript
307 | and these translators convert the programs to machine language for actual
308 | execution by the CPU.
309 |
310 | Since machine language is tied to the computer hardware, machine language
311 | is not portable across different types of hardware. Programs written in
312 | high-level languages can be moved between different computers by using a
313 | different interpreter on the new machine or recompiling the code to create
314 | a machine language version of the program for the new machine.
315 |
316 | These programming language translators fall into two general categories:
317 | (one) interpreters and (two) compilers.
318 | 7073 1865 7084
319 | An interpreter reads the source code of the program as written by the
320 | programmer, parses the source code, and interprets the instructions on the fly.
321 | Python is an interpreter and when we are running Python interactively,
322 | we can type a line of Python (a sentence) and Python processes it immediately
323 | and is ready for us to type another line of Python.
324 | 2923 63
325 | Some of the lines of Python tell Python that you want it to remember some
326 | value for later. We need to pick a name for that value to be remembered and
327 | we can use that symbolic name to retrieve the value later. We use the
328 | term variable to refer to the labels we use to refer to this stored data.
329 |
330 | In this example, we ask Python to remember the value six and use the label x
331 | so we can retrieve the value later. We verify that Python has actually remembered
332 | the value using x and multiply
333 | it by seven and put the newly computed value in y. Then we ask Python to print out
334 | 8824 the value currently in y.
335 | 1079 5801 5047
336 | Even though we are typing these commands into Python one line at a time, Python
337 | is treating them as an ordered sequence of statements with later statements able
338 | to retrieve data created in earlier statements. We are writing our first
339 | simple paragraph with four sentences in a logical and meaningful order.
340 | 5
341 | It is the nature of an interpreter to be able to have an interactive conversation
342 | as shown above. A compiler needs to be handed the entire program in a file, and then
343 | it runs a process to translate the high-level source code into machine language
344 | 2572 and then the compiler puts the resulting machine language into a file for later
345 | execution.
346 |
347 | If you have a Windows system, often these executable machine language programs have a
348 | suffix of .exe or .dll which stand for executable and dynamic link
349 | library respectively. In Linux and Macintosh, there is no suffix that uniquely marks
350 | a file as executable.
351 |
352 | If you were to open an executable file in a text editor, it would look
353 | completely crazy and be unreadable:
354 |
355 | It is not easy to read or write machine language, so it is nice that we have
356 | compilers that allow us to write in high-level
357 | languages like Python or C.
358 |
359 | Now at this point in our discussion of compilers and interpreters, you should
360 | be 5616 wondering 171 a 3062 bit about the Python interpreter itself. What language is
361 | 9552 it written in? Is it written in a compiled language? When we type 7655
362 | python, 829 what 6096 exactly 2312 is happening?
363 |
364 | The Python interpreter is written in a high-level language called C.
365 | You can look at the actual source code for the Python interpreter by
366 | going to www.python.org and working your way to their source code.
367 | So Python is a program itself and it is compiled into machine code.
368 | When you installed Python on your computer (or the vendor installed it),
369 | 6015 you copied a machine-code copy of the translated Python program onto your 7100
370 | system. In Windows, the executable machine code for Python itself is likely
371 | in a file.
372 |
373 | That is more than you really need to know to be a Python programmer, but
374 | sometimes it pays to answer those little nagging questions right at
375 | the beginning.
376 |
377 | Writing a program
378 |
379 | Typing commands into the Python interpreter is a great way to experiment
380 | with Python's features, but it is not recommended for solving more complex problems.
381 |
382 | When we want to write a program,
383 | we use a text editor to write the Python instructions into a file,
384 | which 9548 is 2727 called 1792 a script. By
385 | convention, Python scripts have names that end with .py.
386 |
387 | script
388 |
389 | To execute the script, you have to tell the Python interpreter
390 | the name of the file. In a Unix or Windows command window,
391 | you would type python hello.py as follows:
392 |
393 | We call the Python interpreter and tell it to read its source code from
394 | the file hello.py instead of prompting us for lines of Python code
395 | interactively.
396 |
397 | You will notice that there was no need to have quit() at the end of
398 | the Python program in the file. When Python is reading your source code
399 | from a file, it knows to stop when it reaches the end of the file.
400 |
401 | 8402 What is a program?
402 |
403 | The definition of a program at its most basic is a sequence
404 | of Python statements that have been crafted to do something.
405 | Even our simple hello.py script is a program. It is a one-line
406 | program and is not particularly useful, but in the strictest definition,
407 | it is a Python program.
408 |
409 | It might be easiest to understand what a program is by thinking about a problem
410 | that a program might be built to solve, and then looking at a program
411 | that would solve that problem.
412 |
413 | Lets say you are doing Social Computing research on Facebook posts and
414 | you are interested in the most frequently used word in a series of posts.
415 | You could print out the stream of Facebook posts and pore over the text
416 | looking for the most common word, but that would take a long time and be very
417 | mistake prone. You would be smart to write a Python program to handle the
418 | task quickly and accurately so you can spend the weekend doing something
419 | fun.
420 |
421 | For example, look at the following text about a clown and a car. Look at the
422 | text and figure out the most common word and how many times it occurs.
423 |
424 | Then imagine that you are doing this task looking at millions of lines of
425 | text. Frankly it would be quicker for you to learn Python and write a
426 | Python program to count the words than it would be to manually
427 | scan the words.
428 |
429 | The even better news is that I already came up with a simple program to
430 | find the most common word in a text file. I wrote it,
431 | tested it, and now I am giving it to you to use so you can save some time.
432 |
433 | You don't even need to know Python to use this program. You will need to get through
434 | Chapter ten of this book to fully understand the awesome Python techniques that were
435 | used to make the program. You are the end user, you simply use the program and marvel
436 | at its cleverness and how it saved you so much manual effort.
437 | You simply type the code
438 | into a file called words.py and run it or you download the source
439 | code from http://www.pythonlearn.com/code/ and run it.
440 |
441 | This is a good example of how Python and the Python language are acting as an intermediary
442 | between you (the end user) and me (the programmer). Python is a way for us to exchange useful
443 | instruction sequences (i.e., programs) in a common language that can be used by anyone who
444 | installs Python on their computer. So neither of us are talking to Python,
445 | instead we are communicating with each other through Python.
446 |
447 | The building blocks of programs
448 |
449 | In the next few chapters, we will learn more about the vocabulary, sentence structure,
450 | paragraph structure, and story structure of Python. We will learn about the powerful
451 | capabilities of Python and how to compose those capabilities together to create useful
452 | programs.
453 |
454 | There are some low-level conceptual patterns that we use to construct programs. These
455 | constructs are not just for Python programs, they are part of every programming language
456 | from machine language up to the high-level languages.
457 |
458 | description
459 |
460 | Get data from the outside world. This might be
461 | reading data from a file, or even some kind of sensor like
462 | a microphone or GPS. In our initial programs, our input will come from the user
463 | typing data on the keyboard.
464 |
465 | Display the results of the program on a screen
466 | or store them in a file or perhaps write them to a device like a
467 | speaker to play music or speak text.
468 |
469 | Perform statements one after
470 | another in the order they are encountered in the script.
471 |
472 | Check for certain conditions and
473 | then execute or skip a sequence of statements.
474 |
475 | Perform some set of statements
476 | repeatedly, usually with
477 | some variation.
478 |
479 | Write a set of instructions once and give them a name
480 | and then reuse those instructions as needed throughout your program.
481 |
482 | description
483 |
484 | It sounds almost too simple to be true, and of course it is never
485 | so simple. It is like saying that walking is simply
486 | putting one foot in front of the other. The art
487 | of writing a program is composing and weaving these
488 | basic elements together many times over to produce something
489 | that is useful to its users.
490 |
491 | The word counting program above directly uses all of
492 | these patterns except for one.
493 |
494 | What could possibly go wrong?
495 |
496 | As we saw in our earliest conversations with Python, we must
497 | communicate very precisely when we write Python code. The smallest
498 | deviation or mistake will cause Python to give up looking at your
499 | program.
500 |
501 | Beginning programmers often take the fact that Python leaves no
502 | room for errors as evidence that Python is mean, hateful, and cruel.
503 | While Python seems to like everyone else, Python knows them
504 | personally and holds a grudge against them. Because of this grudge,
505 | Python takes our perfectly written programs and rejects them as
506 | unfit just to torment us.
507 |
508 | There is little to be gained by arguing with Python. It is just a tool.
509 | It has no emotions and it is happy and ready to serve you whenever you
510 | need it. Its error messages sound harsh, but they are just Python's
511 | call for help. It has looked at what you typed, and it simply cannot
512 | understand what you have entered.
513 |
514 | Python is much more like a dog, loving you unconditionally, having a few
515 | key words that it understands, looking you with a sweet look on its
516 | face (>>>), and waiting for you to say something it understands.
517 | When Python says SyntaxError: invalid syntax, it is simply wagging
518 | its tail and saying, You seemed to say something but I just don't
519 | understand what you meant, but please keep talking to me (>>>).
520 |
521 | As your programs become increasingly sophisticated, you will encounter three
522 | general types of errors:
523 |
524 | description
525 |
526 | These are the first errors you will make and the easiest
527 | to fix. A syntax error means that you have violated the grammar rules of Python.
528 | Python does its best to point right at the line and character where
529 | it noticed it was confused. The only tricky bit of syntax errors is that sometimes
530 | the mistake that needs fixing is actually earlier in the program than where Python
531 | noticed it was confused. So the line and character that Python indicates in
532 | a syntax error may just be a starting point for your investigation.
533 |
534 | A logic error is when your program has good syntax but there is a mistake
535 | in the order of the statements or perhaps a mistake in how the statements relate to one another.
536 | A good example of a logic error might be, take a drink from your water bottle, put it
537 | in your backpack, walk to the library, and then put the top back on the bottle.
538 |
539 | A semantic error is when your description of the steps to take
540 | is syntactically perfect and in the right order, but there is simply a mistake in
541 | the program. The program is perfectly correct but it does not do what
542 | you intended for it to do. A simple example would
543 | be if you were giving a person directions to a restaurant and said, ...when you reach
544 | the intersection with the gas station, turn left and go one mile and the restaurant
545 | is a red building on your left. Your friend is very late and calls you to tell you that
546 | they are on a farm and walking around behind a barn, with no sign of a restaurant.
547 | Then you say did you turn left or right at the gas station? and
548 | they say, I followed your directions perfectly, I have
549 | them written down, it says turn left and go one mile at the gas station. Then you say,
550 | I am very sorry, because while my instructions were syntactically correct, they
551 | sadly contained a small but undetected semantic error..
552 |
553 | description
554 |
555 | Again in all three types of errors, Python is merely trying its hardest to
556 | do exactly what you have asked.
557 |
558 | The learning journey
559 |
560 | As you progress through the rest of the book, don't be afraid if the concepts
561 | don't seem to fit together well the first time. When you were learning to speak,
562 | it was not a problem for your first few years that you just made cute gurgling noises.
563 | And it was OK if it took six months for you to move from simple vocabulary to
564 | simple sentences and took five or six more years to move from sentences to paragraphs, and a
565 | few more years to be able to write an interesting complete short story on your own.
566 |
567 | We want you to learn Python much more rapidly, so we teach it all at the same time
568 | over the next few chapters.
569 | But it is like learning a new language that takes time to absorb and understand
570 | before it feels natural.
571 | That leads to some confusion as we visit and revisit
572 | topics to try to get you to see the big picture while we are defining the tiny
573 | fragments that make up that big picture. While the book is written linearly, and
574 | if you are taking a course it will progress in a linear fashion, don't hesitate
575 | to be very nonlinear in how you approach the material. Look forwards and backwards
576 | and read with a light touch. By skimming more advanced material without
577 | fully understanding the details, you can get a better understanding of the why?
578 | of programming. By reviewing previous material and even redoing earlier
579 | exercises, you will realize that you actually learned a lot of material even
580 | if the material you are currently staring at seems a bit impenetrable.
581 |
582 | Usually when you are learning your first programming language, there are a few
583 | wonderful Ah Hah! moments where you can look up from pounding away at some rock
584 | with a hammer and chisel and step away and see that you are indeed building
585 | a beautiful sculpture.
586 |
587 | If something seems particularly hard, there is usually no value in staying up all
588 | night and staring at it. Take a break, take a nap, have a snack, explain what you
589 | are having a problem with to someone (or perhaps your dog), and then come back to it with
590 | fresh eyes. I assure you that once you learn the programming concepts in the book
591 | you will look back and see that it was all really easy and elegant and it simply
592 | took you a bit of time to absorb it.
593 | 42
594 | The end
595 |
--------------------------------------------------------------------------------
/Chapter 12 - Networks and Sockets/chapter12.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 |
3 | # Chapter 12 Assignment: Understanding the Request / Response Cycle
4 |
5 | # Exploring the HyperText Transport Protocol
6 |
7 | # You are to retrieve the following document using the HTTP protocol in a way
8 | # that you can examine the HTTP Response headers.
9 |
10 | # http://data.pr4e.org/intro-short.txt
11 | # There are three ways that you might retrieve this web page and look at the
12 | # response headers:
13 |
14 | # Preferred: Modify the socket1.py program to retrieve the above URL and print
15 | # out the headers and data. Make sure to change the code to retrieve the above
16 | # URL - the values are different for each URL.
17 |
18 | # Open the URL in a web browser with a developer console or FireBug and manually
19 | # examine the headers that are returned.
20 |
21 | # Use the telnet program as shown in lecture to retrieve the headers and content.
22 |
23 | import socket
24 |
25 | mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
26 | mysocket.connect(('www.pythonlearn.com', 80))
27 |
28 | mysocket.send('GET http://www.pythonlearn.com/code/intro-short.txt HTTP/1.1\r\nHost: www.pythonlearn.com\r\n\r\n')
29 |
30 | while True:
31 | data = mysocket.recv(512)
32 | if (len(data) < 1):
33 | break
34 | print data
35 |
36 | mysocket.close()
37 |
--------------------------------------------------------------------------------
/Chapter 12 - Networks and Sockets/examples.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 |
3 | # Chapter 3 examples using either the socket library or the urllib library to read a URL.
4 |
5 | import socket
6 | import urllib
7 |
8 | # socket library example:
9 |
10 | print 'Output from the socket library example:\n'
11 |
12 | mysocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
13 | mysocket.connect(('www.py4inf.com', 80))
14 |
15 | mysocket.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n')
16 |
17 | while True:
18 | data = mysocket.recv(512)
19 | if (len(data) < 1):
20 | break
21 | print data
22 |
23 | mysocket.close()
24 |
25 | # urllib library example:
26 |
27 | print '\n\nOutput from the urllib library example:\n'
28 |
29 | myurl = urllib.urlopen('http://www.py4inf.com/code/romeo.txt')
30 |
31 | for line in myurl:
32 | print line.rstrip()
33 |
--------------------------------------------------------------------------------
/Chapter 12 - Programs that Surf the Web/BeautifulSoup.py:
--------------------------------------------------------------------------------
1 | """Beautiful Soup
2 | Elixir and Tonic
3 | "The Screen-Scraper's Friend"
4 | http://www.crummy.com/software/BeautifulSoup/
5 |
6 | Beautiful Soup parses a (possibly invalid) XML or HTML document into a
7 | tree representation. It provides methods and Pythonic idioms that make
8 | it easy to navigate, search, and modify the tree.
9 |
10 | A well-formed XML/HTML document yields a well-formed data
11 | structure. An ill-formed XML/HTML document yields a correspondingly
12 | ill-formed data structure. If your document is only locally
13 | well-formed, you can use this library to find and process the
14 | well-formed part of it.
15 |
16 | Beautiful Soup works with Python 2.2 and up. It has no external
17 | dependencies, but you'll have more success at converting data to UTF-8
18 | if you also install these three packages:
19 |
20 | * chardet, for auto-detecting character encodings
21 | http://chardet.feedparser.org/
22 | * cjkcodecs and iconv_codec, which add more encodings to the ones supported
23 | by stock Python.
24 | http://cjkpython.i18n.org/
25 |
26 | Beautiful Soup defines classes for two main parsing strategies:
27 |
28 | * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
29 | language that kind of looks like XML.
30 |
31 | * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
32 | or invalid. This class has web browser-like heuristics for
33 | obtaining a sensible parse tree in the face of common HTML errors.
34 |
35 | Beautiful Soup also defines a class (UnicodeDammit) for autodetecting
36 | the encoding of an HTML or XML document, and converting it to
37 | Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
38 |
39 | For more than you ever wanted to know about Beautiful Soup, see the
40 | documentation:
41 | http://www.crummy.com/software/BeautifulSoup/documentation.html
42 |
43 | Here, have some legalese:
44 |
45 | Copyright (c) 2004-2010, Leonard Richardson
46 |
47 | All rights reserved.
48 |
49 | Redistribution and use in source and binary forms, with or without
50 | modification, are permitted provided that the following conditions are
51 | met:
52 |
53 | * Redistributions of source code must retain the above copyright
54 | notice, this list of conditions and the following disclaimer.
55 |
56 | * Redistributions in binary form must reproduce the above
57 | copyright notice, this list of conditions and the following
58 | disclaimer in the documentation and/or other materials provided
59 | with the distribution.
60 |
61 | * Neither the name of the the Beautiful Soup Consortium and All
62 | Night Kosher Bakery nor the names of its contributors may be
63 | used to endorse or promote products derived from this software
64 | without specific prior written permission.
65 |
66 | THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
67 | "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
68 | LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
69 | A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
70 | CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
71 | EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
72 | PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
73 | PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
74 | LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
75 | NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
76 | SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
77 |
78 | """
79 | from __future__ import generators
80 |
81 | __author__ = "Leonard Richardson (leonardr@segfault.org)"
82 | __version__ = "3.0.8.1"
83 | __copyright__ = "Copyright (c) 2004-2010 Leonard Richardson"
84 | __license__ = "New-style BSD"
85 |
86 | from sgmllib import SGMLParser, SGMLParseError
87 | import codecs
88 | import markupbase
89 | import types
90 | import re
91 | import sgmllib
92 | try:
93 | from htmlentitydefs import name2codepoint
94 | except ImportError:
95 | name2codepoint = {}
96 | try:
97 | set
98 | except NameError:
99 | from sets import Set as set
100 |
101 | #These hacks make Beautiful Soup able to parse XML with namespaces
102 | sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
103 | markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
104 |
105 | DEFAULT_OUTPUT_ENCODING = "utf-8"
106 |
107 | def _match_css_class(str):
108 | """Build a RE to match the given CSS class."""
109 | return re.compile(r"(^|.*\s)%s($|\s)" % str)
110 |
111 | # First, the classes that represent markup elements.
112 |
113 | class PageElement(object):
114 | """Contains the navigational information for some part of the page
115 | (either a tag or a piece of text)"""
116 |
117 | def setup(self, parent=None, previous=None):
118 | """Sets up the initial relations between this element and
119 | other elements."""
120 | self.parent = parent
121 | self.previous = previous
122 | self.next = None
123 | self.previousSibling = None
124 | self.nextSibling = None
125 | if self.parent and self.parent.contents:
126 | self.previousSibling = self.parent.contents[-1]
127 | self.previousSibling.nextSibling = self
128 |
129 | def replaceWith(self, replaceWith):
130 | oldParent = self.parent
131 | myIndex = self.parent.index(self)
132 | if hasattr(replaceWith, "parent")\
133 | and replaceWith.parent is self.parent:
134 | # We're replacing this element with one of its siblings.
135 | index = replaceWith.parent.index(replaceWith)
136 | if index and index < myIndex:
137 | # Furthermore, it comes before this element. That
138 | # means that when we extract it, the index of this
139 | # element will change.
140 | myIndex = myIndex - 1
141 | self.extract()
142 | oldParent.insert(myIndex, replaceWith)
143 |
144 | def replaceWithChildren(self):
145 | myParent = self.parent
146 | myIndex = self.parent.index(self)
147 | self.extract()
148 | reversedChildren = list(self.contents)
149 | reversedChildren.reverse()
150 | for child in reversedChildren:
151 | myParent.insert(myIndex, child)
152 |
153 | def extract(self):
154 | """Destructively rips this element out of the tree."""
155 | if self.parent:
156 | try:
157 | del self.parent.contents[self.parent.index(self)]
158 | except ValueError:
159 | pass
160 |
161 | #Find the two elements that would be next to each other if
162 | #this element (and any children) hadn't been parsed. Connect
163 | #the two.
164 | lastChild = self._lastRecursiveChild()
165 | nextElement = lastChild.next
166 |
167 | if self.previous:
168 | self.previous.next = nextElement
169 | if nextElement:
170 | nextElement.previous = self.previous
171 | self.previous = None
172 | lastChild.next = None
173 |
174 | self.parent = None
175 | if self.previousSibling:
176 | self.previousSibling.nextSibling = self.nextSibling
177 | if self.nextSibling:
178 | self.nextSibling.previousSibling = self.previousSibling
179 | self.previousSibling = self.nextSibling = None
180 | return self
181 |
182 | def _lastRecursiveChild(self):
183 | "Finds the last element beneath this object to be parsed."
184 | lastChild = self
185 | while hasattr(lastChild, 'contents') and lastChild.contents:
186 | lastChild = lastChild.contents[-1]
187 | return lastChild
188 |
189 | def insert(self, position, newChild):
190 | if isinstance(newChild, basestring) \
191 | and not isinstance(newChild, NavigableString):
192 | newChild = NavigableString(newChild)
193 |
194 | position = min(position, len(self.contents))
195 | if hasattr(newChild, 'parent') and newChild.parent is not None:
196 | # We're 'inserting' an element that's already one
197 | # of this object's children.
198 | if newChild.parent is self:
199 | index = self.index(newChild)
200 | if index > position:
201 | # Furthermore we're moving it further down the
202 | # list of this object's children. That means that
203 | # when we extract this element, our target index
204 | # will jump down one.
205 | position = position - 1
206 | newChild.extract()
207 |
208 | newChild.parent = self
209 | previousChild = None
210 | if position == 0:
211 | newChild.previousSibling = None
212 | newChild.previous = self
213 | else:
214 | previousChild = self.contents[position-1]
215 | newChild.previousSibling = previousChild
216 | newChild.previousSibling.nextSibling = newChild
217 | newChild.previous = previousChild._lastRecursiveChild()
218 | if newChild.previous:
219 | newChild.previous.next = newChild
220 |
221 | newChildsLastElement = newChild._lastRecursiveChild()
222 |
223 | if position >= len(self.contents):
224 | newChild.nextSibling = None
225 |
226 | parent = self
227 | parentsNextSibling = None
228 | while not parentsNextSibling:
229 | parentsNextSibling = parent.nextSibling
230 | parent = parent.parent
231 | if not parent: # This is the last element in the document.
232 | break
233 | if parentsNextSibling:
234 | newChildsLastElement.next = parentsNextSibling
235 | else:
236 | newChildsLastElement.next = None
237 | else:
238 | nextChild = self.contents[position]
239 | newChild.nextSibling = nextChild
240 | if newChild.nextSibling:
241 | newChild.nextSibling.previousSibling = newChild
242 | newChildsLastElement.next = nextChild
243 |
244 | if newChildsLastElement.next:
245 | newChildsLastElement.next.previous = newChildsLastElement
246 | self.contents.insert(position, newChild)
247 |
248 | def append(self, tag):
249 | """Appends the given tag to the contents of this tag."""
250 | self.insert(len(self.contents), tag)
251 |
252 | def findNext(self, name=None, attrs={}, text=None, **kwargs):
253 | """Returns the first item that matches the given criteria and
254 | appears after this Tag in the document."""
255 | return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
256 |
257 | def findAllNext(self, name=None, attrs={}, text=None, limit=None,
258 | **kwargs):
259 | """Returns all items that match the given criteria and appear
260 | after this Tag in the document."""
261 | return self._findAll(name, attrs, text, limit, self.nextGenerator,
262 | **kwargs)
263 |
264 | def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
265 | """Returns the closest sibling to this Tag that matches the
266 | given criteria and appears after this Tag in the document."""
267 | return self._findOne(self.findNextSiblings, name, attrs, text,
268 | **kwargs)
269 |
270 | def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
271 | **kwargs):
272 | """Returns the siblings of this Tag that match the given
273 | criteria and appear after this Tag in the document."""
274 | return self._findAll(name, attrs, text, limit,
275 | self.nextSiblingGenerator, **kwargs)
276 | fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
277 |
278 | def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
279 | """Returns the first item that matches the given criteria and
280 | appears before this Tag in the document."""
281 | return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
282 |
283 | def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
284 | **kwargs):
285 | """Returns all items that match the given criteria and appear
286 | before this Tag in the document."""
287 | return self._findAll(name, attrs, text, limit, self.previousGenerator,
288 | **kwargs)
289 | fetchPrevious = findAllPrevious # Compatibility with pre-3.x
290 |
291 | def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
292 | """Returns the closest sibling to this Tag that matches the
293 | given criteria and appears before this Tag in the document."""
294 | return self._findOne(self.findPreviousSiblings, name, attrs, text,
295 | **kwargs)
296 |
297 | def findPreviousSiblings(self, name=None, attrs={}, text=None,
298 | limit=None, **kwargs):
299 | """Returns the siblings of this Tag that match the given
300 | criteria and appear before this Tag in the document."""
301 | return self._findAll(name, attrs, text, limit,
302 | self.previousSiblingGenerator, **kwargs)
303 | fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
304 |
305 | def findParent(self, name=None, attrs={}, **kwargs):
306 | """Returns the closest parent of this Tag that matches the given
307 | criteria."""
308 | # NOTE: We can't use _findOne because findParents takes a different
309 | # set of arguments.
310 | r = None
311 | l = self.findParents(name, attrs, 1)
312 | if l:
313 | r = l[0]
314 | return r
315 |
316 | def findParents(self, name=None, attrs={}, limit=None, **kwargs):
317 | """Returns the parents of this Tag that match the given
318 | criteria."""
319 |
320 | return self._findAll(name, attrs, None, limit, self.parentGenerator,
321 | **kwargs)
322 | fetchParents = findParents # Compatibility with pre-3.x
323 |
324 | #These methods do the real heavy lifting.
325 |
326 | def _findOne(self, method, name, attrs, text, **kwargs):
327 | r = None
328 | l = method(name, attrs, text, 1, **kwargs)
329 | if l:
330 | r = l[0]
331 | return r
332 |
333 | def _findAll(self, name, attrs, text, limit, generator, **kwargs):
334 | "Iterates over a generator looking for things that match."
335 |
336 | if isinstance(name, SoupStrainer):
337 | strainer = name
338 | # (Possibly) special case some findAll*(...) searches
339 | elif text is None and not limit and not attrs and not kwargs:
340 | # findAll*(True)
341 | if name is True:
342 | return [element for element in generator()
343 | if isinstance(element, Tag)]
344 | # findAll*('tag-name')
345 | elif isinstance(name, basestring):
346 | return [element for element in generator()
347 | if isinstance(element, Tag) and
348 | element.name == name]
349 | else:
350 | strainer = SoupStrainer(name, attrs, text, **kwargs)
351 | # Build a SoupStrainer
352 | else:
353 | strainer = SoupStrainer(name, attrs, text, **kwargs)
354 | results = ResultSet(strainer)
355 | g = generator()
356 | while True:
357 | try:
358 | i = g.next()
359 | except StopIteration:
360 | break
361 | if i:
362 | found = strainer.search(i)
363 | if found:
364 | results.append(found)
365 | if limit and len(results) >= limit:
366 | break
367 | return results
368 |
369 | #These Generators can be used to navigate starting from both
370 | #NavigableStrings and Tags.
371 | def nextGenerator(self):
372 | i = self
373 | while i is not None:
374 | i = i.next
375 | yield i
376 |
377 | def nextSiblingGenerator(self):
378 | i = self
379 | while i is not None:
380 | i = i.nextSibling
381 | yield i
382 |
383 | def previousGenerator(self):
384 | i = self
385 | while i is not None:
386 | i = i.previous
387 | yield i
388 |
389 | def previousSiblingGenerator(self):
390 | i = self
391 | while i is not None:
392 | i = i.previousSibling
393 | yield i
394 |
395 | def parentGenerator(self):
396 | i = self
397 | while i is not None:
398 | i = i.parent
399 | yield i
400 |
401 | # Utility methods
402 | def substituteEncoding(self, str, encoding=None):
403 | encoding = encoding or "utf-8"
404 | return str.replace("%SOUP-ENCODING%", encoding)
405 |
406 | def toEncoding(self, s, encoding=None):
407 | """Encodes an object to a string in some encoding, or to Unicode.
408 | ."""
409 | if isinstance(s, unicode):
410 | if encoding:
411 | s = s.encode(encoding)
412 | elif isinstance(s, str):
413 | if encoding:
414 | s = s.encode(encoding)
415 | else:
416 | s = unicode(s)
417 | else:
418 | if encoding:
419 | s = self.toEncoding(str(s), encoding)
420 | else:
421 | s = unicode(s)
422 | return s
423 |
424 | class NavigableString(unicode, PageElement):
425 |
426 | def __new__(cls, value):
427 | """Create a new NavigableString.
428 |
429 | When unpickling a NavigableString, this method is called with
430 | the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
431 | passed in to the superclass's __new__ or the superclass won't know
432 | how to handle non-ASCII characters.
433 | """
434 | if isinstance(value, unicode):
435 | return unicode.__new__(cls, value)
436 | return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
437 |
438 | def __getnewargs__(self):
439 | return (NavigableString.__str__(self),)
440 |
441 | def __getattr__(self, attr):
442 | """text.string gives you text. This is for backwards
443 | compatibility for Navigable*String, but for CData* it lets you
444 | get the string without the CData wrapper."""
445 | if attr == 'string':
446 | return self
447 | else:
448 | raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
449 |
450 | def __unicode__(self):
451 | return str(self).decode(DEFAULT_OUTPUT_ENCODING)
452 |
453 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
454 | if encoding:
455 | return self.encode(encoding)
456 | else:
457 | return self
458 |
459 | class CData(NavigableString):
460 |
461 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
462 | return "" % NavigableString.__str__(self, encoding)
463 |
464 | class ProcessingInstruction(NavigableString):
465 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
466 | output = self
467 | if "%SOUP-ENCODING%" in output:
468 | output = self.substituteEncoding(output, encoding)
469 | return "%s?>" % self.toEncoding(output, encoding)
470 |
471 | class Comment(NavigableString):
472 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
473 | return "" % NavigableString.__str__(self, encoding)
474 |
475 | class Declaration(NavigableString):
476 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
477 | return "" % NavigableString.__str__(self, encoding)
478 |
479 | class Tag(PageElement):
480 |
481 | """Represents a found HTML tag with its attributes and contents."""
482 |
483 | def _invert(h):
484 | "Cheap function to invert a hash."
485 | i = {}
486 | for k,v in h.items():
487 | i[v] = k
488 | return i
489 |
490 | XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
491 | "quot" : '"',
492 | "amp" : "&",
493 | "lt" : "<",
494 | "gt" : ">" }
495 |
496 | XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
497 |
498 | def _convertEntities(self, match):
499 | """Used in a call to re.sub to replace HTML, XML, and numeric
500 | entities with the appropriate Unicode characters. If HTML
501 | entities are being converted, any unrecognized entities are
502 | escaped."""
503 | x = match.group(1)
504 | if self.convertHTMLEntities and x in name2codepoint:
505 | return unichr(name2codepoint[x])
506 | elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
507 | if self.convertXMLEntities:
508 | return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
509 | else:
510 | return u'&%s;' % x
511 | elif len(x) > 0 and x[0] == '#':
512 | # Handle numeric entities
513 | if len(x) > 1 and x[1] == 'x':
514 | return unichr(int(x[2:], 16))
515 | else:
516 | return unichr(int(x[1:]))
517 |
518 | elif self.escapeUnrecognizedEntities:
519 | return u'&%s;' % x
520 | else:
521 | return u'&%s;' % x
522 |
523 | def __init__(self, parser, name, attrs=None, parent=None,
524 | previous=None):
525 | "Basic constructor."
526 |
527 | # We don't actually store the parser object: that lets extracted
528 | # chunks be garbage-collected
529 | self.parserClass = parser.__class__
530 | self.isSelfClosing = parser.isSelfClosingTag(name)
531 | self.name = name
532 | if attrs is None:
533 | attrs = []
534 | self.attrs = attrs
535 | self.contents = []
536 | self.setup(parent, previous)
537 | self.hidden = False
538 | self.containsSubstitutions = False
539 | self.convertHTMLEntities = parser.convertHTMLEntities
540 | self.convertXMLEntities = parser.convertXMLEntities
541 | self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
542 |
543 | # Convert any HTML, XML, or numeric entities in the attribute values.
544 | convert = lambda(k, val): (k,
545 | re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
546 | self._convertEntities,
547 | val))
548 | self.attrs = map(convert, self.attrs)
549 |
550 | def getString(self):
551 | if (len(self.contents) == 1
552 | and isinstance(self.contents[0], NavigableString)):
553 | return self.contents[0]
554 |
555 | def setString(self, string):
556 | """Replace the contents of the tag with a string"""
557 | self.clear()
558 | self.append(string)
559 |
560 | string = property(getString, setString)
561 |
562 | def getText(self, separator=u""):
563 | if not len(self.contents):
564 | return u""
565 | stopNode = self._lastRecursiveChild().next
566 | strings = []
567 | current = self.contents[0]
568 | while current is not stopNode:
569 | if isinstance(current, NavigableString):
570 | strings.append(current.strip())
571 | current = current.next
572 | return separator.join(strings)
573 |
574 | text = property(getText)
575 |
576 | def get(self, key, default=None):
577 | """Returns the value of the 'key' attribute for the tag, or
578 | the value given for 'default' if it doesn't have that
579 | attribute."""
580 | return self._getAttrMap().get(key, default)
581 |
582 | def clear(self):
583 | """Extract all children."""
584 | for child in self.contents[:]:
585 | child.extract()
586 |
587 | def index(self, element):
588 | for i, child in enumerate(self.contents):
589 | if child is element:
590 | return i
591 | raise ValueError("Tag.index: element not in tag")
592 |
593 | def has_key(self, key):
594 | return self._getAttrMap().has_key(key)
595 |
596 | def __getitem__(self, key):
597 | """tag[key] returns the value of the 'key' attribute for the tag,
598 | and throws an exception if it's not there."""
599 | return self._getAttrMap()[key]
600 |
601 | def __iter__(self):
602 | "Iterating over a tag iterates over its contents."
603 | return iter(self.contents)
604 |
605 | def __len__(self):
606 | "The length of a tag is the length of its list of contents."
607 | return len(self.contents)
608 |
609 | def __contains__(self, x):
610 | return x in self.contents
611 |
612 | def __nonzero__(self):
613 | "A tag is non-None even if it has no contents."
614 | return True
615 |
616 | def __setitem__(self, key, value):
617 | """Setting tag[key] sets the value of the 'key' attribute for the
618 | tag."""
619 | self._getAttrMap()
620 | self.attrMap[key] = value
621 | found = False
622 | for i in range(0, len(self.attrs)):
623 | if self.attrs[i][0] == key:
624 | self.attrs[i] = (key, value)
625 | found = True
626 | if not found:
627 | self.attrs.append((key, value))
628 | self._getAttrMap()[key] = value
629 |
630 | def __delitem__(self, key):
631 | "Deleting tag[key] deletes all 'key' attributes for the tag."
632 | for item in self.attrs:
633 | if item[0] == key:
634 | self.attrs.remove(item)
635 | #We don't break because bad HTML can define the same
636 | #attribute multiple times.
637 | self._getAttrMap()
638 | if self.attrMap.has_key(key):
639 | del self.attrMap[key]
640 |
641 | def __call__(self, *args, **kwargs):
642 | """Calling a tag like a function is the same as calling its
643 | findAll() method. Eg. tag('a') returns a list of all the A tags
644 | found within this tag."""
645 | return apply(self.findAll, args, kwargs)
646 |
647 | def __getattr__(self, tag):
648 | #print "Getattr %s.%s" % (self.__class__, tag)
649 | if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
650 | return self.find(tag[:-3])
651 | elif tag.find('__') != 0:
652 | return self.find(tag)
653 | raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag)
654 |
655 | def __eq__(self, other):
656 | """Returns true iff this tag has the same name, the same attributes,
657 | and the same contents (recursively) as the given tag.
658 |
659 | NOTE: right now this will return false if two tags have the
660 | same attributes in a different order. Should this be fixed?"""
661 | if other is self:
662 | return True
663 | if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
664 | return False
665 | for i in range(0, len(self.contents)):
666 | if self.contents[i] != other.contents[i]:
667 | return False
668 | return True
669 |
670 | def __ne__(self, other):
671 | """Returns true iff this tag is not identical to the other tag,
672 | as defined in __eq__."""
673 | return not self == other
674 |
675 | def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
676 | """Renders this tag as a string."""
677 | return self.__str__(encoding)
678 |
679 | def __unicode__(self):
680 | return self.__str__(None)
681 |
682 | BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
683 | + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
684 | + ")")
685 |
686 | def _sub_entity(self, x):
687 | """Used with a regular expression to substitute the
688 | appropriate XML entity for an XML special character."""
689 | return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
690 |
691 | def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING,
692 | prettyPrint=False, indentLevel=0):
693 | """Returns a string or Unicode representation of this tag and
694 | its contents. To get Unicode, pass None for encoding.
695 |
696 | NOTE: since Python's HTML parser consumes whitespace, this
697 | method is not certain to reproduce the whitespace present in
698 | the original string."""
699 |
700 | encodedName = self.toEncoding(self.name, encoding)
701 |
702 | attrs = []
703 | if self.attrs:
704 | for key, val in self.attrs:
705 | fmt = '%s="%s"'
706 | if isinstance(val, basestring):
707 | if self.containsSubstitutions and '%SOUP-ENCODING%' in val:
708 | val = self.substituteEncoding(val, encoding)
709 |
710 | # The attribute value either:
711 | #
712 | # * Contains no embedded double quotes or single quotes.
713 | # No problem: we enclose it in double quotes.
714 | # * Contains embedded single quotes. No problem:
715 | # double quotes work here too.
716 | # * Contains embedded double quotes. No problem:
717 | # we enclose it in single quotes.
718 | # * Embeds both single _and_ double quotes. This
719 | # can't happen naturally, but it can happen if
720 | # you modify an attribute value after parsing
721 | # the document. Now we have a bit of a
722 | # problem. We solve it by enclosing the
723 | # attribute in single quotes, and escaping any
724 | # embedded single quotes to XML entities.
725 | if '"' in val:
726 | fmt = "%s='%s'"
727 | if "'" in val:
728 | # TODO: replace with apos when
729 | # appropriate.
730 | val = val.replace("'", "&squot;")
731 |
732 | # Now we're okay w/r/t quotes. But the attribute
733 | # value might also contain angle brackets, or
734 | # ampersands that aren't part of entities. We need
735 | # to escape those to XML entities too.
736 | val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
737 |
738 | attrs.append(fmt % (self.toEncoding(key, encoding),
739 | self.toEncoding(val, encoding)))
740 | close = ''
741 | closeTag = ''
742 | if self.isSelfClosing:
743 | close = ' /'
744 | else:
745 | closeTag = '%s>' % encodedName
746 |
747 | indentTag, indentContents = 0, 0
748 | if prettyPrint:
749 | indentTag = indentLevel
750 | space = (' ' * (indentTag-1))
751 | indentContents = indentTag + 1
752 | contents = self.renderContents(encoding, prettyPrint, indentContents)
753 | if self.hidden:
754 | s = contents
755 | else:
756 | s = []
757 | attributeString = ''
758 | if attrs:
759 | attributeString = ' ' + ' '.join(attrs)
760 | if prettyPrint:
761 | s.append(space)
762 | s.append('<%s%s%s>' % (encodedName, attributeString, close))
763 | if prettyPrint:
764 | s.append("\n")
765 | s.append(contents)
766 | if prettyPrint and contents and contents[-1] != "\n":
767 | s.append("\n")
768 | if prettyPrint and closeTag:
769 | s.append(space)
770 | s.append(closeTag)
771 | if prettyPrint and closeTag and self.nextSibling:
772 | s.append("\n")
773 | s = ''.join(s)
774 | return s
775 |
776 | def decompose(self):
777 | """Recursively destroys the contents of this tree."""
778 | self.extract()
779 | if len(self.contents) == 0:
780 | return
781 | current = self.contents[0]
782 | while current is not None:
783 | next = current.next
784 | if isinstance(current, Tag):
785 | del current.contents[:]
786 | current.parent = None
787 | current.previous = None
788 | current.previousSibling = None
789 | current.next = None
790 | current.nextSibling = None
791 | current = next
792 |
793 | def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
794 | return self.__str__(encoding, True)
795 |
796 | def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
797 | prettyPrint=False, indentLevel=0):
798 | """Renders the contents of this tag as a string in the given
799 | encoding. If encoding is None, returns a Unicode string.."""
800 | s=[]
801 | for c in self:
802 | text = None
803 | if isinstance(c, NavigableString):
804 | text = c.__str__(encoding)
805 | elif isinstance(c, Tag):
806 | s.append(c.__str__(encoding, prettyPrint, indentLevel))
807 | if text and prettyPrint:
808 | text = text.strip()
809 | if text:
810 | if prettyPrint:
811 | s.append(" " * (indentLevel-1))
812 | s.append(text)
813 | if prettyPrint:
814 | s.append("\n")
815 | return ''.join(s)
816 |
817 | #Soup methods
818 |
819 | def find(self, name=None, attrs={}, recursive=True, text=None,
820 | **kwargs):
821 | """Return only the first child of this Tag matching the given
822 | criteria."""
823 | r = None
824 | l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
825 | if l:
826 | r = l[0]
827 | return r
828 | findChild = find
829 |
830 | def findAll(self, name=None, attrs={}, recursive=True, text=None,
831 | limit=None, **kwargs):
832 | """Extracts a list of Tag objects that match the given
833 | criteria. You can specify the name of the Tag and any
834 | attributes you want the Tag to have.
835 |
836 | The value of a key-value pair in the 'attrs' map can be a
837 | string, a list of strings, a regular expression object, or a
838 | callable that takes a string and returns whether or not the
839 | string matches for some custom definition of 'matches'. The
840 | same is true of the tag name."""
841 | generator = self.recursiveChildGenerator
842 | if not recursive:
843 | generator = self.childGenerator
844 | return self._findAll(name, attrs, text, limit, generator, **kwargs)
845 | findChildren = findAll
846 |
847 | # Pre-3.x compatibility methods
848 | first = find
849 | fetch = findAll
850 |
851 | def fetchText(self, text=None, recursive=True, limit=None):
852 | return self.findAll(text=text, recursive=recursive, limit=limit)
853 |
854 | def firstText(self, text=None, recursive=True):
855 | return self.find(text=text, recursive=recursive)
856 |
857 | #Private methods
858 |
859 | def _getAttrMap(self):
860 | """Initializes a map representation of this tag's attributes,
861 | if not already initialized."""
862 | if not getattr(self, 'attrMap'):
863 | self.attrMap = {}
864 | for (key, value) in self.attrs:
865 | self.attrMap[key] = value
866 | return self.attrMap
867 |
868 | #Generator methods
869 | def childGenerator(self):
870 | # Just use the iterator from the contents
871 | return iter(self.contents)
872 |
873 | def recursiveChildGenerator(self):
874 | if not len(self.contents):
875 | raise StopIteration
876 | stopNode = self._lastRecursiveChild().next
877 | current = self.contents[0]
878 | while current is not stopNode:
879 | yield current
880 | current = current.next
881 |
882 |
883 | # Next, a couple classes to represent queries and their results.
884 | class SoupStrainer:
885 | """Encapsulates a number of ways of matching a markup element (tag or
886 | text)."""
887 |
888 | def __init__(self, name=None, attrs={}, text=None, **kwargs):
889 | self.name = name
890 | if isinstance(attrs, basestring):
891 | kwargs['class'] = _match_css_class(attrs)
892 | attrs = None
893 | if kwargs:
894 | if attrs:
895 | attrs = attrs.copy()
896 | attrs.update(kwargs)
897 | else:
898 | attrs = kwargs
899 | self.attrs = attrs
900 | self.text = text
901 |
902 | def __str__(self):
903 | if self.text:
904 | return self.text
905 | else:
906 | return "%s|%s" % (self.name, self.attrs)
907 |
908 | def searchTag(self, markupName=None, markupAttrs={}):
909 | found = None
910 | markup = None
911 | if isinstance(markupName, Tag):
912 | markup = markupName
913 | markupAttrs = markup
914 | callFunctionWithTagData = callable(self.name) \
915 | and not isinstance(markupName, Tag)
916 |
917 | if (not self.name) \
918 | or callFunctionWithTagData \
919 | or (markup and self._matches(markup, self.name)) \
920 | or (not markup and self._matches(markupName, self.name)):
921 | if callFunctionWithTagData:
922 | match = self.name(markupName, markupAttrs)
923 | else:
924 | match = True
925 | markupAttrMap = None
926 | for attr, matchAgainst in self.attrs.items():
927 | if not markupAttrMap:
928 | if hasattr(markupAttrs, 'get'):
929 | markupAttrMap = markupAttrs
930 | else:
931 | markupAttrMap = {}
932 | for k,v in markupAttrs:
933 | markupAttrMap[k] = v
934 | attrValue = markupAttrMap.get(attr)
935 | if not self._matches(attrValue, matchAgainst):
936 | match = False
937 | break
938 | if match:
939 | if markup:
940 | found = markup
941 | else:
942 | found = markupName
943 | return found
944 |
945 | def search(self, markup):
946 | #print 'looking for %s in %s' % (self, markup)
947 | found = None
948 | # If given a list of items, scan it for a text element that
949 | # matches.
950 | if hasattr(markup, "__iter__") \
951 | and not isinstance(markup, Tag):
952 | for element in markup:
953 | if isinstance(element, NavigableString) \
954 | and self.search(element):
955 | found = element
956 | break
957 | # If it's a Tag, make sure its name or attributes match.
958 | # Don't bother with Tags if we're searching for text.
959 | elif isinstance(markup, Tag):
960 | if not self.text:
961 | found = self.searchTag(markup)
962 | # If it's text, make sure the text matches.
963 | elif isinstance(markup, NavigableString) or \
964 | isinstance(markup, basestring):
965 | if self._matches(markup, self.text):
966 | found = markup
967 | else:
968 | raise Exception, "I don't know how to match against a %s" \
969 | % markup.__class__
970 | return found
971 |
972 | def _matches(self, markup, matchAgainst):
973 | #print "Matching %s against %s" % (markup, matchAgainst)
974 | result = False
975 | if matchAgainst is True:
976 | result = markup is not None
977 | elif callable(matchAgainst):
978 | result = matchAgainst(markup)
979 | else:
980 | #Custom match methods take the tag as an argument, but all
981 | #other ways of matching match the tag name as a string.
982 | if isinstance(markup, Tag):
983 | markup = markup.name
984 | if markup and not isinstance(markup, basestring):
985 | markup = unicode(markup)
986 | #Now we know that chunk is either a string, or None.
987 | if hasattr(matchAgainst, 'match'):
988 | # It's a regexp object.
989 | result = markup and matchAgainst.search(markup)
990 | elif hasattr(matchAgainst, '__iter__'): # list-like
991 | result = markup in matchAgainst
992 | elif hasattr(matchAgainst, 'items'):
993 | result = markup.has_key(matchAgainst)
994 | elif matchAgainst and isinstance(markup, basestring):
995 | if isinstance(markup, unicode):
996 | matchAgainst = unicode(matchAgainst)
997 | else:
998 | matchAgainst = str(matchAgainst)
999 |
1000 | if not result:
1001 | result = matchAgainst == markup
1002 | return result
1003 |
1004 | class ResultSet(list):
1005 | """A ResultSet is just a list that keeps track of the SoupStrainer
1006 | that created it."""
1007 | def __init__(self, source):
1008 | list.__init__([])
1009 | self.source = source
1010 |
1011 | # Now, some helper functions.
1012 |
1013 | def buildTagMap(default, *args):
1014 | """Turns a list of maps, lists, or scalars into a single map.
1015 | Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
1016 | NESTING_RESET_TAGS maps out of lists and partial maps."""
1017 | built = {}
1018 | for portion in args:
1019 | if hasattr(portion, 'items'):
1020 | #It's a map. Merge it.
1021 | for k,v in portion.items():
1022 | built[k] = v
1023 | elif hasattr(portion, '__iter__'): # is a list
1024 | #It's a list. Map each item to the default.
1025 | for k in portion:
1026 | built[k] = default
1027 | else:
1028 | #It's a scalar. Map it to the default.
1029 | built[portion] = default
1030 | return built
1031 |
1032 | # Now, the parser classes.
1033 |
1034 | class BeautifulStoneSoup(Tag, SGMLParser):
1035 |
1036 | """This class contains the basic parser and search code. It defines
1037 | a parser that knows nothing about tag behavior except for the
1038 | following:
1039 |
1040 | You can't close a tag without closing all the tags it encloses.
1041 | That is, "
(No space between name of closing tag and tag close)
1100 | (Extraneous whitespace in declaration)
1101 |
1102 | You can pass in a custom list of (RE object, replace method)
1103 | tuples to get Beautiful Soup to scrub your input the way you
1104 | want."""
1105 |
1106 | self.parseOnlyThese = parseOnlyThese
1107 | self.fromEncoding = fromEncoding
1108 | self.smartQuotesTo = smartQuotesTo
1109 | self.convertEntities = convertEntities
1110 | # Set the rules for how we'll deal with the entities we
1111 | # encounter
1112 | if self.convertEntities:
1113 | # It doesn't make sense to convert encoded characters to
1114 | # entities even while you're converting entities to Unicode.
1115 | # Just convert it all to Unicode.
1116 | self.smartQuotesTo = None
1117 | if convertEntities == self.HTML_ENTITIES:
1118 | self.convertXMLEntities = False
1119 | self.convertHTMLEntities = True
1120 | self.escapeUnrecognizedEntities = True
1121 | elif convertEntities == self.XHTML_ENTITIES:
1122 | self.convertXMLEntities = True
1123 | self.convertHTMLEntities = True
1124 | self.escapeUnrecognizedEntities = False
1125 | elif convertEntities == self.XML_ENTITIES:
1126 | self.convertXMLEntities = True
1127 | self.convertHTMLEntities = False
1128 | self.escapeUnrecognizedEntities = False
1129 | else:
1130 | self.convertXMLEntities = False
1131 | self.convertHTMLEntities = False
1132 | self.escapeUnrecognizedEntities = False
1133 |
1134 | self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
1135 | SGMLParser.__init__(self)
1136 |
1137 | if hasattr(markup, 'read'): # It's a file-type object.
1138 | markup = markup.read()
1139 | self.markup = markup
1140 | self.markupMassage = markupMassage
1141 | try:
1142 | self._feed(isHTML=isHTML)
1143 | except StopParsing:
1144 | pass
1145 | self.markup = None # The markup can now be GCed
1146 |
1147 | def convert_charref(self, name):
1148 | """This method fixes a bug in Python's SGMLParser."""
1149 | try:
1150 | n = int(name)
1151 | except ValueError:
1152 | return
1153 | if not 0 <= n <= 127 : # ASCII ends at 127, not 255
1154 | return
1155 | return self.convert_codepoint(n)
1156 |
1157 | def _feed(self, inDocumentEncoding=None, isHTML=False):
1158 | # Convert the document to Unicode.
1159 | markup = self.markup
1160 | if isinstance(markup, unicode):
1161 | if not hasattr(self, 'originalEncoding'):
1162 | self.originalEncoding = None
1163 | else:
1164 | dammit = UnicodeDammit\
1165 | (markup, [self.fromEncoding, inDocumentEncoding],
1166 | smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
1167 | markup = dammit.unicode
1168 | self.originalEncoding = dammit.originalEncoding
1169 | self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
1170 | if markup:
1171 | if self.markupMassage:
1172 | if not hasattr(self.markupMassage, "__iter__"):
1173 | self.markupMassage = self.MARKUP_MASSAGE
1174 | for fix, m in self.markupMassage:
1175 | markup = fix.sub(m, markup)
1176 | # TODO: We get rid of markupMassage so that the
1177 | # soup object can be deepcopied later on. Some
1178 | # Python installations can't copy regexes. If anyone
1179 | # was relying on the existence of markupMassage, this
1180 | # might cause problems.
1181 | del(self.markupMassage)
1182 | self.reset()
1183 |
1184 | SGMLParser.feed(self, markup)
1185 | # Close out any unfinished strings and close all the open tags.
1186 | self.endData()
1187 | while self.currentTag.name != self.ROOT_TAG_NAME:
1188 | self.popTag()
1189 |
1190 | def __getattr__(self, methodName):
1191 | """This method routes method call requests to either the SGMLParser
1192 | superclass or the Tag superclass, depending on the method name."""
1193 | #print "__getattr__ called on %s.%s" % (self.__class__, methodName)
1194 |
1195 | if methodName.startswith('start_') or methodName.startswith('end_') \
1196 | or methodName.startswith('do_'):
1197 | return SGMLParser.__getattr__(self, methodName)
1198 | elif not methodName.startswith('__'):
1199 | return Tag.__getattr__(self, methodName)
1200 | else:
1201 | raise AttributeError
1202 |
1203 | def isSelfClosingTag(self, name):
1204 | """Returns true iff the given string is the name of a
1205 | self-closing tag according to this parser."""
1206 | return self.SELF_CLOSING_TAGS.has_key(name) \
1207 | or self.instanceSelfClosingTags.has_key(name)
1208 |
1209 | def reset(self):
1210 | Tag.__init__(self, self, self.ROOT_TAG_NAME)
1211 | self.hidden = 1
1212 | SGMLParser.reset(self)
1213 | self.currentData = []
1214 | self.currentTag = None
1215 | self.tagStack = []
1216 | self.quoteStack = []
1217 | self.pushTag(self)
1218 |
1219 | def popTag(self):
1220 | tag = self.tagStack.pop()
1221 |
1222 | #print "Pop", tag.name
1223 | if self.tagStack:
1224 | self.currentTag = self.tagStack[-1]
1225 | return self.currentTag
1226 |
1227 | def pushTag(self, tag):
1228 | #print "Push", tag.name
1229 | if self.currentTag:
1230 | self.currentTag.contents.append(tag)
1231 | self.tagStack.append(tag)
1232 | self.currentTag = self.tagStack[-1]
1233 |
1234 | def endData(self, containerClass=NavigableString):
1235 | if self.currentData:
1236 | currentData = u''.join(self.currentData)
1237 | if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
1238 | not set([tag.name for tag in self.tagStack]).intersection(
1239 | self.PRESERVE_WHITESPACE_TAGS)):
1240 | if '\n' in currentData:
1241 | currentData = '\n'
1242 | else:
1243 | currentData = ' '
1244 | self.currentData = []
1245 | if self.parseOnlyThese and len(self.tagStack) <= 1 and \
1246 | (not self.parseOnlyThese.text or \
1247 | not self.parseOnlyThese.search(currentData)):
1248 | return
1249 | o = containerClass(currentData)
1250 | o.setup(self.currentTag, self.previous)
1251 | if self.previous:
1252 | self.previous.next = o
1253 | self.previous = o
1254 | self.currentTag.contents.append(o)
1255 |
1256 |
1257 | def _popToTag(self, name, inclusivePop=True):
1258 | """Pops the tag stack up to and including the most recent
1259 | instance of the given tag. If inclusivePop is false, pops the tag
1260 | stack up to but *not* including the most recent instqance of
1261 | the given tag."""
1262 | #print "Popping to %s" % name
1263 | if name == self.ROOT_TAG_NAME:
1264 | return
1265 |
1266 | numPops = 0
1267 | mostRecentTag = None
1268 | for i in range(len(self.tagStack)-1, 0, -1):
1269 | if name == self.tagStack[i].name:
1270 | numPops = len(self.tagStack)-i
1271 | break
1272 | if not inclusivePop:
1273 | numPops = numPops - 1
1274 |
1275 | for i in range(0, numPops):
1276 | mostRecentTag = self.popTag()
1277 | return mostRecentTag
1278 |
1279 | def _smartPop(self, name):
1280 |
1281 | """We need to pop up to the previous tag of this type, unless
1282 | one of this tag's nesting reset triggers comes between this
1283 | tag and the previous tag of this type, OR unless this tag is a
1284 | generic nesting trigger and another generic nesting trigger
1285 | comes between this tag and the previous tag of this type.
1286 |
1287 | Examples:
1288 |
FooBar *
* should pop to 'p', not 'b'. 1289 |
Foo
* | * should pop to 'tr', not the first 'td'
1295 | """
1296 |
1297 | nestingResetTriggers = self.NESTABLE_TAGS.get(name)
1298 | isNestable = nestingResetTriggers != None
1299 | isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
1300 | popTo = None
1301 | inclusive = True
1302 | for i in range(len(self.tagStack)-1, 0, -1):
1303 | p = self.tagStack[i]
1304 | if (not p or p.name == name) and not isNestable:
1305 | #Non-nestable tags get popped to the top or to their
1306 | #last occurance.
1307 | popTo = name
1308 | break
1309 | if (nestingResetTriggers is not None
1310 | and p.name in nestingResetTriggers) \
1311 | or (nestingResetTriggers is None and isResetNesting
1312 | and self.RESET_NESTING_TAGS.has_key(p.name)):
1313 |
1314 | #If we encounter one of the nesting reset triggers
1315 | #peculiar to this tag, or we encounter another tag
1316 | #that causes nesting to reset, pop up to but not
1317 | #including that tag.
1318 | popTo = p.name
1319 | inclusive = False
1320 | break
1321 | p = p.parent
1322 | if popTo:
1323 | self._popToTag(popTo, inclusive)
1324 |
1325 | def unknown_starttag(self, name, attrs, selfClosing=0):
1326 | #print "Start tag %s: %s" % (name, attrs)
1327 | if self.quoteStack:
1328 | #This is not a real tag.
1329 | #print "<%s> is not real!" % name
1330 | attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs])
1331 | self.handle_data('<%s%s>' % (name, attrs))
1332 | return
1333 | self.endData()
1334 |
1335 | if not self.isSelfClosingTag(name) and not selfClosing:
1336 | self._smartPop(name)
1337 |
1338 | if self.parseOnlyThese and len(self.tagStack) <= 1 \
1339 | and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
1340 | return
1341 |
1342 | tag = Tag(self, name, attrs, self.currentTag, self.previous)
1343 | if self.previous:
1344 | self.previous.next = tag
1345 | self.previous = tag
1346 | self.pushTag(tag)
1347 | if selfClosing or self.isSelfClosingTag(name):
1348 | self.popTag()
1349 | if name in self.QUOTE_TAGS:
1350 | #print "Beginning quote (%s)" % name
1351 | self.quoteStack.append(name)
1352 | self.literal = 1
1353 | return tag
1354 |
1355 | def unknown_endtag(self, name):
1356 | #print "End tag %s" % name
1357 | if self.quoteStack and self.quoteStack[-1] != name:
1358 | #This is not a real end tag.
1359 | #print "%s> is not real!" % name
1360 | self.handle_data('%s>' % name)
1361 | return
1362 | self.endData()
1363 | self._popToTag(name)
1364 | if self.quoteStack and self.quoteStack[-1] == name:
1365 | self.quoteStack.pop()
1366 | self.literal = (len(self.quoteStack) > 0)
1367 |
1368 | def handle_data(self, data):
1369 | self.currentData.append(data)
1370 |
1371 | def _toStringSubclass(self, text, subclass):
1372 | """Adds a certain piece of text to the tree as a NavigableString
1373 | subclass."""
1374 | self.endData()
1375 | self.handle_data(text)
1376 | self.endData(subclass)
1377 |
1378 | def handle_pi(self, text):
1379 | """Handle a processing instruction as a ProcessingInstruction
1380 | object, possibly one with a %SOUP-ENCODING% slot into which an
1381 | encoding will be plugged later."""
1382 | if text[:3] == "xml":
1383 | text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
1384 | self._toStringSubclass(text, ProcessingInstruction)
1385 |
1386 | def handle_comment(self, text):
1387 | "Handle comments as Comment objects."
1388 | self._toStringSubclass(text, Comment)
1389 |
1390 | def handle_charref(self, ref):
1391 | "Handle character references as data."
1392 | if self.convertEntities:
1393 | data = unichr(int(ref))
1394 | else:
1395 | data = '%s;' % ref
1396 | self.handle_data(data)
1397 |
1398 | def handle_entityref(self, ref):
1399 | """Handle entity references as data, possibly converting known
1400 | HTML and/or XML entity references to the corresponding Unicode
1401 | characters."""
1402 | data = None
1403 | if self.convertHTMLEntities:
1404 | try:
1405 | data = unichr(name2codepoint[ref])
1406 | except KeyError:
1407 | pass
1408 |
1409 | if not data and self.convertXMLEntities:
1410 | data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
1411 |
1412 | if not data and self.convertHTMLEntities and \
1413 | not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
1414 | # TODO: We've got a problem here. We're told this is
1415 | # an entity reference, but it's not an XML entity
1416 | # reference or an HTML entity reference. Nonetheless,
1417 | # the logical thing to do is to pass it through as an
1418 | # unrecognized entity reference.
1419 | #
1420 | # Except: when the input is "&carol;" this function
1421 | # will be called with input "carol". When the input is
1422 | # "AT&T", this function will be called with input
1423 | # "T". We have no way of knowing whether a semicolon
1424 | # was present originally, so we don't know whether
1425 | # this is an unknown entity or just a misplaced
1426 | # ampersand.
1427 | #
1428 | # The more common case is a misplaced ampersand, so I
1429 | # escape the ampersand and omit the trailing semicolon.
1430 | data = "&%s" % ref
1431 | if not data:
1432 | # This case is different from the one above, because we
1433 | # haven't already gone through a supposedly comprehensive
1434 | # mapping of entities to Unicode characters. We might not
1435 | # have gone through any mapping at all. So the chances are
1436 | # very high that this is a real entity, and not a
1437 | # misplaced ampersand.
1438 | data = "&%s;" % ref
1439 | self.handle_data(data)
1440 |
1441 | def handle_decl(self, data):
1442 | "Handle DOCTYPEs and the like as Declaration objects."
1443 | self._toStringSubclass(data, Declaration)
1444 |
1445 | def parse_declaration(self, i):
1446 | """Treat a bogus SGML declaration as raw data. Treat a CDATA
1447 | declaration as a CData object."""
1448 | j = None
1449 | if self.rawdata[i:i+9] == '', i)
1451 | if k == -1:
1452 | k = len(self.rawdata)
1453 | data = self.rawdata[i+9:k]
1454 | j = k+3
1455 | self._toStringSubclass(data, CData)
1456 | else:
1457 | try:
1458 | j = SGMLParser.parse_declaration(self, i)
1459 | except SGMLParseError:
1460 | toHandle = self.rawdata[i:]
1461 | self.handle_data(toHandle)
1462 | j = i + len(toHandle)
1463 | return j
1464 |
1465 | class BeautifulSoup(BeautifulStoneSoup):
1466 |
1467 | """This parser knows the following facts about HTML:
1468 |
1469 | * Some tags have no closing tag and should be interpreted as being
1470 | closed as soon as they are encountered.
1471 |
1472 | * The text inside some tags (ie. 'script') may contain tags which
1473 | are not really part of the document and which should be parsed
1474 | as text, not tags. If you want to parse the text as tags, you can
1475 | always fetch it and parse it explicitly.
1476 |
1477 | * Tag nesting rules:
1478 |
1479 | Most tags can't be nested at all. For instance, the occurance of
1480 | a tag should implicitly close the previous tag. 1481 | 1482 | Para1 Para2 1483 | should be transformed into: 1484 | Para1 Para2 1485 | 1486 | Some tags can be nested arbitrarily. For instance, the occurance 1487 | of a tag should _not_ implicitly close the previous 1488 |tag. 1489 | 1490 | Alice said:Bob said:Blah 1491 | should NOT be transformed into: 1492 | Alice said:Bob said:Blah 1493 | 1494 | Some tags can be nested, but the nesting is reset by the 1495 | interposition of other tags. For instance, a |