├── README.md
├── class-notes
├── class-1.md
├── class-2.md
├── class-3.md
├── class-4.md
├── class-5.md
├── class-6.md
├── class-7.md
└── examples
│ ├── class7
│ ├── email.py
│ ├── vidbot.py
│ └── webapp.py
│ ├── natural-language-processing
│ ├── classify.py
│ ├── manifesto.txt
│ ├── part-of-speech.py
│ ├── pride.txt
│ ├── regexp.py
│ ├── similarity.py
│ └── translate.py
│ └── video
│ ├── combine_videos.py
│ ├── random_overlay.py
│ └── randomize.py
├── reader-01-the-command-line.md
└── reader-02-python-basics.md
/README.md:
--------------------------------------------------------------------------------
1 | # Scrapism
2 |
3 | (draft syllabus)
4 |
5 | **Instructor:** [Sam Lavigne](http://lav.io) | [splavigne@gmail.com](mailto:splavigne@gmail.com)
6 | **Teaching Assistant:** TBD
7 | **Track:** Code Poetry, Fall 2018
8 | **Location:** [School for Poetic Computation](http://sfpc.io/) | 155 Bank St, New York, NY 10014
9 | **Time:** Tuesdays 10am to 1pm
10 | **Office Hours:** Tuesdays 2pm to 4pm (or by appointment)
11 | **Class Notes:** [link](https://paper.dropbox.com/folder/show/Class-Notes-e.1gg8YzoPEhbTkrhvQwJ2zz3XJBcZkbceseDnY854qf9k5dPQtUC2)
12 |
13 | Scrapism is the artistic practice of web scraping, or of automatically collecting and transforming found digital material. It hinges upon a combination of curatorial practice, reverse engineering, and hoarding mentality. In this class students will learn how to scrape massive quantities of material from the internet with Python, and then use that material to make poetic, satirical, critical, political projects. Each session we will cover a different web scraping technique, with production assignments relating to text, image and video. We will explore surrealist, dadaist, situationist techniques such as detournement, collage, and cut-ups, and apply them to a contemporary digital context.
14 |
15 | ## Schedule
16 |
17 | ### 1. September 18th
18 |
19 | Introductions. Using the terminal. Basic python. Reading lines.
20 |
21 | #### Readings
22 | * [Intro to the command line](https://github.com/antiboredom/sfpc-scrapism/blob/master/reader-01-the-command-line.md)
23 | * [Python basics](https://github.com/antiboredom/sfpc-scrapism/blob/master/reader-02-python-basics.md)
24 | * [Artificial Hells (introduction and chapter 1)](https://selforganizedseminar.files.wordpress.com/2011/08/bishop-claire-artificial-hells-participatory-art-and-politics-spectatorship.pdf) By Claire Bishop
25 | * [A User’s Guide to Détournement](http://www.bopsecrets.org/SI/detourn.htm)
26 |
27 | #### Assignment
28 | * Find three sentences (or phrases) in the wild. Your sentences could come from the internet or the real world, from a book, a store sign, a facebook post, a news article, product packaging, or from a restaurant menu. Anything is fine, but you must not write it yourself. Be prepared to recite what you have found next week in class.
29 |
30 | ---
31 |
32 | ### 2. September 25th
33 |
34 | Python part 2. Manipulating text. Automating writing.
35 |
36 | #### Readings
37 | * Tech reading 2 TBD
38 | * [The Cut Up Method](http://www.writing.upenn.edu/~afilreis/88v/burroughs-cutup.html) by William Burroughs
39 |
40 | #### Assignment
41 | * Transform a non-poetic text into a poetic text using Python. It is up to you to determine how and why a text is poetic or non-poetic. If you are stuck, try techniques like sorting, randomizing, filtering, deleting, or replacing.
42 |
43 | ---
44 |
45 | ### 3. October 2nd
46 |
47 | Web scraping basics. Making big lists.
48 |
49 | #### Readings
50 | * Tech reading 3
51 | * [Uncreative Writing](https://www.chronicle.com/article/Uncreative-Writing/128908) by Kenneth Goldsmith
52 |
53 | ---
54 |
55 | ### 4. October 9th
56 |
57 | Web scraping part 2. APIs. Advanced text manipulation and parsing.
58 |
59 | #### Readings
60 | * Tech reading 4
61 | * [Digital Divide](https://www.artforum.com/print/201207/digital-divide-contemporary-art-and-new-media-31944) by Claire Bishop
62 | * [Montage](https://lucian.uchicago.edu/blogs/mediatheory/keywords/montage/) by Jared Leibowich
63 |
64 | ---
65 |
66 | ### 5. October 16th
67 |
68 | Automating collage.
69 |
70 | #### Readings
71 | * Tech reading 5
72 | * [Too Much World: Is the Internet Dead?](https://www.e-flux.com/journal/49/60004/too-much-world-is-the-internet-dead/) by Hito Steyerl
73 |
74 | ---
75 |
76 | ### 6. October 23rd
77 |
78 | Automating video.
79 |
80 | #### Readings
81 | * Tech reading 6
82 | * [Surrealism: the Last Snapshot of the European Intelligentsia](https://monoskop.org/images/a/a0/Benjamin_Walter_1929_1978_Surrealism_The_Last_Snapshot_of_the_European_Intelligentsia.pdf) by Walter Benjamin
83 |
84 | ---
85 |
86 | ### 7. October 30th
87 |
88 | Bots and project work.
89 |
90 | ---
91 |
92 |
93 | ## Fun/useful Python Libraries
94 | * [moviepy](http://zulko.github.io/moviepy/) - edit video
95 | * [vidpy](http://antiboredom.github.com/vidpy/) - edit video (my library)
96 | * [videogrep](http://antiboredom.github.com/videogrep/) - make supercuts (my library)
97 | * [youtube-dl](https://rg3.github.io/youtube-dl/) - download videos
98 | * [pillow](https://python-pillow.org/) - edit images
99 | * [flask](http://flask.pocoo.org/) - web server
100 | * [twython](https://github.com/ryanmcgrath/twython) - use the twitter api
101 | * [spacy](https://github.com/ryanmcgrath/twython) - natural language processing
102 | * [requests](http://docs.python-requests.org/en/master/) - easy http requests
103 | * [envelopes](http://tomekwojcik.github.io/envelopes/) - send email
104 | * [opencv](http://opencv.org/) - computer vision
105 | * [asciimatics](https://github.com/peterbrittain/asciimatics) - text-based interfaces and animation
106 | * [colorama](https://github.com/tartley/colorama) - easy color in the terminal
107 |
108 |
--------------------------------------------------------------------------------
/class-notes/class-1.md:
--------------------------------------------------------------------------------
1 | # Sept 18 - The Command Line
2 | **Instructor**: Sam Lavigne | [splavigne@gmail.com](mailto:splavigne@gmail.com)
3 | **Teaching Assistant**: Fernando Ramallo | [fernando.ramallo@gmail.com](mailto:fernando.ramallo@gmail.com)
4 | **Track**: Code Poetry, Fall 2018
5 | **Location**: School for Poetic Computation | 155 Bank St, New York, NY 10014 **Time**: Tuesdays 10am to 1pm
6 | **Office Hours**: Tuesdays 2pm to 4pm (or by appointment)
7 |
8 | Slack channel: #2018-fall-scrapism
9 | Sam’s office hours Sign-up sheet: [+Sam Office Hours](https://paper.dropbox.com/doc/Sam-Office-Hours-gaKmWg2Qo7jnn2FbO7F5b)
10 | Fernando’s office hours sign-up sheet: [+Fernando (TA) Office Hours](https://paper.dropbox.com/doc/Fernando-TA-Office-Hours-p8FxDav0hzpIjrJ4rtfeX)
11 |
12 |
13 | # Reader
14 | - [Intro to the command line](https://github.com/antiboredom/sfpc-scrapism/blob/master/reader-01-the-command-line.md)
15 | - [Python basics](https://github.com/antiboredom/sfpc-scrapism/blob/master/reader-02-python-basics.md)
16 | # Notes
17 |
18 |
19 |
20 | - We all introduced ourselves, again!
21 | - We’re gonna assume no technical knowledge, feel free to reach out for questions.
22 | - Sam will record himself giving the class, put it in a private link
23 |
24 |
25 | ## Sam’s work
26 |
27 | http://lav.io/
28 |
29 | How can we make critical statements without saying specifically what that statement is.
30 |
31 | https://lav.io/projects/white-collar-crime-risk-zones/
32 | https://lav.io/projects/baabaa/ - An index of selected commodities listed for sale on alibaba.com. Items are arranged by price and minimum order quantity and are search results for terms like “riot gear” and “human labor”.
33 | https://lav.io/projects/cspan-5/ - most frequently stated phrases turned into a video
34 |
35 |
36 |
37 | ## Scrapism
38 |
39 | Q of this class: how do we make something new by using material that already exists /
40 | What new things are sayable today? .. by means of these tools that wouldn’t be sayable otherwise
41 |
42 | Objectives
43 |
44 | - learn python
45 | - use it to collect material and manipulate it
46 | - use text: how do we create automatic *poetry*
47 | - image: how do we create automatic *collage*
48 | - video: automatic *montage*
49 |
50 | Look at groups and individuals from the past that used rule-based techniques / almost automatically / surrealists, dadaists, situationists
51 | We’re gonna be making critiques, satires, commentaries, poetry.
52 | Process:
53 |
54 | - find a good source material
55 | - figure out how to get that source material (get a lot of it)
56 | - figure out how to parse it and transform it / take something that is a big mess from the internet, take unstructured information / transform it into something you can use
57 | - figure out how to present what you’ve collected to the world / something new
58 |
59 | We’re gonna treat everything *as a text***,** looking at images *as* *if they were* text, e.g. [C-SPAN5 bot](https://twitter.com/cspanfive) (treating video as text that is cut and put together).
60 |
61 | How do these techniques work in a post-Trump environment?
62 | All information is out in the open, does that make this work superfluous?
63 |
64 | I saw a horrible website today!
65 | https://anti-captcha.com
66 |
67 |
68 | ## Class today
69 |
70 | All the things we’re gonna talk about today are gonna be in these readers:
71 |
72 | - [Intro to the command line](https://github.com/antiboredom/sfpc-scrapism/blob/master/reader-01-the-command-line.md)
73 | - [Python basics](https://github.com/antiboredom/sfpc-scrapism/blob/master/reader-02-python-basics.md)
74 |
75 | Every class will have a series of readings (technical and non-technical):
76 |
77 | - Technical readings are what we talked about in the class, for reference / when you forget
78 | - The readings are the ones listed in the [syllabus](https://github.com/antiboredom/sfpc-scrapism), *in the slot for the previous class*
79 |
80 |
81 |
82 | ## The Terminal
83 |
84 | Applications > Utilities > Terminal
85 | Cmd+Space > “Terminal”
86 |
87 |
88 | 
89 |
90 |
91 | The terminal is a text-based way of navigating folders
92 |
93 | **Print the directory you’re in:**
94 |
95 | pwd
96 |
97 |
98 | 
99 |
100 |
101 |
102 |
103 | **See what’s in the folder you’re in**
104 |
105 | ls
106 | 
107 |
108 |
109 | **Change the directory you’re in**
110 |
111 | cd [folder you want to enter]
112 |
113 | cd Desktop
114 |
115 |
116 | 
117 |
118 |
119 | **The terminal doesn’t understand spaces. Use commas “ to access folders and files with spaces.**
120 |
121 |
122 | cd Creative Cloud Files # doesn't work
123 | cd "Creative Cloud Files"
124 |
125 |
126 |
127 | **Going back: To go back one directory: cd ..**
128 |
129 | cd .. # goes back to the previous folder
130 |
131 |
132 | **Making directories: mkdir**
133 |
134 | mkdir [name of the directory]
135 |
136 | mkdir newfolder # makes a folder called 'newfolder'
137 |
138 |
139 |
140 | **Move files and folders and rename them: mv**
141 |
142 |
143 | mv [old name] [new name]
144 |
145 | mv newfolder/ newnamedfolder #renames folder 'newfolder' to 'newnamedfolder'
146 |
147 | slash means folder, it’s optional
148 |
149 | can be used for moving a file, but also be used for renaming
150 |
151 |
152 | **Creating new files: touch**
153 | Updates the last date modified tag for a file or folder, to be right now.
154 | If that file doesn’t exist, it **creates that file**
155 | a fast way of making files
156 |
157 |
158 | touch [name of file or folder]
159 |
160 | touch coolfile.txt #makes an empty file called 'coolfile.txt'
161 |
162 |
163 |
164 | **Delete**
165 |
166 | rm [name of file]
167 |
168 | rm coolfile.txt
169 |
170 |
171 | **Hit tab to autocomplete a file or folder**
172 |
173 | cd Des[HIT TAB] # autocompletes to cd Desktop
174 |
175 |
176 |
177 |
178 | ## Manipulating text
179 |
180 | **Use gutenberg for source text**
181 | A good external source to work with is [project Gutenberg](http://gutenberg.org/). 57,000 free eBooks public domain texts.
182 |
183 | - Download files in Plain Text format
184 |
185 | Moby dick text: https://www.gutenberg.org/cache/epub/15/pg15.txt
186 | The Trial by Kafka: https://www.gutenberg.org/cache/epub/7849/pg7849.txt
187 |
188 | Save file as Plain Text Document (or Page Source in Safari)
189 |
190 | **See information about the file**
191 |
192 | file [name of file]
193 |
194 | file mobydick.txt
195 | Output: mobydick.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators
196 |
197 | **Looking inside the contents of a file**
198 |
199 | cat [name of file] # prints content of the file on the screen
200 |
201 | cat mobydick.txt
202 | # .... will print the entire text
203 |
204 | **Use the ‘more’ command to actually read through the text with scrolling**
205 |
206 | more [name of file]
207 |
208 | more mobydick.txt
209 | # ... scroll through the text
210 | # ... type Q to exit
211 |
212 |
213 |
214 | **Best command: say**
215 |
216 | say hello
217 |
218 | say this is your computer i am going to murder you
219 |
220 |
221 |
222 | All the commands have a stucture
223 | ***name of command + argument (usually file or folder)***
224 |
225 | **But most commands have additional options**
226 | Every single command has a manual built-in. Access it with **man** keyword
227 |
228 | man say
229 | # will go to the manual about the say command,
230 | # exit by typing Q
231 |
232 | e.g. -v to change the voice, -f file, -r rate
233 | usually two ways of accessing an option, e.g.
234 |
235 | - -r rate
236 | - --rate=rate
237 | say whatever
238 | # says 'whatever' at normal rate
239 |
240 | say -r 500 whatever
241 | # says 'whatever' at the rate of 500 words per minute
242 |
243 | # use -f option to read a file
244 | say -f mobydick.txt
245 | # says the entirety of Moby Dick outloud. Poetic!
246 |
247 |
248 |
249 |
250 | **To stop a command**
251 |
252 | - Ctrl + C: Stops the command
253 | - Cmd + Q (Alt + F4 in windows): Closes the terminal entirely
254 |
255 |
256 | **Use grep command to print every line of a text file that contains a certain word**
257 | a line is understood as every time there’s a carriage return / breaking point / enter in the text
258 |
259 |
260 | grep trial thetrial.txt
261 | # prints all the lines of the text file that has the word 'trial'
262 |
263 | grep whale mobydick.txt
264 |
265 | # to search for more than one word, put it in quotes
266 | grep "the whale" mobydick.txt
267 |
268 |
269 |
270 |
271 | **Sort comand**
272 | sorts every line
273 |
274 | sort thetrial.txt
275 | # returns the trial, alphabetically ordered
276 |
277 | sort -u # only uniques
278 | sort -r # reverse
279 |
280 |
281 |
282 |
283 | **Save the output of the command line to a new file, with the > sign**
284 | *this is called a redirect*
285 |
286 | [command] > [file name to save to]
287 |
288 | sort thetrial.txt > thetrial_sorted.txt
289 | # instead of printing, save whatever output to thetrial_sorted.txt file
290 |
291 |
292 |
293 | **You can combine commands together**
294 | take the output of one command, pipe it to another command, and chain things together
295 | e.g. do the sort and grep at the same time
296 |
297 |
298 | # use the vertical bar character (pipe) | to chain commands
299 |
300 | grep whale mobydick.txt | sort
301 | # take the output of the lines from grep, into the sort command, finally to the screen
302 |
303 | grep whale mobydick.txt | sort > sorted_whales.txt
304 | # make a text file with the lines that include "whale", sorted alphabetically
305 |
306 |
307 |
308 | **Other fun commands**
309 |
310 | **use cut to separate words**
311 |
312 | cut # breaks every line in the file by a delimiter,
313 | # e.g. break the lines by spaces,
314 | # -d delimiter
315 | # -f field
316 |
317 | cut -d " " -f 1 mobydick.txt
318 | # separate the lines by empty spaces (therefore separating each word), get the first field (the first instance, ie. the first word), of mobydick.txt
319 |
320 | **use a wildcard to access multiple files**
321 |
322 | ls *.txt
323 | # lists any file that ends with .txt
324 |
325 |
326 | **clear to clear the screen**
327 |
328 | clear
329 | # empties the terminal window
330 |
331 |
332 |
333 |
334 | ## How the file system works
335 |
336 | Files and folders,
337 | Every folder has exactly one parent folder, except the very top (the root)
338 |
339 | The root folder (the hard drive) is described as a forward slash /
340 |
341 | cd /
342 | # goes to the root folder
343 |
344 | Some files and folders are **hidden**
345 |
346 | cd /
347 | ls
348 | # will list all the files and folders in the root, you'll see some that are hidden in the Finder / Folder viewer
349 |
350 | Each file/folder has a unique path
351 | You can go to a specific folder and access a file inside it
352 |
353 | cd /Users/sam/Desktop/
354 | # go to the desktop
355 | more thetrial.txt
356 | # if there's a file called thetrial.txt in Desktop, it gets printed out
357 | # otherwise, an error
358 |
359 | But you can also access a file by its **unique path**, from any other folder
360 |
361 | cd /
362 | more /Users/sam/Desktop/thetrial.txt
363 |
364 | *Tip: Drag a folder or file from the Finder to the terminal and get its unique path without having to type it*
365 |
366 | cd can be used to navigate the file system easily
367 |
368 | cd
369 | # cd with no argument goes to the root folder
370 |
371 | cd ../Documents
372 | # .. means one level up
373 | # goes one level up, and then down into the Documents folder, if it exists
374 | # can be combined:
375 | cd ../../../Desktop #go three levels up and then into Desktop
376 |
377 | cd ./Desktop
378 | # . means the folder we are currently in
379 |
380 |
381 | **open** opens a file in its default application
382 |
383 | open mobydick.txt
384 | # opens the text file in TextEdit or notepad
385 |
386 | open .
387 | # opens the folder we currently are in, in the folder viewer (eg. Finder)
388 |
389 |
390 | **Some tricks to move the typing cursor quickly**
391 | **Shortcuts:**
392 |
393 | - Ctrl + A: brings the cursor to the beginning of the line
394 | - Ctrl + E: brings the cursor to the end of the line
395 | - Tab: for autocomplete of commands
396 | - “gr” + Tab: show all commands that start with gr
397 | - Cmd + D: splits screen to have multiple terminals
398 | - Cmd + N: makes a new terminal window
399 | - Cmd + T: makes a new tab
400 |
401 | **Another terminal program**
402 |
403 | - iTerm
404 |
405 |
406 |
407 | ## Install python + text editor
408 |
409 | **Installing python**
410 | Your computer comes with python, but we need a different version.
411 |
412 | There’s tons of ways to install python?
413 | We’re gonna use a tool called **brew** to install stuff with:
414 | https://brew.sh/
415 |
416 | Take the main example line, copy paste it into a Termina, hit enter.
417 |
418 | /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
419 |
420 | “It should just work”
421 |
422 | Once brew is installed, install python, on a terminal:
423 |
424 | brew install python3
425 | ##
426 |
427 | **Installing a text editor**
428 |
429 | Doesn’t matter what text editor you use, but a few good ones
430 |
431 | - Sublime https://www.sublimetext.com/ **paid but fast!**
432 | - Visual Studio Code https://code.visualstudio.com/ **free/open source**
433 | - Atom https://atom.io/ **free/open source**
434 |
435 | See [Python basics](https://github.com/antiboredom/sfpc-scrapism/blob/master/reader-02-python-basics.md) for install instructions
436 |
437 | Text editors will color-code a python file to show you different parts.
438 |
439 | You can also edit Python files in an **IDE**, “integrated development environment”, they are full platforms for programming, with lots of features. For the purpose of this class we’ll stick to plain text editors.
440 |
441 | **Using python**
442 |
443 | python is just a command line program (a program that you can use in the Terminal)
444 |
445 | you might have more than one python version,
446 | to use the one we’re using type **python3**
447 |
448 | **Way ONE to use python: without arguments**
449 |
450 | In a terminal window:
451 |
452 | python3
453 |
454 |
455 | 
456 |
457 |
458 |
459 | >>> 2+1
460 | 3 # output
461 |
462 |
463 | To exit the python console, type
464 |
465 | Ctrl + D
466 |
467 | >>> exit()
468 |
469 |
470 | **Way TWO: next week!**
471 |
472 |
473 |
474 |
475 | ## Works to look for / Works we’re basing our work on
476 |
477 | **Allison Parrish**
478 | http://www.decontextualize.com/
479 |
480 | https://twitter.com/everyword
481 | a twitter bot that tweets every single word of the english language in alphabetical order
482 |
483 | / when you make a work for this, what is it you’re doing?
484 | / closer to a performance
485 | / the lens of performance can help us understand this work
486 |
487 | not just about the bot itself, about the reactions to the bot
488 |
489 | / related to Claire Bishop’s reading
490 |
491 | Responses to éclair
492 | [https://twitter.com/everyword/status/475170297776447488](https://twitter.com/everyword/status/475170297776447488)
493 |
494 | **Nick Monfort**
495 | 256 characters-long one line terminal commands to make poetry
496 | https://nickm.com/poems/ppg256.html
497 |
498 | **Everest Pipkin -** Cloud OCR
499 | http://ifyoulived.org/translations.html
500 | Misusing image conversion / analysis
501 | https://procedural-generation.tumblr.com/
502 |
503 | what does the cloud say according to the computer
504 | poem
505 |
506 | / it’s broken / a natural lifespan/limit
507 |
508 | **Daniel Temkin - Internet Directory**
509 | http://danieltemkin.com/InternetDirectory
510 | A 37k+ page loose-leaf book containing all 115 million .COM domains in alphabetical order, along with current IP addresses.
511 |
512 |
513 | **Sam’s own - Patent Generator**
514 | http://lav.io/2014/05/transform-any-text-into-a-patent-application/
515 | Output: https://saaaam.s3.amazonaws.com/communist.pdf
516 |
517 |
518 | **Kate Compton - Tracery**
519 | http://www.tracery.io/
520 | Text generation
521 |
522 |
523 | / You can make tools
524 | / You can share those tools, see what other people make with it
525 | / You are making a form / with constraints
526 |
527 |
528 | **Kyle Macdonald - Keytweeter**
529 | [https://vimeo.com/9922212](https://vimeo.com/9922212)
530 | Tweets everything you type
531 |
532 |
533 |
534 | **Great book for learning python**
535 | Learn Python the hard way
536 | https://www.learnpythonthehardway.org/
537 |
538 | **Other resources for learning python**
539 | Automate the Boring Stuff
540 | https://automatetheboringstuff.com/
541 |
542 | Python for Everybody
543 | https://books.trinket.io/pfe/
544 |
545 |
546 | ## Assignment for next week
547 |
548 | Look at python basics
549 | Read
550 |
551 | - [Python basics](https://github.com/antiboredom/sfpc-scrapism/blob/master/reader-02-python-basics.md)
552 | - [Artificial Hells (introduction and chapter 1)](https://selforganizedseminar.files.wordpress.com/2011/08/bishop-claire-artificial-hells-participatory-art-and-politics-spectatorship.pdf) By Claire Bishop
553 | - [A User’s Guide to Détournement](http://www.bopsecrets.org/SI/detourn.htm)
554 |
555 |
556 | **Find 3 sentences**
557 | You’re gonna assign them to the rest of the class
558 | not too long
559 | they can come from anywhere / internet real world facebook post product packaging menu
560 | as long as you don’t write them yourself
561 |
562 | Combine them (one after the other)
563 | either make sense together or not
564 | that creates new possibilities when put together
565 |
566 |
567 |
568 |
569 | ## WordHack this Thursday!
570 |
571 | [[link]](https://www.facebook.com/events/713754025655700/?acontext=%7B%22ref%22%3A%2229%22%2C%22ref_notif_type%22%3A%22event_aggregate%22%2C%22action_history%22%3A%22null%22%7D¬if_id=1537184173953188¬if_t=event_aggregate)
572 | WordHack is a monthly evening of performances and talks exploring the intersection of language and technology. Code poetry, digital literature, e-lit, language games, coders interested in the creative side, writers interested in new forms writing can take, all are welcome here.
573 |
574 | This month we will feature talks and performances by:
575 | JOANNE MCNEIL ([http://www.joannemcneil.com/](http://www.joannemcneil.com/))
576 | MARTIN O'LEARY ([http://mewo2.com/](http://mewo2.com/))
577 | ESTHER SEYFFARTH ([https://user.phil.hhu.de/~seyffarth/index.html](https://user.phil.hhu.de/~seyffarth/index.html))
578 |
579 |
580 | ## Syncrony NYC
581 |
582 | Syncrony NYC
583 | http://synchrony.nyc/2019/index.html
584 | Synchrony is a DEMOPARTY that begins in NEW YORK CITY, continues on an Amtrak train, and concludes in MONTREAL.
585 |
586 | Synchrony is about being creative with computers, and seeing how computers can produce amazing sorts of animation, graphics, music, and other experiences. At the end we have COMPOS (competitions) that are voted on by those who are there at the party. Some people may work on their entries for these compos for months beforehand; some, just on the train ride up. People are welcome to enter remotely, even if they are unable to attend.
587 |
588 |
589 |
--------------------------------------------------------------------------------
/class-notes/class-2.md:
--------------------------------------------------------------------------------
1 | # Sept 25 - Python part 2. Manipulating text. Automating writing
2 |
3 | **Instructor**: Sam Lavigne | [splavigne@gmail.com](mailto:splavigne@gmail.com)
4 | **Teaching Assistant**: Fernando Ramallo | [fernando.ramallo@gmail.com](mailto:fernando.ramallo@gmail.com)
5 | **Track**: Code Poetry, Fall 2018
6 | **Location**: School for Poetic Computation | 155 Bank St, New York, NY 10014 **Time**: Tuesdays 10am to 1pm
7 | **Office Hours**: Tuesdays 2pm to 4pm (or by appointment)
8 |
9 | Syllabus: http://github.com/antiboredom/sfpc-scrapism
10 | Slack channel: #2018-fall-scrapism
11 | Sam’s office hours Sign-up sheet: [+Sam Office Hours](https://paper.dropbox.com/doc/Sam-Office-Hours-gaKmWg2Qo7jnn2FbO7F5b)
12 | Fernando’s office hours sign-up sheet: [+Fernando (TA) Office Hours](https://paper.dropbox.com/doc/Fernando-TA-Office-Hours-p8FxDav0hzpIjrJ4rtfeX)
13 |
14 |
15 | # Notes
16 |
17 |
18 |
19 | Fernando gave a presentation about his work
20 |
21 | - His website http://byfernando.com/
22 | - His games https://fernandoramallo.itch.io/ (get in touch if you want a free copy of any)
23 |
24 | We all went through our assignments
25 |
26 |
27 | Get you in the mood of using language that you find around.
28 | Juxtaposition
29 |
30 |
31 | # Readings
32 |
33 | Claire Bishop
34 |
35 | - Good survey of things that have been done around reutilizing existing language
36 | - attempts to create art that are not commodifiable, a characteristic of social art
37 | - frequently dealing with ethical political concerns
38 | - creating a social space, rather than making an object
39 | - social art isn’t held to same standards as normal art, when judged. is it good art? good activism? sometimes neither. important to take note of / be aware of, when making work that’s aesthetic and activist.
40 | - is making an art project the best way to achieve activist goals?
41 | - is doing an activist project the best way to achieve the artistic goals?
42 | - set your own ideas for how your work is judged / sometimes it’s not quantifiable
43 | -
44 |
45 | Detournement
46 |
47 | - different intepretations of it
48 | - what is the source text advocating for / in using it your erradicating its context
49 | - you’re renewing its value
50 | - as a practitioner, what would your desired goal be? make a new thing and destroy the old? give new value to the old through that act?
51 |
52 |
53 |
54 | # Python
55 |
56 |
57 | # See Sam’s reader with more examples here:
58 |
59 | https://github.com/antiboredom/sfpc-scrapism/blob/master/reader-02-python-basics.md
60 |
61 |
62 |
63 | ## Using the right version
64 |
65 | when you installed python 3, it didn’t remove your old version
66 | if you type python, sometimes it runs the **older** version that comes with Mac, not the one we’ll use
67 |
68 | Depending on your settings, you might be able to type ***python*** in the terminal and get the right version, but to make sure you can type **python3**
69 |
70 | 
71 |
72 |
73 |
74 | **To exit the python console, press Ctrl + D**
75 |
76 |
77 | ## Creating a file with the Terminal
78 |
79 | On the terminal:
80 |
81 |
82 | 1. Make a new folder with **mkdir python_lesson_1**
83 | 2. Enter it with **cd python_lesson_1**
84 | 3. Create a file with **touch hello.py**
85 | 1. **touch** updates a file’s modified date if it exists, otherwise it creates it
86 | 4. Open the file with the default editor with **open hello.py**
87 | 1. To change the default editor: right click the file in Finder > Get Info > Change the default app in the Open With section
88 |
89 |
90 |
91 | ## Writing our first program
92 |
93 | **Print something to the screen:**
94 |
95 | On the text editor for hello.py:
96 |
97 |
98 | print("a specter is haunting europe")
99 |
100 |
101 |
102 | Hit save on your editor
103 |
104 | Run the program to see its output.
105 | On the termina:
106 |
107 |
108 | $ python3 hello.py
109 |
110 | 
111 |
112 |
113 |
114 | **Expressions**
115 |
116 | python replaces mathematical operations with the value of that operation
117 |
118 |
119 |
120 | # print mathematical operations
121 | print( 1 + 1 ) # outputs 2
122 | print( 5 / 33 )
123 | print( 1 + 7 / 25 * 5 )
124 |
125 |
126 |
127 | You can compare expressions
128 |
129 |
130 |
131 | print( 1 == 2 ) # returns True or False depending on if 1 equals 2
132 |
133 | print( 1 < 2 ) # less than
134 | print( 1 > 2 ) # greater than
135 | print( 1 >= 2 ) # equal or greater than
136 | print( 1 <= 2 ) # equal or lesser than
137 | print( 1 != 2 ) # not equal
138 |
139 |
140 |
141 | You can **comment** parts of code out with # so they’re in your file but they don’t run
142 |
143 |
144 |
145 | print(1+2)
146 | # print("Hello")
147 |
148 |
149 | Some editors let you comment the code you select with **Ctrl + /**
150 |
151 |
152 | You can save the value of an expression with **variables**, where you assign a name to an expression or value
153 |
154 |
155 |
156 | some_number = 100
157 |
158 |
159 | the value 100 is now stored in the variable some_number. We can see its value with print()
160 |
161 |
162 |
163 | print(some_number)
164 | # Output: 100
165 |
166 |
167 | There’s different **kinds** of values:
168 |
169 | - Integer: a whole number (1, 2, 3, 5, 1000)
170 | - Float: a number with decimals (1.55345, 2.0)
171 | - String: a piece of text, defined between quotes (“hello”, “a spectre… “)
172 | - Boolean: True or False
173 | - Lists: a list of items
174 |
175 | some_number = 100
176 | some_float = 10.5
177 | some_string = "a spectre is haunting europe"
178 | some_boolean = False
179 | a_list = [ 1, 100, 20, 25, -305 ] # a list of integers
180 | # You can combine types, not a good idea but..
181 | another_list = [ "hi", 1, 1.53242, False ]
182 |
183 |
184 | The most important for us is going to be
185 |
186 |
187 | ## Strings
188 |
189 | A string is a series of characters
190 |
191 | we can make a variable that stores a string
192 |
193 | we can combine variables to make new values.
194 |
195 | If we add two strings together, it **concatenates** them
196 |
197 |
198 |
199 | first_name = "Karl"
200 | last_name = "Marx"
201 |
202 | full_name = first_name + last_name
203 |
204 | print(full_name) # Output: KarlMarx
205 |
206 | # To put a space between the values
207 | full_name = first_name + " " + last_name
208 | print(full_name) # Output: Karl Marx
209 |
210 |
211 | Each character in our string has a numerical index
212 | If we want the first letter, we **access it with brackets and an index** (starting from zero)
213 |
214 |
215 | first_letter = full_name[0]
216 | second_letter = full_name[1]
217 |
218 | print(first_letter) #Output: K
219 |
220 |
221 | If we use an **index outside of the length of the string**, we get an error
222 |
223 |
224 | print(full_name[1000]) # Output: IndexError: string index out of range
225 |
226 |
227 |
228 | We can use **indices with negative numbers** to start at the end and walk our way backwards:
229 |
230 |
231 |
232 | # Get the last letter
233 | last_letter = full_name[-1]
234 | second_to_last_letter = full_name[-2]
235 |
236 |
237 | We can also get **ranges of characters**, this makes python very powerful for our kind of work
238 |
239 |
240 |
241 | print(full_name[0:3]) # Outputs the first three characters: Kar\
242 |
243 |
244 | We can combine everything we've seen so far:
245 |
246 |
247 |
248 | print(full_name[4:-1]) # Gets a range from the fifth character to the last one
249 |
250 |
251 |
252 | We can check for the length of a string, with **len()**
253 |
254 |
255 |
256 | total_characters = len(full_name)
257 |
258 |
259 |
260 | I can determine if a string contains another string, using the **in** keyword
261 |
262 | - my_string in another_string: return True or False
263 |
264 | sentence = "A spectre is haunting Europe"
265 |
266 | # is "spectre" inside the sentence?
267 | print("spectre" in sentence) #Output: True
268 |
269 | print("specter" in sentence) #Output: False
270 |
271 |
272 | # it's case-sensitive
273 | print("Spectre" in sentence) #Output: False
274 |
275 | # to make the check case-insensitive, we turn it into lowercase
276 | print("europe" in sentence) #Output: False
277 | print("europe" in sentence.lower()) #Output: True . # Note: doesn't modify sentence
278 |
279 |
280 | **String methods** lets us manipulate strings in interesting ways:
281 |
282 |
283 |
284 | sentence = "A spectre is haunting Europe"
285 |
286 | # Make every character upper case
287 | print(sentence.upper()) # Outputs: A SPECTRE IS HAUNTING EUROPE
288 |
289 | # or lower case
290 | print(sentence.lower()) #Outputs: a spectre is haunting europe
291 |
292 | # capitalize the first letter of each word
293 | print(sentence.title()) #Outputs: A Spectre Is Haunting Europe
294 |
295 | # Use replace to find a word and replace it with another
296 | print(sentence.replace("is", "was")) #Outputs: A spectre was haunting Europe
297 |
298 | # We can chain these operations together
299 | print(sentence.replace("is", "was").upper()) #Outputs: A SPECTRE WAS HAUNTING EUROPE
300 |
301 |
302 | None of these examples **modify the original value**, but if we want to actually change it
303 |
304 |
305 | sentence.upper() # only returns the upper case sentence, doesn't modify the variable
306 | sentence = sentence.upper() # assigns the variable to the newer upper case version
307 |
308 |
309 |
310 | You can go through **more string methods** here:
311 | https://docs.python.org/3.7/library/stdtypes.html#string-methods
312 | like center
313 |
314 |
315 | sentence = sentence.center(30, "*") #puts the character * around the sentence until it's 30 characters long
316 |
317 |
318 |
319 | You can also do fun things like **multiplication**
320 |
321 |
322 | hello = "Hello" * 100
323 | print(hello) # Outputs: hellohellohellohellohello ...
324 |
325 | hello = "hello" + "o" * 100
326 | print(hello) # Outputs: helloooooooooooooo
327 |
328 | hello = "he" + "l" * 1000 + "o"
329 |
330 |
331 |
332 |
333 | We can **combine different types**, but there are different ways
334 |
335 | The bad way:
336 |
337 |
338 | number = 10
339 | message = "The number is " + number
340 | # This throws an error (cannot concatenate 'str' and 'int' objects)
341 |
342 |
343 | The OK way, convert a number to a string:
344 |
345 |
346 | message = "The number is " + str(number)
347 |
348 |
349 | The better way if you have lots of numbers, use format, it’ll replace {} with the number
350 |
351 |
352 | # one value
353 | message = "The number is {}".format(number)
354 |
355 | # two values
356 | message = "The number is {} and the 2nd number is {}.".format(number, 100)
357 | print(message) # Outputs: The number is 10 and the 2nd number is 100
358 |
359 |
360 |
361 | ## Make the computer say it
362 |
363 | In the terminal
364 |
365 |
366 | python3 strings.py | say
367 |
368 |
369 |
370 |
371 | ## Save the output of a python file to a text file from the terminal
372 |
373 |
374 |
375 | python 3 strings.py > strings.txt
376 |
377 |
378 |
379 | ## Lists
380 |
381 | Make an empty lists.py file
382 | In the terminal:
383 |
384 | touch lists.py
385 | open lists.py
386 |
387 |
388 | A lot of methods from strings apply to lists.
389 |
390 |
391 | # Declaring a list
392 | names = ["Marx", "Trotsky", "Lenin", "Engels"]
393 |
394 | # Get the length with len(names)
395 | print("Total names: ", len(names)) # Outputs: 4
396 |
397 | # Add items to a list
398 | names.append("Stravinsky")
399 |
400 | # We can declare an empty list
401 | some_list = []
402 |
403 | # You can multiply a list
404 | print(names * 10) # Outputs a list with the content of the names list 10 times
405 |
406 | # You can add lists together
407 | print(names + some_list)
408 |
409 | # You can access individual items by their index starting with zero
410 | print(names[0]) # First item
411 | print(names[-1]) # Last item
412 | print(names[0:3]) # A list with items from the first to the 4th item
413 |
414 |
415 |
416 |
417 | We can go through every item in our list, called **iteration,** using the **for** keyword
418 |
419 |
420 | # declare a variable name first, then the list we're going through second
421 | # it'll temporarily store each of the values in the variable name
422 | for name in names:
423 | print(name)
424 | # Outputs: calls print for every item, outputs its value
425 |
426 | In other languages a *block* is defined with brackets { }, but in python it’s defined by **white space, using indentation**
427 | Anything that shares the same indentation (e.g. a Tab), is part of the same block
428 |
429 |
430 | for name in names:
431 | print(name)
432 | print("is a dead white guy") # also inside the loop
433 |
434 | print("and so is:") # Still inside the loop
435 |
436 | print("That's all the dead white guys in our list") # Outside of the loop
437 |
438 |
439 |
440 |
441 |
442 | ## More
443 |
444 | We’ll grab Kafka’s Metamorphosis from gutenberg
445 | https://www.gutenberg.org/cache/epub/5200/pg5200.txt
446 |
447 | Save it to a file next to our python script
448 |
449 | We’ll read the text file and store it as a variable
450 | In our python script:
451 |
452 |
453 | text = open("kafka.txt").read() # the name of the file, relative to where the script is
454 |
455 | print(text) # Outputs: the entire text
456 |
457 |
458 |
459 | Now we can do stuff with it
460 |
461 |
462 |
463 | print(text.upper())
464 |
465 |
466 |
467 | To read every single lines, Instead of read() we use readlines()
468 |
469 |
470 | text = open("kafka.txt").readlines()
471 | # text is now a list of string items, with each line from the file
472 |
473 |
474 | Now we can iterate over the lines
475 |
476 |
477 |
478 | for line in text:
479 | print(line) #Outputs each line
480 |
481 |
482 | The problem, it’s putting a space in between each line.
483 | This is because there’s an extra character after a line break, called a newline character
484 | We can get rid of that with strip()
485 |
486 |
487 |
488 | for line in text:
489 | line = line.strip()
490 | print(line) # Outputs each line without whitespace or extra line breaks
491 |
492 |
493 |
494 | Each of the lines is a string, so we can print parts of each line
495 |
496 |
497 |
498 | for line in text:
499 | line = line.strip()
500 | print(line[0:4])
501 |
502 | # Output is the first four characters of each line
503 |
504 |
505 | Or do fun stuff like replacing
506 |
507 |
508 |
509 | for line in text:
510 | line = line.strip()
511 | print(line.replace('e', 'eeeeeee'))
512 |
513 |
514 |
515 |
516 | ## Processing text
517 |
518 | We’re gonna use a function called split() to break downs a string according to a delimiter character.
519 | You can use split() to return a string as a list separated by a character
520 | You can use join() to join a list back into a string
521 |
522 |
523 | for line in text:
524 | line = line.strip()
525 | words = line.split(" ") # Separates the lines by an empty space, getting a list of words
526 |
527 | print(words[0]) # Outputs the first word of each sentence
528 |
529 | # Chain it all together!
530 | print(words[0].center(30, '~').upper())
531 |
532 |
533 |
534 | We can use the **random** methods to do interesting stuff
535 |
536 | Sometimes you have to tell python to add **modules** with the **import** keyword to add functionality you need. Here we’ll import the [random module](https://docs.python.org/3.5/library/random.html).
537 |
538 | - Use the documentation to find what you can do with a module
539 | - Make sure you’re seeing the documentation of the python version you’re using (e.g. 3.5)
540 | # Import the module
541 | import random
542 |
543 | text = open("kafka.txt").readlines()
544 |
545 | for line in text:
546 | line = line.strip()
547 | words = line.split(" ")
548 |
549 | random_word = random.choice(words) #Get a random item from the word list
550 |
551 | random.shuffle(words) # Randomizes the order of the items in the list
552 |
553 |
554 |
555 | We use the join() method to join the randomized word list in to a string
556 |
557 |
558 |
559 | for line in text:
560 | line = line.strip()
561 | words = line.split(" ")
562 | random.shuffle(words)
563 |
564 | new_line = " ".join(words) # Joins each element in the list by sticking the space character in between the words, outputs a string
565 |
566 |
567 |
568 | We can sort with sorted()
569 |
570 |
571 |
572 | for line in text:
573 | line = line.strip()
574 | words = line.split(" ")
575 | random.shuffle(words)
576 |
577 | words = sorted(words) # Sort the words list alphabetically
578 |
579 | new_line = " ".join(words)
580 |
581 |
582 |
583 | Final script
584 |
585 | # Import the module
586 | import random
587 |
588 | text = open("kafka.txt").readlines()
589 | for line in text:
590 | line = line.strip()
591 | words = line.split(" ")
592 | random.shuffle(words)
593 | words = sorted(words)
594 | new_line = " ".join(words)
595 | print(new_line)
596 |
597 |
598 | ## List comprehension
599 |
600 | Make a new file comps.py
601 |
602 |
603 | We can make a list of upper case’d items
604 |
605 | names = ["Trotsky", "Marx", "Lenin", "Engels"]
606 |
607 | uppercase_names = []
608 | for name in names:
609 | uppercase_names.append(name.upper())
610 |
611 |
612 |
613 | There’s a handier way of doing this in python, called **list comprehension.**
614 | This does the same thing as the example above
615 |
616 | names = ["Trotsky", "Marx", "Lenin", "Engels"]
617 |
618 | uppercase_names = [name.upper() for name in names]
619 |
620 |
621 | It’s saying: for every value in the list **names** temporarily store it as a variable **name**, make that upper case and store it in a new list called **uppercase_names**
622 |
623 |
624 |
625 | names = [name.replace('r', 'arrrrr') for name in names]
626 |
627 |
628 | We can filter too, by adding **if statements** inside too:
629 |
630 |
631 | names = [name for name in names if name[0] == "l"]
632 | # returns elements inside of the list whose first letter is l
633 |
634 |
635 |
636 | We can add this filtering technique to the words in our previous example
637 |
638 | import random
639 |
640 | text = open("kafka.txt").readlines()
641 | for line in text:
642 | line = line.strip()
643 | words = line.split(" ")
644 |
645 | words = [word for word in words if word.startswith("a")]
646 |
647 | new_line = " ".join(words)
648 |
649 | print(new_line)
650 | # prints all the words that start with a
651 |
652 | OR more:
653 |
654 | words = [word for word in words if len(word) > 5
655 | # all the words that have 5 or more characters in them
656 |
657 |
658 | words = [word for word in words if word.endswith("ing")]
659 | # all the words that end in ing
660 |
661 |
662 |
663 | # Assignment for next week
664 |
665 | Also available in: https://github.com/antiboredom/sfpc-scrapism
666 |
667 | Transform a non-poetic text into a poetic text
668 |
669 | - up to you to determine what’s poetic
670 |
671 | Read some file, or if the text is short you can just put that text directly into python as a variable
672 |
673 | if don’t know what to do try stuff like sorting, randomizing, replacing, deleting things
674 |
675 | by taking something that exists and using these methods we can reformat it, rework it, you can use whatever is at your disposal. you’re not bound by command line, so you can take the output of that text and you’re welcome to format it into something interesting, put it into open frameworkds, whatevr you want to do
676 |
677 | Take something that exists, do something that transforms it.
678 |
679 | If you’re more advanced, you can start to get into using third party libraries to analyze text.
680 | If you’re feeling ambitious, make this program so that it can deal with any text. Make this poetic operation so it can work with any text that you feed it.
681 |
682 |
683 |
--------------------------------------------------------------------------------
/class-notes/class-3.md:
--------------------------------------------------------------------------------
1 | # 10/02 - Dictionaries, scraping the web
2 |
3 |
4 |
5 | # Dictionaries
6 |
7 | List = collection of items ordered numerically
8 | Dictionary = no order, the items are indexed by another variable (usually a String)
9 |
10 |
11 |
12 | On the terminal
13 | Make a new file and open it
14 |
15 |
16 | $ touch dicts.py
17 | $ open dicts.py
18 |
19 |
20 | **Dictionaries are Key and Value pairs**
21 | They’re used to represent structures of data
22 |
23 | In python, you define dictionaries with curly brackets { }
24 |
25 |
26 | person = { } # empty dictionary
27 |
28 | person = { "first_name": "Karl, "last_name": "Marx", "age": 235 }
29 |
30 | # An easier way to look at it:
31 | person = {
32 | "first_name": "Karl,
33 | "last_name": "Marx",
34 | "age": 235
35 | }
36 |
37 |
38 | “first_name” is the **Key**, “Karl” is the **value**
39 |
40 | the values can be of any type: int, float, boolean, Strings, or even other dictionaries
41 |
42 | **Dictionaries can contain any type, including dictionaries and lists**
43 |
44 |
45 | person = {
46 | "first_name": "Karl",
47 | "last_name": "Marx",
48 | "age": 235,
49 | "pet": {
50 | "name": "Proleterry",
51 | "species": "parrot",
52 | "age": 12
53 | },
54 | "favorite_books": ["Ethics", "Twilight"]
55 | }
56 |
57 |
58 |
59 | You’ll want to do things with values in the dictionary
60 |
61 |
62 | ## Getting values
63 |
64 | **You can get a value from a dictionary using brackets and accessing the key**
65 | The key has to be exactly the name of the key, e.g. first_name
66 | If it doesn’t exist, an error halts the program
67 |
68 |
69 | # 1. access the value using brackets by referencing the key
70 | print( person["first_name!"] ) #Outputs: KeyError, there is no key names first_name!
71 |
72 | print( person["first_name"] ) #Outputs: Karl
73 |
74 |
75 | **A safer way is to use the get method,**
76 | Returns None without an error if the key isn’t present
77 |
78 |
79 | name = person.get("first_name")
80 |
81 |
82 | Sometimes dictionaries will have nested values, like a list and dictionaries, so you’ll **iterate** through the values
83 |
84 |
85 | for book in person["favorite_books"]:
86 | print(book)
87 |
88 |
89 |
90 | ## You can iterate through a dictionary
91 |
92 | and get all its properties
93 |
94 |
95 | for key in person:
96 | print(key) # prints all the keys
97 | print(person[key]) # prints all the values
98 |
99 |
100 |
101 | ## Adding and modifying the dictionary
102 |
103 | Accessing a key and modifying its value will override the value for that key:
104 |
105 |
106 | # replaces the value for first_name
107 | person["first_name"] = "Lenin"
108 |
109 |
110 | If the key doesn’t exist, you can create it and assign a value
111 |
112 |
113 | person["middle_name"] = "Terry"
114 | # now there's a new key middle_name with value Terry
115 |
116 |
117 |
118 |
119 |
120 | # Intro to HTML
121 |
122 | HTML is a markup language, that the web is written in.
123 |
124 |
125 | ## Tags
126 |
127 | Works as a series of **tags**
128 | A tag looks like
129 |
130 | \
paragraph
137 | - \ makes text bold
138 | - this text is normal and \this text is bold\
139 | - \ makes a link
140 | - \go to google\
141 | - \ Hi I’m very important\ I am somewhat important\ I am also somewhat important\ a paragraph\ tags
215 | p a {
216 |
217 | }
218 |
219 | // style everything with a certain class name, preseed with a period
220 | .moderately-important {
221 |
222 | }
223 |
224 | // style an id, using #. e.g. style this \ logo text\ tags, but only if they're a certain class
230 | p a.moderately-important {
231 |
232 | }
233 |
234 | # Web scraping
235 |
236 |
237 | Open Chrome
238 |
239 |
240 | ## View source
241 |
242 | go to a website
243 | e.g. https://newyork.craigslist.org/d/antiques/search/ata
244 |
245 | Right click \> View Source
246 | to see the source code
247 |
248 |
249 |
250 | ## See source code for specific elements
251 |
252 | Right click \> Inspect
253 |
254 | Highlights the part of the website as you hover over the source code.
255 |
256 | 
257 |
258 |
259 |
260 |
261 |
262 | ## To scrape you want to figure out how to find a certain element
263 |
264 | We right click a header and inspect the structure of the page where that element is.
265 |
266 | We see that it’s a specific **class**, so we can find all elements of that class to see if that gives us all the headers/
267 |
268 | We right click a craigslist header and find:
269 |
270 | 
271 |
272 |
273 | we see that its class attribute says **result-title,** and that it’s inside an **\** tag
274 | so we’ll try to find **all the \ tags with the result-title attribute, to find all the headers**
275 |
276 | ## Testing inside the browser
277 |
278 | You can quickly find elements inside the browser using the **Console.**
279 |
280 | you can use the document.querySelectorAll() that takes one argument that is a css selector
281 |
282 |
283 |
284 | document.querySelectorAll("h2") // finds all the h2 tags
285 |
286 |
287 |
288 |
289 |
290 | ## Getting the CSS selector for an element automatically
291 |
292 | On the console:
293 | Right click \> Copy \> Copy Selector
294 | gets you the CSS selector
295 | **but this only helps sometimes**
296 |
297 | 
298 |
299 |
300 |
301 |
302 | # How do we translate this into Python and make it automatic?
303 |
304 | We’ll use a library called requests-html
305 |
306 |
307 | - Documentation, How Tos
308 | - https://html.python-requests.org/
309 |
310 |
311 | It’s a library and you can scrape HTML pages with it
312 |
313 | To scrape a page:
314 |
315 | - First you download the page, then you convert it into a python data structure you can manipulate
316 | - Getting the HTML involves downloading the page and getting all the text
317 | - The second part is called **parsing**, going through the text and getting data from it
318 |
319 |
320 | ## Installing the library
321 |
322 | You can install libraries in self-contained environments, or globally.
323 |
324 | On the terminal:
325 |
326 | $ pip3 install requests-html
327 |
328 |
329 |
330 |
331 | ## Using it
332 |
333 | On a new python file
334 |
335 |
336 | #import the library
337 | from requests_html import HTMLSession
338 |
339 | # create a new session
340 | session = HTMLSession()
341 |
342 | # open a website
343 | r = session.get("https://newyork.craigslist.org/d/missed-connections/search/mis")
344 |
345 | print(r) # returns if it was able to open the page or not
346 |
347 |
348 |
349 | **You can find items in the page using css selectors**
350 |
351 |
352 | titles = r.html.find(".result-title")
353 |
354 | for title in titles:
355 | print(title) # prints out the entire tag
356 |
357 | print(title.text) # prints out the text inside the tag
358 |
359 |
360 |
361 | **You can access tags from the items in the page**
362 |
363 |
364 | for title in titles:
365 | print(title.attrs["href"]) # gets the URL in the href attribute
366 |
367 | print(title.attrs.get("href")) # a safer way, since some tags might not have the attribute
368 |
369 |
370 | **We want to also get the description, so we can tell the computer to click on a link and get a part of that other page**
371 |
372 | **We also want to get a single element of a page**
373 |
374 |
375 |
376 | titles = r.html.find(".result-title")
377 | for title in titles:
378 | url = title.attrs.get("href")
379 | name = title.text
380 |
381 | # open the URL we found
382 | r = session.get(url)
383 | # we found the part of the page we want in the article page has an id "postingbody"
384 | # so we get the part of the page with the id (ids are prefaced by #)
385 | content = r.html.find("#postingbody", first=True)
386 |
387 | if (content.text) # only if we found something
388 | print (content.text)
389 |
390 | # without the first=True attribute we'd get a list, and content.text would throw an error
391 | content = r.html.find("#postingbody")
392 |
393 |
394 | **Errors from r.html.find()**
395 | You might get an error for a few reasons
396 |
397 | - It couldn’t find that element
398 |
399 | **Mitigate the requests to not get banned with the time module**
400 |
401 |
402 | # at the top of the page, import the time module
403 | import time
404 |
405 | # use the sleep method to stop the script
406 | for title in titles:
407 | time.sleep(0.2) # stop the script for 0.2 seconds
408 | #...
409 |
410 |
411 |
412 |
413 | ## Full script
414 | import time
415 | from requests_html import HTMLSession
416 |
417 | session = HTMLSession()
418 | r = session.get("https://newyork.craigslist.org/d/missed-connections/search/mis")
419 |
420 | titles = r.html.find(".result-title")
421 | for title in titles:
422 | url = title.attrs.get("href")
423 | name = title.text
424 |
425 | r = session.get(url)
426 | content = r.html.find("#postingbody", first=True)
427 |
428 | if (content.text) # only if we found something
429 | print (content.text)
430 |
431 | sleep(0.2)
432 |
433 |
434 |
435 |
436 | ## Other ways of parsing a page
437 |
438 | Instead
439 |
440 | You can have a for loop that goes through the html object
441 |
442 | - It’ll use **intelligent pagination** where it automatically looks for “next” links, and gives you all the subsequent pages, e.g for search results
443 | - This is easier sometimes
444 |
445 | for html in r.html:
446 | titles = html.find(".result-title")
447 |
448 | for title in titles:
449 | print(title)
450 |
451 |
452 |
453 |
454 | ## Getting multiple items
455 |
456 | In alibaba search results, we might want to get several elements of a post, instead of finding both elements separately we can get the whole post.
457 |
458 | alibaba.py
459 |
460 |
461 | from requests_html import HTMLSession
462 | session = HTMLSession()
463 |
464 | r = session.get("https://www.alibaba.com/trade/search?fsb=y&IndexArea=product_en&CatId=&SearchText=drugs")
465 |
466 |
467 |
468 | We inspect the title we want
469 |
470 |
471 | 
472 |
473 |
474 | we go up the hierarchy, hovering over the code and find what fills up the entire post
475 |
476 | we find \ makes a header
142 | - \
My Header\
143 | - \ to indicate which image
170 | - \
171 |
172 |
173 | ## Structure
174 |
175 | A web page looks like this
176 |
177 |
178 | \
179 | \
180 | \
Hello i am header\
184 |
185 | \