├── .gitattributes ├── .travis.yml ├── EXAMPLES.md ├── LICENSE ├── Makefile ├── README.md ├── go.mod ├── goreleaser.yml ├── main.go └── pluck ├── plucker.go ├── plucker_test.go ├── striphtml ├── striphtml.go └── striphtml_test.go └── test ├── config.toml ├── config2.toml ├── food.toml ├── logo.png ├── main.py ├── song.html ├── song.toml ├── test.txt └── test2.txt /.gitattributes: -------------------------------------------------------------------------------- 1 | vendor/* linguist-vendored 2 | pluck/test/* linguist-vendored 3 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: go 2 | 3 | go: 4 | - tip 5 | 6 | before_install: cd pluck -------------------------------------------------------------------------------- /EXAMPLES.md: -------------------------------------------------------------------------------- 1 | # Examples 2 | 3 | ## Get headlines from news.google.com 4 | 5 | ``` 6 | $ pluck -a 'role="heading"' -a '>' -d '<' -t -s -u 'https://news.google.com/news/?ned=us&hl=en' 7 | ``` 8 | 9 | ## Get latest tweets from Donald Trump 10 | 11 | ``` 12 | $ wget https://twitter.com/search\?f\=tweets\&vertical\=default\&q\=from%3ArealDonaldTrump\&src\=typd -O twitter.html 13 | $ pluck -a '
' -d '
' -l -1 -s -f twitter.html 14 | ``` 15 | 16 | ## Read comments from Hacker News page 17 | 18 | ``` 19 | $ pluck -s -t -a 'class="c00"' -a '>' -d '"] 35 | deactivator = "<" 36 | name = "critic_ratings" 37 | 38 | [[pluck]] 39 | activators = ["all-critics-numbers","Average Rating:",">"] 40 | deactivator = "/" 41 | name = "average_critic_rating" 42 | 43 | [[pluck]] 44 | activators = ["audience-score","Average Rating:",">"] 45 | deactivator = "/" 46 | name = "average_user_rating" 47 | 48 | [[pluck]] 49 | activators = ["User Ratings:",">"] 50 | deactivator = "<" 51 | name = "user_ratings" 52 | ``` 53 | 54 | ```bash 55 | $ pluck -s -c rt.toml -u https://www.rottentomatoes.com/m/spider_man_homecoming/ 56 | ``` 57 | 58 | Returns: 59 | 60 | ```json 61 | { 62 | "average_critic_rating": "7.6", 63 | "average_user_rating": "4.3", 64 | "critic_ratings": "276", 65 | "name": "Spider-Man: Homecoming", 66 | "user_ratings": "85,131" 67 | } 68 | ``` 69 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2017 Zack 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | # Make a release with 2 | # make -j4 release 3 | 4 | VERSION=$(shell git describe) 5 | LDFLAGS=-ldflags "-s -w -X main.version=${VERSION}" 6 | 7 | .PHONY: build 8 | build: 9 | go build ${LDFLAGS} -o dist/pluck 10 | 11 | .PHONY: linuxarm 12 | linuxarm: 13 | env GOOS=linux GOARCH=arm go build ${LDFLAGS} -o dist/pluck_linux_arm 14 | # cd dist && upx --brute pluck_linux_arm 15 | 16 | .PHONY: linux64 17 | linux64: 18 | env GOOS=linux GOARCH=amd64 go build ${LDFLAGS} -o dist/pluck_linux_amd64 19 | cd dist && upx --brute pluck_linux_amd64 20 | 21 | .PHONY: windows 22 | windows: 23 | env GOOS=windows GOARCH=amd64 go build ${LDFLAGS} -o dist/pluck_windows_amd64.exe 24 | # cd dist && upx --brute pluck_windows_amd64.exe 25 | 26 | .PHONY: osx 27 | osx: 28 | env GOOS=darwin GOARCH=amd64 go build ${LDFLAGS} -o dist/pluck_osx_amd64 29 | # cd dist && upx --brute pluck_osx_amd64 30 | 31 | .PHONY: release 32 | release: osx windows linux64 linuxarm 33 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 10 | 11 |Pluck text in a fast and intuitive way. :rooster:
12 | 13 | *pluck* makes text extraction intuitive and [fast](https://github.com/schollz/pluck#current-benchmark). You can specify an extraction in nearly the same way you'd tell a person trying to extract the text by hand: "OK Bob, every time you find *X* and then *Y*, copy down everything you see until you encounter *Z*." 14 | 15 | In *pluck*, *X* and *Y* are called *activators* and *Z* is called the *deactivator*. The file/URL being plucked is parsed (or streamed) byte-by-byte into a finite state machine. Once all *activators* are found, the following bytes are saved to a buffer, which is added to a list of results once the *deactivator* is found. Multiple queries are extracted simultaneously and there is no requirement on the file format (e.g. XML/HTML), as long as its text. 16 | 17 | 18 | # Why? 19 | 20 | *pluck* was made as a simple alternative to xpath and regexp. Through simple declarations, *pluck* allows complex procedures like [extracting text in nested HTML tags](https://github.com/schollz/pluck#use-config-file), or [extracting the content of an attribute of a HTML tag](https://github.com/schollz/pluck#basic-usage). *pluck* may not work in all scenarios, so do not consider it a replacement for xpath or regexp. 21 | 22 | ### Doesn't regex already do this? 23 | 24 | Yes basically. Here is [an (simple) example](https://regex101.com/r/xt7fVr/1): 25 | 26 | ``` 27 | (?:(?:X.*Y)|(?:Y.*X))(.*)(?:Z) 28 | ``` 29 | 30 | Basically, this should try and match everything before a `Z` and after we've seen both `X` and `Y`, in any order. This is not a complete example, but it shows the similarity. 31 | 32 | The benefit with *pluck* is simplicity. You don't have to worry about escaping the right characters, nor do you need to know any regex syntax (which is not simple). Also *pluck* is hard-coded for matching this specific kind of pattern simultaneously, so there is no cost for generating a new deterministic finite automaton from multiple regex. 33 | 34 | ### Doesn't cascadia already do this? 35 | 36 | Yes, there is already [a command-line tool](https://github.com/suntong/cascadia) to extract structured information from XML/HTML. There are many benefits to *cascadia*, namely you can do a lot more complex things with structured data. If you don't have highly structured data, *pluck* is advantageous (it extracts from any file). Also, with *pluck* you don't need to learn CSS selection. 37 | 38 | # Getting Started 39 | 40 | ## Install 41 | 42 | If you have Go1.7+ 43 | 44 | ``` 45 | go get github.com/schollz/pluck 46 | ``` 47 | 48 | or just download from the [latest releases](https://github.com/schollz/pluck/releases/latest). 49 | 50 | ## Basic usage 51 | 52 | Lets say you want to find URLs in a HTML file. 53 | 54 | ```bash 55 | $ wget nytimes.com -O nytimes.html 56 | $ pluck -a '<' -a 'href' -a '"' -d '"' -l 10 -f nytimes.html 57 | { 58 | "0": [ 59 | "https://static01.nyt.com/favicon.ico", 60 | "https://static01.nyt.com/images/icons/ios-ipad-144x144.png", 61 | "https://static01.nyt.com/images/icons/ios-iphone-114x144.png", 62 | "https://static01.nyt.com/images/icons/ios-default-homescreen-57x57.png", 63 | "https://www.nytimes.com", 64 | "http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml", 65 | "http://mobile.nytimes.com", 66 | "http://mobile.nytimes.com", 67 | "https://typeface.nyt.com/css/zam5nzz.css", 68 | "https://a1.nyt.com/assets/homepage/20170731-135831/css/homepage/styles.css" 69 | ] 70 | } 71 | ``` 72 | 73 | The `-a` specifies *activators* and can be specified multiple times. Once all *activators* are found, in order, the bytes are captured. The `-d` specifies a *deactivator*. Once a *deactivator* is found, then it terminates capturing and resets and begins searching again. The `-l` specifies the limit (optional), after reaching the limit (`10` in this example) it stops searching. 74 | 75 | 76 | ## Advanced usage 77 | 78 | ### Parse URLs or Files 79 | 80 | Files can be parsed with `-f FILE` and URLs can be parsed by instead using `-u URL`. 81 | 82 | ```bash 83 | $ pluck -a '<' -a 'href' -a '"' -d '"' -l 10 -u https://nytimes.com 84 | ``` 85 | 86 | ### Use Config file 87 | 88 | You can also specify multiple things to pluck, simultaneously, by listing the *activators* and the *deactivator* in a TOML file. For example, lets say we want to parse ingredients and the title of [a recipe](https://goo.gl/DHmqmv). Make a file `config.toml`: 89 | 90 | ```toml 91 | [[pluck]] 92 | name = "title" 93 | activators = ["