├── .DS_Store ├── .gitignore ├── LICENSE ├── README.md ├── cmd └── goGetJS │ ├── browser.go │ ├── helpers.go │ ├── main.go │ ├── parsers.go │ ├── requests.go │ ├── storage.go │ └── writers.go ├── demo.gif ├── go.mod ├── go.sum └── search.txt /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davemolk/goGetJS/9eae56cd3102aee415eb17a75c016f495102b733/.DS_Store -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | scriptSRC.txt 2 | todo.md 3 | data/ 4 | searchResults/ 5 | debug/ 6 | demo.gif 7 | # Binaries for programs and plugins 8 | *.exe 9 | *.exe~ 10 | *.dll 11 | *.so 12 | *.dylib 13 | 14 | # Test binary, built with `go test -c` 15 | *.test 16 | 17 | # Output of the go coverage tool, specifically when used with LiteIDE 18 | *.out 19 | 20 | # Dependency directories (remove the comment below to include it) 21 | # vendor/ 22 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2022 Dave Molk 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # goGetJS 2 | [![License](https://img.shields.io/badge/License-MIT-blue.svg)](http://opensource.org/licenses/MIT) 3 | [![Go Report Card](https://goreportcard.com/badge/github.com/davemolk/goGetJS)](https://goreportcard.com/report/github.com/davemolk/goGetJS) 4 | [![contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](https://github.com/davemolk/goGetJS/issues) 5 | 6 | goGetJS extracts, searches, and saves JavaScript files. Includes an optional chromium headless browser (playwright) for dealing with JavaScript-heavy sites. 7 | 8 | ![demo](demo.gif) 9 | 10 | ## Overview 11 | * goGetJS scrapes a given page for script tags, visits each src, and writes the contents to an individual file. 12 | * If a script tag doesn't include an src attribute, goGetJS scrapes everything between the script tags and writes the contents to an individual file. 13 | * All src are also saved to a text file. 14 | * goGetJS (optionally) uses playwright to handle JavaScript-heavy sites and retrieve async scripts. Use -b. 15 | * Add some extra waiting time with -ew to allow the network to settle and grab those longer loading async scripts. 16 | * Use -term, -regex, and -terms, respectively, to scan each script for a specific word, with a regular expression, or with a list of words (input as a file). 17 | * goGetJS does not follow redirects by default, but this can be toggled with -redirect=true. 18 | 19 | ## Example Usages (use browser and search each script for a list of terms in search.txt) 20 | ``` 21 | go run ./cmd/goGetJS -u https://go.dev -b -terms search.txt 22 | ``` 23 | ``` 24 | echo https://go.dev | goGetJS -b -terms search.txt 25 | ``` 26 | 27 | ## Command-line Options 28 | ``` 29 | Usage of goGetJS: 30 | -b bool 31 | Use chromium headless browser (powered by playwright). Default is false. 32 | -bt int 33 | Timeout for headless browser. Default is 10000 ms. Must also activate browser via -b. 34 | -ew int 35 | Playwright considers a page loaded after the network has been idle for at least 500ms. Use this flag (in ms) to add time. 36 | -proxy string 37 | Proxy to use on requests. 38 | -redirect bool 39 | Allow redirects. Default is false. 40 | -regex string 41 | Parse each script for the supplied regular expression. Any matches will be saved and exported as a json file. 42 | -rt int 43 | Timeout for retries. Default is 1000ms. 44 | -t int 45 | Request timeout (in milliseconds). Default is 5000. 46 | -term string 47 | Parse each script for the supplied word. Any matches will be saved and exported as a json file. 48 | -terms string 49 | Name of .txt file containing a list of search terms (one per line). Any matches will be saved and exported as a json file. 50 | -u string 51 | URL to extract JS files from. 52 | ``` 53 | 54 | ## Installation 55 | First, you'll need to [install go](https://golang.org/doc/install). 56 | 57 | Then run this command to download + compile goGetJS: 58 | ``` 59 | go install github.com/davemolk/goGetJS/cmd/goGetJS@latest 60 | ``` 61 | 62 | ## Additional Notes 63 | * goGetJS names JavaScript files with ```fName := regexp.MustCompile(`[\w-&]+(\.js)?$`)```. Most scripts play nice, but those that don't are still saved. Each saved script has the full URL prepended to the file. 64 | * Occasionally, an src will link to an empty page. These are automatically retried (set a timeout for these retries with -rt). Typically, these pages are legitimately blank, causing the number of saved files printed to the terminal to be fewer than the number of processed files. Sometimes we're lucky though, and the successful retry will be searched and saved. 65 | 66 | ## Changelog 67 | * **2022-09-15** : Release 1.0. 68 | * **2022-08-26** : Add proxy, redirect, and rt flags. Refactor client creation. Improve error handling throughout. 69 | * **2022-08-20** : Move from %v to %w for handling errors with fmt.Errorf. Move everything to milliseconds. 70 | 71 | ## Support 72 | * Like goGetJS? Use it, star it, and share with your friends! 73 | - Let me know what you're up to so I can feature your work here. 74 | * Want to see a particular feature? Found a bug? Question about usage or documentation? 75 | - Please raise an issue. 76 | * Pull request? 77 | - Please discuss in an issue first. 78 | 79 | ## Built With 80 | * https://github.com/PuerkitoBio/goquery 81 | * https://github.com/playwright-community/playwright-go 82 | 83 | ## License 84 | * goGetJS is released under the MIT license. See [LICENSE](LICENSE) for details. -------------------------------------------------------------------------------- /cmd/goGetJS/browser.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "fmt" 5 | "io" 6 | "net/http" 7 | "strings" 8 | "time" 9 | 10 | "github.com/playwright-community/playwright-go" 11 | ) 12 | 13 | // browser uses a headless browser (chromium via playwright) to scrape a site, waiting until there are no 14 | // network connections for at least 500ms (unless a longer wait is requested with the extraWait) flag. 15 | // browser returns an io.Reader and an error. 16 | func (app *application) browser(url string, browserTimeout *float64, extraWait int, client *http.Client) (io.Reader, error) { 17 | fmt.Println("============================================================") 18 | app.infoLog.Println("initiating playwright browser...") 19 | 20 | pw, err := playwright.Run() 21 | if err != nil { 22 | return nil, fmt.Errorf("start playwright error: %w", err) 23 | } 24 | 25 | browser, err := pw.Chromium.Launch() 26 | if err != nil { 27 | return nil, fmt.Errorf("launch browser error: %w", err) 28 | } 29 | 30 | uAgent := app.randomUA() 31 | context, err := browser.NewContext(playwright.BrowserNewContextOptions{ 32 | UserAgent: playwright.String(uAgent), 33 | }) 34 | if err != nil { 35 | return nil, fmt.Errorf("create browser context error: %w", err) 36 | } 37 | 38 | page, err := context.NewPage() 39 | if err != nil { 40 | return nil, fmt.Errorf("create browser page error: %w", err) 41 | } 42 | 43 | _, err = page.Goto(url, playwright.PageGotoOptions{ 44 | Timeout: browserTimeout, 45 | WaitUntil: playwright.WaitUntilStateNetworkidle, 46 | }) 47 | if err != nil { 48 | return nil, fmt.Errorf("browser navigation error: %w", err) 49 | } 50 | 51 | if extraWait > 0 { 52 | time.Sleep(time.Duration(extraWait) * time.Millisecond) 53 | app.infoLog.Printf("slept for %d milliseconds\n", extraWait) 54 | } 55 | 56 | htmlDoc, err := page.Content() 57 | if err != nil { 58 | return nil, fmt.Errorf("playwright html extraction error: %w", err) 59 | } 60 | 61 | err = browser.Close() 62 | if err != nil { 63 | return nil, fmt.Errorf("close browser error: %w", err) 64 | } 65 | 66 | err = pw.Stop() 67 | if err != nil { 68 | return nil, fmt.Errorf("stop playwright error: %w", err) 69 | } 70 | 71 | app.infoLog.Println("browser finished") 72 | fmt.Println("============================================================") 73 | 74 | return strings.NewReader(htmlDoc), nil 75 | } 76 | -------------------------------------------------------------------------------- /cmd/goGetJS/helpers.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "bufio" 5 | "errors" 6 | "fmt" 7 | "net/url" 8 | "os" 9 | "regexp" 10 | ) 11 | 12 | // assertErrorToNil is a simple helper function for error handling. 13 | func (app *application) assertErrorToNil(err error) { 14 | if err != nil { 15 | app.errorLog.Fatal(err) 16 | } 17 | } 18 | 19 | // readInputFile converts the contents of an input text file to a string slice. 20 | func (app *application) readInputFile(n string) ([]string, error) { 21 | var lines []string 22 | f, err := os.Open(n) 23 | if err != nil { 24 | return lines, fmt.Errorf("open input file error: %w", err) 25 | } 26 | defer f.Close() 27 | scanner := bufio.NewScanner(f) 28 | for scanner.Scan() { 29 | lines = append(lines, scanner.Text()) 30 | } 31 | return lines, scanner.Err() 32 | } 33 | 34 | // getInput checks if the user has supplied a url via stdin. 35 | // If no url is found, goGetJS will exit (getInput is only called 36 | // in the event that no url has been supplied via flag.) 37 | func (app *application) getInput() error { 38 | stat, err := os.Stdin.Stat() 39 | if err != nil { 40 | return fmt.Errorf("stdin read error: %w", err) 41 | } 42 | 43 | if (stat.Mode() & os.ModeCharDevice) == 0 { 44 | s := bufio.NewScanner(os.Stdin) 45 | for s.Scan() { 46 | app.config.url = s.Text() 47 | } 48 | } 49 | if app.config.url == "" { 50 | return errors.New("must provide a url") 51 | } 52 | return nil 53 | } 54 | 55 | // getBaseURL takes the url from the user and returns the base url. 56 | func (app *application) getBaseURL(myUrl string) (string, error) { 57 | u, err := url.Parse(myUrl) 58 | if err != nil { 59 | return "", fmt.Errorf("unable to parse url: %w", err) 60 | } 61 | u.Path = "" 62 | u.RawQuery = "" 63 | u.Fragment = "" 64 | return u.String(), nil 65 | } 66 | 67 | // getQuery checks whether or not the user has used a term flag, a 68 | // regex flag, or a terms flag. If any of these has been submitted, 69 | // the respective input is stored in the query field of the site struct. 70 | func (app *application) getQuery() { 71 | if len(app.config.regex) > 0 { 72 | re := regexp.MustCompile(app.config.regex) 73 | app.query = re 74 | } else if len(app.config.terms) > 0 { 75 | query, err := app.readInputFile(app.config.terms) 76 | if err != nil { 77 | app.errorLog.Fatal(err) 78 | } 79 | app.query = query 80 | } else { 81 | app.query = app.config.term 82 | } 83 | } 84 | -------------------------------------------------------------------------------- /cmd/goGetJS/main.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "flag" 5 | "fmt" 6 | "io" 7 | "log" 8 | "net/http" 9 | "os" 10 | "regexp" 11 | "time" 12 | 13 | "golang.org/x/sync/errgroup" 14 | ) 15 | 16 | type config struct { 17 | browserTimeout float64 18 | extraWait int 19 | proxy string 20 | redirect bool 21 | regex string 22 | retryTimeout int 23 | term string 24 | terms string 25 | timeout int 26 | url string 27 | useBrowser bool 28 | } 29 | 30 | type application struct { 31 | baseURL string 32 | client *http.Client 33 | config config 34 | errorLog *log.Logger 35 | infoLog *log.Logger 36 | query interface{} 37 | retryClient *http.Client 38 | searches *SearchMap 39 | } 40 | 41 | func main() { 42 | var cfg config 43 | 44 | flag.Float64Var(&cfg.browserTimeout, "bt", 10000, "browser timeout (in ms). default 10000") 45 | flag.IntVar(&cfg.extraWait, "ew", 0, "additional wait (in ms) when using a browser. default 0") 46 | flag.StringVar(&cfg.proxy, "proxy", "", "enter a proxy to use") 47 | flag.BoolVar(&cfg.redirect, "redirect", false, "allow redirects. default is false") 48 | flag.StringVar(&cfg.regex, "regex", "", "search JavaScript with a regex expression") 49 | flag.IntVar(&cfg.retryTimeout, "rt", 1000, "timeout for retries. default is 1000ms") 50 | flag.StringVar(&cfg.term, "term", "", "search JavaScript for a particular term") 51 | flag.StringVar(&cfg.terms, "terms", "", "upload a file containing a list of search terms") 52 | flag.IntVar(&cfg.timeout, "t", 5000, "timeout (in ms) for request. default 5000") 53 | flag.StringVar(&cfg.url, "u", "", "url for getting JavaScript") 54 | flag.BoolVar(&cfg.useBrowser, "b", false, "use playwright to handle JS-intensive sites. default false") 55 | 56 | flag.Parse() 57 | 58 | start := time.Now() 59 | 60 | errorLog := log.New(os.Stderr, "ERROR\t", log.Ltime|log.Lshortfile) 61 | infoLog := log.New(os.Stdout, "INFO\t", log.Ltime) 62 | searches := NewSearchMap() 63 | 64 | app := &application{ 65 | config: cfg, 66 | errorLog: errorLog, 67 | infoLog: infoLog, 68 | searches: searches, 69 | } 70 | 71 | app.getQuery() 72 | 73 | if app.config.url == "" { 74 | err := app.getInput() 75 | app.assertErrorToNil(err) 76 | } 77 | 78 | baseURL, err := app.getBaseURL(cfg.url) 79 | app.assertErrorToNil(err) 80 | app.baseURL = baseURL 81 | 82 | err = os.Mkdir("data", 0755) 83 | app.assertErrorToNil(err) 84 | 85 | if cfg.term != "" || cfg.terms != "" || cfg.regex != "" { 86 | err := os.Mkdir("searchResults", 0755) 87 | app.assertErrorToNil(err) 88 | } 89 | 90 | app.client = app.makeClient(cfg.timeout, cfg.proxy, cfg.redirect) 91 | app.retryClient = app.makeClient(cfg.retryTimeout, cfg.proxy, true) 92 | 93 | var reader io.Reader 94 | 95 | // get reader 96 | switch { 97 | case cfg.useBrowser: 98 | reader, err = app.browser(cfg.url, &cfg.browserTimeout, cfg.extraWait, app.client) 99 | app.assertErrorToNil(err) 100 | default: 101 | resp, err := app.makeRequest(cfg.url, app.client) 102 | app.assertErrorToNil(err) 103 | defer resp.Body.Close() 104 | reader = resp.Body 105 | } 106 | 107 | // parse for src, writing javascript files without src 108 | srcs, anonCount, err := app.parseDoc(reader, cfg.url, app.query) 109 | app.assertErrorToNil(err) 110 | 111 | // write src text file 112 | err = app.writeFile(srcs, "scriptSRC.txt") 113 | app.assertErrorToNil(err) 114 | 115 | // handling situations when src doesn't end with .js 116 | fName := regexp.MustCompile(`[\w-&]+(\.js)?$`) 117 | 118 | // extract, search, and write javascript files with src 119 | var g errgroup.Group 120 | for _, src := range srcs { 121 | src := src 122 | g.Go(func() error { 123 | err := app.getJS(app.client, src, app.query, fName) 124 | if err != nil { 125 | return err 126 | } 127 | return nil 128 | }) 129 | } 130 | 131 | counter := anonCount + len(srcs) 132 | 133 | if err := g.Wait(); err != nil { 134 | app.errorLog.Printf("extract/search/write error: %v", err) 135 | counter-- 136 | } 137 | 138 | // save search results (if applicable) 139 | if cfg.term != "" || cfg.terms != "" || cfg.regex != "" { 140 | err = app.writeSearchResults(app.searches.Searches) 141 | app.assertErrorToNil(err) 142 | } 143 | 144 | fmt.Println() 145 | fmt.Println("============================================================") 146 | app.infoLog.Printf("successfully processed %d scripts\n", counter) 147 | app.infoLog.Printf("took %f seconds\n", time.Since(start).Seconds()) 148 | fmt.Println("============================================================") 149 | } 150 | -------------------------------------------------------------------------------- /cmd/goGetJS/parsers.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "errors" 5 | "fmt" 6 | "io" 7 | "net/http" 8 | "os" 9 | "regexp" 10 | "strconv" 11 | "strings" 12 | "sync" 13 | 14 | "github.com/PuerkitoBio/goquery" 15 | ) 16 | 17 | // parseDoc searches a page for script tags, returning a string slice of all found src, the 18 | // number found, and any errors. When a script tag does not have an src attribute, parseDoc 19 | // writes the contents between the script tags as an anonymous javascript file. If no src are 20 | // found on the page, parseDoc writes the html to a file to aid in debugging. 21 | func (app *application) parseDoc(r io.Reader, url string, query interface{}) ([]string, int, error) { 22 | var srcs []string 23 | doc, err := goquery.NewDocumentFromReader(r) 24 | if err != nil { 25 | return srcs, 0, fmt.Errorf("goquery doc creation error for %v: %w", url, err) 26 | } 27 | 28 | anonCount := 0 29 | 30 | doc.Find("script").Each(func(i int, s *goquery.Selection) { 31 | // handling scripts with src 32 | if src, ok := s.Attr("src"); ok { 33 | src = strings.TrimSpace(src) 34 | switch { 35 | case strings.HasPrefix(src, "//"): 36 | full := fmt.Sprintf("http:%s", src) 37 | srcs = append(srcs, full) 38 | case strings.HasPrefix(src, "/"): 39 | full := app.baseURL + src 40 | srcs = append(srcs, full) 41 | default: 42 | srcs = append(srcs, src) 43 | } 44 | } else { 45 | // handling scripts without src 46 | script := strings.TrimSpace(s.Text()) 47 | 48 | // write scripts to file 49 | scriptByte := []byte(script) 50 | anonCount++ 51 | scriptName := fmt.Sprintf("anon%s.js", strconv.Itoa(anonCount)) 52 | app.searchScript(query, scriptName, script) 53 | if err := os.WriteFile("data/"+scriptName, scriptByte, 0644); err != nil { 54 | app.errorLog.Printf("could not write %q: %v\n", scriptName, err) 55 | anonCount-- 56 | } 57 | app.infoLog.Printf("writing: %v\n", scriptName) 58 | } 59 | }) 60 | 61 | if len(srcs) != 0 { 62 | return srcs, anonCount, nil 63 | } 64 | 65 | // if no src found, write the page to a file for debugging purposes 66 | html, err := doc.Html() 67 | if err != nil { 68 | return srcs, anonCount, fmt.Errorf("HTML extraction error for %v: %w", url, err) 69 | } 70 | err = app.writePage(html, url) 71 | if err != nil { 72 | return srcs, anonCount, fmt.Errorf("HTML writing error for %v: %w", url, err) 73 | } 74 | 75 | return srcs, anonCount, errors.New("no src found") 76 | } 77 | 78 | // getJS takes in a url to a javascript file, extracts the contents, and writes them to an individual javascript file. 79 | func (app *application) getJS(client *http.Client, url string, query interface{}, r *regexp.Regexp) error { 80 | app.infoLog.Println("extracting from:", url) 81 | resp, err := app.makeRequest(url, client) 82 | if err != nil { 83 | return fmt.Errorf("request error for %v: %w", url, err) 84 | } 85 | 86 | defer resp.Body.Close() 87 | 88 | body, err := io.ReadAll(resp.Body) 89 | if err != nil { 90 | return fmt.Errorf("resp.Body read error for %v: %w", url, err) 91 | } 92 | 93 | // retry (uses short timeout and allows redirects) 94 | if len(body) == 0 { 95 | app.infoLog.Printf("retrying %v\n", url) 96 | app.quickRetry(url, query, r) 97 | } 98 | 99 | script := string(body) 100 | 101 | app.searchScript(query, url, script) 102 | 103 | if script != "" { 104 | err := app.writeScript(script, url, r) 105 | if err != nil { 106 | return fmt.Errorf("write error for %v: %w", url, err) 107 | } 108 | return nil 109 | } 110 | 111 | return nil 112 | } 113 | 114 | // searchScript takes a query, a url, and the script to be searched, and saves 115 | // any found terms (and the url they were found on) to the SearchMap for 116 | // later writing to a file 117 | func (app *application) searchScript(query interface{}, url, script string) { 118 | app.infoLog.Printf("searching: %v\n", url) 119 | switch q := query.(type) { 120 | case *regexp.Regexp: 121 | savedTerm := make(map[string]bool) 122 | if q.FindAllString(script, -1) != nil { 123 | for _, v := range q.FindAllString(script, -1) { 124 | if savedTerm[v] { 125 | continue 126 | } 127 | savedTerm[v] = true 128 | app.searches.Store(url, v) 129 | } 130 | } 131 | case string: 132 | if q != "" && strings.Contains(script, q) { 133 | app.searches.Store(url, q) 134 | } 135 | case []string: 136 | var wg sync.WaitGroup 137 | for _, term := range q { 138 | wg.Add(1) 139 | go func(t string) { 140 | defer wg.Done() 141 | if strings.Contains(script, t) { 142 | app.searches.Store(url, t) 143 | } 144 | }(term) 145 | } 146 | wg.Wait() 147 | default: 148 | app.errorLog.Println("malformed query, please try again") 149 | } 150 | } 151 | -------------------------------------------------------------------------------- /cmd/goGetJS/requests.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "fmt" 5 | "io" 6 | "math/rand" 7 | "net/http" 8 | "net/url" 9 | "regexp" 10 | "time" 11 | ) 12 | 13 | // makeClient takes in a flag-specified timeout and returns an *http.Client that has 14 | // been configured according to the timeout, proxy, and redirect flags. 15 | func (app *application) makeClient(timeout int, proxy string, redirect bool) *http.Client { 16 | switch { 17 | case proxy != "": 18 | parsed, err := url.Parse(proxy) 19 | if err != nil { 20 | app.errorLog.Fatalf("proxy error: %v", err) 21 | } 22 | 23 | tr := http.DefaultTransport.(*http.Transport).Clone() 24 | tr.Proxy = http.ProxyURL(parsed) 25 | 26 | return &http.Client{ 27 | CheckRedirect: app.allowRedirects(redirect), 28 | Timeout: time.Duration(timeout) * time.Millisecond, 29 | Transport: tr, 30 | } 31 | default: 32 | return &http.Client{ 33 | CheckRedirect: app.allowRedirects(redirect), 34 | Timeout: time.Duration(timeout) * time.Millisecond, 35 | } 36 | } 37 | } 38 | 39 | // allowRedirects checks the redirects flag. If the flag is true, allowRedirects 40 | // returns nil and redirects will be allowed. If the flag is false (the default value), 41 | // allowRedirects returns a function for the CheckRedirect field of the http.Client that 42 | // blocks redirects. 43 | func (app *application) allowRedirects(redirect bool) func(req *http.Request, via []*http.Request) error { 44 | switch { 45 | case redirect: 46 | return nil 47 | default: 48 | return func(req *http.Request, via []*http.Request) error { 49 | app.errorLog.Printf("redirect to %s has been blocked\n", req.URL.String()) 50 | return http.ErrUseLastResponse 51 | } 52 | } 53 | } 54 | 55 | // makeRequest takes in a url and a client, forms a new GET request, sets a random 56 | // user agent, and then returns the response and any errors. 57 | func (app *application) makeRequest(url string, client *http.Client) (*http.Response, error) { 58 | req, err := http.NewRequest("GET", url, nil) 59 | if err != nil { 60 | return nil, fmt.Errorf("request error for %v: %w", url, err) 61 | } 62 | 63 | uAgent := app.randomUA() 64 | req.Header.Set("User-Agent", uAgent) 65 | 66 | resp, err := client.Do(req) 67 | if err != nil { 68 | return nil, fmt.Errorf("response error for %v: %w", url, err) 69 | } 70 | if resp.StatusCode != 200 { 71 | resp.Body.Close() 72 | return nil, fmt.Errorf("status code error for %v: %d", url, resp.StatusCode) 73 | } 74 | return resp, nil 75 | } 76 | 77 | // quickRetry uses a short timeout and allows redirects. It's called within getJS 78 | // to retry any src that link to a page without text. 79 | func (app *application) quickRetry(url string, query interface{}, r *regexp.Regexp) { 80 | resp, err := app.makeRequest(url, app.retryClient) 81 | if err != nil { 82 | app.errorLog.Printf("retry request error for %v: %v\n", url, err) 83 | return 84 | } 85 | defer resp.Body.Close() 86 | 87 | b, err := io.ReadAll(resp.Body) 88 | if err != nil { 89 | app.errorLog.Printf("retry read error for %v: %v\n", url, err) 90 | return 91 | } 92 | script := string(b) 93 | 94 | if script != "" { 95 | app.infoLog.Printf("retry success: %v\n", url) 96 | app.searchScript(query, url, script) 97 | err := app.writeScript(script, url, r) 98 | if err != nil { 99 | app.errorLog.Printf("retry write error for %v: %v\n", url, err) 100 | return 101 | } 102 | } 103 | } 104 | 105 | // randomUA returns a user agent randomly drawn from six possibilities. 106 | func (app *application) randomUA() string { 107 | userAgents := app.getUA() 108 | r := rand.New(rand.NewSource(time.Now().UnixNano())) 109 | rando := r.Intn(len(userAgents)) 110 | 111 | return userAgents[rando] 112 | } 113 | 114 | // getUA returns a string slice of six user agents. 115 | func (app *application) getUA() []string { 116 | return []string{ 117 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36", 118 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko)", 119 | "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/601.7.7 (KHTML, like Gecko) Version/9.1.2 Safari/601.7.7", 120 | "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36", 121 | "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36", 122 | "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:99.0) Gecko/20100101 Firefox/99.0", 123 | } 124 | } 125 | -------------------------------------------------------------------------------- /cmd/goGetJS/storage.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import "sync" 4 | 5 | // SearchMap is a Mutex-protected struct that stores search results 6 | // in the following format: script url: found search term(s) 7 | type SearchMap struct { 8 | mu sync.Mutex 9 | Searches map[string][]string 10 | } 11 | 12 | // NewSearchMap creates a new SearchMap and returns a pointer to it. 13 | func NewSearchMap() *SearchMap { 14 | return &SearchMap{ 15 | Searches: make(map[string][]string), 16 | } 17 | } 18 | 19 | // Store receives a script url and the term found on that page 20 | // and records it in the SearchMap. 21 | func (s *SearchMap) Store(url, term string) { 22 | s.mu.Lock() 23 | s.Searches[url] = append(s.Searches[url], term) 24 | s.mu.Unlock() 25 | } 26 | -------------------------------------------------------------------------------- /cmd/goGetJS/writers.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "encoding/json" 5 | "fmt" 6 | "net/url" 7 | "os" 8 | "regexp" 9 | "strings" 10 | ) 11 | 12 | // writeScript writes a passed in string of javascript to an individual file. 13 | func (app *application) writeScript(script, url string, fileNamer *regexp.Regexp) error { 14 | fName := fileNamer.FindString(url) 15 | url = "// " + url + "\n" 16 | urlByte := []byte(url) 17 | scriptByte := []byte(script) 18 | data := append(urlByte, scriptByte...) 19 | if err := os.WriteFile("data/"+fName, data, 0644); err != nil { 20 | return fmt.Errorf("cannot write %q: %w", fName, err) 21 | } 22 | return nil 23 | } 24 | 25 | // writeFile takes a string slice of src and writes the contents to a text file. 26 | func (app *application) writeFile(scripts []string, fName string) error { 27 | f, err := os.Create(fName) 28 | if err != nil { 29 | return fmt.Errorf("file creation error: %w", err) 30 | } 31 | defer f.Close() 32 | for _, v := range scripts { 33 | if _, err := fmt.Fprintln(f, v); err != nil { 34 | return fmt.Errorf("file write error: %w", err) 35 | } 36 | } 37 | return nil 38 | } 39 | 40 | // writePage takes the html of a page (as a string) and the url and writes 41 | // the contents to a text file. 42 | func (app *application) writePage(s, myURL string) error { 43 | var n string // for naming purposes 44 | 45 | myURL = strings.TrimSuffix(myURL, "/") 46 | u, err := url.Parse(myURL) 47 | if err != nil { 48 | return fmt.Errorf("parse error for %v: %w", myURL, err) 49 | } 50 | 51 | switch { 52 | case u.Path != "": 53 | re := regexp.MustCompile(`[-\w\?\=]+/?$`) 54 | n = re.FindString(u.Path) 55 | n = strings.TrimSuffix(n, "/") 56 | default: 57 | n = u.Host 58 | n = strings.ReplaceAll(n, ".", "") 59 | } 60 | 61 | err = os.Mkdir("debug", 0755) 62 | if err != nil { 63 | return fmt.Errorf("debug folder creation error: %w", err) 64 | } 65 | f, err := os.Create("debug/" + n + ".html") 66 | if err != nil { 67 | return fmt.Errorf("debug file creation error for %v: %w", myURL, err) 68 | } 69 | defer f.Close() 70 | _, err = f.WriteString(s) 71 | if err != nil { 72 | return fmt.Errorf("write file error for %v: %w", myURL, err) 73 | } 74 | err = f.Sync() 75 | if err != nil { 76 | return fmt.Errorf("sync error for %v: %w", myURL, err) 77 | } 78 | return nil 79 | } 80 | 81 | // writeSearchResults takes in the search results (saved as an instance of SearchMap), 82 | // creates a searchResults directory, and writes the results as a json file. 83 | func (app *application) writeSearchResults(data map[string][]string) error { 84 | var n string 85 | 86 | switch app.query.(type) { 87 | case *regexp.Regexp: 88 | n = "regexResults.json" 89 | case string: 90 | n = "termResults.json" 91 | case []string: 92 | n = "termsResults.json" 93 | } 94 | 95 | sr, err := json.Marshal(data) 96 | if err != nil { 97 | return fmt.Errorf("search results marshal error: %w", err) 98 | } 99 | f, err := os.Create(fmt.Sprintf("searchResults/%v", n)) 100 | if err != nil { 101 | return fmt.Errorf("search results file error: %w", err) 102 | } 103 | defer f.Close() 104 | _, err = f.Write(sr) 105 | if err != nil { 106 | return fmt.Errorf("search results write error: %w", err) 107 | } 108 | err = f.Sync() 109 | if err != nil { 110 | return fmt.Errorf("search results sync error: %w", err) 111 | } 112 | return nil 113 | } 114 | -------------------------------------------------------------------------------- /demo.gif: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/davemolk/goGetJS/9eae56cd3102aee415eb17a75c016f495102b733/demo.gif -------------------------------------------------------------------------------- /go.mod: -------------------------------------------------------------------------------- 1 | module github.com/davemolk/goGetJS 2 | 3 | go 1.18 4 | 5 | require ( 6 | github.com/PuerkitoBio/goquery v1.8.0 7 | github.com/playwright-community/playwright-go v0.2000.1 8 | golang.org/x/sync v0.0.0-20220513210516-0976fa681c29 9 | ) 10 | 11 | require ( 12 | github.com/andybalholm/cascadia v1.3.1 // indirect 13 | github.com/danwakefield/fnmatch v0.0.0-20160403171240-cbb64ac3d964 // indirect 14 | github.com/go-stack/stack v1.8.1 // indirect 15 | golang.org/x/net v0.0.0-20210916014120-12bc252f5db8 // indirect 16 | gopkg.in/square/go-jose.v2 v2.6.0 // indirect 17 | gopkg.in/yaml.v3 v3.0.0 // indirect 18 | ) 19 | -------------------------------------------------------------------------------- /go.sum: -------------------------------------------------------------------------------- 1 | github.com/PuerkitoBio/goquery v1.8.0 h1:PJTF7AmFCFKk1N6V6jmKfrNH9tV5pNE6lZMkG0gta/U= 2 | github.com/PuerkitoBio/goquery v1.8.0/go.mod h1:ypIiRMtY7COPGk+I/YbZLbxsxn9g5ejnI2HSMtkjZvI= 3 | github.com/andybalholm/cascadia v1.3.1 h1:nhxRkql1kdYCc8Snf7D5/D3spOX+dBgjA6u8x004T2c= 4 | github.com/andybalholm/cascadia v1.3.1/go.mod h1:R4bJ1UQfqADjvDa4P6HZHLh/3OxWWEqc0Sk8XGwHqvA= 5 | github.com/danwakefield/fnmatch v0.0.0-20160403171240-cbb64ac3d964 h1:y5HC9v93H5EPKqaS1UYVg1uYah5Xf51mBfIoWehClUQ= 6 | github.com/danwakefield/fnmatch v0.0.0-20160403171240-cbb64ac3d964/go.mod h1:Xd9hchkHSWYkEqJwUGisez3G1QY8Ryz0sdWrLPMGjLk= 7 | github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= 8 | github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= 9 | github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= 10 | github.com/go-stack/stack v1.8.1 h1:ntEHSVwIt7PNXNpgPmVfMrNhLtgjlmnZha2kOpuRiDw= 11 | github.com/go-stack/stack v1.8.1/go.mod h1:dcoOX6HbPZSZptuspn9bctJ+N/CnF5gGygcUP3XYfe4= 12 | github.com/gorilla/websocket v1.4.2/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE= 13 | github.com/h2non/filetype v1.1.1/go.mod h1:319b3zT68BvV+WRj7cwy856M2ehB3HqNOt6sy1HndBY= 14 | github.com/playwright-community/playwright-go v0.2000.1 h1:2JViSHpJQ/UL/PO1Gg6gXV5IcXAAsoBJ3KG9L3wKXto= 15 | github.com/playwright-community/playwright-go v0.2000.1/go.mod h1:1y9cM9b9dVHnuRWzED1KLM7FtbwTJC8ibDjI6MNqewU= 16 | github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= 17 | github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= 18 | github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= 19 | github.com/stretchr/testify v1.7.0 h1:nwc3DEeHmmLAfoZucVR881uASk0Mfjw8xYJ99tb5CcY= 20 | github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= 21 | golang.org/x/net v0.0.0-20210916014120-12bc252f5db8 h1:/6y1LfuqNuQdHAm0jjtPtgRcxIxjVZgm5OTu8/QhZvk= 22 | golang.org/x/net v0.0.0-20210916014120-12bc252f5db8/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y= 23 | golang.org/x/sync v0.0.0-20220513210516-0976fa681c29 h1:w8s32wxx3sY+OjLlv9qltkLU5yvJzxjjgiHWLjdIcw4= 24 | golang.org/x/sync v0.0.0-20220513210516-0976fa681c29/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= 25 | golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= 26 | golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= 27 | golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo= 28 | golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= 29 | golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= 30 | gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= 31 | gopkg.in/square/go-jose.v2 v2.6.0 h1:NGk74WTnPKBNUhNzQX7PYcTLUjoq7mzKk2OKbvwk2iI= 32 | gopkg.in/square/go-jose.v2 v2.6.0/go.mod h1:M9dMgbHiYLoDGQrXy7OpJDJWiKiU//h+vD76mk0e1AI= 33 | gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= 34 | gopkg.in/yaml.v3 v3.0.0 h1:hjy8E9ON/egN1tAYqKb61G10WtihqetD4sz2H+8nIeA= 35 | gopkg.in/yaml.v3 v3.0.0/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= 36 | -------------------------------------------------------------------------------- /search.txt: -------------------------------------------------------------------------------- 1 | func 2 | () => 3 | Math --------------------------------------------------------------------------------