├── .gitignore ├── .travis.yml ├── CONTRIBUTING.md ├── LICENSE ├── README.md ├── go.mod ├── sitemap.go ├── sitemap_impl.go ├── sitemap_test.go ├── sitemap_types.go └── testdata ├── sitemap-index.golden ├── sitemap-index.xml ├── sitemap.golden └── sitemap.xml /.gitignore: -------------------------------------------------------------------------------- 1 | # Binaries for programs and plugins 2 | *.exe 3 | *.exe~ 4 | *.dll 5 | *.so 6 | *.dylib 7 | 8 | # Test binary, build with `go test -c` 9 | *.test 10 | 11 | # Output of the go coverage tool, specifically when used with LiteIDE 12 | *.out 13 | -------------------------------------------------------------------------------- /.travis.yml: -------------------------------------------------------------------------------- 1 | language: go 2 | 3 | before_install: 4 | - go get -t -v ./... 5 | 6 | script: 7 | - go test -race -coverprofile=coverage.txt -covermode=atomic -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing 2 | 3 | This is an open source project. We appreciate your help! 4 | 5 | ## Filing issues 6 | 7 | Please check the existing issues before. 8 | 9 | When [filing an issue](https://github.com/oxffaa/gopher-parse-sitemap/issues/new), make sure to answer these five questions: 10 | 11 | 1. What version of Go (`go version`) are you using?? 12 | 3. What code did you run? 13 | 4. What did you expect to see? 14 | 5. What did you see instead? 15 | 16 | ## Contributing code 17 | 18 | Let know if you are interested in working on an issue by leaving a comment on the issue in GitHub. This helps avoid multiple people unknowingly working on the same issue. 19 | 20 | Please read the [Contribution Guidelines](https://golang.org/doc/contribute.html) before sending patches. 21 | 22 | In general, follow the ["fork-and-pull" Git workflow](https://github.com/susam/gitpr) 23 | 24 | 1. Fork the repository to your own Github account 25 | 2. Clone the project to your machine 26 | 3. Create a branch locally with a succinct but descriptive name 27 | 4. Commit changes to the branch 28 | 5. Following any formatting and testing guidelines specific to this repo 29 | 6. Push changes to your fork 30 | 7. Open a PR in our repository and follow the PR template so that we can efficiently review the changes. 31 | 32 | Consult [GitHub Help] for more information on using pull requests. 33 | 34 | [GitHub Help]: https://help.github.com/articles/about-pull-requests/ 35 | 36 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2019 Akulyakov Artem (akulyakov.artem@gmail.com, https://github.com/oxffaa) 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # gopher-parse-sitemap 2 | 3 | [![Build Status](https://travis-ci.org/oxffaa/gopher-parse-sitemap.svg?branch=master)](https://travis-ci.org/oxffaa/gopher-parse-sitemap) 4 | 5 | A high effective golang library for parsing big-sized sitemaps and avoiding high memory usage. The sitemap parser was written on golang without external dependencies. See https://www.sitemaps.org/ for more information about the sitemap format. 6 | 7 | ## Why yet another sitemaps parsing library? 8 | 9 | Time by time needs to parse really huge sitemaps. If you just unmarshal the whole file to an array of structures it produces high memory usage and the application can crash due to OOM (out of memory error). 10 | 11 | 12 | The solution is to handle sitemap entries on the fly. That is read one entity, consume it, repeat while there are unhandled items in the sitemap. 13 | 14 | ```golang 15 | err := sitemap.ParseFromFile("./testdata/sitemap.xml", func(e Entry) error { 16 | return fmt.Println(e.GetLocation()) 17 | }) 18 | ``` 19 | 20 | ### I need parse only small and medium-sized sitemaps. Should I use this library? 21 | 22 | Yes. Of course, you can just load a sitemap to memory. 23 | 24 | ```golang 25 | result := make([]string, 0, 0) 26 | err := sitemap.ParseIndexFromFile("./testdata/sitemap-index.xml", func(e IndexEntry) error { 27 | result = append(result, e.GetLocation()) 28 | return nil 29 | }) 30 | ``` 31 | 32 | But if you are pretty sure that you don't need to handle big-sized sitemaps, may be better to choose a library with simpler and more suitable API. In that case, you can try projects like https://github.com/yterajima/go-sitemap, https://github.com/snabb/sitemap, and https://github.com/decaseal/go-sitemap-parser. 33 | 34 | ## Install 35 | 36 | Installation is pretty easy, just do: 37 | 38 | ```bash 39 | go get -u github.com/oxffaa/gopher-parse-sitemap 40 | ``` 41 | 42 | After that import it: 43 | ```golang 44 | import "github.com/oxffaa/gopher-parse-sitemap" 45 | ``` 46 | 47 | Well done, you can start to create something awesome. 48 | 49 | ## Documentation 50 | 51 | Please, see [here](https://godoc.org/github.com/oxffaa/gopher-parse-sitemap) for documentation. 52 | -------------------------------------------------------------------------------- /go.mod: -------------------------------------------------------------------------------- 1 | module github.com/oxffaa/gopher-parse-sitemap 2 | 3 | go 1.13 4 | -------------------------------------------------------------------------------- /sitemap.go: -------------------------------------------------------------------------------- 1 | // Package sitemap provides primitives for high effective parsing of huge 2 | // sitemap files. 3 | package sitemap 4 | 5 | import ( 6 | "encoding/xml" 7 | "io" 8 | "net/http" 9 | "os" 10 | "time" 11 | ) 12 | 13 | // Frequency is a type alias for change frequency. 14 | type Frequency = string 15 | 16 | // Change frequency constants set describes how frequently a page is changed. 17 | const ( 18 | Always Frequency = "always" // A page is changed always 19 | Hourly Frequency = "hourly" // A page is changed every hour 20 | Daily Frequency = "daily" // A page is changed every day 21 | Weekly Frequency = "weekly" // A page is changed every week 22 | Monthly Frequency = "monthly" // A page is changed every month 23 | Yearly Frequency = "yearly" // A page is changed every year 24 | Never Frequency = "never" // A page is changed never 25 | ) 26 | 27 | // Entry is an interface describes an element \ an URL in the sitemap file. 28 | // Keep in mind. It is implemented by a totally immutable entity so you should 29 | // minimize calls count because it can produce additional memory allocations. 30 | // 31 | // GetLocation returns URL of the page. 32 | // GetLocation must return a non-nil and not empty string value. 33 | // 34 | // GetLastModified parses and returns date and time of last modification of the page. 35 | // GetLastModified can return nil or a valid time.Time instance. 36 | // Be careful. Each call return new time.Time instance. 37 | // 38 | // GetChangeFrequency returns string value indicates how frequent the page is changed. 39 | // GetChangeFrequency returns non-nil string value. See Frequency consts set. 40 | // 41 | // GetPriority return priority of the page. 42 | // The valid value is between 0.0 and 1.0, the default value is 0.5. 43 | // 44 | // You shouldn't implement this interface in your types. 45 | type Entry interface { 46 | GetLocation() string 47 | GetLastModified() *time.Time 48 | GetChangeFrequency() Frequency 49 | GetPriority() float32 50 | } 51 | 52 | // IndexEntry is an interface describes an element \ an URL in a sitemap index file. 53 | // Keep in mind. It is implemented by a totally immutable entity so you should 54 | // minimize calls count because it can produce additional memory allocations. 55 | // 56 | // GetLocation returns URL of a sitemap file. 57 | // GetLocation must return a non-nil and not empty string value. 58 | // 59 | // GetLastModified parses and returns date and time of last modification of sitemap. 60 | // GetLastModified can return nil or a valid time.Time instance. 61 | // Be careful. Each call return new time.Time instance. 62 | // 63 | // You shouldn't implement this interface in your types. 64 | type IndexEntry interface { 65 | GetLocation() string 66 | GetLastModified() *time.Time 67 | } 68 | 69 | // EntryConsumer is a type represents consumer of parsed sitemaps entries 70 | type EntryConsumer func(Entry) error 71 | 72 | // Parse parses data which provides by the reader and for each sitemap 73 | // entry calls the consumer's function. 74 | func Parse(reader io.Reader, consumer EntryConsumer) error { 75 | return parseLoop(reader, func(d *xml.Decoder, se *xml.StartElement) error { 76 | return entryParser(d, se, consumer) 77 | }) 78 | } 79 | 80 | // ParseFromFile reads sitemap from a file, parses it and for each sitemap 81 | // entry calls the consumer's function. 82 | func ParseFromFile(sitemapPath string, consumer EntryConsumer) error { 83 | sitemapFile, err := os.OpenFile(sitemapPath, os.O_RDONLY, os.ModeExclusive) 84 | if err != nil { 85 | return err 86 | } 87 | defer sitemapFile.Close() 88 | 89 | return Parse(sitemapFile, consumer) 90 | } 91 | 92 | // ParseFromSite downloads sitemap from a site, parses it and for each sitemap 93 | // entry calls the consumer's function. 94 | func ParseFromSite(url string, consumer EntryConsumer) error { 95 | res, err := http.Get(url) 96 | if err != nil { 97 | return err 98 | } 99 | defer res.Body.Close() 100 | 101 | return Parse(res.Body, consumer) 102 | } 103 | 104 | // IndexEntryConsumer is a type represents consumer of parsed sitemaps indexes entries 105 | type IndexEntryConsumer func(IndexEntry) error 106 | 107 | // ParseIndex parses data which provides by the reader and for each sitemap index 108 | // entry calls the consumer's function. 109 | func ParseIndex(reader io.Reader, consumer IndexEntryConsumer) error { 110 | return parseLoop(reader, func(d *xml.Decoder, se *xml.StartElement) error { 111 | return indexEntryParser(d, se, consumer) 112 | }) 113 | } 114 | 115 | // ParseIndexFromFile reads sitemap index from a file, parses it and for each sitemap 116 | // index entry calls the consumer's function. 117 | func ParseIndexFromFile(sitemapPath string, consumer IndexEntryConsumer) error { 118 | sitemapFile, err := os.OpenFile(sitemapPath, os.O_RDONLY, os.ModeExclusive) 119 | if err != nil { 120 | return err 121 | } 122 | defer sitemapFile.Close() 123 | 124 | return ParseIndex(sitemapFile, consumer) 125 | } 126 | 127 | // ParseIndexFromSite downloads sitemap index from a site, parses it and for each sitemap 128 | // index entry calls the consumer's function. 129 | func ParseIndexFromSite(sitemapURL string, consumer IndexEntryConsumer) error { 130 | res, err := http.Get(sitemapURL) 131 | if err != nil { 132 | return err 133 | } 134 | defer res.Body.Close() 135 | 136 | return ParseIndex(res.Body, consumer) 137 | } 138 | -------------------------------------------------------------------------------- /sitemap_impl.go: -------------------------------------------------------------------------------- 1 | package sitemap 2 | 3 | import ( 4 | "encoding/xml" 5 | "io" 6 | ) 7 | 8 | func entryParser(decoder *xml.Decoder, se *xml.StartElement, consume EntryConsumer) error { 9 | if se.Name.Local == "url" { 10 | entry := newSitemapEntry() 11 | 12 | decodeError := decoder.DecodeElement(entry, se) 13 | if decodeError != nil { 14 | return decodeError 15 | } 16 | 17 | consumerError := consume(entry) 18 | if consumerError != nil { 19 | return consumerError 20 | } 21 | } 22 | 23 | return nil 24 | } 25 | 26 | func indexEntryParser(decoder *xml.Decoder, se *xml.StartElement, consume IndexEntryConsumer) error { 27 | if se.Name.Local == "sitemap" { 28 | entry := new(sitemapIndexEntry) 29 | 30 | decodeError := decoder.DecodeElement(entry, se) 31 | if decodeError != nil { 32 | return decodeError 33 | } 34 | 35 | consumerError := consume(entry) 36 | if consumerError != nil { 37 | return consumerError 38 | } 39 | } 40 | 41 | return nil 42 | } 43 | 44 | type elementParser func(*xml.Decoder, *xml.StartElement) error 45 | 46 | func parseLoop(reader io.Reader, parser elementParser) error { 47 | decoder := xml.NewDecoder(reader) 48 | 49 | for { 50 | t, tokenError := decoder.Token() 51 | 52 | if tokenError == io.EOF { 53 | break 54 | } else if tokenError != nil { 55 | return tokenError 56 | } 57 | 58 | se, ok := t.(xml.StartElement) 59 | if !ok { 60 | continue 61 | } 62 | 63 | parserError := parser(decoder, &se) 64 | if parserError != nil { 65 | return parserError 66 | } 67 | } 68 | 69 | return nil 70 | } 71 | -------------------------------------------------------------------------------- /sitemap_test.go: -------------------------------------------------------------------------------- 1 | package sitemap 2 | 3 | import ( 4 | "errors" 5 | "fmt" 6 | "io/ioutil" 7 | "strings" 8 | "testing" 9 | "time" 10 | ) 11 | 12 | /* 13 | * Examples 14 | */ 15 | func ExampleParseFromFile() { 16 | err := ParseFromFile("./testdata/sitemap.xml", func(e Entry) error { 17 | fmt.Println(e.GetLocation()) 18 | return nil 19 | }) 20 | if err != nil { 21 | panic(err) 22 | } 23 | } 24 | 25 | func ExampleParseIndexFromFile() { 26 | result := make([]string, 0, 0) 27 | err := ParseIndexFromFile("./testdata/sitemap-index.xml", func(e IndexEntry) error { 28 | result = append(result, e.GetLocation()) 29 | return nil 30 | }) 31 | if err != nil { 32 | panic(err) 33 | } 34 | } 35 | 36 | /* 37 | * Public API tests 38 | */ 39 | func TestParseSitemap(t *testing.T) { 40 | var ( 41 | counter int 42 | sb strings.Builder 43 | ) 44 | err := ParseFromFile("./testdata/sitemap.xml", func(e Entry) error { 45 | counter++ 46 | 47 | fmt.Fprintln(&sb, e.GetLocation()) 48 | lastmod := e.GetLastModified() 49 | if lastmod != nil { 50 | fmt.Fprintln(&sb, lastmod.Format(time.RFC3339)) 51 | } 52 | fmt.Fprintln(&sb, e.GetChangeFrequency()) 53 | fmt.Fprintln(&sb, e.GetPriority()) 54 | 55 | return nil 56 | }) 57 | 58 | if err != nil { 59 | t.Errorf("Parsing failed with error %s", err) 60 | } 61 | 62 | if counter != 4 { 63 | t.Errorf("Expected 4 elements, but given only %d", counter) 64 | } 65 | 66 | expected, err := ioutil.ReadFile("./testdata/sitemap.golden") 67 | if err != nil { 68 | t.Errorf("Can't read golden file due to %s", err) 69 | } 70 | 71 | if sb.String() != string(expected) { 72 | t.Error("Unxepected result") 73 | } 74 | } 75 | 76 | func TestParseSitemap_BreakingOnError(t *testing.T) { 77 | var counter = 0 78 | breakErr := errors.New("break error") 79 | err := ParseFromFile("./testdata/sitemap.xml", func(e Entry) error { 80 | counter++ 81 | return breakErr 82 | }) 83 | 84 | if counter != 1 { 85 | t.Error("Error didn't break parsing") 86 | } 87 | 88 | if breakErr != err { 89 | t.Error("If consumer failed, ParseSitemap should return consumer error") 90 | } 91 | } 92 | 93 | func TestParseSitemapIndex(t *testing.T) { 94 | var ( 95 | counter int 96 | sb strings.Builder 97 | ) 98 | err := ParseIndexFromFile("./testdata/sitemap-index.xml", func(e IndexEntry) error { 99 | counter++ 100 | 101 | fmt.Fprintln(&sb, e.GetLocation()) 102 | lastmod := e.GetLastModified() 103 | if lastmod != nil { 104 | fmt.Fprintln(&sb, lastmod.Format(time.RFC3339)) 105 | } 106 | 107 | return nil 108 | }) 109 | 110 | if err != nil { 111 | t.Errorf("Parsing failed with error %s", err) 112 | } 113 | 114 | if counter != 3 { 115 | t.Errorf("Expected 3 elements, but given only %d", counter) 116 | } 117 | 118 | expected, err := ioutil.ReadFile("./testdata/sitemap-index.golden") 119 | if err != nil { 120 | t.Errorf("Can't read golden file due to %s", err) 121 | } 122 | 123 | if sb.String() != string(expected) { 124 | t.Error("Unxepected result") 125 | } 126 | } 127 | 128 | /* 129 | * Private API tests 130 | */ 131 | 132 | func TestParseShortDateTime(t *testing.T) { 133 | res := parseDateTime("2015-05-07") 134 | if res == nil { 135 | t.Error("Date time was't parsed") 136 | return 137 | } 138 | if res.Year() != 2015 || res.Month() != 05 || res.Day() != 07 { 139 | t.Errorf("Date was parsed wrong %s", res.Format(time.RFC3339)) 140 | } 141 | } 142 | -------------------------------------------------------------------------------- /sitemap_types.go: -------------------------------------------------------------------------------- 1 | package sitemap 2 | 3 | import "time" 4 | 5 | type sitemapEntry struct { 6 | Location string `xml:"loc"` 7 | LastModified string `xml:"lastmod,omitempy"` 8 | ParsedLastModified *time.Time 9 | ChangeFrequency Frequency `xml:"changefreq,omitempty"` 10 | Priority float32 `xml:"priority,omitempty"` 11 | } 12 | 13 | func newSitemapEntry() *sitemapEntry { 14 | return &sitemapEntry{ChangeFrequency: Always, Priority: 0.5} 15 | } 16 | 17 | func (e *sitemapEntry) GetLocation() string { 18 | return e.Location 19 | } 20 | 21 | func (e *sitemapEntry) GetLastModified() *time.Time { 22 | if e.ParsedLastModified == nil && e.LastModified != "" { 23 | e.ParsedLastModified = parseDateTime(e.LastModified) 24 | } 25 | return e.ParsedLastModified 26 | } 27 | 28 | func (e *sitemapEntry) GetChangeFrequency() Frequency { 29 | return e.ChangeFrequency 30 | } 31 | 32 | func (e *sitemapEntry) GetPriority() float32 { 33 | return e.Priority 34 | } 35 | 36 | type sitemapIndexEntry struct { 37 | Location string `xml:"loc"` 38 | LastModified string `xml:"lastmod,omitempty"` 39 | ParsedLastModified *time.Time 40 | } 41 | 42 | func newSitemapIndexEntry() *sitemapIndexEntry { 43 | return &sitemapIndexEntry{} 44 | } 45 | 46 | func (e *sitemapIndexEntry) GetLocation() string { 47 | return e.Location 48 | } 49 | 50 | func (e *sitemapIndexEntry) GetLastModified() *time.Time { 51 | if e.ParsedLastModified == nil && e.LastModified != "" { 52 | e.ParsedLastModified = parseDateTime(e.LastModified) 53 | } 54 | return e.ParsedLastModified 55 | } 56 | 57 | func parseDateTime(value string) *time.Time { 58 | if value == "" { 59 | return nil 60 | } 61 | 62 | t, err := time.Parse(time.RFC3339, value) 63 | if err != nil { 64 | // second chance 65 | // try parse as short format 66 | t, err = time.Parse("2006-01-02", value) 67 | if err != nil { 68 | return nil 69 | } 70 | } 71 | 72 | return &t 73 | } 74 | -------------------------------------------------------------------------------- /testdata/sitemap-index.golden: -------------------------------------------------------------------------------- 1 | http://www.example.com/sitemap1.xml.gz 2 | 2004-10-01T18:23:17Z 3 | http://www.example.com/sitemap2.xml.gz 4 | 2005-01-01T00:00:00Z 5 | http://www.example.com/sitemap3.xml.gz 6 | -------------------------------------------------------------------------------- /testdata/sitemap-index.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | http://www.example.com/sitemap1.xml.gz 5 | 2004-10-01T18:23:17+00:00 6 | 7 | 8 | http://www.example.com/sitemap2.xml.gz 9 | 2005-01-01 10 | 11 | 12 | http://www.example.com/sitemap3.xml.gz 13 | 14 | -------------------------------------------------------------------------------- /testdata/sitemap.golden: -------------------------------------------------------------------------------- 1 | http://HOST/ 2 | always 3 | 0.5 4 | http://HOST/tools/ 5 | 2015-05-07T19:13:09+09:00 6 | always 7 | 0.5 8 | http://HOST/contribution-to-oss/ 9 | 2015-05-07T00:00:00Z 10 | monthly 11 | 0.5 12 | http://HOST/page-1/ 13 | 2015-05-07T19:13:09+09:00 14 | monthly 15 | 0.9 16 | -------------------------------------------------------------------------------- /testdata/sitemap.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | http://HOST/ 5 | 6 | 7 | http://HOST/tools/ 8 | 2015-05-07T19:13:09+09:00 9 | 10 | 11 | http://HOST/contribution-to-oss/ 12 | 2015-05-07 13 | monthly 14 | 15 | 16 | http://HOST/page-1/ 17 | 2015-05-07T19:13:09+09:00 18 | monthly 19 | 0.9 20 | 21 | --------------------------------------------------------------------------------