├── .gitignore
├── .travis.yml
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── go.mod
├── sitemap.go
├── sitemap_impl.go
├── sitemap_test.go
├── sitemap_types.go
└── testdata
├── sitemap-index.golden
├── sitemap-index.xml
├── sitemap.golden
└── sitemap.xml
/.gitignore:
--------------------------------------------------------------------------------
1 | # Binaries for programs and plugins
2 | *.exe
3 | *.exe~
4 | *.dll
5 | *.so
6 | *.dylib
7 |
8 | # Test binary, build with `go test -c`
9 | *.test
10 |
11 | # Output of the go coverage tool, specifically when used with LiteIDE
12 | *.out
13 |
--------------------------------------------------------------------------------
/.travis.yml:
--------------------------------------------------------------------------------
1 | language: go
2 |
3 | before_install:
4 | - go get -t -v ./...
5 |
6 | script:
7 | - go test -race -coverprofile=coverage.txt -covermode=atomic
--------------------------------------------------------------------------------
/CONTRIBUTING.md:
--------------------------------------------------------------------------------
1 | # Contributing
2 |
3 | This is an open source project. We appreciate your help!
4 |
5 | ## Filing issues
6 |
7 | Please check the existing issues before.
8 |
9 | When [filing an issue](https://github.com/oxffaa/gopher-parse-sitemap/issues/new), make sure to answer these five questions:
10 |
11 | 1. What version of Go (`go version`) are you using??
12 | 3. What code did you run?
13 | 4. What did you expect to see?
14 | 5. What did you see instead?
15 |
16 | ## Contributing code
17 |
18 | Let know if you are interested in working on an issue by leaving a comment on the issue in GitHub. This helps avoid multiple people unknowingly working on the same issue.
19 |
20 | Please read the [Contribution Guidelines](https://golang.org/doc/contribute.html) before sending patches.
21 |
22 | In general, follow the ["fork-and-pull" Git workflow](https://github.com/susam/gitpr)
23 |
24 | 1. Fork the repository to your own Github account
25 | 2. Clone the project to your machine
26 | 3. Create a branch locally with a succinct but descriptive name
27 | 4. Commit changes to the branch
28 | 5. Following any formatting and testing guidelines specific to this repo
29 | 6. Push changes to your fork
30 | 7. Open a PR in our repository and follow the PR template so that we can efficiently review the changes.
31 |
32 | Consult [GitHub Help] for more information on using pull requests.
33 |
34 | [GitHub Help]: https://help.github.com/articles/about-pull-requests/
35 |
36 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2019 Akulyakov Artem (akulyakov.artem@gmail.com, https://github.com/oxffaa)
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # gopher-parse-sitemap
2 |
3 | [](https://travis-ci.org/oxffaa/gopher-parse-sitemap)
4 |
5 | A high effective golang library for parsing big-sized sitemaps and avoiding high memory usage. The sitemap parser was written on golang without external dependencies. See https://www.sitemaps.org/ for more information about the sitemap format.
6 |
7 | ## Why yet another sitemaps parsing library?
8 |
9 | Time by time needs to parse really huge sitemaps. If you just unmarshal the whole file to an array of structures it produces high memory usage and the application can crash due to OOM (out of memory error).
10 |
11 |
12 | The solution is to handle sitemap entries on the fly. That is read one entity, consume it, repeat while there are unhandled items in the sitemap.
13 |
14 | ```golang
15 | err := sitemap.ParseFromFile("./testdata/sitemap.xml", func(e Entry) error {
16 | return fmt.Println(e.GetLocation())
17 | })
18 | ```
19 |
20 | ### I need parse only small and medium-sized sitemaps. Should I use this library?
21 |
22 | Yes. Of course, you can just load a sitemap to memory.
23 |
24 | ```golang
25 | result := make([]string, 0, 0)
26 | err := sitemap.ParseIndexFromFile("./testdata/sitemap-index.xml", func(e IndexEntry) error {
27 | result = append(result, e.GetLocation())
28 | return nil
29 | })
30 | ```
31 |
32 | But if you are pretty sure that you don't need to handle big-sized sitemaps, may be better to choose a library with simpler and more suitable API. In that case, you can try projects like https://github.com/yterajima/go-sitemap, https://github.com/snabb/sitemap, and https://github.com/decaseal/go-sitemap-parser.
33 |
34 | ## Install
35 |
36 | Installation is pretty easy, just do:
37 |
38 | ```bash
39 | go get -u github.com/oxffaa/gopher-parse-sitemap
40 | ```
41 |
42 | After that import it:
43 | ```golang
44 | import "github.com/oxffaa/gopher-parse-sitemap"
45 | ```
46 |
47 | Well done, you can start to create something awesome.
48 |
49 | ## Documentation
50 |
51 | Please, see [here](https://godoc.org/github.com/oxffaa/gopher-parse-sitemap) for documentation.
52 |
--------------------------------------------------------------------------------
/go.mod:
--------------------------------------------------------------------------------
1 | module github.com/oxffaa/gopher-parse-sitemap
2 |
3 | go 1.13
4 |
--------------------------------------------------------------------------------
/sitemap.go:
--------------------------------------------------------------------------------
1 | // Package sitemap provides primitives for high effective parsing of huge
2 | // sitemap files.
3 | package sitemap
4 |
5 | import (
6 | "encoding/xml"
7 | "io"
8 | "net/http"
9 | "os"
10 | "time"
11 | )
12 |
13 | // Frequency is a type alias for change frequency.
14 | type Frequency = string
15 |
16 | // Change frequency constants set describes how frequently a page is changed.
17 | const (
18 | Always Frequency = "always" // A page is changed always
19 | Hourly Frequency = "hourly" // A page is changed every hour
20 | Daily Frequency = "daily" // A page is changed every day
21 | Weekly Frequency = "weekly" // A page is changed every week
22 | Monthly Frequency = "monthly" // A page is changed every month
23 | Yearly Frequency = "yearly" // A page is changed every year
24 | Never Frequency = "never" // A page is changed never
25 | )
26 |
27 | // Entry is an interface describes an element \ an URL in the sitemap file.
28 | // Keep in mind. It is implemented by a totally immutable entity so you should
29 | // minimize calls count because it can produce additional memory allocations.
30 | //
31 | // GetLocation returns URL of the page.
32 | // GetLocation must return a non-nil and not empty string value.
33 | //
34 | // GetLastModified parses and returns date and time of last modification of the page.
35 | // GetLastModified can return nil or a valid time.Time instance.
36 | // Be careful. Each call return new time.Time instance.
37 | //
38 | // GetChangeFrequency returns string value indicates how frequent the page is changed.
39 | // GetChangeFrequency returns non-nil string value. See Frequency consts set.
40 | //
41 | // GetPriority return priority of the page.
42 | // The valid value is between 0.0 and 1.0, the default value is 0.5.
43 | //
44 | // You shouldn't implement this interface in your types.
45 | type Entry interface {
46 | GetLocation() string
47 | GetLastModified() *time.Time
48 | GetChangeFrequency() Frequency
49 | GetPriority() float32
50 | }
51 |
52 | // IndexEntry is an interface describes an element \ an URL in a sitemap index file.
53 | // Keep in mind. It is implemented by a totally immutable entity so you should
54 | // minimize calls count because it can produce additional memory allocations.
55 | //
56 | // GetLocation returns URL of a sitemap file.
57 | // GetLocation must return a non-nil and not empty string value.
58 | //
59 | // GetLastModified parses and returns date and time of last modification of sitemap.
60 | // GetLastModified can return nil or a valid time.Time instance.
61 | // Be careful. Each call return new time.Time instance.
62 | //
63 | // You shouldn't implement this interface in your types.
64 | type IndexEntry interface {
65 | GetLocation() string
66 | GetLastModified() *time.Time
67 | }
68 |
69 | // EntryConsumer is a type represents consumer of parsed sitemaps entries
70 | type EntryConsumer func(Entry) error
71 |
72 | // Parse parses data which provides by the reader and for each sitemap
73 | // entry calls the consumer's function.
74 | func Parse(reader io.Reader, consumer EntryConsumer) error {
75 | return parseLoop(reader, func(d *xml.Decoder, se *xml.StartElement) error {
76 | return entryParser(d, se, consumer)
77 | })
78 | }
79 |
80 | // ParseFromFile reads sitemap from a file, parses it and for each sitemap
81 | // entry calls the consumer's function.
82 | func ParseFromFile(sitemapPath string, consumer EntryConsumer) error {
83 | sitemapFile, err := os.OpenFile(sitemapPath, os.O_RDONLY, os.ModeExclusive)
84 | if err != nil {
85 | return err
86 | }
87 | defer sitemapFile.Close()
88 |
89 | return Parse(sitemapFile, consumer)
90 | }
91 |
92 | // ParseFromSite downloads sitemap from a site, parses it and for each sitemap
93 | // entry calls the consumer's function.
94 | func ParseFromSite(url string, consumer EntryConsumer) error {
95 | res, err := http.Get(url)
96 | if err != nil {
97 | return err
98 | }
99 | defer res.Body.Close()
100 |
101 | return Parse(res.Body, consumer)
102 | }
103 |
104 | // IndexEntryConsumer is a type represents consumer of parsed sitemaps indexes entries
105 | type IndexEntryConsumer func(IndexEntry) error
106 |
107 | // ParseIndex parses data which provides by the reader and for each sitemap index
108 | // entry calls the consumer's function.
109 | func ParseIndex(reader io.Reader, consumer IndexEntryConsumer) error {
110 | return parseLoop(reader, func(d *xml.Decoder, se *xml.StartElement) error {
111 | return indexEntryParser(d, se, consumer)
112 | })
113 | }
114 |
115 | // ParseIndexFromFile reads sitemap index from a file, parses it and for each sitemap
116 | // index entry calls the consumer's function.
117 | func ParseIndexFromFile(sitemapPath string, consumer IndexEntryConsumer) error {
118 | sitemapFile, err := os.OpenFile(sitemapPath, os.O_RDONLY, os.ModeExclusive)
119 | if err != nil {
120 | return err
121 | }
122 | defer sitemapFile.Close()
123 |
124 | return ParseIndex(sitemapFile, consumer)
125 | }
126 |
127 | // ParseIndexFromSite downloads sitemap index from a site, parses it and for each sitemap
128 | // index entry calls the consumer's function.
129 | func ParseIndexFromSite(sitemapURL string, consumer IndexEntryConsumer) error {
130 | res, err := http.Get(sitemapURL)
131 | if err != nil {
132 | return err
133 | }
134 | defer res.Body.Close()
135 |
136 | return ParseIndex(res.Body, consumer)
137 | }
138 |
--------------------------------------------------------------------------------
/sitemap_impl.go:
--------------------------------------------------------------------------------
1 | package sitemap
2 |
3 | import (
4 | "encoding/xml"
5 | "io"
6 | )
7 |
8 | func entryParser(decoder *xml.Decoder, se *xml.StartElement, consume EntryConsumer) error {
9 | if se.Name.Local == "url" {
10 | entry := newSitemapEntry()
11 |
12 | decodeError := decoder.DecodeElement(entry, se)
13 | if decodeError != nil {
14 | return decodeError
15 | }
16 |
17 | consumerError := consume(entry)
18 | if consumerError != nil {
19 | return consumerError
20 | }
21 | }
22 |
23 | return nil
24 | }
25 |
26 | func indexEntryParser(decoder *xml.Decoder, se *xml.StartElement, consume IndexEntryConsumer) error {
27 | if se.Name.Local == "sitemap" {
28 | entry := new(sitemapIndexEntry)
29 |
30 | decodeError := decoder.DecodeElement(entry, se)
31 | if decodeError != nil {
32 | return decodeError
33 | }
34 |
35 | consumerError := consume(entry)
36 | if consumerError != nil {
37 | return consumerError
38 | }
39 | }
40 |
41 | return nil
42 | }
43 |
44 | type elementParser func(*xml.Decoder, *xml.StartElement) error
45 |
46 | func parseLoop(reader io.Reader, parser elementParser) error {
47 | decoder := xml.NewDecoder(reader)
48 |
49 | for {
50 | t, tokenError := decoder.Token()
51 |
52 | if tokenError == io.EOF {
53 | break
54 | } else if tokenError != nil {
55 | return tokenError
56 | }
57 |
58 | se, ok := t.(xml.StartElement)
59 | if !ok {
60 | continue
61 | }
62 |
63 | parserError := parser(decoder, &se)
64 | if parserError != nil {
65 | return parserError
66 | }
67 | }
68 |
69 | return nil
70 | }
71 |
--------------------------------------------------------------------------------
/sitemap_test.go:
--------------------------------------------------------------------------------
1 | package sitemap
2 |
3 | import (
4 | "errors"
5 | "fmt"
6 | "io/ioutil"
7 | "strings"
8 | "testing"
9 | "time"
10 | )
11 |
12 | /*
13 | * Examples
14 | */
15 | func ExampleParseFromFile() {
16 | err := ParseFromFile("./testdata/sitemap.xml", func(e Entry) error {
17 | fmt.Println(e.GetLocation())
18 | return nil
19 | })
20 | if err != nil {
21 | panic(err)
22 | }
23 | }
24 |
25 | func ExampleParseIndexFromFile() {
26 | result := make([]string, 0, 0)
27 | err := ParseIndexFromFile("./testdata/sitemap-index.xml", func(e IndexEntry) error {
28 | result = append(result, e.GetLocation())
29 | return nil
30 | })
31 | if err != nil {
32 | panic(err)
33 | }
34 | }
35 |
36 | /*
37 | * Public API tests
38 | */
39 | func TestParseSitemap(t *testing.T) {
40 | var (
41 | counter int
42 | sb strings.Builder
43 | )
44 | err := ParseFromFile("./testdata/sitemap.xml", func(e Entry) error {
45 | counter++
46 |
47 | fmt.Fprintln(&sb, e.GetLocation())
48 | lastmod := e.GetLastModified()
49 | if lastmod != nil {
50 | fmt.Fprintln(&sb, lastmod.Format(time.RFC3339))
51 | }
52 | fmt.Fprintln(&sb, e.GetChangeFrequency())
53 | fmt.Fprintln(&sb, e.GetPriority())
54 |
55 | return nil
56 | })
57 |
58 | if err != nil {
59 | t.Errorf("Parsing failed with error %s", err)
60 | }
61 |
62 | if counter != 4 {
63 | t.Errorf("Expected 4 elements, but given only %d", counter)
64 | }
65 |
66 | expected, err := ioutil.ReadFile("./testdata/sitemap.golden")
67 | if err != nil {
68 | t.Errorf("Can't read golden file due to %s", err)
69 | }
70 |
71 | if sb.String() != string(expected) {
72 | t.Error("Unxepected result")
73 | }
74 | }
75 |
76 | func TestParseSitemap_BreakingOnError(t *testing.T) {
77 | var counter = 0
78 | breakErr := errors.New("break error")
79 | err := ParseFromFile("./testdata/sitemap.xml", func(e Entry) error {
80 | counter++
81 | return breakErr
82 | })
83 |
84 | if counter != 1 {
85 | t.Error("Error didn't break parsing")
86 | }
87 |
88 | if breakErr != err {
89 | t.Error("If consumer failed, ParseSitemap should return consumer error")
90 | }
91 | }
92 |
93 | func TestParseSitemapIndex(t *testing.T) {
94 | var (
95 | counter int
96 | sb strings.Builder
97 | )
98 | err := ParseIndexFromFile("./testdata/sitemap-index.xml", func(e IndexEntry) error {
99 | counter++
100 |
101 | fmt.Fprintln(&sb, e.GetLocation())
102 | lastmod := e.GetLastModified()
103 | if lastmod != nil {
104 | fmt.Fprintln(&sb, lastmod.Format(time.RFC3339))
105 | }
106 |
107 | return nil
108 | })
109 |
110 | if err != nil {
111 | t.Errorf("Parsing failed with error %s", err)
112 | }
113 |
114 | if counter != 3 {
115 | t.Errorf("Expected 3 elements, but given only %d", counter)
116 | }
117 |
118 | expected, err := ioutil.ReadFile("./testdata/sitemap-index.golden")
119 | if err != nil {
120 | t.Errorf("Can't read golden file due to %s", err)
121 | }
122 |
123 | if sb.String() != string(expected) {
124 | t.Error("Unxepected result")
125 | }
126 | }
127 |
128 | /*
129 | * Private API tests
130 | */
131 |
132 | func TestParseShortDateTime(t *testing.T) {
133 | res := parseDateTime("2015-05-07")
134 | if res == nil {
135 | t.Error("Date time was't parsed")
136 | return
137 | }
138 | if res.Year() != 2015 || res.Month() != 05 || res.Day() != 07 {
139 | t.Errorf("Date was parsed wrong %s", res.Format(time.RFC3339))
140 | }
141 | }
142 |
--------------------------------------------------------------------------------
/sitemap_types.go:
--------------------------------------------------------------------------------
1 | package sitemap
2 |
3 | import "time"
4 |
5 | type sitemapEntry struct {
6 | Location string `xml:"loc"`
7 | LastModified string `xml:"lastmod,omitempy"`
8 | ParsedLastModified *time.Time
9 | ChangeFrequency Frequency `xml:"changefreq,omitempty"`
10 | Priority float32 `xml:"priority,omitempty"`
11 | }
12 |
13 | func newSitemapEntry() *sitemapEntry {
14 | return &sitemapEntry{ChangeFrequency: Always, Priority: 0.5}
15 | }
16 |
17 | func (e *sitemapEntry) GetLocation() string {
18 | return e.Location
19 | }
20 |
21 | func (e *sitemapEntry) GetLastModified() *time.Time {
22 | if e.ParsedLastModified == nil && e.LastModified != "" {
23 | e.ParsedLastModified = parseDateTime(e.LastModified)
24 | }
25 | return e.ParsedLastModified
26 | }
27 |
28 | func (e *sitemapEntry) GetChangeFrequency() Frequency {
29 | return e.ChangeFrequency
30 | }
31 |
32 | func (e *sitemapEntry) GetPriority() float32 {
33 | return e.Priority
34 | }
35 |
36 | type sitemapIndexEntry struct {
37 | Location string `xml:"loc"`
38 | LastModified string `xml:"lastmod,omitempty"`
39 | ParsedLastModified *time.Time
40 | }
41 |
42 | func newSitemapIndexEntry() *sitemapIndexEntry {
43 | return &sitemapIndexEntry{}
44 | }
45 |
46 | func (e *sitemapIndexEntry) GetLocation() string {
47 | return e.Location
48 | }
49 |
50 | func (e *sitemapIndexEntry) GetLastModified() *time.Time {
51 | if e.ParsedLastModified == nil && e.LastModified != "" {
52 | e.ParsedLastModified = parseDateTime(e.LastModified)
53 | }
54 | return e.ParsedLastModified
55 | }
56 |
57 | func parseDateTime(value string) *time.Time {
58 | if value == "" {
59 | return nil
60 | }
61 |
62 | t, err := time.Parse(time.RFC3339, value)
63 | if err != nil {
64 | // second chance
65 | // try parse as short format
66 | t, err = time.Parse("2006-01-02", value)
67 | if err != nil {
68 | return nil
69 | }
70 | }
71 |
72 | return &t
73 | }
74 |
--------------------------------------------------------------------------------
/testdata/sitemap-index.golden:
--------------------------------------------------------------------------------
1 | http://www.example.com/sitemap1.xml.gz
2 | 2004-10-01T18:23:17Z
3 | http://www.example.com/sitemap2.xml.gz
4 | 2005-01-01T00:00:00Z
5 | http://www.example.com/sitemap3.xml.gz
6 |
--------------------------------------------------------------------------------
/testdata/sitemap-index.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | http://www.example.com/sitemap1.xml.gz
5 | 2004-10-01T18:23:17+00:00
6 |
7 |
8 | http://www.example.com/sitemap2.xml.gz
9 | 2005-01-01
10 |
11 |
12 | http://www.example.com/sitemap3.xml.gz
13 |
14 |
--------------------------------------------------------------------------------
/testdata/sitemap.golden:
--------------------------------------------------------------------------------
1 | http://HOST/
2 | always
3 | 0.5
4 | http://HOST/tools/
5 | 2015-05-07T19:13:09+09:00
6 | always
7 | 0.5
8 | http://HOST/contribution-to-oss/
9 | 2015-05-07T00:00:00Z
10 | monthly
11 | 0.5
12 | http://HOST/page-1/
13 | 2015-05-07T19:13:09+09:00
14 | monthly
15 | 0.9
16 |
--------------------------------------------------------------------------------
/testdata/sitemap.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | http://HOST/
5 |
6 |
7 | http://HOST/tools/
8 | 2015-05-07T19:13:09+09:00
9 |
10 |
11 | http://HOST/contribution-to-oss/
12 | 2015-05-07
13 | monthly
14 |
15 |
16 | http://HOST/page-1/
17 | 2015-05-07T19:13:09+09:00
18 | monthly
19 | 0.9
20 |
21 |
--------------------------------------------------------------------------------