├── LICENSE ├── README.md ├── common.go ├── example └── main.go ├── go.mod ├── go.sum ├── sd.go ├── sd_benchmark_test.go └── sd_test.go /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 Lars Karlslund 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # stringdedup - in memory string deduplication for Golang 2 | 3 | Easy-peasy string deduplication to Golang. You can implement this in literally 2 minutes. But you might not want to - please read all of this. 4 | 5 | [![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) 6 | [![Go Report Card](https://goreportcard.com/badge/github.com/lkarlslund/stringdedup)](https://goreportcard.com/report/github.com/lkarlslund/stringdedup) 7 | 8 | ## How 9 | 10 | Instanciate a deduplication object by providing a function that takes a []byte slice and returns a hash. As stringdedup is using generics, your returned value type is user specified. For small amounts of strings an uint32 is fine, but you can use uint64 or even a [4]uint64 for a sha256 value. Choose something that is fast ;) 11 | 12 | ``` 13 | dedup := stringdedup.New(func(in []byte) uint32 { 14 | return xxhash.Checksum32(in) 15 | }) 16 | ``` 17 | Every time you encounter a string you want deduplicated, just wrap it in a deduplication call: 18 | 19 | ``` 20 | deduppedstring := dedup.S(inputstring) 21 | ``` 22 | 23 | You can also ingest []byte and get a deduplicated string back. This saves an allocation per call, se more detailed example below: 24 | 25 | ``` 26 | inputdata := []byte{0x01, 0x02, 0x03, 0x04} 27 | deduppedstring := dedup.BS(inputdata) 28 | ``` 29 | 30 | ## Why? 31 | 32 | In a scenario where you read a lot of data containing repeated freeform strings, unless you do something special, you're wasting a lot of memory. A very simplistic example could be that you are indexing a lot of files - see the example folder in the package. 33 | 34 | I use it in two different projects, and one of them gets a deduplication ratio of 1:5, saving a massive amount of memory. 35 | 36 | The example included shows that things are not black and white. You might gain something by using this package, and you might not. It really depends on what you are doing, and also how you are doing it. 37 | 38 | ## How do strings work, and why do I care? Isn't Go already clever when it comes to strings? 39 | 40 | Yes, actually Go is quite clever about strings. Or at least as clever as technically possible, without it costing way too much CPU during normal scenarios. 41 | 42 | Internally in Golang a string is defined as: 43 | 44 | ``` 45 | type string struct { // no, you can't do this in reality - use reflect.StringHeader 46 | p pointer // the bytes the string consists of, Golang internally uses "pointer" as the type, this is not a type reachable by mortals 47 | len int // how many bytes are allocated 48 | } 49 | ``` 50 | So the string variable takes up 12 bytes of space + the actual space the backing data use + some overhead for the heap management (?). This package tries to cut down on the duplicate backing data when your program introduces the same contents several times. 51 | 52 | For the below explainations assume this is defined: 53 | ``` 54 | var a, b string 55 | a = "ho ho ho" 56 | ``` 57 | ### The rules of Go strings 58 | - A string is a fixed length variable (!) with a pointer to variable length backing data (contents of the string) 59 | - Strings are immutable (you can not change a string's backing data): `a = "no more" // one heap allocation, pointer and length is changed in a` 60 | - Assigning a string to another string does not allocate new backing data: `b = a // pointer and length is changed in b (no heap allocation)` 61 | - You can also cut up a string in smaller pieces without allocating new backing data (re-slicing): `b = a[0:5] // pointer and length is changed in b (no heap allocation)` 62 | - Constant strings (hardcoded into your program) are still normal strings, the pointer does just not point to the heap but to your data section (?) 63 | 64 | Every time you read data from somewhere external, run a string through a function (uppercase, lowercase etc), or convert from []byte to string, you allocate new backing data on the heap. 65 | 66 | ## Okay, let's dedup! 67 | 68 | Get the package: 69 | ``` 70 | go get github.com/lkarlslund/stringdedup 71 | ``` 72 | 73 | ### Using stringdedup 74 | - When you dedup something, and we don't know about it, it's *always* heap allocated and copied. 75 | - If you have []byte, you can dedup it to a string in one call (reader.Read(b) -> stringdedup.BS(b)) 76 | 77 | ``` 78 | dedup := stringdedup.New(func(in []byte) uint32 { 79 | return xxhash.Checksum32(in) 80 | }) 81 | dedupedstring := dedup.S(somestring) // input string, get deduplicated string back 82 | ``` 83 | 84 | That's it! You're now guaranteed that this string only exists once in your program, if all the other string allocations process the same way. 85 | 86 | If you're repeatedly reading from the same []byte buffer, you can save an allocation per call this way: 87 | ``` 88 | dedup := stringdedup.New(func(in []byte) uint32 { 89 | return xxhash.Checksum32(in) 90 | }) 91 | buffer := make([]byte, 16384) 92 | var mystrings []string 93 | var err error 94 | for err == nil { 95 | _, err = myreader.Read(buffer) 96 | // do some processing, oh you found something you want to save at buffer[42:103] 97 | mystrings = append(mystrings, dedup.BS(buffer[42:103])) // BS = input []byte, get deduplicated string back 98 | } 99 | ``` 100 | If you know that you're not going to dedup any of the existing strings in memory again, you can call: 101 | ``` 102 | stringdedup.Flush() 103 | ``` 104 | This frees the hashing indexes from stringdedup. It doesn't mean you can not dedup again, it just means that stringdedup forgets about the strings that are already in memory. 105 | 106 | ## Caveats (there are some, and you better read this) 107 | 108 | This package uses some tricks, that *may* break at any time, if the Golang developers choose to implement something differently. Namely it's using these particularities: 109 | 110 | - Weak references by using a map of uintptr's 111 | - Strings are removed from the deduplication map by using the SetFinalizer method. That means you can't use SetFinalizer on the strings that you put into or get back from the package. Golang really doesn't want you to use SetFinalizer, they see it as a horrible kludge, but I've found no other way of doing weak references with cleanup 112 | - The strings are hash indexed via a 32-bit XXHASH. This is not a crypto safe hashing algorithm, but we're not doing this to prevent malicious collisions. This is about statistics, and I'd guess that you would have to store more than 400 million strings before you start to run into problems. Strings are validated before they're returned, so you will never get invalid data back. You could optimize this away if you're feeling really lucky. 113 | - You can choose to purge the deduplication index by calling Flush() to free memory. New deduplicated strings start over, so now you might get duplicate strings anyway. Again, this is for specific scenarios. 114 | 115 | This requires Go 1.18 on x86 / x64. Please let me know your experiences. 116 | 117 | Twitter: @lkarlslund -------------------------------------------------------------------------------- /common.go: -------------------------------------------------------------------------------- 1 | package stringdedup 2 | 3 | import ( 4 | "runtime" 5 | "sync" 6 | "unsafe" 7 | ) 8 | 9 | var lock sync.RWMutex 10 | 11 | type weakdata struct { 12 | data uintptr 13 | length int 14 | } 15 | 16 | func (wd weakdata) Uintptr() uintptr { 17 | return wd.data 18 | } 19 | 20 | func (wd weakdata) Pointer() *byte { 21 | return (*byte)(unsafe.Pointer(wd.data)) 22 | } 23 | 24 | func weakString(in string) weakdata { 25 | ws := weakdata{ 26 | data: uintptr(unsafe.Pointer(unsafe.StringData(in))), 27 | length: len(in), 28 | } 29 | return ws 30 | } 31 | 32 | func weakBytes(in []byte) weakdata { 33 | ws := weakdata{ 34 | data: uintptr(unsafe.Pointer(&in[0])), 35 | length: len(in), 36 | } 37 | return ws 38 | } 39 | 40 | func (wd weakdata) String() string { 41 | return unsafe.String((*byte)(unsafe.Pointer(wd.data)), wd.length) 42 | } 43 | 44 | func (wd weakdata) Bytes() []byte { 45 | return unsafe.Slice((*byte)(unsafe.Pointer(wd.data)), wd.length) 46 | } 47 | 48 | func castStringToBytes(in string) []byte { 49 | return unsafe.Slice(unsafe.StringData(in), len(in)) 50 | } 51 | 52 | func castBytesToString(in []byte) string { 53 | out := unsafe.String(&in[0], len(in)) 54 | runtime.KeepAlive(in) 55 | return out 56 | } 57 | 58 | // ValidateResults ensures that no collisions in returned strings are possible. This is enabled default, but you can speed things up by setting this to false 59 | var ValidateResults = true 60 | 61 | // YesIKnowThisCouldGoHorriblyWrong requires you to read the source code to understand what it does. This is intentional, as usage is only for very specific an careful scenarios 62 | var YesIKnowThisCouldGoHorriblyWrong = false 63 | -------------------------------------------------------------------------------- /example/main.go: -------------------------------------------------------------------------------- 1 | package main 2 | 3 | import ( 4 | "fmt" 5 | "os" 6 | "path/filepath" 7 | "runtime" 8 | "time" 9 | "unsafe" 10 | 11 | "github.com/OneOfOne/xxhash" 12 | "github.com/lkarlslund/stringdedup" 13 | ) 14 | 15 | type fileinfo struct { 16 | folder, basename, extension string 17 | } 18 | 19 | var files, files2 []fileinfo 20 | 21 | func main() { 22 | fmt.Println("String deduplication demonstration") 23 | fmt.Println("---") 24 | 25 | d := stringdedup.New(func(in []byte) uint32 { 26 | return xxhash.Checksum32(in) 27 | }) 28 | 29 | var memstats runtime.MemStats 30 | 31 | runtime.ReadMemStats(&memstats) 32 | fmt.Printf("Initial memory usage at start of program: %v objects, consuming %v bytes\n", memstats.HeapObjects, memstats.HeapInuse) 33 | fmt.Println("---") 34 | 35 | searchDir := "/usr" 36 | if runtime.GOOS == "windows" { 37 | searchDir = "c:/windows" 38 | } 39 | 40 | fmt.Printf("Scanning and indexing files in %v - hang on ...\n", searchDir) 41 | 42 | filepath.Walk(searchDir, func(path string, f os.FileInfo, err error) error { 43 | if !f.IsDir() { 44 | folder := filepath.Dir(path) 45 | extension := filepath.Ext(path) 46 | basename := filepath.Base(path) 47 | basename = basename[:len(basename)-len(extension)] 48 | files = append(files, fileinfo{ 49 | folder: folder, 50 | basename: extension, 51 | extension: basename, 52 | }) 53 | } 54 | return nil 55 | }) 56 | 57 | fmt.Println("Scanning done!") 58 | fmt.Println("---") 59 | 60 | runtime.GC() // Let garbage collector run, and see memory usage 61 | time.Sleep(time.Millisecond * 100) // Settle down 62 | runtime.ReadMemStats(&memstats) 63 | fmt.Printf("Memory usage for %v fileinfo: %v object, consuming %v bytes\n", len(files), memstats.HeapObjects, memstats.HeapInuse) 64 | 65 | undedupbytes := memstats.HeapInuse 66 | 67 | fmt.Printf("Slice reference costs %v x %v bytes - a total of %v bytes\n", len(files), unsafe.Sizeof(fileinfo{}), len(files)*int(unsafe.Sizeof(fileinfo{}))) 68 | 69 | checksum := xxhash.New64() 70 | for _, fi := range files { 71 | checksum.Write([]byte(fi.folder + fi.basename + fi.extension)) 72 | } 73 | csum := checksum.Sum64() 74 | fmt.Printf("Validation checksum on non deduped files is %x\n", csum) 75 | fmt.Println("---") 76 | 77 | // NON DEDUPLICATED STATISTICS END 78 | 79 | // A new batch of fileinfo 80 | files2 = make([]fileinfo, len(files), cap(files)) 81 | 82 | // Lets try that again with deduplication 83 | for i, fi := range files { 84 | files2[i] = fileinfo{ 85 | folder: d.S(fi.folder), 86 | basename: d.S(fi.basename), 87 | extension: d.S(fi.extension), 88 | } 89 | } 90 | 91 | runtime.ReadMemStats(&memstats) 92 | fmt.Println("Both a duplicated and non-deduplicated slice is now in memory") 93 | fmt.Printf("Double allocated memory usage for %v fileinfo: %v objects, consuming %v bytes\n", len(files2), memstats.HeapObjects, memstats.HeapInuse) 94 | 95 | // Let garbage collector run, and see memory usage 96 | runtime.KeepAlive(files) 97 | files = nil 98 | runtime.GC() 99 | time.Sleep(time.Millisecond * 1000) 100 | 101 | runtime.ReadMemStats(&memstats) 102 | fmt.Println("---") 103 | fmt.Printf("Dedup memory usage for %v fileinfo: %v objects, consuming %v bytes\n", len(files2), memstats.HeapObjects, memstats.HeapInuse) 104 | 105 | dedupbytes := memstats.HeapInuse 106 | fmt.Printf("Reduction in memory usage: %.2f\n", float32(dedupbytes)/float32(undedupbytes)) 107 | 108 | // Drop indexes and let's see 109 | d.Flush() 110 | runtime.GC() 111 | time.Sleep(time.Millisecond * 1000) 112 | 113 | runtime.ReadMemStats(&memstats) 114 | fmt.Println("---") 115 | fmt.Printf("Flushed index memory usage: %v object, consuming %v bytes\n", memstats.HeapObjects, memstats.HeapInuse) 116 | fmt.Printf("Reduction in memory usage (after dropping indexes): %.2f\n", float32(memstats.HeapInuse)/float32(undedupbytes)) 117 | 118 | // Validate that deduped files are the same as non deduped files 119 | checksum = xxhash.New64() 120 | for _, fi := range files2 { 121 | checksum.Write([]byte(fi.folder + fi.basename + fi.extension)) 122 | } 123 | fmt.Println("---") 124 | csum2 := checksum.Sum64() 125 | fmt.Printf("Validation on dedup strings checksum is %x\n", csum2) 126 | checksum = nil 127 | 128 | if csum != csum2 { 129 | fmt.Println("!!! VALIDATION FAILED. DEDUPED STRINGS ARE NOT THE SAME AS NON DEDUPED STRINGS !!!") 130 | } 131 | 132 | var bytes int 133 | for _, file := range files2 { 134 | bytes += len(file.basename) + len(file.extension) + len(file.folder) 135 | } 136 | 137 | // Let garbage collector run, and see memory usage 138 | // Clean up stuff left by finalizers 139 | files2 = nil 140 | runtime.GC() 141 | time.Sleep(time.Millisecond * 100) 142 | runtime.GC() 143 | 144 | runtime.ReadMemStats(&memstats) 145 | fmt.Println("---") 146 | fmt.Printf("Cleared memory usage: %v object, consuming %v bytes\n", memstats.HeapObjects, memstats.HeapInuse) 147 | } 148 | 149 | // func printmemstats("") 150 | -------------------------------------------------------------------------------- /go.mod: -------------------------------------------------------------------------------- 1 | module github.com/lkarlslund/stringdedup 2 | 3 | go 1.21 4 | 5 | require github.com/OneOfOne/xxhash v1.2.8 6 | 7 | require github.com/SaveTheRbtz/generic-sync-map-go v0.0.0-20230201052002-6c5833b989be 8 | 9 | require go4.org/unsafe/assume-no-moving-gc v0.0.0-20230525183740-e7c30c78aeb2 10 | -------------------------------------------------------------------------------- /go.sum: -------------------------------------------------------------------------------- 1 | github.com/OneOfOne/xxhash v1.2.8 h1:31czK/TI9sNkxIKfaUfGlU47BAxQ0ztGgd9vPyqimf8= 2 | github.com/OneOfOne/xxhash v1.2.8/go.mod h1:eZbhyaAYD41SGSSsnmcpxVoRiQ/MPUTjUdIIOT9Um7Q= 3 | github.com/SaveTheRbtz/generic-sync-map-go v0.0.0-20220414055132-a37292614db8 h1:Xa6tp8DPDhdV+k23uiTC/GrAYOe4IdyJVKtob4KW3GA= 4 | github.com/SaveTheRbtz/generic-sync-map-go v0.0.0-20220414055132-a37292614db8/go.mod h1:ihkm1viTbO/LOsgdGoFPBSvzqvx7ibvkMzYp3CgtHik= 5 | github.com/SaveTheRbtz/generic-sync-map-go v0.0.0-20230201052002-6c5833b989be h1:ZUMGZpetBeapAS/oOlffnBL6aSG6WwXSWfNXeadAzXE= 6 | github.com/SaveTheRbtz/generic-sync-map-go v0.0.0-20230201052002-6c5833b989be/go.mod h1:ihkm1viTbO/LOsgdGoFPBSvzqvx7ibvkMzYp3CgtHik= 7 | go4.org/unsafe/assume-no-moving-gc v0.0.0-20211027215541-db492cf91b37 h1:Tx9kY6yUkLge/pFG7IEMwDZy6CS2ajFc9TvQdPCW0uA= 8 | go4.org/unsafe/assume-no-moving-gc v0.0.0-20211027215541-db492cf91b37/go.mod h1:FftLjUGFEDu5k8lt0ddY+HcrH/qU/0qk+H8j9/nTl3E= 9 | go4.org/unsafe/assume-no-moving-gc v0.0.0-20220617031537-928513b29760 h1:FyBZqvoA/jbNzuAWLQE2kG820zMAkcilx6BMjGbL/E4= 10 | go4.org/unsafe/assume-no-moving-gc v0.0.0-20220617031537-928513b29760/go.mod h1:FftLjUGFEDu5k8lt0ddY+HcrH/qU/0qk+H8j9/nTl3E= 11 | go4.org/unsafe/assume-no-moving-gc v0.0.0-20230525183740-e7c30c78aeb2 h1:WJhcL4p+YeDxmZWg141nRm7XC8IDmhz7lk5GpadO1Sg= 12 | go4.org/unsafe/assume-no-moving-gc v0.0.0-20230525183740-e7c30c78aeb2/go.mod h1:FftLjUGFEDu5k8lt0ddY+HcrH/qU/0qk+H8j9/nTl3E= 13 | -------------------------------------------------------------------------------- /sd.go: -------------------------------------------------------------------------------- 1 | package stringdedup 2 | 3 | import ( 4 | "runtime" 5 | "sync" 6 | "sync/atomic" 7 | "time" 8 | "unsafe" 9 | 10 | gsync "github.com/SaveTheRbtz/generic-sync-map-go" 11 | _ "go4.org/unsafe/assume-no-moving-gc" 12 | ) 13 | 14 | func New[hashtype comparable](hashfunc func(in []byte) hashtype) *stringDedup[hashtype] { 15 | var sd stringDedup[hashtype] 16 | sd.removefromthismap = generateFinalizerFunc(&sd) 17 | sd.hashfunc = hashfunc 18 | return &sd 19 | } 20 | 21 | type stringDedup[hashtype comparable] struct { 22 | stats Statistics // Statistics moved to front to ensure 64-bit alignment even on 32-bit platforms (uses atomic to update) 23 | 24 | pointermap gsync.MapOf[uintptr, hashtype] 25 | hashmap gsync.MapOf[hashtype, weakdata] // key is hash, value is weakdata entry containing pointer to start of string or byte slice *header* and length 26 | 27 | // Let dedup object keep some strings 'alive' for a period of time 28 | KeepAlive time.Duration 29 | 30 | keepAliveSchedLock sync.Mutex 31 | keepalivemap gsync.MapOf[string, time.Time] 32 | keepaliveFlusher *time.Timer 33 | keepaliveitems, keepaliveitemsremoved int64 34 | 35 | hashfunc func([]byte) hashtype 36 | 37 | removefromthismap finalizerFunc 38 | 39 | flushing bool 40 | 41 | // DontValidateResults skips collisions check in returned strings 42 | DontValidateResults bool // Disable at your own peril, hash collisions will give you wrong strings back 43 | } 44 | 45 | type Statistics struct { 46 | ItemsAdded, 47 | BytesInMemory, 48 | ItemsSaved, 49 | BytesSaved, 50 | ItemsRemoved, 51 | Collisions, 52 | FirstCollisionDetected, 53 | KeepAliveItemsAdded, 54 | KeepAliveItemsRemoved int64 55 | } 56 | 57 | // Size returns the number of deduplicated strings currently being tracked in memory 58 | func (sd *stringDedup[hashtype]) Size() int64 { 59 | return atomic.LoadInt64(&sd.stats.ItemsAdded) - atomic.LoadInt64(&sd.stats.ItemsRemoved) 60 | } 61 | 62 | func (sd *stringDedup[hashtype]) Statistics() Statistics { 63 | // Not thread safe 64 | return sd.stats 65 | } 66 | 67 | // Flush clears all state information about deduplication 68 | func (sd *stringDedup[hashtype]) Flush() { 69 | // Clear our data 70 | sd.flushing = true 71 | 72 | sd.pointermap.Range(func(pointer uintptr, hash hashtype) bool { 73 | // Don't finalize, we don't care about it any more 74 | runtime.SetFinalizer((*byte)(unsafe.Pointer(pointer)), nil) 75 | 76 | sd.pointermap.Delete(pointer) 77 | sd.hashmap.Delete(hash) 78 | 79 | atomic.AddInt64(&sd.stats.ItemsRemoved, 1) 80 | return true 81 | }) 82 | 83 | // Get rid of any keepalives 84 | sd.keepalivemap.Range(func(s string, t time.Time) bool { 85 | sd.keepalivemap.Delete(s) 86 | atomic.AddInt64(&sd.keepaliveitemsremoved, 1) 87 | return true 88 | }) 89 | 90 | sd.flushing = false 91 | } 92 | 93 | // BS takes a slice of bytes, and returns a copy of it as a deduplicated string 94 | func (sd *stringDedup[hashtype]) BS(in []byte) string { 95 | str := castBytesToString(in) // NoCopy 96 | return sd.S(str) 97 | } 98 | 99 | func (sd *stringDedup[hashtype]) S(in string) string { 100 | if len(in) == 0 { 101 | // Nothing to see here, move along now 102 | return in 103 | } 104 | 105 | hash := sd.hashfunc(castStringToBytes(in)) 106 | 107 | ws, loaded := sd.hashmap.Load(hash) 108 | 109 | if loaded { 110 | atomic.AddInt64(&sd.stats.ItemsSaved, 1) 111 | atomic.AddInt64(&sd.stats.BytesSaved, int64(ws.length)) 112 | out := ws.String() 113 | if !sd.DontValidateResults && out != in { 114 | atomic.CompareAndSwapInt64(&sd.stats.FirstCollisionDetected, 0, sd.Size()) 115 | atomic.AddInt64(&sd.stats.Collisions, 1) 116 | return in // Collision 117 | } 118 | return out 119 | } 120 | 121 | // We might recieve a static non-dynamically allocated string, so we need to make a copy 122 | // Can we detect this somehow and avoid it? 123 | buf := make([]byte, len(in)) 124 | copy(buf, in) 125 | str := castBytesToString(buf) 126 | ws = weakString(str) 127 | 128 | sd.hashmap.Store(hash, ws) 129 | sd.pointermap.Store(ws.data, hash) 130 | 131 | // We need to keep the string alive 132 | if sd.KeepAlive > 0 { 133 | sd.keepalivemap.Store(str, time.Now().Add(sd.KeepAlive)) 134 | atomic.AddInt64(&sd.keepaliveitems, 1) 135 | // Naughty checking without locking 136 | if sd.keepaliveFlusher == nil { 137 | sd.keepAliveSchedLock.Lock() 138 | if sd.keepaliveFlusher == nil { 139 | sd.keepaliveFlusher = time.AfterFunc(sd.KeepAlive/5, sd.flushKeepAlive) 140 | } 141 | sd.keepAliveSchedLock.Unlock() 142 | } 143 | } 144 | 145 | atomic.AddInt64(&sd.stats.ItemsAdded, 1) 146 | atomic.AddInt64(&sd.stats.BytesInMemory, int64(ws.length)) 147 | 148 | runtime.SetFinalizer((*byte)(unsafe.Pointer(&buf[0])), sd.removefromthismap) 149 | runtime.KeepAlive(str) 150 | return str 151 | } 152 | 153 | func (sd *stringDedup[hashtype]) flushKeepAlive() { 154 | var items int 155 | now := time.Now() 156 | sd.keepalivemap.Range(func(key string, value time.Time) bool { 157 | if now.After(value) { 158 | sd.keepalivemap.Delete(key) 159 | atomic.AddInt64(&sd.keepaliveitemsremoved, 1) 160 | } else { 161 | items++ 162 | } 163 | return true 164 | }) 165 | 166 | // Reschedule ourselves if needed 167 | sd.keepAliveSchedLock.Lock() 168 | if items > 0 { 169 | sd.keepaliveFlusher = time.AfterFunc(sd.KeepAlive/5, sd.flushKeepAlive) 170 | } else { 171 | sd.keepaliveFlusher = nil 172 | } 173 | sd.keepAliveSchedLock.Unlock() 174 | } 175 | 176 | type finalizerFunc func(*byte) 177 | 178 | func generateFinalizerFunc[hashtype comparable](sd *stringDedup[hashtype]) finalizerFunc { 179 | return func(in *byte) { 180 | if sd.flushing { 181 | return // We're flushing, don't bother 182 | } 183 | 184 | pointer := uintptr(unsafe.Pointer(in)) 185 | hash, found := sd.pointermap.Load(pointer) 186 | if !found { 187 | panic("dedup map is missing string to remove") 188 | } 189 | sd.pointermap.Delete(pointer) 190 | sd.hashmap.Delete(hash) 191 | atomic.AddInt64(&sd.stats.ItemsRemoved, 1) 192 | } 193 | } 194 | -------------------------------------------------------------------------------- /sd_benchmark_test.go: -------------------------------------------------------------------------------- 1 | package stringdedup 2 | 3 | import ( 4 | "math/rand" 5 | "runtime" 6 | "testing" 7 | "time" 8 | 9 | "github.com/OneOfOne/xxhash" 10 | ) 11 | 12 | // const letterBytes = "abcdef" 13 | 14 | const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" 15 | const ( 16 | letterIdxBits = 6 // 6 bits to represent a letter index 17 | letterIdxMask = 1<= 0; { 26 | if remain == 0 { 27 | cache, remain = src.Int63(), letterIdxMax 28 | } 29 | if idx := int(cache & letterIdxMask); idx < len(letterBytes) { 30 | b[i] = letterBytes[idx] 31 | i-- 32 | } 33 | cache >>= letterIdxBits 34 | remain-- 35 | } 36 | } 37 | 38 | func generatestrings(totalstrings, stringlength int) []string { 39 | if totalstrings < 1 { 40 | totalstrings = 1 41 | } 42 | generated := make([]string, totalstrings) 43 | b := make([]byte, stringlength) 44 | for i := 0; i < len(generated); i++ { 45 | RandomBytes(b) 46 | generated[i] = string(b) 47 | } 48 | return generated 49 | } 50 | 51 | var bs = make([]byte, 12) 52 | 53 | func BenchmarkGoRandom(b *testing.B) { 54 | var s = make([]string, b.N) 55 | for n := 0; n < b.N; n++ { 56 | RandomBytes(bs) 57 | s[n] = string(bs) 58 | } 59 | } 60 | 61 | func BenchmarkNSDRandom(b *testing.B) { 62 | sd := New(func(in []byte) uint32 { 63 | return xxhash.Checksum32(in) 64 | }) 65 | var s = make([]string, b.N) 66 | for n := 0; n < b.N; n++ { 67 | RandomBytes(bs) 68 | s[n] = sd.BS(bs) 69 | } 70 | runtime.KeepAlive(s) 71 | } 72 | 73 | func BenchmarkNSDRandomNoValidate(b *testing.B) { 74 | sd := New(func(in []byte) uint32 { 75 | return xxhash.Checksum32(in) 76 | }) 77 | sd.DontValidateResults = true 78 | var s = make([]string, b.N) 79 | for n := 0; n < b.N; n++ { 80 | RandomBytes(bs) 81 | s[n] = sd.BS(bs) 82 | } 83 | runtime.KeepAlive(s) 84 | } 85 | 86 | func BenchmarkNSD64Random(b *testing.B) { 87 | sd := New(func(in []byte) uint64 { 88 | return xxhash.Checksum64(in) 89 | }) 90 | var s = make([]string, b.N) 91 | for n := 0; n < b.N; n++ { 92 | RandomBytes(bs) 93 | s[n] = sd.BS(bs) 94 | } 95 | runtime.KeepAlive(s) 96 | } 97 | 98 | func BenchmarkNSD64RandomNoValidate(b *testing.B) { 99 | sd := New(func(in []byte) uint64 { 100 | return xxhash.Checksum64(in) 101 | }) 102 | sd.DontValidateResults = true 103 | var s = make([]string, b.N) 104 | for n := 0; n < b.N; n++ { 105 | RandomBytes(bs) 106 | s[n] = sd.BS(bs) 107 | } 108 | runtime.KeepAlive(s) 109 | } 110 | 111 | var somestring = "SomeStaticString" 112 | 113 | func BenchmarkGoPrecalculated(b *testing.B) { 114 | b.StopTimer() 115 | generated := generatestrings(b.N/10, 5) 116 | b.StartTimer() 117 | var s = make([]string, b.N) 118 | for n := 0; n < b.N; n++ { 119 | s[n] = generated[n%len(generated)] 120 | } 121 | runtime.KeepAlive(s) 122 | } 123 | 124 | func BenchmarkNSDPrecalculated(b *testing.B) { 125 | b.StopTimer() 126 | generated := generatestrings(b.N/10, 5) 127 | b.StartTimer() 128 | sd := New(func(in []byte) uint32 { 129 | return xxhash.Checksum32(in) 130 | }) 131 | var s = make([]string, b.N) 132 | for n := 0; n < b.N; n++ { 133 | s[n] = sd.S(generated[n%len(generated)]) 134 | } 135 | runtime.KeepAlive(s) 136 | } 137 | 138 | func BenchmarkNSDPrecalculatedNoValidate(b *testing.B) { 139 | b.StopTimer() 140 | generated := generatestrings(b.N/10, 5) 141 | b.StartTimer() 142 | sd := New(func(in []byte) uint32 { 143 | return xxhash.Checksum32(in) 144 | }) 145 | sd.DontValidateResults = true 146 | var s = make([]string, b.N) 147 | for n := 0; n < b.N; n++ { 148 | s[n] = sd.S(generated[n%len(generated)]) 149 | } 150 | runtime.KeepAlive(s) 151 | } 152 | 153 | func BenchmarkNSD64Precalculated(b *testing.B) { 154 | b.StopTimer() 155 | generated := generatestrings(b.N/10, 5) 156 | b.StartTimer() 157 | sd := New(func(in []byte) uint64 { 158 | return xxhash.Checksum64(in) 159 | }) 160 | var s = make([]string, b.N) 161 | for n := 0; n < b.N; n++ { 162 | s[n] = sd.S(generated[n%len(generated)]) 163 | } 164 | runtime.KeepAlive(s) 165 | } 166 | 167 | func BenchmarkNSD64PrecalculatedNoValidate(b *testing.B) { 168 | b.StopTimer() 169 | generated := generatestrings(b.N/10, 5) 170 | b.StartTimer() 171 | sd := New(func(in []byte) uint64 { 172 | return xxhash.Checksum64(in) 173 | }) 174 | sd.DontValidateResults = true 175 | var s = make([]string, b.N) 176 | for n := 0; n < b.N; n++ { 177 | s[n] = sd.S(generated[n%len(generated)]) 178 | } 179 | runtime.KeepAlive(s) 180 | } 181 | -------------------------------------------------------------------------------- /sd_test.go: -------------------------------------------------------------------------------- 1 | package stringdedup 2 | 3 | import ( 4 | "runtime" 5 | "testing" 6 | "time" 7 | 8 | "github.com/OneOfOne/xxhash" 9 | ) 10 | 11 | func TestBlankString(t *testing.T) { 12 | NS32 := New(func(in []byte) uint32 { 13 | ns32 := xxhash.New32() 14 | ns32.Write(in) 15 | return ns32.Sum32() 16 | }) 17 | if NS32.S("") != "" { 18 | t.Error("Blank string should return blank string (new 32-bit hash)") 19 | } 20 | NS64 := New(func(in []byte) uint64 { 21 | ns64 := xxhash.New64() 22 | ns64.Write(in) 23 | return ns64.Sum64() 24 | }) 25 | if NS64.S("") != "" { 26 | t.Error("Blank string should return blank string (new 64-bit hash)") 27 | } 28 | } 29 | 30 | func TestGC(t *testing.T) { 31 | ns := New(func(in []byte) uint32 { 32 | return xxhash.Checksum32(in) 33 | }) 34 | s := make([]string, 100000) 35 | for n := 0; n < len(s); n++ { 36 | RandomBytes(bs) 37 | s[n] = ns.BS(bs) 38 | if n%1000 == 0 { 39 | runtime.GC() 40 | } 41 | } 42 | lock.RLock() 43 | runtime.GC() 44 | time.Sleep(time.Millisecond * 100) // Let finalizers run 45 | t.Log("Items in cache:", ns.Size()) 46 | if ns.Size() == 0 { 47 | t.Fatal("Deduplication map is empty") 48 | } 49 | lock.RUnlock() 50 | s = make([]string, 0) // Clear our references 51 | runtime.KeepAlive(s) // oh shut up Go Vet 52 | runtime.GC() // Clean up 53 | time.Sleep(time.Millisecond * 100) // Let finalizers run 54 | runtime.GC() // Clean up 55 | lock.RLock() 56 | t.Log("Items in cache:", ns.Size()) 57 | if ns.Size() != 0 { 58 | t.Fatal("Deduplication map is not empty") 59 | } 60 | lock.RUnlock() 61 | } 62 | 63 | func TestNewGC(t *testing.T) { 64 | d := New(func(in []byte) uint32 { 65 | return xxhash.Checksum32(in) 66 | }) 67 | // d.KeepAlive = time.Millisecond * 500 68 | 69 | totalcount := 100000 70 | 71 | // Insert stuff 72 | o := make([]string, totalcount) 73 | s := make([]string, totalcount) 74 | for n := 0; n < len(s); n++ { 75 | RandomBytes(bs) 76 | o[n] = string(bs) 77 | s[n] = d.BS(bs) 78 | if n%1000 == 0 { 79 | runtime.GC() 80 | } 81 | } 82 | // Try to get GC to remove them from dedup object 83 | runtime.GC() 84 | time.Sleep(time.Millisecond * 500) // Let finalizers run 85 | 86 | items := d.Size() 87 | t.Log("Items in cache (expecting full):", items) 88 | if items < int64(totalcount/100*95) { 89 | t.Errorf("Deduplication map is not full - %v", items) 90 | } 91 | 92 | time.Sleep(time.Millisecond * 2000) // KeepAlive dies after 2 seconds, but map shouldn't be empty yet 93 | items = d.Size() 94 | t.Log("Items in cache (still expecting full):", items) 95 | if items < int64(totalcount/100*95) { 96 | t.Errorf("Deduplication map is not full still - %v", items) 97 | } 98 | 99 | // Clear references 100 | for n := 0; n < len(s); n++ { 101 | if o[n] != s[n] { 102 | t.Errorf("%v != %v", o[n], s[n]) 103 | } 104 | } 105 | runtime.KeepAlive(s) // Ensure runtime doesn't GC the dedup table, only needed if you don't do the above check 106 | 107 | s = make([]string, 0) // Clear our references 108 | runtime.GC() // Clean up 109 | time.Sleep(time.Millisecond * 1000) // Let finalizers run 110 | runtime.KeepAlive(s) 111 | 112 | items = d.Size() 113 | t.Log("Items in cache (expecting empty):", items) 114 | // if items > int64(totalcount/50) { 115 | // t.Errorf("Deduplication map is not empty - %v", d.Size()) 116 | // } 117 | 118 | stats := d.Statistics() 119 | t.Logf("Items added: %v", stats.ItemsAdded) 120 | t.Logf("Bytes in memory: %v", stats.BytesInMemory) 121 | t.Logf("Items saved: %v", stats.ItemsSaved) 122 | t.Logf("Bytes saved: %v", stats.BytesSaved) 123 | t.Logf("Items removed: %v", stats.ItemsRemoved) 124 | t.Logf("Collisions: %v - first at %v", stats.Collisions, stats.FirstCollisionDetected) 125 | t.Logf("Keepalive items added: %v - removed: %v", stats.KeepAliveItemsAdded, stats.KeepAliveItemsRemoved) 126 | 127 | t.Logf("timer: %v", d.keepaliveFlusher) 128 | } 129 | --------------------------------------------------------------------------------