├── NOTES.txt ├── README.md └── parse_discogs_dump.py /NOTES.txt: -------------------------------------------------------------------------------- 1 | The older dump files were formatted inconveniently. Some were ISO-8859-1 2 | encoded, rather than UTF-8, but they did not have an encoding declaration 3 | at the top of the file. Sometimes they would contain characters invalid 4 | in XML (e.g. control characters). 5 | 6 | Worse, each file had multiple elements at the root level, rather than 7 | having a single element wrapped around everything. This means the file was 8 | what the XML spec calls a parsed general entity: an XML fragment which 9 | could only be referenced from another document; it couldn't be parsed as a 10 | well-formed XML document, itself, unless we somehow wrapped it in an 11 | enclosing 'document element', also sometimes called a 'root element'. 12 | 13 | As of 2017, it seems these issues have all been resolved. Nevertheless, 14 | I originally wrote the script to handle the old rootless format, so I am 15 | leaving that code in to demonstrate the technique for wrapping a stream 16 | in a 'dummy' element. 17 | 18 | There are different options for doing the wrapping: 19 | 20 | A. Create a new file with '' + the decompressed dump file + ''. 21 | The output could be gzipped. This is feasible but isn't very elegant. We'd 22 | also probably want to make sure it hasn't already been done, which could get 23 | tricky. 24 | 25 | B. Decompress the dump file and reference it as an external entity from 26 | within a separate XML document that looks like the following: 27 | 28 | 29 | ]> 30 | &data; 31 | 32 | This is risky because the decompressed data exceeds 2 GB (a common upper limit 33 | when reading files), and because a non-validating XML parser isn't required to 34 | read external entities at all. 35 | 36 | C. Wrap the decompressed stream in an object which adds '' & '' 37 | to the stream. This is probably the ideal solution, and is what I implemented, 38 | with help from Jeremy Kloth. 39 | 40 | Another issue is cElementTree builds a tree the whole time it is parsing. You 41 | can call the clear() method on elements to release memory, but this is only 42 | removing the element's attributes and fully parsed content. It doesn't remove 43 | the element itself; an empty element remains attached to its parent. 44 | 45 | As explained at , you can call 46 | clear() on the root element (which takes some extra effort to obtain), but 47 | this only clears the fully-read children; it doesn't help if the elements you 48 | need to clear are not fully read yet, as happens when we use the dummy wrapper. 49 | Sample code for lxml's ElementTree to deal with this problem can be found at 50 | , but it relies 51 | on lxml's extensions to the API; you can't use it with regular ElementTree. 52 | 53 | As I posted at , you don't use 54 | clear() at all, but rather just keep a stack of elements seen, based on start 55 | tags. When pushing an element of particular interest onto the stack, increment 56 | a counter. At each end tag, pop the current element off the stack, and if the 57 | end tag is for the element of interest, decrement the counter. Now the last 58 | element in the stack is the parent of the current element. If the counter is 59 | zero, you know you're not in an element of interest, so it is safe to call 60 | parent.remove(elem) to clear and discard the current element. 61 | 62 | This technique thus allows a tree to be built for each element of interest, 63 | such as a 'release' element. Immediately before discarding it, you can use the 64 | usual ElementTree API on it, such as elem.findall('.//track/title'). 65 | 66 | - Mike J. Brown 67 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Discogs-dump-parser 2 | ## A lightweight parser of Discogs data dumps 3 | 4 | This software is released under a Creative Commons "CC0" Public Domain Dedication; see https://creativecommons.org/publicdomain/zero/1.0/ for more info. 5 | 6 | Discogs artist, label, release, and master release data is publicly available in huge XML files produced monthly and made available at https://data.discogs.com/ for anyone to download. 7 | 8 | This Python script parses the Discogs release data using the ultrafast [cElementTree](https://docs.python.org/2/library/xml.etree.elementtree.html) API which comes with Python 2.5 and up. The script automatically handles compressed and uncompressed data, and the old style of dump which had no root element. 9 | 10 | ### Demo 11 | 12 | To try it out, get one of the release data dump files and just run the script, passing the dump file path as the only argument. For example, if the script and the dump file named discogs_20191101_releases.xml.gz are in the same directory: 13 | 14 | python parse_discogs_dump.py discogs_20191101_releases.xml.gz 15 | 16 | By default, a dot will be printed to the screen for every 1000 'release' elements read. At the end it tells you how much time it took. If you interrupt it, it tells you the last release ID it saw. The parsed data is mostly ignored; the idea is just to successfully read the XML, building a temporary tree for each 'release' element. 17 | 18 | As of 2018, on my 3.1 GHz Intel Core i5-2400 system (using only 1 of 4 cores), it takes 61 minutes to plow through the 6.0 GB gzipped release data XML and print the dots, yet it only ever uses about 17 MB of memory. It could run faster if a temporary tree was not built and discarded for each 'release', but I feel it is a better benchmark this way. 19 | 20 | If you uncomment one line of code near the end, then instead of a dot, you can get a complete XML fragment for every 1000th release element read. 21 | 22 | ### Customization 23 | 24 | There is no need to modify parse_discogs_dump.py directly. You can write your own code to handle each 'release' element, which will be an instance of [`ElementTree.Element`](https://docs.python.org/2/library/xml.etree.elementtree.html#element-objects). Your code needs to just do the following: 25 | 26 | 1. Import `ElementProcessor` and `process_dump_file()` from parse_discogs_dump.py. 27 | 2. Define a subclass of `ElementProcessor` with, at a minimum, a `process()` method to handle each 'release' element (which will be an instance of `ElementTree.Element`) in whatever way you want. 28 | 3. Pass the dump file path and an instance of your subclass to `process_dump_file()`. 29 | 30 | For example: [find_invalid_release_dates.py](https://pastebin.com/Acutu7xE) is a script which does exactly those things. It can be run like this: 31 | 32 | python find_invalid_release_dates.py discogs_20191101_releases.xml.gz > report.txt 33 | 34 | Every time it finds a non-empty release date which does not match the patterns `####` or `####-##-##` with a non-zero month value, it will print a dot to the screen, and the output file report.txt will get a line like this: 35 | 36 | https://www.discogs.com/release/41748 - release date is "?" 37 | 38 | ### Error handling 39 | 40 | Some Discogs dump files contain errors in the XML. If you get an error message about the XML not being well-formed, you will have to fix the dump file. For example, you might need to remove the control characters which are forbidden in XML: 41 | 42 | gzcat discogs_20080309_releases.xml.gz | tr -d '\1\2\3\4\5\6\7\10\13\14\16\17\20\21\22\23\24\25\26\27\30\31\32\33\34\35\36\37\177\200\201\202\203\204\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237' | gzip -9 > discogs_20080309_releases.fixed.xml.gz 43 | 44 | ### Contact 45 | 46 | I am user [mjb](https://www.discogs.com/user/mjb) on Discogs. Feel free to contact me there via private message, or in the API forum. 47 | -------------------------------------------------------------------------------- /parse_discogs_dump.py: -------------------------------------------------------------------------------- 1 | """ 2 | parse_discogs_dump.py 3 | A lightweight parser of Discogs data dumps 4 | 5 | Version: 2018-09-30 6 | Original author: Mike J. Brown 7 | License: CC0 8 | Requires: Python 2.5 or higher 9 | """ 10 | 11 | # for Python 2 compatibility 12 | from __future__ import print_function 13 | 14 | import gzip 15 | from io import BytesIO 16 | from sys import stdout, stderr, exc_info 17 | 18 | try: 19 | from xml.etree import cElementTree as ET 20 | except ImportError: 21 | from xml.etree import ElementTree as ET 22 | print('cElementTree unavailable; using regular ElementTree instead. Expect slowness.', file=stderr) 23 | 24 | 25 | class GeneralEntityStreamWrapper(object): 26 | """ 27 | A wrapper for an XML general parsed entity (an XML fragment which may or 28 | may not have a root/document element). 29 | 30 | Initialize it with the entity stream. It will act as if the stream is wrapped 31 | in an element named 'dummy'. It only supports read() and close() operations. 32 | """ 33 | _streams = None 34 | _current_stream = None 35 | 36 | def _prepare_next_stream(self): 37 | self._current_stream = self._streams.pop() 38 | self._read = self._current_stream.read 39 | self._close = self._current_stream.close 40 | 41 | def __init__(self, file_stream): 42 | self._streams = [BytesIO(b''), file_stream, BytesIO(b'')] 43 | self._prepare_next_stream() 44 | 45 | def read(self, size=-1): 46 | if self._current_stream: 47 | bytes = self._read(size) 48 | if bytes: 49 | return bytes 50 | else: 51 | try: 52 | self._prepare_next_stream() 53 | except IndexError: 54 | return '' 55 | return self._read(size) 56 | else: 57 | return '' 58 | 59 | def close(self): 60 | self._close() 61 | 62 | 63 | def get_dump_file_stream(filepath): 64 | """ 65 | Open the dump file, decompressing it on the fly if necessary. 66 | """ 67 | if filepath.endswith('.xml'): 68 | return open(filepath, 'rb') 69 | elif filepath.endswith('.xml.gz'): 70 | return gzip.open(filepath) 71 | else: 72 | raise 'Unknown extension on dump file path ' + dump_filepath 73 | 74 | 75 | def read_via_etree(stream, element_processor): 76 | """ 77 | Parse a release dump XML stream incrementally, fully removing elements 78 | when they are completely read, unless they are descendants of an element 79 | with the given name, in which case they are processed by the process() 80 | method of the given ElementProcessor (or subclass) instance. 81 | """ 82 | element_stack = [] 83 | interesting_element_name = element_processor.interesting_element_name or 'release' 84 | interesting_element_depth = 0 85 | item_id = None 86 | try: 87 | context = ET.iterparse(stream, events=('start', 'end')) 88 | for event, elem in context: 89 | if event == 'start': 90 | element_stack.append(elem) 91 | if elem.tag == interesting_element_name: 92 | interesting_element_depth += 1 93 | elif event == 'end': 94 | element_stack.pop() 95 | if elem.tag == interesting_element_name: 96 | interesting_element_depth -= 1 97 | element_processor.process(elem) 98 | if element_stack and not interesting_element_depth: 99 | element_stack[-1].remove(elem) 100 | del context 101 | except: 102 | if hasattr(element_processor, 'handle_interruption'): 103 | element_processor.handle_interruption(exc_info()[0]) 104 | else: 105 | print('\nInterrupted.', file=stderr) 106 | raise 107 | 108 | 109 | class ElementProcessor: 110 | """ 111 | An object which processes ElementTree elements. Examples of subclasses follow. 112 | """ 113 | def __init__(self): 114 | self.counter = 0 115 | self.interesting_element_name = '' # subclasses should define this 116 | 117 | def process(self, elem): 118 | """ 119 | Do something with a parsed element. 120 | This example just increments a counter of elements processed. 121 | Subclasses should provide their own version of this method to do more. 122 | """ 123 | self.counter += 1 124 | 125 | def handle_interruption(self, e): 126 | """ 127 | If parsing is interrupted, this method will be called to handle it. 128 | This example prints the element count. 129 | """ 130 | print('\nInterrupted after %d %ss.' % (self.counter, self.interesting_element_name), file=stderr) 131 | raise 132 | 133 | 134 | class ReleaseElementCounter(ElementProcessor): 135 | """ 136 | An example of an object which processes release elements: 137 | Print a dot for every nth release (default n=1000). 138 | If interrupted, print the parsed element count and last processed element ID. 139 | """ 140 | def __init__(self, n=1000): 141 | self.counter = 0 142 | self.interval = n 143 | self.item_id = None 144 | self.interesting_element_name = 'release' 145 | 146 | def process(self, elem): 147 | self.counter += 1 148 | self.item_id = elem.get('id') 149 | if self.counter % self.interval == 0: 150 | print('.', end='', file=stderr) 151 | stderr.flush() 152 | 153 | def handle_interruption(self, e): 154 | print('\nInterrupted after %d %ss. Last %s id parsed: %s' % (self.counter, self.interesting_element_name, self.interesting_element_name, self.item_id), file=stderr) 155 | raise 156 | 157 | 158 | class ReleaseElementSerializer(ElementProcessor): 159 | """ 160 | Another example of an object which processes elements: 161 | Write an XML fragment to stdout for every nth release (default n=1000). 162 | If interrupted, do whatever the base class does. 163 | """ 164 | def __init__(self, n=1000): 165 | self.counter = 0 166 | self.interval = n 167 | self.interesting_element_name = 'release' 168 | 169 | def process(self, elem): 170 | self.counter += 1 171 | if self.counter % self.interval == 0: 172 | tree = ET.ElementTree(elem) 173 | tree.write(stdout, encoding='windows-1252') 174 | stdout.flush() 175 | del tree 176 | 177 | 178 | def process_dump_file(dump_filepath, element_processor): 179 | """ 180 | Given an XML dump file path (as a string), convert it to a stream and 181 | pass it, along with an ElementProcessor (or subclass) instance, to 182 | read_via_etree(). 183 | """ 184 | dump_file_stream = GeneralEntityStreamWrapper(get_dump_file_stream(dump_filepath)) 185 | read_via_etree(dump_file_stream, element_processor) 186 | dump_file_stream.close() 187 | 188 | 189 | # when run from the command line, do this stuff 190 | if __name__ == "__main__": 191 | from sys import argv 192 | from time import time 193 | if len(argv) < 2: 194 | raise RuntimeError("A dump file path must be provided as the first argument.") 195 | # for this demo, process the XML by printing a dot for every nth release element. 196 | processor = ReleaseElementCounter() 197 | # uncomment the following if you want an XML fragment instead of a dot 198 | #processor = ReleaseElementSerializer() 199 | print('reading file:', file=stderr) 200 | stderr.flush() 201 | starttime = time() 202 | process_dump_file(argv[1], processor) 203 | endtime = time() 204 | print(' (total time: ', endtime - starttime, 's)', sep='', file=stderr) 205 | --------------------------------------------------------------------------------