├── NOTES.txt
├── README.md
└── parse_discogs_dump.py


/NOTES.txt:
--------------------------------------------------------------------------------
 1 | The older dump files were formatted inconveniently. Some were ISO-8859-1
 2 | encoded, rather than UTF-8, but they did not have an encoding declaration
 3 | at the top of the file. Sometimes they would contain characters invalid
 4 | in XML (e.g. control characters).
 5 | 
 6 | Worse, each file had multiple elements at the root level, rather than
 7 | having a single element wrapped around everything. This means the file was
 8 | what the XML spec calls a parsed general entity: an XML fragment which
 9 | could only be referenced from another document; it couldn't be parsed as a
10 | well-formed XML document, itself, unless we somehow wrapped it in an
11 | enclosing 'document element', also sometimes called a 'root element'.
12 | 
13 | As of 2017, it seems these issues have all been resolved. Nevertheless,
14 | I originally wrote the script to handle the old rootless format, so I am
15 | leaving that code in to demonstrate the technique for wrapping a stream
16 | in a 'dummy' element.
17 | 
18 | There are different options for doing the wrapping:
19 | 
20 | A. Create a new file with '<dummy>' + the decompressed dump file + '</dummy>'.
21 | The output could be gzipped. This is feasible but isn't very elegant. We'd
22 | also probably want to make sure it hasn't already been done, which could get
23 | tricky.
24 | 
25 | B. Decompress the dump file and reference it as an external entity from
26 | within a separate XML document that looks like the following:
27 | 
28 | <?xml version="1.0">
29 | <!DOCTYPE dummy [<!ENTITY data SYSTEM "discogs_foo_releases.xml">]>
30 | <dummy>&data;</dummy>
31 | 
32 | This is risky because the decompressed data exceeds 2 GB (a common upper limit
33 | when reading files), and because a non-validating XML parser isn't required to
34 | read external entities at all.
35 | 
36 | C. Wrap the decompressed stream in an object which adds '<dummy>' & '</dummy>'
37 | to the stream. This is probably the ideal solution, and is what I implemented,
38 | with help from Jeremy Kloth.
39 | 
40 | Another issue is cElementTree builds a tree the whole time it is parsing. You
41 | can call the clear() method on elements to release memory, but this is only
42 | removing the element's attributes and fully parsed content. It doesn't remove
43 | the element itself; an empty element remains attached to its parent.
44 | 
45 | As explained at <http://effbot.org/elementtree/iterparse.htm>, you can call
46 | clear() on the root element (which takes some extra effort to obtain), but
47 | this only clears the fully-read children; it doesn't help if the elements you
48 | need to clear are not fully read yet, as happens when we use the dummy wrapper.
49 | Sample code for lxml's ElementTree to deal with this problem can be found at
50 | <https://www.ibm.com/developerworks/xml/library/x-hiperfparse/>, but it relies
51 | on lxml's extensions to the API; you can't use it with regular ElementTree.
52 | 
53 | As I posted at <https://stackoverflow.com/a/44509632/1362109>, you don't use
54 | clear() at all, but rather just keep a stack of elements seen, based on start
55 | tags. When pushing an element of particular interest onto the stack, increment
56 | a counter. At each end tag, pop the current element off the stack, and if the
57 | end tag is for the element of interest, decrement the counter. Now the last
58 | element in the stack is the parent of the current element. If the counter is
59 | zero, you know you're not in an element of interest, so it is safe to call
60 | parent.remove(elem) to clear and discard the current element.
61 | 
62 | This technique thus allows a tree to be built for each element of interest,
63 | such as a 'release' element. Immediately before discarding it, you can use the
64 | usual ElementTree API on it, such as elem.findall('.//track/title').
65 | 
66 | - Mike J. Brown <mike -at- skew.org>
67 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Discogs-dump-parser
 2 | ## A lightweight parser of Discogs data dumps
 3 | 
 4 | This software is released under a Creative Commons "CC0" Public Domain Dedication; see https://creativecommons.org/publicdomain/zero/1.0/ for more info.
 5 | 
 6 | Discogs artist, label, release, and master release data is publicly available in huge XML files produced monthly and made available at https://data.discogs.com/ for anyone to download.
 7 | 
 8 | This Python script parses the Discogs release data using the ultrafast [cElementTree](https://docs.python.org/2/library/xml.etree.elementtree.html) API which comes with Python 2.5 and up. The script automatically handles compressed and uncompressed data, and the old style of dump which had no root element.
 9 | 
10 | ### Demo
11 | 
12 | To try it out, get one of the release data dump files and just run the script, passing the dump file path as the only argument. For example, if the script and the dump file named discogs_20191101_releases.xml.gz are in the same directory:
13 | 
14 |     python parse_discogs_dump.py discogs_20191101_releases.xml.gz
15 | 
16 | By default, a dot will be printed to the screen for every 1000 'release' elements read. At the end it tells you how much time it took. If you interrupt it, it tells you the last release ID it saw. The parsed data is mostly ignored; the idea is just to successfully read the XML, building a temporary tree for each 'release' element.
17 | 
18 | As of 2018, on my 3.1 GHz Intel Core i5-2400 system (using only 1 of 4 cores), it takes 61 minutes to plow through the 6.0 GB gzipped release data XML and print the dots, yet it only ever uses about 17 MB of memory. It could run faster if a temporary tree was not built and discarded for each 'release', but I feel it is a better benchmark this way.
19 | 
20 | If you uncomment one line of code near the end, then instead of a dot, you can get a complete XML fragment for every 1000th release element read.
21 | 
22 | ### Customization
23 | 
24 | There is no need to modify parse_discogs_dump.py directly. You can write your own code to handle each 'release' element, which will be an instance of [`ElementTree.Element`](https://docs.python.org/2/library/xml.etree.elementtree.html#element-objects). Your code needs to just do the following:
25 | 
26 | 1. Import `ElementProcessor` and `process_dump_file()` from parse_discogs_dump.py.
27 | 2. Define a subclass of `ElementProcessor` with, at a minimum, a `process()` method to handle each 'release' element (which will be an instance of `ElementTree.Element`) in whatever way you want.
28 | 3. Pass the dump file path and an instance of your subclass to `process_dump_file()`.
29 | 
30 | For example: [find_invalid_release_dates.py](https://pastebin.com/Acutu7xE) is a script which does exactly those things. It can be run like this:
31 | 
32 |     python find_invalid_release_dates.py discogs_20191101_releases.xml.gz > report.txt
33 | 
34 | Every time it finds a non-empty release date which does not match the patterns `####` or `####-##-##` with a non-zero month value, it will print a dot to the screen, and the output file report.txt will get a line like this:
35 | 
36 |     https://www.discogs.com/release/41748 - release date is "?"
37 | 
38 | ### Error handling
39 | 
40 | Some Discogs dump files contain errors in the XML. If you get an error message about the XML not being well-formed, you will have to fix the dump file. For example, you might need to remove the control characters which are forbidden in XML:
41 | 
42 |     gzcat discogs_20080309_releases.xml.gz | tr -d '\1\2\3\4\5\6\7\10\13\14\16\17\20\21\22\23\24\25\26\27\30\31\32\33\34\35\36\37\177\200\201\202\203\204\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237' | gzip -9 > discogs_20080309_releases.fixed.xml.gz
43 | 
44 | ### Contact
45 | 
46 | I am user [mjb](https://www.discogs.com/user/mjb) on Discogs. Feel free to contact me there via private message, or in the API forum.
47 | 


--------------------------------------------------------------------------------
/parse_discogs_dump.py:
--------------------------------------------------------------------------------
  1 | """
  2 |   parse_discogs_dump.py
  3 |   A lightweight parser of Discogs data dumps
  4 |   
  5 |   Version: 2018-09-30
  6 |   Original author: Mike J. Brown <mike@skew.org>
  7 |   License: CC0 <creativecommons.org/publicdomain/zero/1.0/>
  8 |   Requires: Python 2.5 or higher
  9 | """
 10 | 
 11 | # for Python 2 compatibility
 12 | from __future__ import print_function
 13 | 
 14 | import gzip
 15 | from io import BytesIO
 16 | from sys import stdout, stderr, exc_info
 17 | 
 18 | try:
 19 | 	from xml.etree import cElementTree as ET
 20 | except ImportError:
 21 | 	from xml.etree import ElementTree as ET
 22 | 	print('cElementTree unavailable; using regular ElementTree instead. Expect slowness.', file=stderr)
 23 | 
 24 | 
 25 | class GeneralEntityStreamWrapper(object):
 26 | 	"""
 27 | 	A wrapper for an XML general parsed entity (an XML fragment which may or
 28 | 	may not have a root/document element).
 29 | 	
 30 | 	Initialize it with the entity stream. It will act as if the stream is wrapped
 31 | 	in an element named 'dummy'. It only supports read() and close() operations.
 32 | 	"""
 33 | 	_streams = None
 34 | 	_current_stream = None
 35 | 
 36 | 	def _prepare_next_stream(self):
 37 | 		self._current_stream = self._streams.pop()
 38 | 		self._read = self._current_stream.read
 39 | 		self._close = self._current_stream.close
 40 | 
 41 | 	def __init__(self, file_stream):
 42 | 		self._streams = [BytesIO(b'</dummy>'), file_stream, BytesIO(b'<dummy>')]
 43 | 		self._prepare_next_stream()
 44 | 
 45 | 	def read(self, size=-1):
 46 | 		if self._current_stream:
 47 | 			bytes = self._read(size)
 48 | 			if bytes:
 49 | 				return bytes
 50 | 			else:
 51 | 				try:
 52 | 					self._prepare_next_stream()
 53 | 				except IndexError:
 54 | 					return ''
 55 | 				return self._read(size)
 56 | 		else:
 57 | 			return ''
 58 | 
 59 | 	def close(self):
 60 | 		self._close()
 61 | 
 62 | 
 63 | def get_dump_file_stream(filepath):
 64 | 	"""
 65 | 	Open the dump file, decompressing it on the fly if necessary.
 66 | 	"""
 67 | 	if filepath.endswith('.xml'):
 68 | 		return open(filepath, 'rb')
 69 | 	elif filepath.endswith('.xml.gz'):
 70 | 		return gzip.open(filepath)
 71 | 	else:
 72 | 		raise 'Unknown extension on dump file path ' + dump_filepath
 73 | 
 74 | 
 75 | def read_via_etree(stream, element_processor):
 76 | 	"""
 77 | 	Parse a release dump XML stream incrementally, fully removing elements
 78 | 	when they are completely read, unless they are descendants of an element
 79 | 	with the given name, in which case they are processed by the process()
 80 | 	method of the given ElementProcessor (or subclass) instance.
 81 | 	"""
 82 | 	element_stack = []
 83 | 	interesting_element_name = element_processor.interesting_element_name or 'release'
 84 | 	interesting_element_depth = 0
 85 | 	item_id = None
 86 | 	try:
 87 | 		context = ET.iterparse(stream, events=('start', 'end'))
 88 | 		for event, elem in context:
 89 | 			if event == 'start':
 90 | 				element_stack.append(elem)
 91 | 				if elem.tag == interesting_element_name:
 92 | 					interesting_element_depth += 1
 93 | 			elif event == 'end':
 94 | 				element_stack.pop()
 95 | 				if elem.tag == interesting_element_name:
 96 | 					interesting_element_depth -= 1
 97 | 					element_processor.process(elem)
 98 | 				if element_stack and not interesting_element_depth:
 99 | 					element_stack[-1].remove(elem)
100 | 		del context
101 | 	except:
102 | 		if hasattr(element_processor, 'handle_interruption'):
103 | 			element_processor.handle_interruption(exc_info()[0])
104 | 		else:
105 | 			print('\nInterrupted.', file=stderr)
106 | 			raise
107 | 
108 | 
109 | class ElementProcessor:
110 | 	"""
111 | 	An object which processes ElementTree elements. Examples of subclasses follow.
112 | 	"""
113 | 	def __init__(self):
114 | 		self.counter = 0
115 | 		self.interesting_element_name = '' # subclasses should define this
116 | 
117 | 	def process(self, elem):
118 | 		"""
119 | 		Do something with a parsed element.
120 | 		This example just increments a counter of elements processed.
121 | 		Subclasses should provide their own version of this method to do more.
122 | 		"""
123 | 		self.counter += 1
124 | 
125 | 	def handle_interruption(self, e):
126 | 		"""
127 | 		If parsing is interrupted, this method will be called to handle it.
128 | 		This example prints the element count.
129 | 		"""
130 | 		print('\nInterrupted after %d %ss.' % (self.counter, self.interesting_element_name), file=stderr)
131 | 		raise
132 | 
133 | 
134 | class ReleaseElementCounter(ElementProcessor):
135 | 	"""
136 | 	An example of an object which processes release elements:
137 | 	Print a dot for every nth release (default n=1000).
138 | 	If interrupted, print the parsed element count and last processed element ID.
139 | 	"""
140 | 	def __init__(self, n=1000):
141 | 		self.counter = 0
142 | 		self.interval = n
143 | 		self.item_id = None
144 | 		self.interesting_element_name = 'release'
145 | 
146 | 	def process(self, elem):
147 | 		self.counter += 1
148 | 		self.item_id = elem.get('id')
149 | 		if self.counter % self.interval == 0:
150 | 			print('.', end='', file=stderr)
151 | 			stderr.flush()
152 | 
153 | 	def handle_interruption(self, e):
154 | 		print('\nInterrupted after %d %ss. Last %s id parsed: %s' % (self.counter, self.interesting_element_name, self.interesting_element_name, self.item_id), file=stderr)
155 | 		raise
156 | 
157 | 
158 | class ReleaseElementSerializer(ElementProcessor):
159 | 	"""
160 | 	Another example of an object which processes elements:
161 | 	Write an XML fragment to stdout for every nth release (default n=1000).
162 | 	If interrupted, do whatever the base class does.
163 | 	"""
164 | 	def __init__(self, n=1000):
165 | 		self.counter = 0
166 | 		self.interval = n
167 | 		self.interesting_element_name = 'release'
168 | 
169 | 	def process(self, elem):
170 | 		self.counter += 1
171 | 		if self.counter % self.interval == 0:
172 | 			tree = ET.ElementTree(elem)
173 | 			tree.write(stdout, encoding='windows-1252')
174 | 			stdout.flush()
175 | 			del tree
176 | 
177 | 
178 | def process_dump_file(dump_filepath, element_processor):
179 | 	"""
180 | 	Given an XML dump file path (as a string), convert it to a stream and
181 | 	pass it, along with an ElementProcessor (or subclass) instance, to
182 | 	read_via_etree().
183 | 	"""
184 | 	dump_file_stream = GeneralEntityStreamWrapper(get_dump_file_stream(dump_filepath))
185 | 	read_via_etree(dump_file_stream, element_processor)
186 | 	dump_file_stream.close()
187 | 
188 | 
189 | # when run from the command line, do this stuff
190 | if __name__ == "__main__":
191 | 	from sys import argv
192 | 	from time import time
193 | 	if len(argv) < 2:
194 | 		raise RuntimeError("A dump file path must be provided as the first argument.")
195 | 	# for this demo, process the XML by printing a dot for every nth release element.
196 | 	processor = ReleaseElementCounter()
197 | 	# uncomment the following if you want an XML fragment instead of a dot
198 | 	#processor = ReleaseElementSerializer()
199 | 	print('reading file:', file=stderr)
200 | 	stderr.flush()
201 | 	starttime = time()
202 | 	process_dump_file(argv[1], processor)
203 | 	endtime = time()
204 | 	print(' (total time: ', endtime - starttime, 's)', sep='', file=stderr)
205 | 


--------------------------------------------------------------------------------