├── .gitignore ├── LICENSE.md ├── README.md ├── untl_breaker.py └── dc_breaker.py /.gitignore: -------------------------------------------------------------------------------- 1 | *.py[cod] 2 | 3 | # C extensions 4 | *.so 5 | 6 | # Packages 7 | *.egg 8 | *.egg-info 9 | dist 10 | build 11 | eggs 12 | parts 13 | bin 14 | var 15 | sdist 16 | develop-eggs 17 | .installed.cfg 18 | lib 19 | lib64 20 | 21 | # Installer logs 22 | pip-log.txt 23 | 24 | # Unit test / coverage reports 25 | .coverage 26 | .tox 27 | nosetests.xml 28 | 29 | # Translations 30 | *.mo 31 | 32 | # Mr Developer 33 | .mr.developer.cfg 34 | .project 35 | .pydevproject 36 | -------------------------------------------------------------------------------- /LICENSE.md: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2013 Mark Phillips 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | metadata breakers 2 | ================= 3 | 4 | Python scripts for "breaking" or atomizing OAI-PMH repositories into simpler text formats. 5 | 6 | These scripts were designed to use the output from [pyoaiharvester](https://github.com/vphill/pyoaiharvester). 7 | 8 | Basic Usage 9 | ----------- 10 | 11 | Using pyoaiharvester, grab some records you are interested in working with. 12 | 13 | ```bash 14 | python3 pyoaiharvest.py -l https://texashistory.unt.edu/explore/collections/ACUC/oai/ -o acuc.dc.xml 15 | ``` 16 | 17 | This will result in a repository xml file called `acuc.dc.xml` for the [ACUC collection](https://texashistory.unt.edu/explore/collections/ACUC/) on The Portal to Texas History. 18 | 19 | Next you can start to work with the _metadata breakers_. 20 | 21 | ```bash 22 | python3 dc_breaker.py ../pyoaiharvester/acuc.dc.xml 23 | 24 | 25 | {http://purl.org/dc/elements/1.1/}title: |=========================| 191/191 | 100.00% 26 | {http://purl.org/dc/elements/1.1/}creator: |=========================| 191/191 | 100.00% 27 | {http://purl.org/dc/elements/1.1/}contributor: | | 3/191 | 1.57% 28 | {http://purl.org/dc/elements/1.1/}publisher: |=========================| 191/191 | 100.00% 29 | {http://purl.org/dc/elements/1.1/}date: |=========================| 191/191 | 100.00% 30 | {http://purl.org/dc/elements/1.1/}language: |=========================| 191/191 | 100.00% 31 | {http://purl.org/dc/elements/1.1/}description: |=========================| 191/191 | 100.00% 32 | {http://purl.org/dc/elements/1.1/}subject: |=========================| 191/191 | 100.00% 33 | {http://purl.org/dc/elements/1.1/}coverage: |=========================| 191/191 | 100.00% 34 | {http://purl.org/dc/elements/1.1/}rights: |= | 10/191 | 5.24% 35 | {http://purl.org/dc/elements/1.1/}type: |=========================| 191/191 | 100.00% 36 | {http://purl.org/dc/elements/1.1/}format: |=========================| 191/191 | 100.00% 37 | {http://purl.org/dc/elements/1.1/}identifier: |=========================| 191/191 | 100.00% 38 | 39 | 40 | dc_completeness 73.79 41 | collection_completeness 100.00 42 | wwww_completeness 100.00 43 | average_completeness 91.26 44 | ``` 45 | 46 | You can designate a specific Dublin Core field to list those elements only. 47 | 48 | ```bash 49 | python3 dc_breaker.py ../pyoaiharvester/acuc.dc.xml -e title 50 | 51 | Catalog of Abilene Christian College, 1906-1907 52 | The Childers Classical Institute, Abilene, Texas, Catalog 1906-1907 53 | Announcements 1907-1908 54 | Catalog of Abilene Christian College, 1910-1911 55 | Fifth Annual Catalogue, Abilene Christian College, Abilene, Texas, 1910-1911 56 | Announcement 1910-1911 57 | Catalog of Abilene Christian College, 1912-1913 58 | Seventh Annual Announcement, Abilene Christian College, Abilene, Texas, 1912-1913 59 | Announcement 1912-1913 60 | Catalog of Abilene Christian College, 1913-1914 61 | ``` 62 | 63 | You can prepend the identifier for the record to the line with the `-i` flag. 64 | 65 | ```bash 66 | 67 | python3 dc_breaker.py ../pyoaiharvester/acuc.dc.xml -e title -i | head 68 | info:ark/67531/metapth45902 Catalog of Abilene Christian College, 1906-1907 69 | info:ark/67531/metapth45902 The Childers Classical Institute, Abilene, Texas, Catalog 1906-1907 70 | info:ark/67531/metapth45902 Announcements 1907-1908 71 | info:ark/67531/metapth45910 Catalog of Abilene Christian College, 1910-1911 72 | info:ark/67531/metapth45910 Fifth Annual Catalogue, Abilene Christian College, Abilene, Texas, 1910-1911 73 | info:ark/67531/metapth45910 Announcement 1910-1911 74 | info:ark/67531/metapth45909 Catalog of Abilene Christian College, 1912-1913 75 | info:ark/67531/metapth45909 Seventh Annual Announcement, Abilene Christian College, Abilene, Texas, 1912-1913 76 | info:ark/67531/metapth45909 Announcement 1912-1913 77 | info:ark/67531/metapth45908 Catalog of Abilene Christian College, 1913-1914 78 | ``` 79 | 80 | More examples and a full explination of how you might use this tool as part of metadata analysis can be found in the article [Metadata Analysis at the Command-Line](https://journal.code4lib.org/articles/7818) 81 | -------------------------------------------------------------------------------- /untl_breaker.py: -------------------------------------------------------------------------------- 1 | """untl_breaker script for processing OAI-PMH 2.0 Repository XML Files""" 2 | 3 | import argparse 4 | import sys 5 | from xml.etree import ElementTree 6 | 7 | 8 | UNTL_NAMESPACE = "{http://digital2.library.unt.edu/untl/}" 9 | UNTL_NSMAP = {"untl": UNTL_NAMESPACE} 10 | 11 | NAME_FIELDS = ["creator", "contributor", "publisher"] 12 | 13 | 14 | class Record: 15 | """Base class for a UNTL metadata record in an OAI-PMH 16 | Repository file.""" 17 | 18 | def __init__(self, elem, options): 19 | self.elem = elem 20 | self.options = options 21 | 22 | def get_meta_id(self): 23 | """Returns record ARK identifier.""" 24 | metas = self.elem[1][0].findall(UNTL_NAMESPACE + "meta") 25 | for meta in metas: 26 | if meta.get("qualifier") == "ark": 27 | meta_id = meta.text 28 | break 29 | 30 | return meta_id 31 | 32 | def get_record_status(self): 33 | """Returns record status which is either active or deleted""" 34 | return self.elem.find("header").get("status", "active") 35 | 36 | def get_elements(self): 37 | """Yields designated element instances from record.""" 38 | elements = self.elem[1][0].findall(UNTL_NAMESPACE + self.options.element) 39 | for element in elements: 40 | if element is not None: 41 | element_dict = {} 42 | # Name fields have an additional nesting we need to deal with. 43 | if self.options.element in NAME_FIELDS: 44 | name = element.findtext(UNTL_NAMESPACE + "name", "").strip() 45 | element_dict["value"] = name 46 | else: 47 | element_dict["value"] = element.text.strip() 48 | element_dict["value"] = element_dict["value"].replace("\t", " ") 49 | element_dict["value"] = element_dict["value"].replace("\n", " ") 50 | element_dict["qualifier"] = element.get("qualifier", 'None') 51 | # If "value" is empty we want to skip the element. 52 | if not element_dict["value"]: 53 | continue 54 | # If we have asked for only a specific qualifier, yield only that. 55 | if self.options.qualifier: 56 | if self.options.qualifier == element_dict['qualifier']: 57 | yield element_dict 58 | # We didn't ask for a specific qualifier so yield all of them. 59 | else: 60 | yield element_dict 61 | 62 | def get_all_data(self): 63 | """Returns a list of all metadata elements and values""" 64 | for element in self.elem[1][0]: 65 | text = '' 66 | if element.tag.replace(UNTL_NAMESPACE, '') in NAME_FIELDS: 67 | text = element.findtext(UNTL_NAMESPACE + "name", "").strip() 68 | else: 69 | text = element.text.strip() 70 | if text: 71 | value = text.replace("\t", " ") 72 | value = value.replace("\n", " ") 73 | qualifier = element.get("qualifier", None) 74 | tag = element.tag 75 | yield (tag, qualifier, value) 76 | 77 | def has_element(self): 78 | """Returns True or False if a record has value in a selected metadata element""" 79 | has_elements = self.elem[1][0].findall(UNTL_NAMESPACE + self.options.element) 80 | for element in has_elements: 81 | if element.text: 82 | return True 83 | return False 84 | 85 | 86 | def main(): 87 | """Main file handling and option handling""" 88 | parser = argparse.ArgumentParser() 89 | parser.add_argument("-e", "--element", dest="element", default=None, 90 | help="elemnt to print to screen", metavar="subject, creator") 91 | parser.add_argument("-q", "--qualifier", dest="qualifier", default=None, 92 | help="qualifier to limit to", metavar="KWD, officialtitle") 93 | parser.add_argument("-i", "--id", action="store_true", dest="id", default=False, 94 | help="prepend meta_id to line") 95 | parser.add_argument("-a", "--add-qualifier", action="store_true", dest="add_qualifier", 96 | help="prepend qualifier to line", default=False) 97 | parser.add_argument("-p", "--present", action="store_true", dest="present", default=False, 98 | help="print if there is value of defined element in record") 99 | parser.add_argument("-d", "--dump", action="store_true", dest="dump", default=False, 100 | help="Dump all record data to a tab delimited format") 101 | parser.add_argument("filename", type=str, 102 | help="OAI-PMH UNTL Repository File") 103 | 104 | args = parser.parse_args() 105 | 106 | if (args.element is None and args.dump is False): 107 | parser.print_help() 108 | sys.exit(1) 109 | 110 | for _event, elem in ElementTree.iterparse(args.filename): 111 | if elem.tag == "record": 112 | record = Record(elem, args) 113 | meta_id = record.get_meta_id() 114 | 115 | if args.dump is True and record.get_record_status() == "active": 116 | for field_data in record.get_all_data(): 117 | print(f"{meta_id}\t{field_data[0]}\t{field_data[1]}\t{field_data[2]}") 118 | elem.clear() 119 | continue 120 | 121 | # Present Section 122 | if args.present is True and record.get_record_status() == "active": 123 | print(f"{meta_id}\t{record.has_element()}") 124 | elem.clear() 125 | continue 126 | 127 | if record.get_elements(): 128 | for i in record.get_elements(): 129 | if args.id and args.add_qualifier: 130 | print(f"{meta_id}\t{i['qualifier']}\t{i['value']}") 131 | elif args.add_qualifier: 132 | print(f"{i['qualifier']}\t{i['value']}") 133 | elif args.id and args.add_qualifier is False: 134 | print(f"{meta_id}\t{i['value']}") 135 | else: 136 | print(i["value"]) 137 | elem.clear() 138 | 139 | if __name__ == "__main__": 140 | main() 141 | -------------------------------------------------------------------------------- /dc_breaker.py: -------------------------------------------------------------------------------- 1 | 2 | """dc_breaker script for processing OAI-PMH 2.0 Repository XML Files""" 3 | 4 | import argparse 5 | from xml.etree import ElementTree 6 | 7 | 8 | OAI_NAMESPACE = "{http://www.openarchives.org/OAI/2.0/oai_dc/}" 9 | DC_NAMESPACE = "{http://purl.org/dc/elements/1.1/}" 10 | 11 | METADATA_FIELD_ORDER = ["{http://purl.org/dc/elements/1.1/}title", 12 | "{http://purl.org/dc/elements/1.1/}creator", 13 | "{http://purl.org/dc/elements/1.1/}contributor", 14 | "{http://purl.org/dc/elements/1.1/}publisher", 15 | "{http://purl.org/dc/elements/1.1/}date", 16 | "{http://purl.org/dc/elements/1.1/}language", 17 | "{http://purl.org/dc/elements/1.1/}description", 18 | "{http://purl.org/dc/elements/1.1/}subject", 19 | "{http://purl.org/dc/elements/1.1/}coverage", 20 | "{http://purl.org/dc/elements/1.1/}source", 21 | "{http://purl.org/dc/elements/1.1/}relation", 22 | "{http://purl.org/dc/elements/1.1/}rights", 23 | "{http://purl.org/dc/elements/1.1/}type", 24 | "{http://purl.org/dc/elements/1.1/}format", 25 | "{http://purl.org/dc/elements/1.1/}identifier"] 26 | 27 | 28 | class RepoInvestigatorException(Exception): 29 | """This is our base exception for this script""" 30 | def __init__(self, value): 31 | self.value = value 32 | 33 | def __str__(self): 34 | return f"{self.value}" 35 | 36 | 37 | class Record: 38 | """Base class for a Dublin Core metadata record in an OAI-PMH 39 | Repository file.""" 40 | 41 | def __init__(self, elem, options): 42 | self.elem = elem 43 | self.options = options 44 | 45 | def get_record_id(self): 46 | """Returns record identifier or raises error if identifier is not present.""" 47 | try: 48 | record_id = self.elem.find("header/identifier").text 49 | return record_id 50 | except AttributeError as err: 51 | raise RepoInvestigatorException( 52 | "Record does not have a valid Record Identifier") from err 53 | 54 | def get_record_status(self): 55 | """Returns record status which is either active or deleted""" 56 | return self.elem.find("header").get("status", "active") 57 | 58 | def get_elements(self): 59 | """Returns a list of values for the selected metadata element""" 60 | out = [] 61 | elements = self.elem[1][0].findall(DC_NAMESPACE + self.options.element) 62 | for element in elements: 63 | if element.text: 64 | out.append(element.text.strip()) 65 | return out 66 | 67 | def get_all_data(self): 68 | """Returns a list of all metadata elements and values""" 69 | out = [] 70 | for i in self.elem[1][0]: 71 | if i.text: 72 | out.append((i.tag, i.text.strip().replace("\n", " "))) 73 | return out 74 | 75 | def get_stats(self): 76 | """Calculates counts for elements in record""" 77 | stats = {} 78 | for element in self.elem[1][0]: 79 | stats.setdefault(element.tag, 0) 80 | stats[element.tag] += 1 81 | return stats 82 | 83 | def has_element(self): 84 | """Returns True or False if a record has value in a selected metadata element""" 85 | has_elements = self.elem[1][0].findall(DC_NAMESPACE + self.options.element) 86 | for element in has_elements: 87 | if element.text: 88 | return True 89 | return False 90 | 91 | 92 | def collect_stats(stats_aggregate, stats): 93 | """Collect stats from entire repository""" 94 | # increment the record counter 95 | stats_aggregate["record_count"] += 1 96 | 97 | for field in stats: 98 | 99 | # get the total number of times a field occurs 100 | stats_aggregate["field_info"].setdefault(field, {"field_count": 0}) 101 | stats_aggregate["field_info"][field]["field_count"] += 1 102 | 103 | # get average of all fields 104 | stats_aggregate["field_info"][field].setdefault("field_count_total", 0) 105 | stats_aggregate["field_info"][field]["field_count_total"] += stats[field] 106 | 107 | 108 | def create_stats_averages(stats_aggregate): 109 | """Create repository averages for stats collected""" 110 | for field in stats_aggregate["field_info"]: 111 | field_count = stats_aggregate["field_info"][field]["field_count"] 112 | field_count_total = stats_aggregate["field_info"][field]["field_count_total"] 113 | 114 | field_count_total_avg = (float(field_count_total) / float(stats_aggregate["record_count"])) 115 | stats_aggregate["field_info"][field]["field_count_total_average"] = field_count_total_avg 116 | 117 | field_count_elem_avg = (float(field_count_total) / float(field_count)) 118 | stats_aggregate["field_info"][field]["field_count_element_average"] = field_count_elem_avg 119 | 120 | return stats_aggregate 121 | 122 | 123 | def calc_completeness(stats_averages): 124 | """Calculate completeness values for repository records""" 125 | completeness = {} 126 | record_count = stats_averages["record_count"] 127 | completeness_total = 0 128 | wwww_total = 0 129 | collection_total = 0 130 | collection_field_to_count = 0 131 | 132 | wwww = [ 133 | "{http://purl.org/dc/elements/1.1/}creator", # who 134 | "{http://purl.org/dc/elements/1.1/}title", # what 135 | "{http://purl.org/dc/elements/1.1/}identifier", # where 136 | "{http://purl.org/dc/elements/1.1/}date" # when 137 | ] 138 | 139 | for element in sorted(stats_averages["field_info"]): 140 | elem_comp_perc = 0 141 | elem_comp_perc = ((stats_averages["field_info"][element]["field_count"] / 142 | float(record_count)) * 100) 143 | completeness_total += elem_comp_perc 144 | 145 | # gather collection completeness 146 | if elem_comp_perc > 10: 147 | collection_total += elem_comp_perc 148 | collection_field_to_count += 1 149 | # gather wwww completeness 150 | if element in wwww: 151 | wwww_total += elem_comp_perc 152 | 153 | completeness["dc_completeness"] = completeness_total / float(15) 154 | completeness["collection_completeness"] = collection_total / float(collection_field_to_count) 155 | completeness["wwww_completeness"] = wwww_total / float(len(wwww)) 156 | completeness["average_completeness"] = ((completeness["dc_completeness"] + 157 | completeness["collection_completeness"] + 158 | completeness["wwww_completeness"]) / float(3)) 159 | return completeness 160 | 161 | 162 | def pretty_print_stats(stats_averages): 163 | """Generates a pretty table with results""" 164 | record_count = stats_averages["record_count"] 165 | # get header length 166 | element_length = 0 167 | for element in stats_averages["field_info"]: 168 | if element_length < len(element): 169 | element_length = len(element) 170 | 171 | print("\n") 172 | for element in METADATA_FIELD_ORDER: 173 | if stats_averages["field_info"].get(element): 174 | field_count = stats_averages["field_info"][element]["field_count"] 175 | perc = (field_count / float(record_count)) * 100 176 | perc_print = "=" * (int(perc) // 4) 177 | column_one = " " * (element_length - len(element)) + element 178 | field_count = stats_averages["field_info"][element]["field_count"] 179 | print(f"{column_one}: |{perc_print:25}| {field_count:6}/{record_count} | {perc:>6.2f}%") 180 | 181 | print("\n") 182 | completeness = calc_completeness(stats_averages) 183 | for comp_type in ["dc_completeness", 184 | "collection_completeness", 185 | "wwww_completeness", 186 | "average_completeness"]: 187 | print(f"{comp_type:>23} {completeness[comp_type]:10.2f}") 188 | 189 | 190 | def dump_record_values(record_id, record): 191 | """Iterates through all values in a record and returns a formatted string""" 192 | if record.get_record_status() == "active": 193 | record_fields = record.get_all_data() 194 | for field_data in record_fields: 195 | field_name = field_data[0] 196 | field_value = field_data[1].replace("\t", " ") 197 | yield f"{record_id}\t{field_name}\t{field_value}" 198 | 199 | 200 | def main(): 201 | """Main file handling and option handling""" 202 | stats_aggregate = { 203 | "record_count": 0, 204 | "field_info": {} 205 | } 206 | 207 | element_choices = ["title", "creator", "contributor", "publisher", "date", "language", 208 | "description", "subject", "coverage", "source", "relation", 209 | "rights", "type", "format", "identifier"] 210 | parser = argparse.ArgumentParser() 211 | parser.add_argument("-e", "--element", dest="element", default=None, 212 | help="elemnt to print to screen", choices=element_choices) 213 | parser.add_argument("-i", "--id", action="store_true", dest="id", default=False, 214 | help="prepend meta_id to line") 215 | parser.add_argument("-s", "--stats", action="store_true", dest="stats", default=False, 216 | help="only print stats for repository") 217 | parser.add_argument("-p", "--present", action="store_true", dest="present", default=False, 218 | help="print if there is value of defined element in record") 219 | parser.add_argument("-d", "--dump", action="store_true", dest="dump", default=False, 220 | help="Dump all record data to a tab delimited format") 221 | parser.add_argument("filename", type=str, 222 | help="OAI-PMH Repository File") 223 | 224 | args = parser.parse_args() 225 | 226 | if args.element is None: 227 | args.stats = True 228 | 229 | record_count = 0 230 | for _event, elem in ElementTree.iterparse(args.filename): 231 | if elem.tag == "record": 232 | record = Record(elem, args) 233 | record_id = record.get_record_id() 234 | 235 | if args.dump is True: 236 | # Dumps all values for each record. 237 | for value in dump_record_values(record_id, record): 238 | print(value) 239 | elem.clear() 240 | continue # Skip stats building 241 | 242 | if args.stats is False and args.element and record.get_record_status() == "active": 243 | for i in record.get_elements(): 244 | out = [i] 245 | if args.id: 246 | out.insert(0, record_id) 247 | if args.present: 248 | out = [record_id, str(record.has_element())] 249 | print('\t'.join(out)) 250 | continue # Skip stats building 251 | 252 | if args.stats is True and record.get_record_status() == "active": 253 | if (record_count % 1000) == 0 and record_count != 0: 254 | print(f"{record_count} records processed") 255 | 256 | collect_stats(stats_aggregate, record.get_stats()) 257 | record_count += 1 258 | elem.clear() 259 | 260 | if args.stats is True and args.dump is False and args.element is None: 261 | stats_averages = create_stats_averages(stats_aggregate) 262 | pretty_print_stats(stats_averages) 263 | 264 | if __name__ == "__main__": 265 | main() 266 | --------------------------------------------------------------------------------