├── .gitignore ├── README.md ├── external_id_search └── script.py ├── m0_preprocessing └── convert_sid_to_jpg.py ├── m1_geotiff └── convert_image_to_geotiff.py ├── m2_detection_recognition └── crop_img.py ├── m3_image_geojson └── stitch_output.py ├── m4_geocoordinate_converter └── convert_geojson_to_geocoord.py ├── m5_entity_linker └── entity_linker.py ├── m6_post_ocr └── lexical_search.py ├── m_sanborn ├── s1_geocoding.py ├── s2_clustering.py └── s3_gen_geojson.py ├── metadata ├── davidrumsey │ ├── davidrumsey.py │ └── davidrumsey_metadata.csv └── sanborn.py ├── model_card_template ├── pipe_run.sh ├── pipe_run_img.sh ├── requirements.txt ├── run.py ├── run_img.py ├── run_leeje.py ├── run_only_eval.py ├── run_sanborn.py └── utils.py /.gitignore: -------------------------------------------------------------------------------- 1 | data/ 2 | data0/ 3 | data1/ 4 | rumsey_output/ 5 | .idea/ 6 | .env 7 | MrSID* 8 | __pycache__ 9 | debug/ 10 | .ipynb_checkpoints/ 11 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | --- 3 | 4 | # Table of Contents 5 | - [Dataset Card](#dataset-card) 6 | - [Dataset Description](#dataset-description) 7 | - [Dataset Download Link](#dataset-download-link) 8 | - [Dataset Languages](#dataset-languages) 9 | - [Dataset Structure](#dataset-structure) 10 | - [Data Fields](#data-fields) 11 | - [Model Card](#model-card) 12 | - [Model Description](#model-description) 13 | - [Model Summary](#model-summary) 14 | - [Model Tags](#model-tags) 15 | - [Model Input and Output](#model-input-and-output) 16 | - [Additional Information](#additional-information) 17 | - [Licensing Information](#licensing-information) 18 | - [Contributions](#contributions) 19 | 20 | 21 | # Dataset Card 22 | 23 | ## Dataset Description 24 | 25 | Map text recognized from the georeferenced Rumsey historical map collection. 26 | 27 | ### Dataset Download Link 28 | 29 | - **Original Map Images:** https://www.davidrumsey.com/ 30 | - **Processed Output:** https://s3.msi.umn.edu/rumsey_output/geojson_testr_syn_54119.zip 31 | 32 | ### Dataset Languages 33 | 34 | English 35 | 36 | ### Language Creators: 37 | 38 | Machine-generated 39 | 40 | ## Dataset Structure 41 | 42 | ### Data Fields 43 | 44 | 45 | 46 | ### Output File Name 47 | 48 | Output geojson file is named after the external ID of origina map image. 49 | 50 | 51 | 52 | 53 | 54 | # Model Card 55 | 56 | ## Model Description 57 | 58 | A **fully automatic** pipeline to process a large amount of scanned historical map images. **Outputs** include the recognized text labels, label bounding polygons, labels after post-OCR correction and geo-entity identifier in OSM database. 59 | 60 | ### Model Summary 61 | 62 | - **Orange boxes:** Modules in the pipeline 63 | - **Blue boxes:** Inputs of the modules 64 | - **Green boxes:** Outputs of the modules 65 | 66 | image 67 | 68 | ### Model Details 69 | - **ImageCropping** module divides huge map images (>10K pixels) to smaller image patches (1K pixels) so that TextSpotter could process. 70 | 71 | - **PatchTextSpotter** uses a state-of-the-art network architecture [TESTR](https://github.com/mlpc-ucsd/TESTR) for detecting and recognizing text labels on image patches. Due to the lack of annotated samples for training, we create a set of synthetic maps to mimic the text styles (e.g., font, spacing, orientation) in the real historical maps. We place the location names from OpenStreetMap on a map by considering the shape of the location geometry and merge the text with various background styles extracted from the Rumsey collection maps. We train the model with these unlimited synthetic maps and apply the model to the historical maps. 72 | 73 | - **PatchtoMapMerging** is the module to merge the patch-level spotting results into map-level. 74 | 75 | - **GeocoordinateConverter** converts the text label bounding polygons from image coordinates system to geocoordinates system. Note: polygons in both coordinate systems are saved in the output. 76 | 77 | - **PostOCR** helps to verify the output and correct misspelled words from PatchTextSpotter using the OpenStreetMap dictionary. PostOCR module finds words' candidates using [fuzzy query function](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html) from elasticsearch, which contains the place name attribute from the Openstreetmap dictionary. Once PostOCR module identifies words' candidates, the module picks one candidate by the word popularity from the dictionary. 78 | 79 | - **EntityLinker** links each map text to the candidate geo-entities in the OpenStreetMap. The entity linking retrieves the candidates that satisfy two criteria: 1) the recognized text from the text spotter contains the geo-entities' name 2) the geocoordinates of detected bounding polygons intersect with the geo-entities' geometry. (Geo-coordinates are obtained from GeocoordConverter) 80 | 81 | 82 | ### How To Use 83 | All the modules can be launched from `run.py`. All the outputs will be saved in the `expt_name` subfolder in `output_folder` specified in the input arguments. 84 | 85 | ``` 86 | usage: run.py [-h] [--map_kurator_system_dir MAP_KURATOR_SYSTEM_DIR] [--text_spotting_model_dir TEXT_SPOTTING_MODEL_DIR] 87 | [--sample_map_csv_path SAMPLE_MAP_CSV_PATH] [--output_folder OUTPUT_FOLDER] [--expt_name EXPT_NAME] [--module_get_dimension] 88 | [--module_gen_geotiff] [--module_cropping] [--module_text_spotting] [--module_img_geojson] [--module_geocoord_geojson] [--module_entity_linking] 89 | [--module_post_ocr] [--spotter_model {abcnet,testr}] [--spotter_config SPOTTER_CONFIG] [--spotter_expt_name SPOTTER_EXPT_NAME] [--print_command] 90 | 91 | optional arguments: 92 | -h, --help show this help message and exit 93 | --map_kurator_system_dir MAP_KURATOR_SYSTEM_DIR 94 | --text_spotting_model_dir TEXT_SPOTTING_MODEL_DIR 95 | --sample_map_csv_path SAMPLE_MAP_CSV_PATH 96 | --output_folder OUTPUT_FOLDER 97 | --expt_name EXPT_NAME 98 | --module_get_dimension 99 | --module_gen_geotiff 100 | --module_cropping 101 | --module_text_spotting 102 | --module_img_geojson 103 | --module_geocoord_geojson 104 | --module_entity_linking 105 | --module_post_ocr 106 | --spotter_model {abcnet,testr} 107 | Select text spotting model option from ["abcnet","testr"] 108 | --spotter_config SPOTTER_CONFIG 109 | Path to the config file for text spotting model 110 | --spotter_expt_name SPOTTER_EXPT_NAME 111 | Name of spotter experiment, if empty using config file name 112 | --print_command 113 | ``` 114 | 115 | ### Model Tags 116 | - Text spotting 117 | - Entity Linking 118 | - Historical maps 119 | 120 | 121 | # Additional Information 122 | 123 | ### Licensing Information 124 | 125 | MIT License 126 | 127 | ### Contribution and Acknowledgement 128 | 129 | Thanks to [@zekun-li](https://zekun-li.github.io/),[@Jina-Kim](https://github.com/Jina-Kim), [@MinNamgung](https://github.com/MinNamgung) and [@linyijun](https://github.com/linyijun) for adding this dataset and models. 130 | 131 | Thanks to [TESTR](https://github.com/mlpc-ucsd/TESTR) for an open-source text spotting model. 132 | -------------------------------------------------------------------------------- /external_id_search/script.py: -------------------------------------------------------------------------------- 1 | from elasticsearch_dsl import Search, Q 2 | from elasticsearch import Elasticsearch, helpers 3 | from elasticsearch import RequestsHttpConnection 4 | import argparse 5 | import os 6 | import glob 7 | import json 8 | import nltk 9 | import logging 10 | from dotenv import load_dotenv 11 | 12 | import pandas as pd 13 | import numpy as np 14 | import logging 15 | import re 16 | import warnings 17 | warnings.filterwarnings("ignore") 18 | 19 | 20 | 21 | def db_connect(): 22 | """Elasticsearch Connection on Sansa""" 23 | load_dotenv() 24 | 25 | DB_HOST = os.getenv("DB_HOST") 26 | USER_NAME = os.getenv("DB_USERNAME") 27 | PASSWORD = os.getenv("DB_PASSWORD") 28 | 29 | es = Elasticsearch([DB_HOST], connection_class=RequestsHttpConnection, http_auth=(USER_NAME, PASSWORD), verify_certs=False) 30 | return es 31 | 32 | 33 | def query(target): 34 | es = db_connect() 35 | inputs = target.upper() 36 | query = {"query": {"match": {"text": f"{inputs}"}}} 37 | test = es.search(index="meta", body=query, size=10000)["hits"]["hits"] 38 | 39 | id_list = [] 40 | if len(test) != 0 : 41 | for i in range(len(test)): 42 | map_id = test[i]['_source']['external_id'] 43 | id_list.append(map_id) 44 | 45 | 46 | result = sorted(list(set(id_list))) 47 | return result 48 | 49 | 50 | def main(args): 51 | keyword = args.target 52 | metadata_path = args.metadata 53 | meta_df = pd.read_csv(metadata_path) 54 | meta_df['tmp'] = meta_df['image_no'].str.split(".").str[0] 55 | 56 | results = query(keyword) 57 | # print(f' "{keyword}" exist in: {results}') 58 | 59 | tmp_df = meta_df[meta_df.tmp.isin(results)] 60 | 61 | print(f'"{keyword}" exist in:') 62 | for index, row in tmp_df.iterrows(): 63 | print(f'{row.tmp} \t {row.title}') 64 | 65 | 66 | if __name__ == '__main__': 67 | parser = argparse.ArgumentParser() 68 | parser.add_argument('--target', type=str, default='east', help='') 69 | parser.add_argument('--metadata', type=str, default='/home/maplord/maplist_csv/luna_omo_metadata_56628_20220724.csv', help='') 70 | 71 | args = parser.parse_args() 72 | print(args) 73 | 74 | main(args) 75 | -------------------------------------------------------------------------------- /m0_preprocessing/convert_sid_to_jpg.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import time 4 | import multiprocessing 5 | 6 | sid_dir = '/data/rumsey-sid' 7 | sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/' 8 | num_process = 20 9 | if_print_command = True 10 | 11 | sid_list = glob.glob(os.path.join(sid_dir, '*/*.sid')) 12 | 13 | def execute_command(command, if_print_command): 14 | t1 = time.time() 15 | 16 | if if_print_command: 17 | print(command) 18 | os.system(command) 19 | 20 | t2 = time.time() 21 | time_usage = t2 - t1 22 | return time_usage 23 | 24 | 25 | def conversion(img_path): 26 | mrsiddecode_executable="/home/zekun/dr_maps/mapkurator-system/m1_geotiff/MrSID_DSDK-9.5.4.4709-rhel6.x86-64.gcc531/Raster_DSDK/bin/mrsiddecode" 27 | map_name = os.path.basename(img_path)[:-4] 28 | 29 | redirected_path = os.path.join(sid_to_jpg_dir, map_name + '.jpg') 30 | 31 | run_sid_to_jpg_command = mrsiddecode_executable + ' -quiet -i '+ img_path + ' -o '+redirected_path 32 | time_usage = execute_command(run_sid_to_jpg_command, if_print_command) 33 | 34 | 35 | 36 | if __name__ == "__main__": 37 | pool = multiprocessing.Pool(num_process) 38 | start_time = time.perf_counter() 39 | processes = [pool.apply_async(conversion, args=(sid_path,)) for sid_path in sid_list] 40 | result = [p.get() for p in processes] 41 | finish_time = time.perf_counter() 42 | print(f"Program finished in {finish_time-start_time} seconds") 43 | 44 | -------------------------------------------------------------------------------- /m1_geotiff/convert_image_to_geotiff.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import pandas as pd 4 | import ast 5 | import argparse 6 | import logging 7 | import pdb 8 | 9 | logging.basicConfig(level=logging.INFO) 10 | 11 | def func_file_to_fullpath_dict(file_path_list): 12 | 13 | file_fullpath_dict = dict() 14 | for file_path in file_path_list: 15 | file_fullpath_dict[os.path.basename(file_path).split('.')[0]] = file_path 16 | 17 | return file_fullpath_dict 18 | 19 | def main(args): 20 | 21 | jp2_root_dir = args.jp2_root_dir 22 | sid_root_dir = args.sid_root_dir 23 | additional_root_dir = args.additional_root_dir 24 | out_geotiff_dir = args.out_geotiff_dir 25 | 26 | sample_map_path = args.sample_map_path 27 | external_id_key = args.external_id_key 28 | 29 | jp2_file_path_list = glob.glob(os.path.join(jp2_root_dir, '*/*.jp2')) 30 | sid_file_path_list = glob.glob(os.path.join(sid_root_dir, '*.jpg')) # use converted jpg directly 31 | add_file_path_list = glob.glob(os.path.join(additional_root_dir, '*')) 32 | 33 | jp2_file_fullpath_dict = func_file_to_fullpath_dict(jp2_file_path_list) 34 | sid_file_fullpath_dict = func_file_to_fullpath_dict(sid_file_path_list) 35 | add_file_fullpath_dict = func_file_to_fullpath_dict(add_file_path_list) 36 | 37 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}) 38 | 39 | 40 | for index, record in sample_map_df.iterrows(): 41 | external_id = record.external_id 42 | transform_method = record.transformation_method 43 | gcps = record.gcps 44 | filename_without_extension = external_id.strip("'").replace('.','') 45 | 46 | full_path = '' 47 | if filename_without_extension in jp2_file_fullpath_dict: 48 | full_path = jp2_file_fullpath_dict[filename_without_extension] 49 | elif filename_without_extension in sid_file_fullpath_dict: 50 | full_path = sid_file_fullpath_dict[filename_without_extension] 51 | elif filename_without_extension in add_file_fullpath_dict: 52 | full_path = add_file_fullpath_dict[filename_without_extension] 53 | else: 54 | print('image with external_id not found in image_dir:', external_id) 55 | continue 56 | assert (len(full_path)!=0) 57 | 58 | gcps = ast.literal_eval(gcps) 59 | 60 | gcp_str = '' 61 | for gcp in gcps: 62 | lng, lat = gcp['location'] 63 | x, y = gcp['pixel'] 64 | gcp_str += '-gcp '+str(x) + ' ' + str(y) + ' ' + str(lng) + ' ' + str(lat) + ' ' 65 | 66 | # gdal_translate to add GCP to raw image 67 | gdal_command = 'gdal_translate -of Gtiff '+gcp_str + full_path + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' 68 | print(gdal_command) 69 | os.system(gdal_command) 70 | 71 | 72 | assert transform_method in ['affine','polynomial','tps'] 73 | 74 | # reprojection with gdal_warp 75 | if transform_method == 'affine': 76 | # first order 77 | 78 | warp_command = 'gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3857 -r near -order 1 -of GTiff ' + os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '.geotiff' 79 | 80 | elif transform_method == 'polynomial': 81 | # second order 82 | warp_command = 'gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3857 -r near -order 2 -of GTiff '+ os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '.geotiff' 83 | 84 | elif transform_method == 'tps': 85 | # Thin plate spline #debug/11558008.geotiff #10057000.geotiff 86 | warp_command = 'gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3857 -r near -tps -of GTiff '+ os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '.geotiff' 87 | 88 | else: 89 | raise NotImplementedError 90 | print(warp_command) 91 | os.system(warp_command) 92 | # remove temporary tiff file 93 | # os.system('rm ' + os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff') 94 | 95 | 96 | logging.info('Done generating geotiff for %s', external_id) 97 | 98 | 99 | if __name__ == '__main__': 100 | 101 | parser = argparse.ArgumentParser() 102 | parser.add_argument('--jp2_root_dir', type=str, default='/data/rumsey-jp2/', 103 | help='image dir of jp2 files.') 104 | parser.add_argument('--sid_root_dir', type=str, default='/data2/rumsey_sid_to_jpg/', 105 | help='image dir of sid files.') 106 | parser.add_argument('--additional_root_dir', type=str, default='/data2/rumsey-luna-img/', 107 | help='image dir of additional luna files.') 108 | parser.add_argument('--out_geotiff_dir', type=str, default='data/geotiff/', 109 | help='output dir for geotiff') 110 | parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv', 111 | help='path to sample map csv, which contains gcps info') 112 | parser.add_argument('--external_id_key', type=str, default='external_id', 113 | help='key string for external id, could be external_id or ListNo') 114 | 115 | args = parser.parse_args() 116 | print(args) 117 | 118 | 119 | main(args) 120 | -------------------------------------------------------------------------------- /m2_detection_recognition/crop_img.py: -------------------------------------------------------------------------------- 1 | import sys 2 | import os 3 | from PIL import Image, ImageFile 4 | import numpy as np 5 | import argparse 6 | import logging 7 | 8 | logging.basicConfig(level=logging.INFO) 9 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images 10 | 11 | #add this one line and import ImageFile above 12 | ImageFile.LOAD_TRUNCATED_IMAGES = True 13 | 14 | def main(args): 15 | 16 | img_path = args.img_path 17 | output_dir = args.output_dir 18 | 19 | map_name = os.path.basename(img_path).split('.')[0] # get the map name without extension 20 | output_dir = os.path.join(output_dir, map_name) 21 | 22 | if not os.path.isdir(output_dir): 23 | os.makedirs(output_dir) 24 | 25 | map_img = Image.open(img_path) 26 | width, height = map_img.size 27 | 28 | #print(width, height) 29 | 30 | shift_size = 1000 31 | 32 | # pad the image to the size divisible by shift-size 33 | num_tiles_w = int(np.ceil(1. * width / shift_size)) 34 | num_tiles_h = int(np.ceil(1. * height / shift_size)) 35 | enlarged_width = int(shift_size * num_tiles_w) 36 | enlarged_height = int(shift_size * num_tiles_h) 37 | 38 | enlarged_map = Image.new(mode="RGB", size=(enlarged_width, enlarged_height)) 39 | # paste map_imge to enlarged_map 40 | enlarged_map.paste(map_img) 41 | 42 | for idx in range(0, num_tiles_h): 43 | for jdx in range(0, num_tiles_w): 44 | img_clip = enlarged_map.crop((jdx * shift_size, idx * shift_size,(jdx + 1) * shift_size, (idx + 1) * shift_size, )) 45 | 46 | out_path = os.path.join(output_dir, 'h' + str(idx) + '_w' + str(jdx) + '.jpg') 47 | img_clip.save(out_path) 48 | 49 | logging.info('Done cropping %s' %img_path ) 50 | 51 | 52 | if __name__ == '__main__': 53 | 54 | parser = argparse.ArgumentParser() 55 | parser.add_argument('--img_path', type=str, default='../data/100_maps/8628000.jp2', 56 | help='path to image file.') 57 | parser.add_argument('--output_dir', type=str, default='../data/100_maps_crop/', 58 | help='path to output dir') 59 | 60 | args = parser.parse_args() 61 | print(args) 62 | 63 | 64 | # if not os.path.isdir(args.output_dir): 65 | # os.makedirs(args.output_dir) 66 | # print('created dir',args.output_dir) 67 | 68 | main(args) 69 | -------------------------------------------------------------------------------- /m3_image_geojson/stitch_output.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import pandas as pd 4 | import numpy as np 5 | import argparse 6 | from geojson import Polygon, Feature, FeatureCollection, dump 7 | import logging 8 | import pdb 9 | 10 | logging.basicConfig(level=logging.INFO) 11 | pd.options.mode.chained_assignment = None 12 | 13 | def concatenate_and_convert_to_geojson(args): 14 | map_subdir = args.input_dir 15 | output_geojson = args.output_geojson 16 | shift_size = args.shift_size 17 | eval_bool = args.eval_only 18 | 19 | file_list = glob.glob(map_subdir + '/*.json') 20 | file_list = sorted(file_list) 21 | if len(file_list) == 0: 22 | logging.warning('No files found for %s' % map_subdir) 23 | 24 | map_data = [] 25 | for file_path in file_list: 26 | patch_index_h, patch_index_w = os.path.basename(file_path).split('.')[0].split('_') 27 | patch_index_h = int(patch_index_h[1:]) 28 | patch_index_w = int(patch_index_w[1:]) 29 | try: 30 | df = pd.read_json(file_path) 31 | except pd.errors.EmptyDataError: 32 | logging.warning('%s is empty. Skipping.' % file_path) 33 | 34 | 35 | for index, line_data in df.iterrows(): 36 | df['polygon_x'][index] = np.array(df['polygon_x'][index]) + shift_size * patch_index_w 37 | df['polygon_y'][index] = np.array(df['polygon_y'][index]) + shift_size * patch_index_h 38 | map_data.append(df) 39 | 40 | map_df = pd.concat(map_data) 41 | 42 | features = [] 43 | for index, line_data in map_df.iterrows(): 44 | polygon_x, polygon_y = list(line_data['polygon_x']), list(line_data['polygon_y']) 45 | 46 | if eval_bool == False: 47 | # y is kept to be positive. Needs to be negative for QGIS visualization 48 | polygon = Polygon([[[x,-y] for x,y in zip(polygon_x, polygon_y)]+[[polygon_x[0], -polygon_y[0]]]]) 49 | else: 50 | polygon = Polygon([[[x,y] for x,y in zip(polygon_x, polygon_y)]+[[polygon_x[0], polygon_y[0]]]]) 51 | 52 | text = line_data['text'] 53 | score = line_data['score'] 54 | features.append(Feature(geometry = polygon, properties={"text": text, "score": score} )) 55 | 56 | feature_collection = FeatureCollection(features) 57 | # with open(os.path.join(output_dir, map_subdir +'.geojson'), 'w') as f: 58 | # dump(feature_collection, f) 59 | with open(output_geojson, 'w') as f: 60 | dump(feature_collection, f) 61 | 62 | logging.info('Done generating geojson (img coord) for %s', map_subdir) 63 | 64 | 65 | if __name__ == '__main__': 66 | 67 | parser = argparse.ArgumentParser() 68 | parser.add_argument('--input_dir', type=str, default='data/100_maps_crop_abc/0063014', 69 | help='path to input json path.') 70 | 71 | parser.add_argument('--output_geojson', type=str, default='data/100_maps_geojson_abc/0063014.geojson', 72 | help='path to output geojson path') 73 | 74 | parser.add_argument('--shift_size', type=int, default = 1000, 75 | help='image patch size and shift size.') 76 | 77 | # This can not be of string type. Otherwise it will be interpreted to True all the time. 78 | parser.add_argument('--eval_only', default = False, action='store_true', 79 | help='keep positive coordinate') 80 | 81 | args = parser.parse_args() 82 | print(args) 83 | 84 | concatenate_and_convert_to_geojson(args) 85 | 86 | 87 | 88 | 89 | 90 | -------------------------------------------------------------------------------- /m4_geocoordinate_converter/convert_geojson_to_geocoord.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import logging 4 | import ast 5 | 6 | import pandas as pd 7 | import numpy as np 8 | import geojson 9 | 10 | logging.basicConfig(level=logging.INFO) 11 | 12 | 13 | def main(args): 14 | geojson_file = args.in_geojson_file 15 | output_dir = args.out_geojson_dir 16 | 17 | sample_map_df = pd.read_csv(args.sample_map_path, dtype={'external_id': str}) 18 | sample_map_df['external_id'] = sample_map_df['external_id'].str.strip("'").str.replace('.', '', regex=True) 19 | geojson_filename_id = geojson_file.split(".")[0].split("/")[-1] 20 | 21 | row = sample_map_df[sample_map_df['external_id'] == geojson_filename_id] 22 | if not row.empty: 23 | gcps = ast.literal_eval(row.iloc[0]['gcps']) 24 | gcp_str = '' 25 | for gcp in gcps: 26 | lng, lat = gcp['location'] 27 | x, y = gcp['pixel'] 28 | gcp_str += '-gcp ' + str(x) + ' ' + str(y) + ' ' + str(lng) + ' ' + str(lat) + ' ' 29 | 30 | transform_method = row.iloc[0]['transformation_method'] 31 | assert transform_method in ['affine', 'polynomial', 'tps'] 32 | 33 | output = '"' + output_dir + geojson_filename_id + '.geojson"' 34 | input = '"' + geojson_file + '"' 35 | 36 | if transform_method == 'affine': 37 | gecoord_convert_command = 'ogr2ogr -f "GeoJSON" ' + output + " " + input + ' -order 1 ' + gcp_str 38 | 39 | elif transform_method == 'polynomial': 40 | gecoord_convert_command = 'ogr2ogr -f "GeoJSON" ' + output + " " + input + ' -order 2 ' + gcp_str 41 | 42 | elif transform_method == 'tps': 43 | gecoord_convert_command = 'ogr2ogr -f "GeoJSON" ' + output + " " + input + ' -tps ' + gcp_str 44 | 45 | else: 46 | raise NotImplementedError 47 | 48 | ret_value = os.system(gecoord_convert_command) 49 | if ret_value != 0: 50 | logging.info('Failed generating geocoord geojson for %s', geojson_file) 51 | else: 52 | with open(geojson_file) as img_geojson, open(output_dir + geojson_filename_id + '.geojson', 53 | 'r+') as geocoord_geojson: 54 | img_data = geojson.load(img_geojson) 55 | geocoord_data = geojson.load(geocoord_geojson) 56 | for img_feature, geocoord_feature in zip(img_data['features'], geocoord_data['features']): 57 | geocoord_feature['properties']['img_coordinates'] = np.array(img_feature['geometry']['coordinates'], 58 | dtype=np.int32).reshape(-1, 2).tolist() 59 | 60 | with open(output_dir + geojson_filename_id + '.geojson', 'w') as geocoord_geojson: 61 | geojson.dump(geocoord_data, geocoord_geojson) 62 | 63 | logging.info('Done generating geocoord geojson for %s', geojson_file) 64 | 65 | 66 | if __name__ == '__main__': 67 | parser = argparse.ArgumentParser() 68 | parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv', 69 | help='path to sample map csv, which contains gcps info') 70 | parser.add_argument('--in_geojson_file', type=str, 71 | help='input geojson file; results of M2') 72 | parser.add_argument('--out_geojson_dir', type=str, default='data/100_maps_geojson_abc_geocoord/', 73 | help='output dir for converted geojson files') 74 | 75 | args = parser.parse_args() 76 | 77 | main(args) -------------------------------------------------------------------------------- /m5_entity_linker/entity_linker.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import ast 4 | from dotenv import load_dotenv 5 | 6 | import pandas as pd 7 | import numpy as np 8 | 9 | import geojson 10 | 11 | import sqlalchemy 12 | from sqlalchemy import create_engine 13 | 14 | import geocoder 15 | from shapely.geometry import Polygon 16 | 17 | import re 18 | 19 | load_dotenv() 20 | 21 | DB_HOST = os.getenv("DB_HOST") 22 | DB_USERNAME = os.getenv("DB_USERNAME") 23 | DB_PASSWORD = os.getenv("DB_PASSWORD") 24 | DB_NAME = os.getenv("DB_NAME") 25 | 26 | connection_string = f'postgresql://postgres:{DB_USERNAME}:{DB_PASSWORD}@{DB_HOST}:5432/{DB_NAME}' 27 | 28 | 29 | def main(args): 30 | 31 | # check if first pair of gcps is in midwest-US 32 | regex = re.compile('[^a-zA-Z]') 33 | conn = create_engine(connection_string, echo=False) 34 | sample_map_df = pd.read_csv(args.sample_map_path, dtype={'external_id': str}) 35 | sample_map_df['external_id'] = sample_map_df['external_id'].str.strip("'").str.replace('.', '') 36 | midwest = ["Illinois", "Missouri", "Kansas", "Iowa", "South Dakota", "Indiana", "Ohio", "Wisconsin", "Minnesota", "Michigan"] 37 | 38 | geojson_files = os.listdir(args.in_geojson_dir) 39 | for i, geojson_file in enumerate(geojson_files): 40 | row = sample_map_df[sample_map_df['external_id']==geojson_file.split(".")[0]] 41 | gcps = ast.literal_eval(row.iloc[0]['gcps']) 42 | geocode = geocoder.osm(gcps[0]['location'][::-1], method='reverse') 43 | 44 | if geocode.state in midwest: 45 | with open(args.in_geojson_dir+geojson_file) as f: 46 | data = geojson.load(f) 47 | for feature_data in data['features']: 48 | pts = np.array(feature_data['geometry']['coordinates']).reshape(-1, 2) 49 | map_polygon = Polygon(pts) 50 | map_text = str(feature_data['properties']['text']).lower() 51 | map_text = regex.sub(' ', map_text) # remove all non-alphabetic characters 52 | 53 | query = f"""SELECT p.ogc_fid 54 | FROM polygon_features p 55 | WHERE LOWER(p.name) LIKE '%%{map_text}%%' 56 | AND ST_INTERSECTS(ST_TRANSFORM(ST_SetSRID(ST_MakeValid('{map_polygon}'::geometry), 4326)::geometry, 4326), p.wkb_geometry); 57 | """ 58 | 59 | try: 60 | intersect_df = pd.read_sql(query, con=conn) 61 | except sqlalchemy.exc.InternalError: 62 | continue 63 | 64 | if not intersect_df.empty: 65 | feature_data['properties']['osm_ogc_fid'] = intersect_df['ogc_fid'].values.tolist() 66 | # else: 67 | # feature_data['properties']['osm_ogc_fid'] = [] 68 | 69 | with open(args.out_geojson_dir+geojson_file, 'w') as output_geojson: 70 | geojson.dump(data, output_geojson) 71 | 72 | 73 | if __name__ == '__main__': 74 | parser = argparse.ArgumentParser() 75 | parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv', 76 | help='path to sample map csv, which contains gcps info') 77 | parser.add_argument('--in_geojson_dir', type=str, default='data/100_maps_geojson_abc_geocoord/', 78 | help='input dir for results of M2') 79 | parser.add_argument('--out_geojson_dir', type=str, default='data/100_maps_geojson_abc_linked/', 80 | help='output dir for converted geojson files') 81 | 82 | args = parser.parse_args() 83 | 84 | main(args) 85 | -------------------------------------------------------------------------------- /m6_post_ocr/lexical_search.py: -------------------------------------------------------------------------------- 1 | #-*-coding utf-8-*- 2 | import logging 3 | import requests 4 | import json 5 | import argparse 6 | import http.client as http_client 7 | import nltk 8 | import re 9 | import glob 10 | import os 11 | 12 | # set the debug level 13 | http_client.HTTPConnection.debuglevel = 1 14 | logging.basicConfig(level=logging.INFO) 15 | warnings.filterwarnings("ignore") 16 | 17 | headers = { 18 | 'Content-Type': 'application/json', 19 | } 20 | 21 | def query(args): 22 | """ Query candidates and save them as 'postocr_label' """ 23 | 24 | input_dir = args.in_geojson_dir 25 | output_geojson = args.out_geojson_dir 26 | 27 | map_name_output = input_dir.split('/')[-1] 28 | 29 | with open(input_dir) as json_file: 30 | json_df = json.load(json_file) 31 | 32 | if json_df != {}: 33 | query_result = [] 34 | for i in range(len(json_df["features"])): 35 | target_text = json_df['features'][i]["properties"]["text"] 36 | target_pts = json_df['features'][i]["geometry"]["coordinates"] 37 | 38 | clean_txt = [] 39 | if type(target_text) == str: 40 | for t in range(len(target_text)): 41 | txt = target_text[t] 42 | if txt.isalpha(): 43 | clean_txt.append(txt) 44 | 45 | temp_label = ''.join([str(item) for item in clean_txt]) 46 | if len(temp_label) != 0: 47 | target_text = temp_label 48 | 49 | process = re.findall('[A-Z][^A-Z]*', target_text) 50 | if all(c.isupper() for c in process) or len(process) == 1: 51 | 52 | if type(target_text) == str and any(c.isalpha() for c in target_text): 53 | # edist 0 54 | inputs = target_text.lower() 55 | q1 = '{"query": {"fuzzy": {"name": {"value": "'+ inputs +'", "fuzziness": "0"}}}}' 56 | resp = requests.get(f'http://localhost:9200/osm-voca/_search?', \ 57 | data=q1.encode("utf-8"), \ 58 | headers = headers) 59 | resp_json = json.loads(resp.text) 60 | test = resp_json["hits"]["hits"] 61 | 62 | edist = [] 63 | edist_update = [] 64 | 65 | edd_min_find = 0 66 | min_candidates = False 67 | 68 | if test != 'NaN': 69 | for tt in range(len(test)): 70 | if 'name' in test[tt]['_source']: 71 | candidate = test[tt]['_source']['name'] 72 | edist.append(candidate) 73 | 74 | for e in range(len(edist)): 75 | edd = nltk.edit_distance(inputs.upper(), edist[e].upper()) 76 | 77 | if edd == 0: 78 | edist_update.append(edist[e]) 79 | min_candidates = edist[e] 80 | edd_min_find = 1 81 | 82 | # edd 1 83 | if edd_min_find != 1: 84 | # edist 1 85 | q2 = '{"query": {"fuzzy": {"name": {"value": "'+ inputs +'", "fuzziness": "1"}}}}' 86 | resp = requests.get(f'http://localhost:9200/osm-voca/_search?', \ 87 | data=q2.encode("utf-8"), \ 88 | headers = headers) 89 | resp_json = json.loads(resp.text) 90 | test = resp_json["hits"]["hits"] 91 | 92 | edist = [] 93 | edist_count = [] 94 | edist_update = [] 95 | edist_count_update = [] 96 | 97 | if test != 'NaN': 98 | for tt in range(len(test)): 99 | if 'name' in test[tt]['_source']: 100 | candidate = test[tt]['_source']['message'] 101 | cand = candidate.split(',')[0] 102 | count = candidate.split(',')[1] 103 | edist.append(cand) 104 | edist_count.append(count) 105 | 106 | for e in range(len(edist)): 107 | edd = nltk.edit_distance(inputs.upper(), edist[e].upper()) 108 | 109 | if edd == 1: 110 | edist_update.append(edist[e]) 111 | edist_count_update.append(edist_count[e]) 112 | 113 | if len(edist_update) != 0: 114 | index = edist_count_update.index(max(edist_count_update)) 115 | min_candidates = edist_update[index] 116 | edd_min_find = 1 117 | 118 | # edd 2 119 | if edd_min_find != 1: 120 | # edist 2 121 | q3 = '{"query": {"fuzzy": {"name": {"value": "'+ inputs +'", "fuzziness": "2"}}}}' 122 | resp = requests.get(f'http://localhost:9200/osm-voca/_search?', \ 123 | data=q3.encode("utf-8"), \ 124 | headers = headers) 125 | resp_json = json.loads(resp.text) 126 | test = resp_json["hits"]["hits"] 127 | 128 | edist = [] 129 | edist_count = [] 130 | edist_update = [] 131 | edist_count_update = [] 132 | 133 | if test != 'NaN': 134 | for tt in range(len(test)): 135 | if 'name' in test[tt]['_source']: 136 | candidate = test[tt]['_source']['message'] 137 | cand = candidate.split(',')[0] 138 | count = candidate.split(',')[1] 139 | edist.append(cand) 140 | edist_count.append(count) 141 | 142 | for e in range(len(edist)): 143 | edd = nltk.edit_distance(inputs.upper(), edist[e].upper()) 144 | 145 | if edd == 2: 146 | edist_update.append(edist[e]) 147 | edist_count_update.append(edist_count[e]) 148 | 149 | if len(edist_update) != 0: 150 | index = edist_count_update.index(max(edist_count_update)) 151 | min_candidates = edist_update[index] 152 | edd_min_find = 1 153 | 154 | if edd_min_find != 1: 155 | min_candidates = False 156 | 157 | 158 | if min_candidates != False: 159 | json_df['features'][i]["properties"]["postocr_label"] = str(min_candidates) 160 | else: 161 | json_df['features'][i]["properties"]["postocr_label"] = str(target_text) 162 | 163 | else: # added 164 | json_df['features'][i]["properties"]["postocr_label"] = str(target_text) 165 | 166 | else: 167 | # only numeric pred_text 168 | json_df['features'][i]["properties"]["postocr_label"] = str(target_text) 169 | 170 | else: 171 | json_df['features'][i]["properties"]["postocr_label"] = str(target_text) 172 | 173 | # Save 174 | with open(output_geojson, 'w') as json_file: 175 | json.dump(json_df, json_file, ensure_ascii=False) 176 | 177 | logging.info('Done generating post-OCR geojson for %s', map_name_output) 178 | 179 | 180 | def main(args): 181 | query(args) 182 | 183 | 184 | if __name__ == '__main__': 185 | 186 | 187 | parser = argparse.ArgumentParser() 188 | parser.add_argument('--in_geojson_dir', type=str, default='/data2/rumsey_output/test2/', 189 | help='input dir for post-OCR module (= the output of M4) /crop_MN/output_stitch/') 190 | parser.add_argument('--out_geojson_dir', type=str, default='/data2/rumsey_output/out/', 191 | help='post-OCR result') 192 | 193 | args = parser.parse_args() 194 | print(args) 195 | 196 | # if not os.path.isdir(args.out_geojson_dir): 197 | # os.makedirs(args.out_geojson_dir) 198 | # print('created dir',args.out_geojson_dir) 199 | 200 | main(args) -------------------------------------------------------------------------------- /m_sanborn/s1_geocoding.py: -------------------------------------------------------------------------------- 1 | import os 2 | import argparse 3 | import geojson 4 | import geocoder 5 | import json 6 | import time 7 | import pdb 8 | 9 | 10 | def arcgic_geocoding(place_name, maxRows = 5): 11 | try: 12 | response = geocoder.arcgis(place_name,maxRows=maxRows) 13 | return response.json 14 | except exception as e: 15 | print(e) 16 | return -1 17 | 18 | 19 | def google_geocoding(place_name, api_key = None, maxRows = 5): 20 | try: 21 | response = geocoder.google(place_name, key=api_key, maxRows = maxRows) 22 | return response.json 23 | except exception as e: 24 | print(e) 25 | return -1 26 | 27 | def osm_geocoding(place_name, maxRows = 5): 28 | try: 29 | response = geocoder.osm(place_name, maxRows = maxRows) 30 | return response.json 31 | except exception as e: 32 | print(e) 33 | return -1 34 | 35 | 36 | def geonames_geocoding(place_name, user_name = None, maxRows = 5): 37 | try: 38 | response = geocoder.geonames(place_name, key = user_name, maxRows=maxRows) 39 | # hourly limit of 1000 credits 40 | time.sleep(4) 41 | return response.json 42 | except exception as e: 43 | print(e) 44 | return -1 45 | 46 | 47 | def geocoding(args): 48 | output_folder = args.output_folder 49 | input_map_geojson_path = args.input_map_geojson_path 50 | api_key = args.api_key 51 | user_name = args.user_name 52 | geocoder_option = args.geocoder_option 53 | max_results = args.max_results 54 | suffix = args.suffix 55 | 56 | with open(input_map_geojson_path, 'r') as f: 57 | data = geojson.load(f) 58 | 59 | map_name = os.path.basename(input_map_geojson_path).split('.')[0] 60 | output_folder = os.path.join(output_folder, geocoder_option) 61 | 62 | if not os.path.isdir(output_folder): 63 | os.makedirs(output_folder) 64 | 65 | output_path = os.path.join(output_folder, map_name) + '.json' 66 | 67 | with open(output_path, 'w') as f: 68 | pass # flush output file 69 | 70 | features = data['features'] 71 | for feature in features: # iterate through all the detected text labels 72 | geometry = feature['geometry'] 73 | text = feature['properties']['text'] 74 | score = feature['properties']['score'] 75 | 76 | # suffix = ', Los Angeles' 77 | text = str(text) + suffix 78 | 79 | print(text) 80 | 81 | if geocoder_option == 'arcgis': 82 | results = arcgic_geocoding(text, maxRows = max_results) 83 | elif geocoder_option == 'google': 84 | results = google_geocoding(text, api_key = api_key, maxRows = max_results) 85 | elif geocoder_option == 'geonames': 86 | results = geonames_geocoding(text, user_name = user_name, maxRows = max_results) 87 | elif geocoder_option == 'osm': 88 | results = osm_geocoding(text, maxRows = max_results) 89 | else: 90 | raise NotImplementedError 91 | 92 | if results == -1: 93 | # geocoder can not find match 94 | pass 95 | else: 96 | # save results 97 | with open(output_path, 'a') as f: 98 | json.dump({'text':text, 'score':score, 'geometry': geometry, 'geocoding':results}, f) 99 | f.write('\n') 100 | 101 | # pdb.set_trace() 102 | 103 | 104 | def main(): 105 | parser = argparse.ArgumentParser() 106 | 107 | parser.add_argument('--output_folder', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geocoding/') 108 | parser.add_argument('--input_map_geojson_path', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geojson_testr/service-gmd-gmd436m-g4364m-g4364lm-g4364lm_g00656189401-00656_01_1894-0001l.geojson') 109 | parser.add_argument('--api_key', type=str, default=None, help='Specify API key if needed') 110 | parser.add_argument('--user_name', type=str, default=None, help='Specify user name if needed') 111 | 112 | parser.add_argument('--suffix', type=str, default=None, help='placename suffix (e.g. city name)') 113 | 114 | parser.add_argument('--max_results', type=int, default=5, help='max number of results returend by geocoder') 115 | 116 | parser.add_argument('--geocoder_option', type=str, default='arcgis', 117 | choices=['arcgis', 'google','geonames','osm'], 118 | help='Select text spotting model option from ["arcgis","google","geonames","osm"]') # select text spotting model 119 | 120 | 121 | args = parser.parse_args() 122 | print('\n') 123 | print(args) 124 | print('\n') 125 | 126 | if not os.path.isdir(args.output_folder): 127 | os.makedirs(args.output_folder) 128 | 129 | geocoding(args) 130 | 131 | 132 | if __name__ == '__main__': 133 | 134 | main() 135 | 136 | 137 | 138 | -------------------------------------------------------------------------------- /m_sanborn/s2_clustering.py: -------------------------------------------------------------------------------- 1 | import os 2 | import json 3 | import argparse 4 | from sklearn.cluster import DBSCAN 5 | from matplotlib import pyplot as plt 6 | import geopandas as gpd 7 | import pandas as pd 8 | from bs4 import BeautifulSoup 9 | from mpl_toolkits.basemap import Basemap 10 | from pyproj import Proj, transform 11 | 12 | from shapely.geometry import Point 13 | from shapely.geometry.polygon import Polygon 14 | import numpy as np 15 | from shapely.geometry import MultiPoint 16 | from geopy.distance import great_circle 17 | 18 | 19 | county_index_dict = {'Cuyahoga County (OH)': 193, 20 | 'Fulton County (GA)': 73, 21 | 'Kern County (CA)': 2872, 22 | 'Lancaster County (NE)': 1629, 23 | 'Los Angeles County (CA)': 44, 24 | 'Mexico': -1, 25 | 'Nevada County (CA)': 46, 26 | 'New Orleans (LA)': -1, 27 | 'Pima County (AZ)': 2797, 28 | 'Placer County (CA)': 1273, 29 | 'Providence County (RI)\xa0': 1124, 30 | 'Saint Louis (MO)': -1, 31 | 'San Francisco County (CA)': 1261, 32 | 'San Joaquin County (CA)': 1213, 33 | 'Santa Clara (CA)': 48, 34 | 'Santa Cruz (CA)': 2386, 35 | 'Suffolk County (MA)': 272, 36 | 'Tulsa County (OK)': 526, 37 | 'Washington County (AK)': -1, 38 | 'Washington DC': -1} 39 | 40 | def get_centermost_point(cluster): 41 | centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y) 42 | centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m) 43 | return tuple(centermost_point) 44 | 45 | def clustering_func(lat_list, lng_list): 46 | X = [[a,b] for a,b in zip(lat_list, lng_list)] 47 | coords = np.array(X) 48 | 49 | # https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/ 50 | kms_per_radian = 6371.0088 51 | epsilon = 1.5 / kms_per_radian 52 | db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords)) 53 | cluster_labels = db.labels_ 54 | num_clusters = len(set(cluster_labels)) 55 | clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)]) 56 | 57 | centermost_points = get_centermost_point(clusters[0]) 58 | return centermost_points 59 | 60 | def plot_points(lat_list, lng_list, target_lat_list=None, target_lng_list = None, pred_lat=None, pred_lng = None, title = None): 61 | 62 | plt.figure(figsize=(10,6)) 63 | plt.title(title) 64 | 65 | plt.scatter(lng_list, lat_list, marker='o', c = 'violet', alpha=0.5) 66 | if pred_lat is not None and pred_lng is not None: 67 | plt.scatter(pred_lng, pred_lat, marker='o', c = 'red') 68 | 69 | if target_lat_list is not None and target_lng_list is not None: 70 | plt.scatter(target_lng_list, target_lat_list, 10, c = 'blue') 71 | plt.show() 72 | 73 | def plot_points_basemap(lat_list, lng_list, target_lat_list=None, target_lng_list = None, pred_lat=None, pred_lng = None, title = None): 74 | 75 | plt.figure(figsize=(10,6)) 76 | plt.title(title) 77 | 78 | if len(lat_list) >0 and len(lng_list) > 0: 79 | anchor_lat, anchor_lng = lat_list[0], lng_list[0] 80 | elif target_lat_list is not None: 81 | anchor_lat, anchor_lng = target_lat_list[0], target_lng_list[0] 82 | else: 83 | anchor_lat, anchor_lng = 45, -100 84 | 85 | m = Basemap(projection='lcc', resolution=None, 86 | width=8E4, height=8E4, 87 | lat_0=anchor_lat, lon_0=anchor_lng) 88 | m.etopo(scale=0.5, alpha=0.5) 89 | # m.arcgisimage(service='ESRI_Imagery_World_2D', xpixels = 2000, verbose= True) 90 | # m.arcgisimage(service='ESRI_Imagery_World_2D',scale=0.5, alpha=0.5) 91 | # m.arcgisimage(service='ESRI_Imagery_World_2D', xpixels = 2000, verbose= True) 92 | 93 | lng_list, lat_list = m(lng_list, lat_list) # transform coordinates 94 | plt.scatter(lng_list, lat_list, marker='o', c = 'violet', alpha=0.5) 95 | 96 | 97 | if target_lat_list is not None and target_lng_list is not None: 98 | target_lng_list, target_lat_list = m(target_lng_list, target_lat_list) 99 | plt.scatter(target_lng_list, target_lat_list, marker='o', c = 'blue',edgecolor='blue') 100 | 101 | if pred_lat is not None and pred_lng is not None: 102 | pred_lng, pred_lat = m(pred_lng, pred_lat) 103 | plt.scatter(pred_lng, pred_lat, marker='o', c = 'red', edgecolor='black') 104 | 105 | plt.show() 106 | 107 | def plotting_func(loc_sanborn_dir, pred_dict, lat_lng_dict, dataset_name, geocoding_name): 108 | 109 | for map_name, pred in pred_dict.items(): 110 | 111 | title = dataset_name + '-' + geocoding_name + '-' + map_name 112 | lat_list = lat_lng_dict[map_name]['lat_list'] 113 | lng_list = lat_lng_dict[map_name]['lng_list'] 114 | 115 | if dataset_name == 'LoC_sanborn': 116 | xml_path = os.path.join(loc_sanborn_dir,map_name + '.tif.aux.xml') 117 | try: 118 | with open(xml_path) as fp: 119 | soup = BeautifulSoup(fp) 120 | 121 | target_gcp_list = soup.findAll("metadata")[1].targetgcps.findAll("double") 122 | except Exception as e: 123 | print(xml_path) 124 | continue 125 | 126 | xy_list = [] 127 | for target_gcp in target_gcp_list: 128 | xy_list.append(float(target_gcp.contents[0])) 129 | 130 | x_list = xy_list[0::2] 131 | y_list = xy_list[1::2] 132 | 133 | lng2_list, lat2_list = [],[] 134 | for x1,y1 in zip(x_list, y_list): 135 | x2,y2 = transform(inProj,outProj,x1,y1) 136 | #print (x2,y2) 137 | lng2_list.append(x2) 138 | lat2_list.append(y2) 139 | 140 | plot_points(lat_list, lng_list, lat2_list, lng2_list, pred_lat = pred[0], pred_lng = pred[1], title=title) 141 | else: 142 | plot_points(lat_list, lng_list,pred_lat = pred[0], pred_lng = pred[1], title=title) 143 | 144 | 145 | def clustering(args): 146 | dataset_name = args.dataset_name 147 | geocoding_name = args.geocoding_name 148 | remove_duplicate_location = args.remove_duplicate_location 149 | visualize = args.visualize 150 | 151 | sanborn_output_dir = '/data2/sanborn_maps_output' 152 | 153 | input_dir=os.path.join(sanborn_output_dir, dataset_name, 'geocoding_suffix_testr', geocoding_name) 154 | if remove_duplicate_location: 155 | output_dir = os.path.join(sanborn_output_dir, dataset_name, 'clustering_testr_removeduplicate', geocoding_name) 156 | else: 157 | output_dir = os.path.join(sanborn_output_dir, dataset_name, 'clustering_testr', geocoding_name) 158 | 159 | county_boundary_path = '/home/zekun/Sanborn/cb_2018_us_county_500k/cb_2018_us_county_500k.shp' 160 | 161 | if not os.path.isdir(output_dir): 162 | os.makedirs(output_dir) 163 | 164 | inProj = Proj(init='epsg:3857') 165 | outProj = Proj(init='epsg:4326') 166 | 167 | county_boundary_df = gpd.read_file(county_boundary_path) 168 | 169 | if dataset_name == 'LoC_sanborn': 170 | loc_sanborn_dir = '/data2/sanborn_maps/Sanborn100_Georef/' # for comparing with GT 171 | metadata_tsv_path = '/home/zekun/Sanborn/Sheet_List.tsv' 172 | meta_df = pd.read_csv(metadata_tsv_path, sep='\t') 173 | 174 | file_list = os.listdir(input_dir) 175 | 176 | pred_dict = dict() 177 | lat_lng_dict = dict() 178 | for file_path in file_list: 179 | 180 | map_name = os.path.basename(file_path).split('.')[0] 181 | if dataset_name == 'LoC_sanborn': 182 | county_name = meta_df[meta_df['filename'] == map_name]['County'].values[0] 183 | elif dataset_name == 'LA_sanborn' or 'two_more': 184 | county_name = 'Los Angeles County (CA)' 185 | else: 186 | raise NotImplementedError 187 | 188 | index = county_index_dict[county_name] 189 | if index >= 0: 190 | poly_geometry = county_boundary_df.iloc[index].geometry 191 | 192 | with open(os.path.join(input_dir,file_path), 'r') as f: 193 | data = f.readlines() 194 | 195 | lat_list = [] 196 | lng_list = [] 197 | for line in data: 198 | 199 | line_dict = json.loads(line) 200 | geocoding_dict = line_dict['geocoding'] 201 | text = line_dict['text'] 202 | score = line_dict['score'] 203 | geometry = line_dict['geometry'] 204 | 205 | if geocoding_dict is None: 206 | continue # if no geolocation returned by geocoder, then skip 207 | 208 | if 'lat' not in geocoding_dict or 'lng' not in geocoding_dict: 209 | #print(geocoding_dict) 210 | continue 211 | 212 | lat = float(geocoding_dict['lat']) 213 | lng = float(geocoding_dict['lng']) 214 | 215 | point = Point(lng, lat) 216 | 217 | if index >= 0: 218 | if point.within(poly_geometry): # geocoding point within county boundary 219 | lat_list.append(lat) 220 | lng_list.append(lng) 221 | else: 222 | pass 223 | else: # cluster based on all results 224 | lat_list.append(lat) 225 | lng_list.append(lng) 226 | 227 | if remove_duplicate_location: 228 | lat_list = list(set(lat_list)) 229 | lng_list = list(set(lng_list)) 230 | 231 | if len(lat_list) >0 and len(lng_list) > 0: 232 | pred = clustering_func(lat_list, lng_list) 233 | # print(pred) 234 | else: 235 | print('No data to cluster') 236 | 237 | print(map_name, pred) 238 | pred_dict[map_name] = pred 239 | lat_lng_dict[map_name]={'lat_list':lat_list, 'lng_list':lng_list} 240 | 241 | if visualize: 242 | plotting_func(loc_sanborn_dir = loc_sanborn_dir, pred_dict = pred_dict, lat_lng_dict = lat_lng_dict, 243 | dataset_name = dataset_name, geocoding_name = geocoding_name) 244 | 245 | with open(os.path.join(output_dir, 'pred_center.json'),'w') as f: 246 | json.dump(pred_dict, f) 247 | 248 | 249 | def main(): 250 | parser = argparse.ArgumentParser() 251 | 252 | parser.add_argument('--dataset_name', type=str, default=None, 253 | choices=['LA_sanborn', 'LoC_sanborn',], 254 | help='dataset name, same as expt_name') 255 | parser.add_argument('--geocoding_name', type=str, default=None, 256 | choices=['google','arcgis','geonames','osm'], 257 | help='geocoder name') 258 | parser.add_argument('--visualize', default = False, action = 'store_true') # Enable this when in notebook 259 | parser.add_argument('--remove_duplicate_location', default=False, action='store_true') # whether remove duplicate geolocations for clustering 260 | 261 | # parser.add_argument('--output_folder', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geocoding/') 262 | # parser.add_argument('--input_map_geojson_path', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geojson_testr/service-gmd-gmd436m-g4364m-g4364lm-g4364lm_g00656189401-00656_01_1894-0001l.geojson') 263 | 264 | 265 | args = parser.parse_args() 266 | print('\n') 267 | print(args) 268 | print('\n') 269 | 270 | clustering(args) 271 | 272 | 273 | if __name__ == '__main__': 274 | 275 | main() 276 | -------------------------------------------------------------------------------- /m_sanborn/s3_gen_geojson.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/machines-reading-maps/mapkurator-system/fc3bf223a6806bd3d5beaa3bc1e4640449caa482/m_sanborn/s3_gen_geojson.py -------------------------------------------------------------------------------- /metadata/davidrumsey/davidrumsey.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import pandas as pd 3 | 4 | 5 | class DavidRumsey: 6 | 7 | csv_filename="davidrumsey_metadata.csv" 8 | df = pd.read_csv(csv_filename) 9 | 10 | def __init__(self, api_key): 11 | self.api_key = api_key 12 | self.headers = { 13 | 'Authorization': self.api_key, 14 | 'Content-Type': 'application/json', 15 | 'charset': 'utf-8' 16 | } 17 | 18 | def get_ground_control_points(self, external_id): 19 | """ 20 | Get ground control points of map image via Oldmapsonline API. 21 | Args: 22 | external_id: str 23 | Returns: 24 | transform_method: str 25 | Transformation method 26 | e.g., "affine", "polynomial", "tps" 27 | gcps: list 28 | All pairs of ground control points 29 | e.g., [{'location': [-118.269356489, 34.063140276], 'pixel': [5629, 5064]},{'location': , 'pixel': }, ... ] 30 | """ 31 | 32 | # 404 ERROR on many maps 33 | # 1. GET /maps/external/{external_id} 34 | # baseurl = "https://api.oldmapsonline.org/1.0/maps/external/" + external_id 35 | # res = requests.get(baseurl, headers=self.headers) 36 | map_id = self.df[self.df['external_id']==external_id]['id'] 37 | 38 | # 2. GET /maps/{id}/georeferences 39 | baseurl = "https://api.oldmapsonline.org/1.0/maps/" + map_id + "/georeferences" 40 | res = requests.get(baseurl, headers=self.headers) 41 | 42 | try: 43 | res.raise_for_status() 44 | except requests.exceptions.HTTPError as e: 45 | print(e) 46 | return None 47 | 48 | data = res.json() 49 | if not data['items']: 50 | return None 51 | else: 52 | transform_method = data['items'][0]['transformation_method'] 53 | gcps = data['items'][0]['gcps'] 54 | return transform_method, gcps 55 | -------------------------------------------------------------------------------- /metadata/sanborn.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/machines-reading-maps/mapkurator-system/fc3bf223a6806bd3d5beaa3bc1e4640449caa482/metadata/sanborn.py -------------------------------------------------------------------------------- /model_card_template: -------------------------------------------------------------------------------- 1 | --- 2 | license: cc-by-nc-2.0 3 | language: 4 | - en 5 | tags: 6 | - text spotting 7 | - scene text detection 8 | - maps 9 | - cultural heritage 10 | --- 11 | # Model Card for Model ID 12 | 13 | 14 | 15 | 16 | ## Model Details 17 | 18 | ### Model Description 19 | 20 | 21 | 22 | 23 | 24 | - **Developed by:** Knowledge Computing Lab, University of Minnesota: Leeje Jang, Jina Kim, Zekun Li, Yijun Lin, Min Namgung, Yao-Yi Chiang 25 | - **Shared by:** Machines Reading Maps 26 | - **Model type:** text spotter 27 | - **Language(s):** English 28 | - **License:** CC-BY-NC 2.0 29 | 30 | ### Model Sources [optional] 31 | 32 | 33 | 34 | - **Repository:** https://github.com/knowledge-computing/mapkurator-spotter 35 | - **Paper [optional]:** [More Information Needed] 36 | - **Documentation:** https://knowledge-computing.github.io/mapkurator-doc/#/ 37 | 38 | ## Uses 39 | 40 | 41 | 42 | ### Direct Use 43 | 44 | 45 | 46 | The model detects and recognizes text on images. It was trained specifically to identify text on a wide range of historical maps with many styles printed between ca. 1500-2000 provided by the David Rumsey Map Collection. 47 | This version of the model was trained with an English language model. 48 | 49 | 50 | ### Downstream Use 51 | 52 | 53 | Using this model for new experiments will require attention to the style and language of text on images, including (possibly) the creation of new, synthetic or other training data. 54 | 55 | 56 | ### Out-of-Scope Use 57 | 58 | 59 | 60 | 61 | ## Bias, Risks, and Limitations 62 | 63 | 64 | This model will struggle to return high quality results for maps with complex fonts, low contrast images, complex background colors and textures, and non-English language words. 65 | 66 | [More Information Needed] 67 | 68 | ### Recommendations 69 | 70 | 71 | 72 | Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. 73 | 74 | ## How to Get Started with the Model 75 | 76 | Please refer to the mapKurator documentation for details: https://knowledge-computing.github.io/mapkurator-doc/#/ 77 | 78 | ## Training Details 79 | 80 | ### Training Data 81 | 82 | 83 | 84 | Synthetic training datasets: 85 | 1. SynthText: 40k text-free background images from COCO and use them to generate synthetic text images (see the left image). Code: https://github.com/ankush-me/SynthText; Dataset: TBD. 86 | 2. SynMap: "patches" of synthetic maps that mimic the text (e.g., font, spacing, orientation) and background styles in the real historical maps (see the right image). Code: TBD; Dataset: TBD. 87 | 88 | 89 | ## Citation [optional] 90 | 91 | 92 | 93 | **BibTeX:** 94 | 95 | [More Information Needed] 96 | 97 | **APA:** 98 | 99 | [More Information Needed] 100 | 101 | 102 | 103 | ## Model Card Authors 104 | 105 | Yijun Lin, Katherine McDonough, Valeria Vitale 106 | 107 | ## Model Card Contact 108 | 109 | Yijun Lin, lin00786 at umn.edu 110 | 111 | 112 | -------------------------------------------------------------------------------- /pipe_run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_get_dimension --module_cropping 4 | 5 | python run.py --sample_map_csv_path //home/maplord/maplist_csv/luna_omo_metadata_2508.csv --expt_name Rerun_2_2508 --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --output_folder /data2/rumsey_output/ --spotter_expt_name testr_syn 6 | 7 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_img_geojson 8 | 9 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_geocoord_geojson 10 | 11 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_post_ocr 12 | -------------------------------------------------------------------------------- /pipe_run_img.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_get_dimension --module_cropping 4 | 5 | python run_img.py --sample_map_csv_path /data2/rumsey_output/sample_sb/data/ --expt_name sample_sb_opt --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --output_folder /data2/rumsey_output/ --spotter_expt_name testr_syn 6 | 7 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_img_geojson 8 | 9 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_geocoord_geojson 10 | 11 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_post_ocr 12 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/machines-reading-maps/mapkurator-system/fc3bf223a6806bd3d5beaa3bc1e4640449caa482/requirements.txt -------------------------------------------------------------------------------- /run.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import glob 4 | import argparse 5 | import time 6 | import logging 7 | import pandas as pd 8 | import pdb 9 | import datetime 10 | from PIL import Image 11 | from utils import get_img_path_from_external_id, get_img_path_from_external_id_and_image_no 12 | 13 | import subprocess 14 | 15 | 16 | logging.basicConfig(level=logging.INFO) 17 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images 18 | 19 | # def execute_command(command, if_print_command): 20 | # t1 = time.time() 21 | 22 | # if if_print_command: 23 | # print(command) 24 | # os.system(command) 25 | 26 | # t2 = time.time() 27 | # time_usage = t2 - t1 28 | # return time_usage 29 | 30 | def execute_command(command, if_print_command): 31 | t1 = time.time() 32 | 33 | if if_print_command: 34 | print(command) 35 | 36 | try: 37 | subprocess.run(command, shell=True,check=True, capture_output = True) #stderr=subprocess.STDOUT) 38 | t2 = time.time() 39 | time_usage = t2 - t1 40 | return {'time_usage':time_usage} 41 | except subprocess.CalledProcessError as err: 42 | error = err.stderr.decode('utf8') 43 | # format error message to one line 44 | error = error.replace('\n','\t') 45 | error = error.replace(',',';') 46 | return {'error': error} 47 | 48 | 49 | def get_img_dimension(img_path): 50 | map_img = Image.open(img_path) 51 | width, height = map_img.size 52 | 53 | return width, height 54 | 55 | 56 | def run_pipeline(args): 57 | # ------------------------- Pass arguments ----------------------------------------- 58 | map_kurator_system_dir = args.map_kurator_system_dir 59 | text_spotting_model_dir = args.text_spotting_model_dir 60 | sample_map_path = args.sample_map_csv_path 61 | expt_name = args.expt_name 62 | output_folder = args.output_folder 63 | 64 | module_get_dimension = args.module_get_dimension 65 | module_gen_geotiff = args.module_gen_geotiff 66 | module_cropping = args.module_cropping 67 | module_text_spotting = args.module_text_spotting 68 | module_img_geojson = args.module_img_geojson 69 | module_geocoord_geojson = args.module_geocoord_geojson 70 | module_entity_linking = args.module_entity_linking 71 | module_post_ocr = args.module_post_ocr 72 | 73 | spotter_model = args.spotter_model 74 | spotter_config = args.spotter_config 75 | spotter_expt_name = args.spotter_expt_name 76 | gpu_id = args.gpu_id 77 | 78 | if_print_command = args.print_command 79 | 80 | 81 | # sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/' 82 | 83 | # ------------------------- Read sample map list and prepare output dir ---------------- 84 | input_csv_path = sample_map_path 85 | if input_csv_path[-4:] == '.csv': 86 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}) 87 | elif input_csv_path[-4:] == '.tsv': 88 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}, sep='\t') 89 | else: 90 | raise NotImplementedError 91 | 92 | # external_id_to_img_path_dict = get_img_path_from_external_id( sample_map_path = input_csv_path) 93 | external_id_to_img_path_dict, unmatched_external_id_list = get_img_path_from_external_id_and_image_no( sample_map_path = input_csv_path) 94 | 95 | # initialize error reason dict 96 | error_reason_dict = dict() 97 | for ex_id in unmatched_external_id_list: 98 | error_reason_dict[ex_id] = {'img_path':None, 'error':'Can not find image given external_id.'} 99 | 100 | # initialize time_usage_dict 101 | # time_usage_dict = dict() 102 | # for ex_id in sample_map_df['external_id']: 103 | # time_usage_dict[ex_id] = {} 104 | 105 | expt_out_dir = os.path.join(output_folder, expt_name) 106 | geotiff_output_dir = os.path.join(output_folder, expt_name, 'geotiff') 107 | cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/') 108 | spotting_output_dir = os.path.join(output_folder, expt_name, 'spotter/' + spotter_expt_name) 109 | stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_expt_name) 110 | postocr_output_dir = os.path.join(output_folder, expt_name, 'postocr/'+ spotter_expt_name) 111 | geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_expt_name + '/') 112 | 113 | if not os.path.isdir(expt_out_dir): 114 | os.makedirs(expt_out_dir) 115 | 116 | # ------------------------ Get image dimension ------------------------------ 117 | if module_get_dimension: 118 | for index, record in sample_map_df.iterrows(): 119 | external_id = record.external_id 120 | # pdb.set_trace() 121 | if external_id not in external_id_to_img_path_dict: 122 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 123 | continue 124 | 125 | img_path = external_id_to_img_path_dict[external_id] 126 | map_name = os.path.basename(img_path).split('.')[0] 127 | 128 | try: 129 | width, height = get_img_dimension(img_path) 130 | except Exception as e: 131 | error_reason_dict[external_id] = {'img_path':img_path, 'error': e } 132 | 133 | time_usage_dict[external_id]['img_w'] = width 134 | time_usage_dict[external_id]['img_h'] = height 135 | 136 | 137 | # ------------------------- Generate geotiff ------------------------------ 138 | time_start = time.time() 139 | if module_gen_geotiff: 140 | os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff')) 141 | 142 | if not os.path.isdir(geotiff_output_dir): 143 | os.makedirs(geotiff_output_dir) 144 | 145 | # use converted jpg folder instead of original sid folder 146 | run_geotiff_command = 'python convert_image_to_geotiff.py --sid_root_dir /data2/rumsey_sid_to_jpg/ --sample_map_path '+ input_csv_path +' --out_geotiff_dir '+geotiff_output_dir # can change params in argparse 147 | exe_ret = execute_command(run_geotiff_command, if_print_command) 148 | if 'error' in exe_ret: 149 | error = exe_ret['error'] 150 | elif 'time_usage' in exe_ret: 151 | time_usage = exe_ret['time_usage'] 152 | 153 | time_usage_dict[external_id]['geotiff'] = time_usage 154 | 155 | 156 | time_geotiff = time.time() 157 | 158 | 159 | # ------------------------- Image cropping ------------------------------ 160 | if module_cropping: 161 | for index, record in sample_map_df.iterrows(): 162 | external_id = record.external_id 163 | 164 | if external_id not in external_id_to_img_path_dict: 165 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 166 | continue 167 | 168 | img_path = external_id_to_img_path_dict[external_id] 169 | map_name = os.path.basename(img_path).split('.')[0] 170 | 171 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition')) 172 | if not os.path.isdir(cropping_output_dir): 173 | os.makedirs(cropping_output_dir) 174 | 175 | run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir 176 | 177 | exe_ret = execute_command(run_crop_command, if_print_command) 178 | 179 | if 'error' in exe_ret: 180 | error = exe_ret['error'] 181 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 182 | elif 'time_usage' in exe_ret: 183 | time_usage = exe_ret['time_usage'] 184 | time_usage_dict[external_id]['cropping'] = time_usage 185 | else: 186 | raise NotImplementedError 187 | 188 | 189 | time_cropping = time.time() 190 | 191 | # ------------------------- Text Spotting (patch level) ------------------------------ 192 | if module_text_spotting: 193 | assert os.path.exists(spotter_config), "Config file for spotter must exist!" 194 | os.chdir(text_spotting_model_dir) 195 | 196 | for index, record in sample_map_df.iterrows(): 197 | 198 | external_id = record.external_id 199 | if external_id not in external_id_to_img_path_dict: 200 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 201 | continue 202 | 203 | img_path = external_id_to_img_path_dict[external_id] 204 | map_name = os.path.basename(img_path).split('.')[0] 205 | 206 | map_spotting_output_dir = os.path.join(spotting_output_dir, map_name) 207 | if not os.path.isdir(map_spotting_output_dir): 208 | os.makedirs(map_spotting_output_dir) 209 | 210 | if spotter_model == 'abcnet': 211 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth' 212 | elif spotter_model == 'testr': 213 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir}' 214 | elif spotter_model == 'spotter_v2': 215 | run_spotting_command = f'CUDA_VISIBLE_DEVICES={gpu_id} python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir}' 216 | print(run_spotting_command) 217 | else: 218 | raise NotImplementedError 219 | 220 | run_spotting_command += ' 1> /dev/null' 221 | 222 | 223 | 224 | exe_ret = execute_command(run_spotting_command, if_print_command) 225 | if 'error' in exe_ret: 226 | error = exe_ret['error'] 227 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 228 | # elif 'time_usage' in exe_ret: 229 | # time_usage = exe_ret['time_usage'] 230 | # time_usage_dict[external_id]['spotting'] = time_usage 231 | # else: 232 | # raise NotImplementedError 233 | 234 | logging.info('Done text spotting for %s', map_name) 235 | time_text_spotting = time.time() 236 | 237 | 238 | # ------------------------- Image coord geojson (map level) ------------------------------ 239 | if module_img_geojson: 240 | os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson')) 241 | 242 | if not os.path.isdir(stitch_output_dir): 243 | os.makedirs(stitch_output_dir) 244 | 245 | for index, record in sample_map_df.iterrows(): 246 | external_id = record.external_id 247 | if external_id not in external_id_to_img_path_dict: 248 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 249 | continue 250 | 251 | img_path = external_id_to_img_path_dict[external_id] 252 | map_name = os.path.basename(img_path).split('.')[0] 253 | 254 | stitch_input_dir = os.path.join(spotting_output_dir, map_name) 255 | output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson') 256 | 257 | run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson 258 | 259 | exe_ret = execute_command(run_stitch_command, if_print_command) 260 | 261 | if 'error' in exe_ret: 262 | error = exe_ret['error'] 263 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 264 | elif 'time_usage' in exe_ret: 265 | time_usage = exe_ret['time_usage'] 266 | time_usage_dict[external_id]['stitch'] = time_usage 267 | else: 268 | raise NotImplementedError 269 | 270 | time_img_geojson = time.time() 271 | 272 | # ------------------------- post-OCR ------------------------------ 273 | if module_post_ocr: 274 | 275 | os.chdir(os.path.join(map_kurator_system_dir, 'm6_post_ocr')) 276 | 277 | if not os.path.isdir(postocr_output_dir): 278 | os.makedirs(postocr_output_dir) 279 | 280 | for index, record in sample_map_df.iterrows(): 281 | 282 | external_id = record.external_id 283 | if external_id not in external_id_to_img_path_dict: 284 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 285 | continue 286 | 287 | img_path = external_id_to_img_path_dict[external_id] 288 | map_name = os.path.basename(img_path).split('.')[0] 289 | 290 | input_geojson_file = os.path.join(stitch_output_dir, map_name + '.geojson') 291 | geojson_postocr_output_file = os.path.join(postocr_output_dir, map_name + '.geojson') 292 | 293 | run_postocr_command = 'python lexical_search.py --in_geojson_dir '+ input_geojson_file +' --out_geojson_dir '+ geojson_postocr_output_file 294 | 295 | exe_ret = execute_command(run_postocr_command, if_print_command) 296 | 297 | if 'error' in exe_ret: 298 | error = exe_ret['error'] 299 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 300 | elif 'time_usage' in exe_ret: 301 | time_usage = exe_ret['time_usage'] 302 | time_usage_dict[external_id]['postocr'] = time_usage 303 | else: 304 | raise NotImplementedError 305 | 306 | time_post_ocr = time.time() 307 | 308 | 309 | # ------------------------- Convert image coordinates to geocoordinates ------------------------------ 310 | if module_geocoord_geojson: 311 | os.chdir(os.path.join(map_kurator_system_dir, 'm4_geocoordinate_converter')) 312 | 313 | if not os.path.isdir(geojson_output_dir): 314 | os.makedirs(geojson_output_dir) 315 | 316 | for index, record in sample_map_df.iterrows(): 317 | external_id = record.external_id 318 | if external_id not in external_id_to_img_path_dict: 319 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 320 | continue 321 | 322 | in_geojson = os.path.join(output_folder, postocr_output_dir+'/') + external_id.strip("'").replace('.', '') + ".geojson" 323 | 324 | run_converter_command = 'python convert_geojson_to_geocoord.py --sample_map_path '+ os.path.join(map_kurator_system_dir, input_csv_path) +' --in_geojson_file '+ in_geojson +' --out_geojson_dir '+ os.path.join(map_kurator_system_dir, geojson_output_dir) 325 | 326 | exe_ret = execute_command(run_converter_command, if_print_command) 327 | 328 | if 'error' in exe_ret: 329 | error = exe_ret['error'] 330 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 331 | elif 'time_usage' in exe_ret: 332 | time_usage = exe_ret['time_usage'] 333 | time_usage_dict[external_id]['geocoord_geojson'] = time_usage 334 | else: 335 | raise NotImplementedError 336 | 337 | time_geocoord_geojson = time.time() 338 | 339 | # ------------------------- Link entities in OSM ------------------------------ 340 | if module_entity_linking: 341 | os.chdir(os.path.join(map_kurator_system_dir, 'm5_entity_linker')) 342 | 343 | geojson_linked_output_dir = os.path.join(map_kurator_system_dir, 'm5_entity_linker', 'data/100_maps_geojson_abc_linked/') 344 | if not os.path.isdir(geojson_output_dir): 345 | os.makedirs(geojson_output_dir) 346 | 347 | run_linker_command = 'python entity_linker.py --sample_map_path '+ input_csv_path +' --in_geojson_dir '+ geojson_output_dir +' --out_geojson_dir '+ geojson_linked_output_dir 348 | execute_command(run_linker_command, if_print_command) 349 | 350 | time_entity_linking = time.time() 351 | 352 | 353 | # --------------------- Time usage logging -------------------------- 354 | # print('\n') 355 | # logging.info('Time for generating geotiff: %d', time_geotiff - time_start) 356 | # logging.info('Time for Cropping : %d',time_cropping - time_geotiff) 357 | # logging.info('Time for text spotting : %d',time_text_spotting - time_cropping) 358 | # logging.info('Time for generating geojson in img coordinate : %d',time_img_geojson - time_text_spotting) 359 | # logging.info('Time for generating geojson in geo coordinate : %d',time_geocoord_geojson - time_img_geojson) 360 | # logging.info('Time for entity linking : %d',time_entity_linking - time_geocoord_geojson) 361 | # logging.info('Time for post OCR : %d',time_post_ocr - time_img_geojson) 362 | 363 | # time_usage_df = pd.DataFrame.from_dict(time_usage_dict, orient='index') 364 | # time_usage_log_path = os.path.join(output_folder, expt_name, 'time_usage.csv') 365 | 366 | # # check if exist time_usage log file 367 | # if os.path.isfile(time_usage_log_path): 368 | # existing_df = pd.read_csv(time_usage_log_path, index_col='external_id', dtype={'external_id':str}) 369 | # # if exist duplicate columns, ret time usage values to the latest run 370 | # cols_to_use = existing_df.columns.difference(time_usage_df.columns) 371 | 372 | # time_usage_df = time_usage_df.join(existing_df[cols_to_use]) 373 | 374 | # # make sure time_usage_expt_name.csv always have the latest time usage 375 | # # move the old time_usage.csv to time_usage[timestamp].csv where timestamp is the last expt running time 376 | # m_time = os.path.getmtime(time_usage_log_path) 377 | # dt_m = datetime.datetime.fromtimestamp(m_time) 378 | # timestr = dt_m.strftime("%Y%m%d-%H%M%S") 379 | 380 | # deprecated_path = os.path.join(output_folder, expt_name, 'time_usage_' + timestr +'.csv') 381 | # run_command = 'mv ' + time_usage_log_path + ' ' + deprecated_path 382 | # execute_command(run_command, if_print_command) 383 | 384 | # time_usage_df.to_csv(time_usage_log_path, index_label='external_id') 385 | 386 | # --------------------- Error logging -------------------------- 387 | print('\n') 388 | current_time = datetime.datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p") 389 | error_reason_df = pd.DataFrame.from_dict(error_reason_dict, orient='index') 390 | error_reason_log_path = os.path.join(output_folder, expt_name, 'error_reason_' + current_time +'.csv') 391 | error_reason_df.to_csv(error_reason_log_path, index_label='external_id') 392 | 393 | 394 | def main(): 395 | parser = argparse.ArgumentParser() 396 | 397 | parser.add_argument('--map_kurator_system_dir', type=str, default='/home/maplord/rumsey/mapkurator-system/') 398 | parser.add_argument('--text_spotting_model_dir', type=str, default='/home/maplord/rumsey/TESTR/') 399 | parser.add_argument('--sample_map_csv_path', type=str, default='m1_geotiff/data/sample_US_jp2_100_maps.csv') # Original: sample_US_jp2_100_maps.csv 400 | parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') # Original: /data2/rumsey_output 401 | parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix 402 | 403 | parser.add_argument('--module_get_dimension', default=False, action='store_true') 404 | parser.add_argument('--module_gen_geotiff', default=False, action='store_true') 405 | parser.add_argument('--module_cropping', default=False, action='store_true') 406 | parser.add_argument('--module_text_spotting', default=False, action='store_true') 407 | parser.add_argument('--module_img_geojson', default=False, action='store_true') 408 | parser.add_argument('--module_geocoord_geojson', default=False, action='store_true') 409 | parser.add_argument('--module_entity_linking', default=False, action='store_true') 410 | parser.add_argument('--module_post_ocr', default=False, action='store_true') 411 | 412 | parser.add_argument('--spotter_model', type=str, default='spotter_v2', choices=['abcnet', 'testr', 'spotter_v2'], 413 | help='Select text spotting model option from ["abcnet","testr", "testr_v2"]') # select text spotting model 414 | parser.add_argument('--spotter_config', type=str, default='/home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml', 415 | help='Path to the config file for text spotting model') 416 | parser.add_argument('--spotter_expt_name', type=str, default='exp', 417 | help='Name of spotter experiment, if empty using config file name') 418 | # python run.py --text_spotting_model_dir /home/maplord/rumsey/testr_v2/TESTR/ 419 | # --sample_map_csv_path /home/maplord/maplist_csv/luna_omo_splits/luna_omo_metadata_56628_20220724.csv 420 | # --expt_name 57k_maps_r2 --module_text_spotting 421 | # --spotter_model testr_v2 --spotter_config /home/maplord/rumsey/testr_v2/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --spotter_expt_name testr_synmap 422 | 423 | parser.add_argument('--print_command', default=False, action='store_true') 424 | parser.add_argument('--gpu_id', type=int, default=0) 425 | 426 | 427 | args = parser.parse_args() 428 | print('\n') 429 | print(args) 430 | print('\n') 431 | 432 | run_pipeline(args) 433 | 434 | 435 | 436 | if __name__ == '__main__': 437 | 438 | main() 439 | 440 | 441 | -------------------------------------------------------------------------------- /run_img.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import glob 4 | import argparse 5 | import time 6 | import logging 7 | import pandas as pd 8 | import pdb 9 | import datetime 10 | from PIL import Image 11 | from utils import get_img_path_from_external_id, get_img_path_from_external_id_and_image_no 12 | 13 | import subprocess 14 | 15 | #this code is the case for getting an input as folders which include images. 16 | #tested image : /home/maplord/rumsey/mapkurator-system/data/100_maps_crop/crop_leeje_2/test_run_img/ 17 | logging.basicConfig(level=logging.INFO) 18 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images 19 | 20 | # def execute_command(command, if_print_command): 21 | # t1 = time.time() 22 | 23 | # if if_print_command: 24 | # print(command) 25 | # os.system(command) 26 | 27 | # t2 = time.time() 28 | # time_usage = t2 - t1 29 | # return time_usage 30 | 31 | def execute_command(command, if_print_command): 32 | t1 = time.time() 33 | 34 | if if_print_command: 35 | print(command) 36 | 37 | try: 38 | subprocess.run(command, shell=True,check=True, capture_output = True) #stderr=subprocess.STDOUT) 39 | t2 = time.time() 40 | time_usage = t2 - t1 41 | return {'time_usage':time_usage} 42 | except subprocess.CalledProcessError as err: 43 | error = err.stderr.decode('utf8') 44 | # format error message to one line 45 | error = error.replace('\n','\t') 46 | error = error.replace(',',';') 47 | return {'error': error} 48 | 49 | 50 | def get_img_dimension(img_path): 51 | map_img = Image.open(img_path) 52 | width, height = map_img.size 53 | 54 | return width, height 55 | 56 | 57 | def run_pipeline(args): 58 | # ------------------------- Pass arguments ----------------------------------------- 59 | map_kurator_system_dir = args.map_kurator_system_dir 60 | text_spotting_model_dir = args.text_spotting_model_dir 61 | sample_map_path = args.sample_map_csv_path 62 | expt_name = args.expt_name 63 | output_folder = args.output_folder 64 | 65 | module_get_dimension = args.module_get_dimension 66 | module_gen_geotiff = args.module_gen_geotiff 67 | module_cropping = args.module_cropping 68 | module_text_spotting = args.module_text_spotting 69 | module_img_geojson = args.module_img_geojson 70 | module_geocoord_geojson = args.module_geocoord_geojson 71 | module_entity_linking = args.module_entity_linking 72 | module_post_ocr = args.module_post_ocr 73 | 74 | spotter_model = args.spotter_model 75 | spotter_config = args.spotter_config 76 | spotter_expt_name = args.spotter_expt_name 77 | 78 | if_print_command = args.print_command 79 | 80 | 81 | # sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/' 82 | 83 | # ------------------------- Read sample map list and prepare output dir ---------------- 84 | 85 | 86 | # if input_csv_path[-4:] == '.csv': 87 | # sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}) 88 | # elif input_csv_path[-4:] == '.tsv': 89 | # sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}, sep='\t') 90 | # else: 91 | # raise NotImplementedError 92 | 93 | input_img_path = sample_map_path 94 | sample_map_df = pd.DataFrame(columns = ["external_id"]) 95 | for images in os.listdir(input_img_path): 96 | tmp_path={"external_id": input_img_path+images} 97 | sample_map_df=sample_map_df.append(tmp_path,ignore_index=True) 98 | 99 | # ------------------------- Read image and prepare output dir ---------------- 100 | 101 | 102 | # # external_id_to_img_path_dict = get_img_path_from_external_id( sample_map_path = input_csv_path) 103 | # external_id_to_img_path_dict, unmatched_external_id_list = get_img_path_from_external_id_and_image_no( sample_map_path = input_csv_path) 104 | 105 | # # initialize error reason dict 106 | # error_reason_dict = dict() 107 | # for ex_id in unmatched_external_id_list: 108 | # error_reason_dict[ex_id] = {'img_path':None, 'error':'Can not find image given external_id.'} 109 | 110 | # initialize time_usage_dict 111 | # time_usage_dict = dict() 112 | # for ex_id in sample_map_df['external_id']: 113 | # time_usage_dict[ex_id] = {} 114 | 115 | expt_out_dir = os.path.join(output_folder, expt_name) 116 | geotiff_output_dir = os.path.join(output_folder, expt_name, 'geotiff') 117 | cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/') 118 | spotting_output_dir = os.path.join(output_folder, expt_name, 'spotter/' + spotter_expt_name) 119 | stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_expt_name) 120 | postocr_output_dir = os.path.join(output_folder, expt_name, 'postocr/'+ spotter_expt_name) 121 | geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_expt_name + '/') 122 | 123 | 124 | 125 | if not os.path.isdir(expt_out_dir): 126 | os.makedirs(expt_out_dir) 127 | 128 | # ------------------------ Get image dimension ------------------------------ 129 | if module_get_dimension: 130 | for index, record in sample_map_df.iterrows(): 131 | external_id = record.external_id 132 | # pdb.set_trace() 133 | # if external_id not in external_id_to_img_path_dict: 134 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 135 | # continue 136 | 137 | img_path = sample_map_df['external_id'].iloc[index] 138 | # print("img_path",img_path) 139 | map_name = os.path.basename(img_path).split('.')[0] 140 | # print("map_name",map_name) 141 | width, height = get_img_dimension(img_path) 142 | 143 | 144 | # time_usage_dict[external_id]['img_w'] = width 145 | # time_usage_dict[external_id]['img_h'] = height 146 | 147 | 148 | # ------------------------- Generate geotiff ------------------------------ 149 | time_start = time.time() 150 | if module_gen_geotiff: 151 | os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff')) 152 | 153 | if not os.path.isdir(geotiff_output_dir): 154 | os.makedirs(geotiff_output_dir) 155 | 156 | # use converted jpg folder instead of original sid folder 157 | run_geotiff_command = 'python convert_image_to_geotiff.py --sid_root_dir /data2/rumsey_sid_to_jpg/ --sample_map_path '+ input_img_path +' --out_geotiff_dir '+geotiff_output_dir # can change params in argparse 158 | exe_ret = execute_command(run_geotiff_command, if_print_command) 159 | if 'error' in exe_ret: 160 | error = exe_ret['error'] 161 | elif 'time_usage' in exe_ret: 162 | time_usage = exe_ret['time_usage'] 163 | 164 | # time_usage_dict[external_id]['geotiff'] = time_usage 165 | 166 | 167 | time_geotiff = time.time() 168 | 169 | 170 | # ------------------------- Image cropping ------------------------------ 171 | if module_cropping: 172 | for index, record in sample_map_df.iterrows(): 173 | external_id = record.external_id 174 | 175 | # if external_id not in external_id_to_img_path_dict: 176 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 177 | # continue 178 | 179 | img_path = sample_map_df['external_id'].iloc[index] 180 | map_name = os.path.basename(img_path).split('.')[0] 181 | 182 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition')) 183 | if not os.path.isdir(cropping_output_dir): 184 | os.makedirs(cropping_output_dir) 185 | 186 | run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir 187 | 188 | exe_ret = execute_command(run_crop_command, if_print_command) 189 | 190 | # if 'error' in exe_ret: 191 | # error = exe_ret['error'] 192 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 193 | # if 'time_usage' in exe_ret: 194 | # time_usage = exe_ret['time_usage'] 195 | # time_usage_dict[external_id]['cropping'] = time_usage 196 | # else: 197 | # raise NotImplementedError 198 | 199 | 200 | time_cropping = time.time() 201 | 202 | # ------------------------- Text Spotting (patch level) ------------------------------ 203 | if module_text_spotting: 204 | assert os.path.exists(spotter_config), "Config file for spotter must exist!" 205 | os.chdir(text_spotting_model_dir) 206 | 207 | for index, record in sample_map_df.iterrows(): 208 | 209 | external_id = record.external_id 210 | # if external_id not in external_id_to_img_path_dict: 211 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 212 | # continue 213 | 214 | img_path = sample_map_df['external_id'].iloc[index] 215 | map_name = os.path.basename(img_path).split('.')[0] 216 | 217 | map_spotting_output_dir = os.path.join(spotting_output_dir, map_name) 218 | if not os.path.isdir(map_spotting_output_dir): 219 | os.makedirs(map_spotting_output_dir) 220 | 221 | print(os.path.join(cropping_output_dir,map_name)) 222 | if spotter_model == 'abcnet': 223 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth' 224 | elif spotter_model == 'testr': 225 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.TRANSFORMER.INFERENCE_TH_TEST 0.3' 226 | # print(run_spotting_command) 227 | else: 228 | raise NotImplementedError 229 | 230 | run_spotting_command += ' 1> /dev/null' 231 | 232 | exe_ret = execute_command(run_spotting_command, if_print_command) 233 | 234 | # if 'error' in exe_ret: 235 | # error = exe_ret['error'] 236 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 237 | # elif 'time_usage' in exe_ret: 238 | # time_usage = exe_ret['time_usage'] 239 | # time_usage_dict[external_id]['spotting'] = time_usage 240 | # else: 241 | # raise NotImplementedError 242 | 243 | logging.info('Done text spotting for %s', map_name) 244 | time_text_spotting = time.time() 245 | 246 | 247 | # ------------------------- Image coord geojson (map level) ------------------------------ 248 | if module_img_geojson: 249 | os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson')) 250 | 251 | if not os.path.isdir(stitch_output_dir): 252 | os.makedirs(stitch_output_dir) 253 | 254 | for index, record in sample_map_df.iterrows(): 255 | external_id = record.external_id 256 | # if external_id not in external_id_to_img_path_dict: 257 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 258 | # continue 259 | 260 | img_path = sample_map_df['external_id'].iloc[index] 261 | map_name = os.path.basename(img_path).split('.')[0] 262 | 263 | stitch_input_dir = os.path.join(spotting_output_dir, map_name) 264 | output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson') 265 | 266 | run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson 267 | 268 | 269 | exe_ret = execute_command(run_stitch_command, if_print_command) 270 | 271 | # if 'error' in exe_ret: 272 | # error = exe_ret['error'] 273 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 274 | # elif 'time_usage' in exe_ret: 275 | # time_usage = exe_ret['time_usage'] 276 | # time_usage_dict[external_id]['stitch'] = time_usage 277 | # else: 278 | # raise NotImplementedError 279 | 280 | time_img_geojson = time.time() 281 | 282 | # ------------------------- post-OCR ------------------------------ 283 | if module_post_ocr: 284 | 285 | os.chdir(os.path.join(map_kurator_system_dir, 'm6_post_ocr')) 286 | 287 | if not os.path.isdir(postocr_output_dir): 288 | os.makedirs(postocr_output_dir) 289 | 290 | for index, record in sample_map_df.iterrows(): 291 | 292 | external_id = record.external_id 293 | # if external_id not in external_id_to_img_path_dict: 294 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 295 | # continue 296 | 297 | img_path = sample_map_df['external_id'].iloc[index] 298 | map_name = os.path.basename(img_path).split('.')[0] 299 | 300 | input_geojson_file = os.path.join(stitch_output_dir, map_name + '.geojson') 301 | geojson_postocr_output_file = os.path.join(postocr_output_dir, map_name + '.geojson') 302 | print('input_geojson_file',input_geojson_file) 303 | print('geojson_postocr_output_file',geojson_postocr_output_file) 304 | run_postocr_command = 'python lexical_search.py --in_geojson_dir '+ input_geojson_file +' --out_geojson_dir '+ geojson_postocr_output_file 305 | 306 | exe_ret = execute_command(run_postocr_command, if_print_command) 307 | print('exe_ret',exe_ret) 308 | # if 'error' in exe_ret: 309 | # error = exe_ret['error'] 310 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 311 | # elif 'time_usage' in exe_ret: 312 | # time_usage = exe_ret['time_usage'] 313 | # time_usage_dict[external_id]['postocr'] = time_usage 314 | # else: 315 | # raise NotImplementedError 316 | 317 | time_post_ocr = time.time() 318 | 319 | 320 | # ------------------------- Convert image coordinates to geocoordinates ------------------------------ 321 | if module_geocoord_geojson: 322 | os.chdir(os.path.join(map_kurator_system_dir, 'm4_geocoordinate_converter')) 323 | 324 | if not os.path.isdir(geojson_output_dir): 325 | os.makedirs(geojson_output_dir) 326 | 327 | for index, record in sample_map_df.iterrows(): 328 | external_id = record.external_id 329 | # if external_id not in external_id_to_img_path_dict: 330 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 331 | # continue 332 | 333 | in_geojson = os.path.join(output_folder, postocr_output_dir+'/') + external_id.strip("'").replace('.', '') + ".geojson" 334 | 335 | run_converter_command = 'python convert_geojson_to_geocoord.py --sample_map_path '+ os.path.join(map_kurator_system_dir, input_img_path) +' --in_geojson_file '+ in_geojson +' --out_geojson_dir '+ os.path.join(map_kurator_system_dir, geojson_output_dir) 336 | 337 | exe_ret = execute_command(run_converter_command, if_print_command) 338 | 339 | # if 'error' in exe_ret: 340 | # error = exe_ret['error'] 341 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 342 | # elif 'time_usage' in exe_ret: 343 | # time_usage = exe_ret['time_usage'] 344 | # time_usage_dict[external_id]['geocoord_geojson'] = time_usage 345 | # else: 346 | # raise NotImplementedError 347 | 348 | time_geocoord_geojson = time.time() 349 | 350 | # ------------------------- Link entities in OSM ------------------------------ 351 | if module_entity_linking: 352 | os.chdir(os.path.join(map_kurator_system_dir, 'm5_entity_linker')) 353 | 354 | geojson_linked_output_dir = os.path.join(map_kurator_system_dir, 'm5_entity_linker', 'data/100_maps_geojson_abc_linked/') 355 | if not os.path.isdir(geojson_output_dir): 356 | os.makedirs(geojson_output_dir) 357 | 358 | run_linker_command = 'python entity_linker.py --sample_map_path '+ input_img_path +' --in_geojson_dir '+ geojson_output_dir +' --out_geojson_dir '+ geojson_linked_output_dir 359 | execute_command(run_linker_command, if_print_command) 360 | 361 | time_entity_linking = time.time() 362 | 363 | 364 | # --------------------- Time usage logging -------------------------- 365 | print('\n') 366 | logging.info('Time for generating geotiff: %d', time_geotiff - time_start) 367 | logging.info('Time for Cropping : %d',time_cropping - time_geotiff) 368 | logging.info('Time for text spotting : %d',time_text_spotting - time_cropping) 369 | logging.info('Time for generating geojson in img coordinate : %d',time_img_geojson - time_text_spotting) 370 | logging.info('Time for generating geojson in geo coordinate : %d',time_geocoord_geojson - time_img_geojson) 371 | logging.info('Time for entity linking : %d',time_entity_linking - time_geocoord_geojson) 372 | logging.info('Time for post OCR : %d',time_post_ocr - time_img_geojson) 373 | 374 | # time_usage_df = pd.DataFrame.from_dict(time_usage_dict, orient='index') 375 | # time_usage_log_path = os.path.join(output_folder, expt_name, 'time_usage.csv') 376 | 377 | # check if exist time_usage log file 378 | # if os.path.isfile(time_usage_log_path): 379 | # existing_df = pd.read_csv(time_usage_log_path, index_col='external_id', dtype={'external_id':str}) 380 | # # if exist duplicate columns, ret time usage values to the latest run 381 | # cols_to_use = existing_df.columns.difference(time_usage_df.columns) 382 | 383 | # time_usage_df = time_usage_df.join(existing_df[cols_to_use]) 384 | 385 | # # make sure time_usage_expt_name.csv always have the latest time usage 386 | # # move the old time_usage.csv to time_usage[timestamp].csv where timestamp is the last expt running time 387 | # m_time = os.path.getmtime(time_usage_log_path) 388 | # dt_m = datetime.datetime.fromtimestamp(m_time) 389 | # timestr = dt_m.strftime("%Y%m%d-%H%M%S") 390 | 391 | # deprecated_path = os.path.join(output_folder, expt_name, 'time_usage_' + timestr +'.csv') 392 | # run_command = 'mv ' + time_usage_log_path + ' ' + deprecated_path 393 | # execute_command(run_command, if_print_command) 394 | 395 | # time_usage_df.to_csv(time_usage_log_path, index_label='external_id') 396 | 397 | # --------------------- Error logging -------------------------- 398 | # print('\n') 399 | # current_time = datetime.datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p") 400 | # error_reason_df = pd.DataFrame.from_dict(error_reason_dict, orient='index') 401 | # error_reason_log_path = os.path.join(output_folder, expt_name, 'error_reason_' + current_time +'.csv') 402 | # error_reason_df.to_csv(error_reason_log_path, index_label='external_id') 403 | 404 | 405 | def main(): 406 | parser = argparse.ArgumentParser() 407 | 408 | parser.add_argument('--map_kurator_system_dir', type=str, default='/home/maplord/rumsey/mapkurator-system/') 409 | parser.add_argument('--text_spotting_model_dir', type=str, default='/home/maplord/rumsey/TESTR/') 410 | parser.add_argument('--sample_map_csv_path', type=str, default='m1_geotiff/data/sample_US_jp2_100_maps.csv') # Original: sample_US_jp2_100_maps.csv 411 | parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') # Original: /data2/rumsey_output 412 | parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix 413 | 414 | parser.add_argument('--module_get_dimension', default=False, action='store_true') 415 | parser.add_argument('--module_gen_geotiff', default=False, action='store_true') 416 | parser.add_argument('--module_cropping', default=False, action='store_true') 417 | parser.add_argument('--module_text_spotting', default=False, action='store_true') 418 | parser.add_argument('--module_img_geojson', default=False, action='store_true') 419 | parser.add_argument('--module_geocoord_geojson', default=False, action='store_true') 420 | parser.add_argument('--module_entity_linking', default=False, action='store_true') 421 | parser.add_argument('--module_post_ocr', default=False, action='store_true') 422 | 423 | 424 | parser.add_argument('--spotter_model', type=str, default='testr', choices=['abcnet', 'testr'], 425 | help='Select text spotting model option from ["abcnet","testr"]') # select text spotting model 426 | parser.add_argument('--spotter_config', type=str, default='/home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml', 427 | help='Path to the config file for text spotting model') 428 | parser.add_argument('--spotter_expt_name', type=str, default='testr_syn', 429 | help='Name of spotter experiment, if empty using config file name') 430 | # python run.py --sample_map_csv_path /home/maplord/maplist_csv/luna_omo_metadata_56628_20220724.csv --expt_name 57k_maps --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --spotter_expt_name testr_synmap 431 | 432 | parser.add_argument('--print_command', default=False, action='store_true') 433 | 434 | 435 | args = parser.parse_args() 436 | print('\n') 437 | print(args) 438 | print('\n') 439 | 440 | run_pipeline(args) 441 | 442 | 443 | 444 | if __name__ == '__main__': 445 | 446 | main() 447 | 448 | 449 | -------------------------------------------------------------------------------- /run_leeje.py: -------------------------------------------------------------------------------- 1 | import os 2 | import subprocess 3 | import glob 4 | import argparse 5 | import time 6 | import logging 7 | import pandas as pd 8 | import pdb 9 | import datetime 10 | from PIL import Image 11 | from utils import get_img_path_from_external_id, get_img_path_from_external_id_and_image_no 12 | 13 | import subprocess 14 | 15 | ##input file : tiff and csv 16 | # read csv 17 | # change to format (tiff) 18 | # change the path 19 | # execute all module 20 | 21 | logging.basicConfig(level=logging.INFO) 22 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images 23 | 24 | # def execute_command(command, if_print_command): 25 | # t1 = time.time() 26 | 27 | # if if_print_command: 28 | # print(command) 29 | # os.system(command) 30 | 31 | # t2 = time.time() 32 | # time_usage = t2 - t1 33 | # return time_usage 34 | 35 | def execute_command(command, if_print_command): 36 | t1 = time.time() 37 | 38 | if if_print_command: 39 | print(command) 40 | 41 | try: 42 | subprocess.run(command, shell=True,check=True, capture_output = True) #stderr=subprocess.STDOUT) 43 | t2 = time.time() 44 | time_usage = t2 - t1 45 | return {'time_usage':time_usage} 46 | except subprocess.CalledProcessError as err: 47 | error = err.stderr.decode('utf8') 48 | # format error message to one line 49 | error = error.replace('\n','\t') 50 | error = error.replace(',',';') 51 | return {'error': error} 52 | 53 | 54 | def get_img_dimension(img_path): 55 | map_img = Image.open(img_path) 56 | width, height = map_img.size 57 | 58 | return width, height 59 | 60 | 61 | def run_pipeline(args): 62 | # ------------------------- Pass arguments ----------------------------------------- 63 | map_kurator_system_dir = args.map_kurator_system_dir 64 | text_spotting_model_dir = args.text_spotting_model_dir 65 | sample_map_path = args.sample_map_csv_path 66 | expt_name = args.expt_name 67 | output_folder = args.output_folder 68 | 69 | module_get_dimension = args.module_get_dimension 70 | module_gen_geotiff = args.module_gen_geotiff 71 | module_cropping = args.module_cropping 72 | module_text_spotting = args.module_text_spotting 73 | module_img_geojson = args.module_img_geojson 74 | module_geocoord_geojson = args.module_geocoord_geojson 75 | module_entity_linking = args.module_entity_linking 76 | module_post_ocr = args.module_post_ocr 77 | 78 | spotter_model = args.spotter_model 79 | spotter_config = args.spotter_config 80 | spotter_expt_name = args.spotter_expt_name 81 | 82 | if_print_command = args.print_command 83 | 84 | 85 | # sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/' 86 | 87 | # ------------------------- Read sample map list and prepare output dir ---------------- 88 | input_csv_path = sample_map_path 89 | if input_csv_path[-4:] == '.csv': 90 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}) 91 | elif input_csv_path[-4:] == '.tsv': 92 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}, sep='\t') 93 | else: 94 | raise NotImplementedError 95 | 96 | # external_id_to_img_path_dict = get_img_path_from_external_id( sample_map_path = input_csv_path) 97 | external_id_to_img_path_dict, unmatched_external_id_list = get_img_path_from_external_id_and_image_no( sample_map_path = input_csv_path) 98 | 99 | # initialize error reason dict 100 | error_reason_dict = dict() 101 | for ex_id in unmatched_external_id_list: 102 | error_reason_dict[ex_id] = {'img_path':None, 'error':'Can not find image given external_id.'} 103 | 104 | # initialize time_usage_dict 105 | time_usage_dict = dict() 106 | for ex_id in sample_map_df['external_id']: 107 | time_usage_dict[ex_id] = {} 108 | 109 | expt_out_dir = os.path.join(output_folder, expt_name) 110 | geotiff_output_dir = os.path.join(output_folder, expt_name, 'geotiff') 111 | cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/') 112 | spotting_output_dir = os.path.join(output_folder, expt_name, 'spotter/' + spotter_expt_name) 113 | stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_expt_name) 114 | postocr_output_dir = os.path.join(output_folder, expt_name, 'postocr/'+ spotter_expt_name) 115 | geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_expt_name + '/') 116 | 117 | 118 | 119 | if not os.path.isdir(expt_out_dir): 120 | os.makedirs(expt_out_dir) 121 | 122 | # ------------------------ Get image dimension ------------------------------ 123 | if module_get_dimension: 124 | for index, record in sample_map_df.iterrows(): 125 | external_id = record.external_id 126 | # pdb.set_trace() 127 | if external_id not in external_id_to_img_path_dict: 128 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 129 | continue 130 | 131 | img_path = external_id_to_img_path_dict[external_id] 132 | map_name = os.path.basename(img_path).split('.')[0] 133 | 134 | try: 135 | width, height = get_img_dimension(img_path) 136 | except Exception as e: 137 | error_reason_dict[external_id] = {'img_path':img_path, 'error': e } 138 | 139 | time_usage_dict[external_id]['img_w'] = width 140 | time_usage_dict[external_id]['img_h'] = height 141 | 142 | 143 | # ------------------------- Generate geotiff ------------------------------ 144 | time_start = time.time() 145 | if module_gen_geotiff: 146 | os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff')) 147 | 148 | if not os.path.isdir(geotiff_output_dir): 149 | os.makedirs(geotiff_output_dir) 150 | 151 | # use converted jpg folder instead of original sid folder 152 | run_geotiff_command = 'python convert_image_to_geotiff.py --sid_root_dir /data2/rumsey_sid_to_jpg/ --sample_map_path '+ input_csv_path +' --out_geotiff_dir '+geotiff_output_dir # can change params in argparse 153 | exe_ret = execute_command(run_geotiff_command, if_print_command) 154 | if 'error' in exe_ret: 155 | error = exe_ret['error'] 156 | elif 'time_usage' in exe_ret: 157 | time_usage = exe_ret['time_usage'] 158 | 159 | time_usage_dict[external_id]['geotiff'] = time_usage 160 | 161 | 162 | time_geotiff = time.time() 163 | 164 | 165 | # ------------------------- Image cropping ------------------------------ 166 | if module_cropping: 167 | for index, record in sample_map_df.iterrows(): 168 | external_id = record.external_id 169 | 170 | if external_id not in external_id_to_img_path_dict: 171 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 172 | continue 173 | 174 | img_path = external_id_to_img_path_dict[external_id] 175 | map_name = os.path.basename(img_path).split('.')[0] 176 | 177 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition')) 178 | if not os.path.isdir(cropping_output_dir): 179 | os.makedirs(cropping_output_dir) 180 | 181 | run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir 182 | 183 | exe_ret = execute_command(run_crop_command, if_print_command) 184 | 185 | if 'error' in exe_ret: 186 | error = exe_ret['error'] 187 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 188 | elif 'time_usage' in exe_ret: 189 | time_usage = exe_ret['time_usage'] 190 | time_usage_dict[external_id]['cropping'] = time_usage 191 | else: 192 | raise NotImplementedError 193 | 194 | 195 | time_cropping = time.time() 196 | 197 | # ------------------------- Text Spotting (patch level) ------------------------------ 198 | if module_text_spotting: 199 | assert os.path.exists(spotter_config), "Config file for spotter must exist!" 200 | os.chdir(text_spotting_model_dir) 201 | 202 | for index, record in sample_map_df.iterrows(): 203 | 204 | external_id = record.external_id 205 | if external_id not in external_id_to_img_path_dict: 206 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 207 | continue 208 | 209 | img_path = external_id_to_img_path_dict[external_id] 210 | map_name = os.path.basename(img_path).split('.')[0] 211 | 212 | map_spotting_output_dir = os.path.join(spotting_output_dir, map_name) 213 | if not os.path.isdir(map_spotting_output_dir): 214 | os.makedirs(map_spotting_output_dir) 215 | 216 | if spotter_model == 'abcnet': 217 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth' 218 | elif spotter_model == 'testr': 219 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir}' 220 | else: 221 | raise NotImplementedError 222 | 223 | run_spotting_command += ' 1> /dev/null' 224 | 225 | exe_ret = execute_command(run_spotting_command, if_print_command) 226 | 227 | if 'error' in exe_ret: 228 | error = exe_ret['error'] 229 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 230 | elif 'time_usage' in exe_ret: 231 | time_usage = exe_ret['time_usage'] 232 | time_usage_dict[external_id]['spotting'] = time_usage 233 | else: 234 | raise NotImplementedError 235 | 236 | logging.info('Done text spotting for %s', map_name) 237 | time_text_spotting = time.time() 238 | 239 | 240 | # ------------------------- Image coord geojson (map level) ------------------------------ 241 | if module_img_geojson: 242 | os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson')) 243 | 244 | if not os.path.isdir(stitch_output_dir): 245 | os.makedirs(stitch_output_dir) 246 | 247 | for index, record in sample_map_df.iterrows(): 248 | external_id = record.external_id 249 | if external_id not in external_id_to_img_path_dict: 250 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 251 | continue 252 | 253 | img_path = external_id_to_img_path_dict[external_id] 254 | map_name = os.path.basename(img_path).split('.')[0] 255 | 256 | stitch_input_dir = os.path.join(spotting_output_dir, map_name) 257 | output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson') 258 | 259 | run_stitch_command = 'python stitch_output.py --eval_only --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson 260 | 261 | exe_ret = execute_command(run_stitch_command, if_print_command) 262 | 263 | if 'error' in exe_ret: 264 | error = exe_ret['error'] 265 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 266 | elif 'time_usage' in exe_ret: 267 | time_usage = exe_ret['time_usage'] 268 | time_usage_dict[external_id]['stitch'] = time_usage 269 | else: 270 | raise NotImplementedError 271 | 272 | time_img_geojson = time.time() 273 | 274 | # ------------------------- post-OCR ------------------------------ 275 | if module_post_ocr: 276 | 277 | # Check if the geojson has been recorded 278 | geojson_postocr_output_dir_check = os.path.join(output_folder, '57k_maps', 'postocr/testr_syn') 279 | file_list = glob.glob(geojson_postocr_output_dir_check + '/*.geojson') 280 | file_list = sorted(file_list) 281 | 282 | existed = [] 283 | for file in file_list: 284 | name = file.split('/')[-1].split('.')[0] 285 | existed.append(name) 286 | ##### 287 | 288 | os.chdir(os.path.join(map_kurator_system_dir, 'm6_post_ocr')) 289 | 290 | sample_map_df2 = sample_map_df 291 | sample_map_df2['external_id_process'] = sample_map_df2['external_id'] 292 | sample_map_df2['external_id_process'] = sample_map_df2['external_id_process'].str.strip("'") 293 | sample_map_df2['external_id_process'] = sample_map_df2['external_id_process'].str.replace('.', '') 294 | sample_map_df2 = sample_map_df2[~sample_map_df2['external_id_process'].isin(existed)] 295 | 296 | print(len(sample_map_df2)) 297 | print(len(existed)) 298 | 299 | ##### 300 | for index, record in sample_map_df2.iterrows(): 301 | 302 | external_id = record.external_id 303 | if external_id not in external_id_to_img_path_dict: 304 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 305 | continue 306 | 307 | img_path = external_id_to_img_path_dict[external_id] 308 | map_name = os.path.basename(img_path).split('.')[0] 309 | 310 | input_geojson_file = os.path.join(stitch_output_dir, map_name + '.geojson') 311 | geojson_postocr_output_file = os.path.join(postocr_output_dir, map_name + '.geojson') 312 | 313 | if os.path.isfile(in_geojson): 314 | run_postocr_command = 'python lexical_search.py --in_geojson_dir '+ input_geojson_file +' --out_geojson_dir '+ geojson_postocr_output_file 315 | exe_ret = execute_command(run_postocr_command, if_print_command) 316 | 317 | if 'error' in exe_ret: 318 | error = exe_ret['error'] 319 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 320 | elif 'time_usage' in exe_ret: 321 | time_usage = exe_ret['time_usage'] 322 | time_usage_dict[external_id]['postocr'] = time_usage 323 | else: 324 | raise NotImplementedError 325 | 326 | else: 327 | continue 328 | 329 | time_post_ocr = time.time() 330 | 331 | 332 | # ------------------------- Convert image coordinates to geocoordinates ------------------------------ 333 | if module_geocoord_geojson: 334 | os.chdir(os.path.join(map_kurator_system_dir, 'm4_geocoordinate_converter')) 335 | 336 | if not os.path.isdir(geojson_output_dir): 337 | os.makedirs(geojson_output_dir) 338 | 339 | for index, record in sample_map_df.iterrows(): 340 | external_id = record.external_id 341 | if external_id not in external_id_to_img_path_dict: 342 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 343 | continue 344 | 345 | in_geojson = os.path.join(output_folder, postocr_output_dir+'/') + external_id.strip("'").replace('.', '') + ".geojson" 346 | 347 | run_converter_command = 'python convert_geojson_to_geocoord.py --sample_map_path '+ os.path.join(map_kurator_system_dir, input_csv_path) +' --in_geojson_file '+ in_geojson +' --out_geojson_dir '+ os.path.join(map_kurator_system_dir, geojson_output_dir) 348 | 349 | exe_ret = execute_command(run_converter_command, if_print_command) 350 | 351 | if 'error' in exe_ret: 352 | error = exe_ret['error'] 353 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 354 | elif 'time_usage' in exe_ret: 355 | time_usage = exe_ret['time_usage'] 356 | time_usage_dict[external_id]['geocoord_geojson'] = time_usage 357 | else: 358 | raise NotImplementedError 359 | 360 | time_geocoord_geojson = time.time() 361 | 362 | # ------------------------- Link entities in OSM ------------------------------ 363 | if module_entity_linking: 364 | os.chdir(os.path.join(map_kurator_system_dir, 'm5_entity_linker')) 365 | 366 | geojson_linked_output_dir = os.path.join(map_kurator_system_dir, 'm5_entity_linker', 'data/100_maps_geojson_abc_linked/') 367 | if not os.path.isdir(geojson_output_dir): 368 | os.makedirs(geojson_output_dir) 369 | 370 | run_linker_command = 'python entity_linker.py --sample_map_path '+ input_csv_path +' --in_geojson_dir '+ geojson_output_dir +' --out_geojson_dir '+ geojson_linked_output_dir 371 | execute_command(run_linker_command, if_print_command) 372 | 373 | time_entity_linking = time.time() 374 | 375 | 376 | # --------------------- Time usage logging -------------------------- 377 | print('\n') 378 | logging.info('Time for generating geotiff: %d', time_geotiff - time_start) 379 | logging.info('Time for Cropping : %d',time_cropping - time_geotiff) 380 | logging.info('Time for text spotting : %d',time_text_spotting - time_cropping) 381 | logging.info('Time for generating geojson in img coordinate : %d',time_img_geojson - time_text_spotting) 382 | logging.info('Time for generating geojson in geo coordinate : %d',time_geocoord_geojson - time_img_geojson) 383 | logging.info('Time for entity linking : %d',time_entity_linking - time_geocoord_geojson) 384 | logging.info('Time for post OCR : %d',time_post_ocr - time_geocoord_geojson) 385 | 386 | time_usage_df = pd.DataFrame.from_dict(time_usage_dict, orient='index') 387 | time_usage_log_path = os.path.join(output_folder, expt_name, 'time_usage.csv') 388 | 389 | # check if exist time_usage log file 390 | if os.path.isfile(time_usage_log_path): 391 | existing_df = pd.read_csv(time_usage_log_path, index_col='external_id', dtype={'external_id':str}) 392 | # if exist duplicate columns, ret time usage values to the latest run 393 | cols_to_use = existing_df.columns.difference(time_usage_df.columns) 394 | 395 | time_usage_df = time_usage_df.join(existing_df[cols_to_use]) 396 | 397 | # make sure time_usage_expt_name.csv always have the latest time usage 398 | # move the old time_usage.csv to time_usage[timestamp].csv where timestamp is the last expt running time 399 | m_time = os.path.getmtime(time_usage_log_path) 400 | dt_m = datetime.datetime.fromtimestamp(m_time) 401 | timestr = dt_m.strftime("%Y%m%d-%H%M%S") 402 | 403 | deprecated_path = os.path.join(output_folder, expt_name, 'time_usage_' + timestr +'.csv') 404 | run_command = 'mv ' + time_usage_log_path + ' ' + deprecated_path 405 | execute_command(run_command, if_print_command) 406 | 407 | time_usage_df.to_csv(time_usage_log_path, index_label='external_id') 408 | 409 | # --------------------- Error logging -------------------------- 410 | print('\n') 411 | current_time = datetime.datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p") 412 | error_reason_df = pd.DataFrame.from_dict(error_reason_dict, orient='index') 413 | error_reason_log_path = os.path.join(output_folder, expt_name, 'error_reason_' + current_time +'.csv') 414 | error_reason_df.to_csv(error_reason_log_path, index_label='external_id') 415 | 416 | 417 | def main(): 418 | parser = argparse.ArgumentParser() 419 | 420 | parser.add_argument('--map_kurator_system_dir', type=str, default='/home/maplord/rumsey/mapkurator-system/') 421 | parser.add_argument('--text_spotting_model_dir', type=str, default='/home/maplord/rumsey/TESTR/') 422 | parser.add_argument('--sample_map_csv_path', type=str, default='m1_geotiff/data/sample_US_jp2_100_maps.csv') # Original: sample_US_jp2_100_maps.csv 423 | parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') # Original: /data2/rumsey_output 424 | parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix 425 | 426 | parser.add_argument('--module_get_dimension', default=False, action='store_true') 427 | parser.add_argument('--module_gen_geotiff', default=False, action='store_true') 428 | parser.add_argument('--module_cropping', default=False, action='store_true') 429 | parser.add_argument('--module_text_spotting', default=False, action='store_true') 430 | parser.add_argument('--module_img_geojson', default=False, action='store_true') 431 | parser.add_argument('--module_geocoord_geojson', default=False, action='store_true') 432 | parser.add_argument('--module_entity_linking', default=False, action='store_true') 433 | parser.add_argument('--module_post_ocr', default=False, action='store_true') 434 | 435 | 436 | parser.add_argument('--spotter_model', type=str, default='testr', choices=['abcnet', 'testr'], 437 | help='Select text spotting model option from ["abcnet","testr"]') # select text spotting model 438 | parser.add_argument('--spotter_config', type=str, default='/home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml', 439 | help='Path to the config file for text spotting model') 440 | parser.add_argument('--spotter_expt_name', type=str, default='testr_syn', 441 | help='Name of spotter experiment, if empty using config file name') 442 | # python run.py --sample_map_csv_path /home/maplord/maplist_csv/luna_omo_metadata_56628_20220724.csv --expt_name 57k_maps --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --spotter_expt_name testr_synmap 443 | 444 | parser.add_argument('--print_command', default=False, action='store_true') 445 | 446 | 447 | args = parser.parse_args() 448 | print('\n') 449 | print(args) 450 | print('\n') 451 | 452 | run_pipeline(args) 453 | 454 | 455 | 456 | if __name__ == '__main__': 457 | 458 | main() 459 | 460 | 461 | -------------------------------------------------------------------------------- /run_only_eval.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | 4 | map_kurator_system_dir = '/home/zekun/dr_maps/mapkurator-system/' 5 | text_spotting_model_dir = '/home/zekun/antique_names/model/AdelaiDet/' 6 | sample_map_path = 'm1_geotiff/data/sample_US_jp2_100_maps.csv' 7 | 8 | # # run module1 to generate geotiff 9 | # os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff')) 10 | # input_csv = os.path.join(map_kurator_system_dir ,sample_map_path) 11 | # geotiff_output_dir = os.path.join(map_kurator_system_dir ,'m1_geotiff/data/geotiff') 12 | # if not os.path.isdir(geotiff_output_dir): 13 | # os.makedirs(geotiff_output_dir) 14 | 15 | # run_geotiff_command = 'python convert_image_to_geotiff.py --sample_map_path '+ input_csv +' --out_geotiff_dir '+geotiff_output_dir # can change params in argparse 16 | # print(run_geotiff_command) 17 | # #os.system(run_geotiff_command) 18 | 19 | 20 | # run module2: image cropping 21 | 22 | geotiff_path_list = glob.glob(os.path.join(geotiff_output_dir, '*.geotiff')) 23 | assert(len(geotiff_path_list) != 0) 24 | 25 | for geotiff_path in geotiff_path_list: 26 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition')) 27 | map_name = os.path.basename(geotiff_path).split('.')[0] 28 | 29 | cropping_output_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_crop/') 30 | if not os.path.isdir(cropping_output_dir): 31 | os.makedirs(cropping_output_dir) 32 | run_crop_command = 'python crop_img.py --img_path '+geotiff_path + ' --output_dir '+ cropping_output_dir 33 | print(run_crop_command) 34 | os.system(run_crop_command) 35 | 36 | # run module2: text spotting 37 | os.chdir(text_spotting_model_dir) 38 | map_name = os.path.basename(geotiff_path).split('.')[0] 39 | 40 | spotting_output_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_crop_outabc/',map_name) 41 | if not os.path.isdir(spotting_output_dir): 42 | os.makedirs(spotting_output_dir) 43 | 44 | run_spotting_command = 'python demo/demo.py --config-file configs/BAText/CTW1500/attn_R_50.yaml --input '+ map_kurator_system_dir+'/m2_detection_recognition/data/100_maps_crop/'+map_name+' --output '+ spotting_output_dir + ' --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth' 45 | run_spotting_command += ' 1> /dev/null' 46 | print(run_spotting_command) 47 | os.system(run_spotting_command) 48 | 49 | #break 50 | 51 | 52 | # run module2: geojson stitching 53 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition')) 54 | 55 | 56 | stitch_input_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_crop_outabc/') 57 | stitch_output_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_geojson_abc/') 58 | if not os.path.isdir(stitch_output_dir): 59 | os.makedirs(stitch_output_dir) 60 | run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_dir ' + stitch_output_dir 61 | print(run_stitch_command) 62 | os.system(run_stitch_command) 63 | 64 | -------------------------------------------------------------------------------- /run_sanborn.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import argparse 4 | import time 5 | import logging 6 | import pandas as pd 7 | import pdb 8 | import datetime 9 | from PIL import Image 10 | from utils import get_img_path_from_external_id 11 | 12 | logging.basicConfig(level=logging.INFO) 13 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images 14 | 15 | ''' 16 | This Sanborn processing pipeline shares some common modules as DR processing pipeline, including cropping and text spotting. 17 | The unique modules are geocoding, clustering, and output geojson generation module. 18 | The GeoTiff conversion, Image dimension retrival, Img_to_geo coord and entity linking modules are removed. 19 | Time usage analysis and error reason logging are removed. 20 | ''' 21 | 22 | def execute_command(command, if_print_command): 23 | t1 = time.time() 24 | 25 | if if_print_command: 26 | print(command) 27 | os.system(command) 28 | 29 | t2 = time.time() 30 | time_usage = t2 - t1 31 | return time_usage 32 | 33 | def get_img_dimension(img_path): 34 | map_img = Image.open(img_path) 35 | width, height = map_img.size 36 | 37 | return width, height 38 | 39 | 40 | def run_pipeline(args): 41 | # ------------------------- Pass arguments ----------------------------------------- 42 | map_kurator_system_dir = args.map_kurator_system_dir 43 | text_spotting_model_dir = args.text_spotting_model_dir 44 | # sample_map_path = args.sample_map_csv_path 45 | expt_name = args.expt_name 46 | output_folder = args.output_folder 47 | input_map_dir = args.input_map_dir 48 | 49 | module_get_dimension = args.module_get_dimension 50 | # module_gen_geotiff = args.module_gen_geotiff 51 | module_cropping = args.module_cropping 52 | module_text_spotting = args.module_text_spotting 53 | module_img_geojson = args.module_img_geojson 54 | # module_geocoord_geojson = args.module_geocoord_geojson 55 | # module_entity_linking = args.module_entity_linking 56 | module_geocoding = args.module_geocoding 57 | module_clustering = args.module_clustering 58 | 59 | spotter_option = args.spotter_option 60 | geocoder_option = args.geocoder_option 61 | api_key = args.api_key 62 | user_name = args.user_name 63 | 64 | metadata_tsv_path = args.metadata_tsv_path 65 | 66 | if_print_command = args.print_command 67 | 68 | sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/' 69 | 70 | file_list = os.listdir(input_map_dir) 71 | 72 | file_list = [f for f in file_list if os.path.basename(f).split('.')[-1] in ['sid','jp2','png','jpg','jpeg','tiff','tif','geotiff','geotiff']] 73 | 74 | print(len(file_list)) 75 | 76 | 77 | 78 | # pdb.set_trace() 79 | # ------------------------- Read sample map list and prepare output dir ---------------- 80 | 81 | cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/') 82 | spotting_output_dir = os.path.join(output_folder, expt_name, 'spotter/' + spotter_option) 83 | stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_option) 84 | # geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_option + '/') 85 | geocoding_output_dir = os.path.join(output_folder, expt_name, 'geocoding_suffix_' + spotter_option) 86 | clustering_output_dir = os.path.join(output_folder, expt_name, 'cluster_' + spotter_option + '/') 87 | 88 | 89 | # ------------------------- Image cropping ------------------------------ 90 | if module_cropping: 91 | # for index, record in sample_map_df.iterrows(): 92 | for file_path in file_list: 93 | img_path = os.path.join(input_map_dir, file_path) 94 | print(img_path) 95 | # external_id = record.external_id 96 | # img_path = external_id_to_img_path_dict[external_id] 97 | 98 | map_name = os.path.basename(img_path).split('.')[0] 99 | 100 | 101 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition')) 102 | if not os.path.isdir(cropping_output_dir): 103 | os.makedirs(cropping_output_dir) 104 | run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir 105 | 106 | time_usage = execute_command(run_crop_command, if_print_command) 107 | # time_usage_dict[external_id]['cropping'] = time_usage 108 | 109 | time_cropping = time.time() 110 | 111 | # # ------------------------- Text Spotting (patch level) ------------------------------ 112 | if module_text_spotting: 113 | os.chdir(text_spotting_model_dir) 114 | 115 | # for index, record in sample_map_df.iterrows(): 116 | for file_path in file_list: 117 | map_name = os.path.basename(file_path).split('.')[0] 118 | 119 | map_spotting_output_dir = os.path.join(spotting_output_dir,map_name) 120 | if not os.path.isdir(map_spotting_output_dir): 121 | os.makedirs(map_spotting_output_dir) 122 | 123 | if spotter_option == 'abcnet': 124 | run_spotting_command = 'python demo/demo.py --config-file configs/BAText/CTW1500/attn_R_50.yaml --input='+ os.path.join(cropping_output_dir,map_name) + ' --output='+ map_spotting_output_dir + ' --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth' 125 | elif spotter_option == 'testr': 126 | run_spotting_command = 'python demo/demo.py --output_json --input='+ os.path.join(cropping_output_dir,map_name) + ' --output='+map_spotting_output_dir +' --opts MODEL.WEIGHTS icdar15_testr_R_50_polygon.pth' 127 | else: 128 | raise NotImplementedError 129 | 130 | run_spotting_command += ' 1> /dev/null' 131 | 132 | time_usage = execute_command(run_spotting_command, if_print_command) 133 | 134 | logging.info('Done text spotting for %s', map_name) 135 | 136 | time_text_spotting = time.time() 137 | 138 | 139 | # # ------------------------- Image coord geojson (map level) ------------------------------ 140 | if module_img_geojson: 141 | os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson')) 142 | if not os.path.isdir(stitch_output_dir): 143 | os.makedirs(stitch_output_dir) 144 | 145 | for file_path in file_list: 146 | map_name = os.path.basename(file_path).split('.')[0] 147 | 148 | stitch_input_dir = os.path.join(spotting_output_dir, map_name) 149 | output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson') 150 | 151 | run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson 152 | time_usage = execute_command(run_stitch_command, if_print_command) 153 | # time_usage_dict[external_id]['imgcoord_geojson'] = time_usage 154 | 155 | time_img_geojson = time.time() 156 | 157 | # # ------------------------- Geocoding ------------------------------ 158 | if module_geocoding: 159 | os.chdir(os.path.join(map_kurator_system_dir ,'m_sanborn')) 160 | 161 | if metadata_tsv_path is not None: 162 | map_df = pd.read_csv(metadata_tsv_path, sep='\t') 163 | 164 | if not os.path.isdir(geocoding_output_dir): 165 | os.makedirs(geocoding_output_dir) 166 | 167 | for file_path in file_list: 168 | map_name = os.path.basename(file_path).split('.')[0] 169 | if metadata_tsv_path is not None: 170 | suffix = map_df[map_df['filename'] == map_name]['City'].values[0] # LoC sanborn 171 | suffix = ', ' + suffix 172 | else: 173 | suffix = ', Los Angeles' # LA sanborn 174 | 175 | run_geocoding_command = 'python3 s1_geocoding.py --input_map_geojson_path='+ os.path.join(stitch_output_dir,map_name + '.geojson') + ' --output_folder=' + geocoding_output_dir + \ 176 | ' --api_key=' + api_key + ' --user_name=' + user_name + ' --max_results=5 --geocoder_option=' + geocoder_option + ' --suffix="' + suffix + '"' 177 | 178 | time_usage = execute_command(run_geocoding_command, if_print_command) 179 | 180 | # break 181 | 182 | logging.info('Done geocoding for %s', map_name) 183 | 184 | 185 | time_geocoding = time.time() 186 | 187 | 188 | if module_clustering: 189 | os.chdir(os.path.join(map_kurator_system_dir ,'m_sanborn')) 190 | 191 | if not os.path.isdir(clustering_output_dir): 192 | os.makedirs(clustering_output_dir) 193 | 194 | # for file_path in file_list: 195 | # map_name = os.path.basename(file_path).split('.')[0] 196 | 197 | # run_clustering_command = 'python3 s2_clustering.py --dataset_name='+ expt_name + ' --output_folder=' + geocoding_output_dir + \ 198 | # ' --api_key=' + api_key + ' --user_name=' + user_name + ' --max_results=5 --geocoder_option=' + geocoder_option + ' --suffix="' + suffix + '"' 199 | 200 | # time_usage = execute_command(run_clustering_command, if_print_command) 201 | 202 | 203 | # logging.info('Done geocoding for %s', map_name) 204 | 205 | 206 | def main(): 207 | parser = argparse.ArgumentParser() 208 | 209 | parser.add_argument('--map_kurator_system_dir', type=str, default='/home/zekun/dr_maps/mapkurator-system/') 210 | parser.add_argument('--text_spotting_model_dir', type=str, default='/home/zekun/antique_names/model/AdelaiDet/') 211 | 212 | parser.add_argument('--input_map_dir', type=str, default='/data2/mrm_sanborn_maps/LA_sanborn') 213 | parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') 214 | parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix 215 | 216 | parser.add_argument('--module_get_dimension', default=False, action='store_true') 217 | # parser.add_argument('--module_gen_geotiff', default=False, action='store_true') # only supports dr maps 218 | parser.add_argument('--module_cropping', default=False, action='store_true') 219 | parser.add_argument('--module_text_spotting', default=False, action='store_true') 220 | parser.add_argument('--module_img_geojson', default=False, action='store_true') 221 | parser.add_argument('--module_geocoding', default=False, action='store_true') # only supports sanborn 222 | # parser.add_argument('--module_geocoord_geojson', default=False, action='store_true') # only supports dr maps 223 | # parser.add_argument('--module_entity_linking', default=False, action='store_true') # only supports dr maps 224 | parser.add_argument('--module_clustering', default=False, action='store_true') # only supports dr maps 225 | 226 | parser.add_argument('--print_command', default=False, action='store_true') 227 | 228 | parser.add_argument('--spotter_option', type=str, default='testr', 229 | choices=['abcnet', 'testr'], 230 | help='Select text spotting model option from ["abcnet","testr"]') # select text spotting model 231 | 232 | parser.add_argument('--geocoder_option', type=str, default='arcgis', 233 | choices=['arcgis', 'google','geonames','osm'], 234 | help='Select text spotting model option from ["arcgis","google","geonames","osm"]') # select text spotting model 235 | 236 | # params for geocoder: 237 | parser.add_argument('--api_key', type=str, default=None, help='api_key for geocoder. can be None if not running geocoding module') 238 | parser.add_argument('--user_name', type=str, default=None, help='user_name for geocoder. can be None if not running geocoding module') 239 | parser.add_argument('--metadata_tsv_path', type=str, default=None) # '/home/zekun/Sanborn/Sheet_List.tsv' 240 | 241 | 242 | args = parser.parse_args() 243 | print('\n') 244 | print(args) 245 | print('\n') 246 | 247 | run_pipeline(args) 248 | 249 | 250 | if __name__ == '__main__': 251 | 252 | main() 253 | 254 | 255 | -------------------------------------------------------------------------------- /utils.py: -------------------------------------------------------------------------------- 1 | import os 2 | import glob 3 | import pandas as pd 4 | import ast 5 | import argparse 6 | import logging 7 | import pdb 8 | 9 | logging.basicConfig(level=logging.INFO) 10 | 11 | def func_file_to_fullpath_dict(file_path_list): 12 | 13 | file_fullpath_dict = dict() 14 | for file_path in file_path_list: 15 | file_fullpath_dict[os.path.basename(file_path).split('.')[0]] = file_path 16 | 17 | return file_fullpath_dict 18 | 19 | def get_img_path_from_external_id(jp2_root_dir = '/data/rumsey-jp2/', sid_root_dir = '/data2/rumsey_sid_to_jpg/', additional_root_dir='/data2/rumsey-luna-img/', sample_map_path = None,external_id_key = 'external_id') : 20 | # returns (1) a dict with external-id as key, full image path as value (2) list of external-id that can not find image path 21 | 22 | jp2_file_path_list = glob.glob(os.path.join(jp2_root_dir, '*/*.jp2')) 23 | sid_file_path_list = glob.glob(os.path.join(sid_root_dir, '*.jpg')) 24 | add_file_path_list = glob.glob(os.path.join(additional_root_dir, '*')) 25 | 26 | jp2_file_fullpath_dict = func_file_to_fullpath_dict(jp2_file_path_list) 27 | sid_file_fullpath_dict = func_file_to_fullpath_dict(sid_file_path_list) 28 | add_file_fullpath_dict = func_file_to_fullpath_dict(add_file_path_list) 29 | 30 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}) 31 | 32 | external_id_to_img_path_dict = {} 33 | 34 | unmatched_external_id_list = [] 35 | 36 | for index, record in sample_map_df.iterrows(): 37 | external_id = record.external_id 38 | filename_without_extension = external_id.strip("'").replace('.','') 39 | 40 | full_path = '' 41 | if filename_without_extension in jp2_file_fullpath_dict: 42 | full_path = jp2_file_fullpath_dict[filename_without_extension] 43 | elif filename_without_extension in sid_file_fullpath_dict: 44 | full_path = sid_file_fullpath_dict[filename_without_extension] 45 | elif filename_without_extension in add_file_fullpath_dict: 46 | full_path = add_file_fullpath_dict[filename_without_extension] 47 | else: 48 | # print('image with external_id not found in image_dir:', external_id) 49 | unmatched_external_id_list.append(external_id) 50 | continue 51 | assert (len(full_path)!=0) 52 | 53 | external_id_to_img_path_dict[external_id] = full_path 54 | 55 | return external_id_to_img_path_dict, unmatched_external_id_list 56 | 57 | def get_img_path_from_external_id_and_image_no(jp2_root_dir = '/data/rumsey-jp2/', sid_root_dir = '/data2/rumsey_sid_to_jpg/', additional_root_dir='/data2/rumsey-luna-img/', sample_map_path = None,external_id_key = 'external_id') : 58 | # returns (1) a dict with external-id as key, full image path as value (2) list of external-id that can not find image path 59 | 60 | jp2_file_path_list = glob.glob(os.path.join(jp2_root_dir, '*/*.jp2')) 61 | sid_file_path_list = glob.glob(os.path.join(sid_root_dir, '*.jpg')) # use converted jpg directly 62 | add_file_path_list = glob.glob(os.path.join(additional_root_dir, '*')) 63 | 64 | jp2_file_fullpath_dict = func_file_to_fullpath_dict(jp2_file_path_list) 65 | sid_file_fullpath_dict = func_file_to_fullpath_dict(sid_file_path_list) 66 | add_file_fullpath_dict = func_file_to_fullpath_dict(add_file_path_list) 67 | 68 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}) 69 | 70 | external_id_to_img_path_dict = {} 71 | 72 | unmatched_external_id_list = [] 73 | for index, record in sample_map_df.iterrows(): 74 | external_id = record.external_id 75 | image_no = record.image_no 76 | # filename_without_extension = external_id.strip("'").replace('.','') 77 | filename_without_extension = image_no.strip("'").split('.')[0] 78 | 79 | full_path = '' 80 | if filename_without_extension in jp2_file_fullpath_dict: 81 | full_path = jp2_file_fullpath_dict[filename_without_extension] 82 | elif filename_without_extension in sid_file_fullpath_dict: 83 | full_path = sid_file_fullpath_dict[filename_without_extension] 84 | elif filename_without_extension in add_file_fullpath_dict: 85 | full_path = add_file_fullpath_dict[filename_without_extension] 86 | else: 87 | print('image with external_id not found in image_dir:', external_id) 88 | unmatched_external_id_list.append(external_id) 89 | continue 90 | assert (len(full_path)!=0) 91 | 92 | external_id_to_img_path_dict[external_id] = full_path 93 | 94 | return external_id_to_img_path_dict, unmatched_external_id_list 95 | 96 | 97 | if __name__ == '__main__': 98 | 99 | parser = argparse.ArgumentParser() 100 | parser.add_argument('--jp2_root_dir', type=str, default='/data/rumsey-jp2/', 101 | help='image dir of jp2 files.') 102 | parser.add_argument('--sid_root_dir', type=str, default='/data2/rumsey_sid_to_jpg/', 103 | help='image dir of sid files.') 104 | parser.add_argument('--additional_root_dir', type=str, default='/data2/rumsey-luna-img/', 105 | help='image dir of additional luna files.') 106 | parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv', 107 | help='path to sample map csv, which contains gcps info') 108 | parser.add_argument('--external_id_key', type=str, default='external_id', 109 | help='key string for external id, could be external_id or ListNo') 110 | 111 | args = parser.parse_args() 112 | print(args) 113 | 114 | # get_img_path_from_external_id(jp2_root_dir = args.jp2_root_dir, sid_root_dir = args.sid_root_dir, additional_root_dir = args.additional_root_dir, 115 | # sample_map_path = args.sample_map_path,external_id_key = args.external_id_key) 116 | 117 | get_img_path_from_external_id_and_image_no(jp2_root_dir = args.jp2_root_dir, sid_root_dir = args.sid_root_dir, additional_root_dir = args.additional_root_dir, 118 | sample_map_path = args.sample_map_path,external_id_key = args.external_id_key) 119 | --------------------------------------------------------------------------------