├── .gitignore
├── README.md
├── external_id_search
└── script.py
├── m0_preprocessing
└── convert_sid_to_jpg.py
├── m1_geotiff
└── convert_image_to_geotiff.py
├── m2_detection_recognition
└── crop_img.py
├── m3_image_geojson
└── stitch_output.py
├── m4_geocoordinate_converter
└── convert_geojson_to_geocoord.py
├── m5_entity_linker
└── entity_linker.py
├── m6_post_ocr
└── lexical_search.py
├── m_sanborn
├── s1_geocoding.py
├── s2_clustering.py
└── s3_gen_geojson.py
├── metadata
├── davidrumsey
│ ├── davidrumsey.py
│ └── davidrumsey_metadata.csv
└── sanborn.py
├── model_card_template
├── pipe_run.sh
├── pipe_run_img.sh
├── requirements.txt
├── run.py
├── run_img.py
├── run_leeje.py
├── run_only_eval.py
├── run_sanborn.py
└── utils.py
/.gitignore:
--------------------------------------------------------------------------------
1 | data/
2 | data0/
3 | data1/
4 | rumsey_output/
5 | .idea/
6 | .env
7 | MrSID*
8 | __pycache__
9 | debug/
10 | .ipynb_checkpoints/
11 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 |
2 | ---
3 |
4 | # Table of Contents
5 | - [Dataset Card](#dataset-card)
6 | - [Dataset Description](#dataset-description)
7 | - [Dataset Download Link](#dataset-download-link)
8 | - [Dataset Languages](#dataset-languages)
9 | - [Dataset Structure](#dataset-structure)
10 | - [Data Fields](#data-fields)
11 | - [Model Card](#model-card)
12 | - [Model Description](#model-description)
13 | - [Model Summary](#model-summary)
14 | - [Model Tags](#model-tags)
15 | - [Model Input and Output](#model-input-and-output)
16 | - [Additional Information](#additional-information)
17 | - [Licensing Information](#licensing-information)
18 | - [Contributions](#contributions)
19 |
20 |
21 | # Dataset Card
22 |
23 | ## Dataset Description
24 |
25 | Map text recognized from the georeferenced Rumsey historical map collection.
26 |
27 | ### Dataset Download Link
28 |
29 | - **Original Map Images:** https://www.davidrumsey.com/
30 | - **Processed Output:** https://s3.msi.umn.edu/rumsey_output/geojson_testr_syn_54119.zip
31 |
32 | ### Dataset Languages
33 |
34 | English
35 |
36 | ### Language Creators:
37 |
38 | Machine-generated
39 |
40 | ## Dataset Structure
41 |
42 | ### Data Fields
43 |
44 |
45 |
46 | ### Output File Name
47 |
48 | Output geojson file is named after the external ID of origina map image.
49 |
50 |
51 |
52 |
53 |
54 | # Model Card
55 |
56 | ## Model Description
57 |
58 | A **fully automatic** pipeline to process a large amount of scanned historical map images. **Outputs** include the recognized text labels, label bounding polygons, labels after post-OCR correction and geo-entity identifier in OSM database.
59 |
60 | ### Model Summary
61 |
62 | - **Orange boxes:** Modules in the pipeline
63 | - **Blue boxes:** Inputs of the modules
64 | - **Green boxes:** Outputs of the modules
65 |
66 |
67 |
68 | ### Model Details
69 | - **ImageCropping** module divides huge map images (>10K pixels) to smaller image patches (1K pixels) so that TextSpotter could process.
70 |
71 | - **PatchTextSpotter** uses a state-of-the-art network architecture [TESTR](https://github.com/mlpc-ucsd/TESTR) for detecting and recognizing text labels on image patches. Due to the lack of annotated samples for training, we create a set of synthetic maps to mimic the text styles (e.g., font, spacing, orientation) in the real historical maps. We place the location names from OpenStreetMap on a map by considering the shape of the location geometry and merge the text with various background styles extracted from the Rumsey collection maps. We train the model with these unlimited synthetic maps and apply the model to the historical maps.
72 |
73 | - **PatchtoMapMerging** is the module to merge the patch-level spotting results into map-level.
74 |
75 | - **GeocoordinateConverter** converts the text label bounding polygons from image coordinates system to geocoordinates system. Note: polygons in both coordinate systems are saved in the output.
76 |
77 | - **PostOCR** helps to verify the output and correct misspelled words from PatchTextSpotter using the OpenStreetMap dictionary. PostOCR module finds words' candidates using [fuzzy query function](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html) from elasticsearch, which contains the place name attribute from the Openstreetmap dictionary. Once PostOCR module identifies words' candidates, the module picks one candidate by the word popularity from the dictionary.
78 |
79 | - **EntityLinker** links each map text to the candidate geo-entities in the OpenStreetMap. The entity linking retrieves the candidates that satisfy two criteria: 1) the recognized text from the text spotter contains the geo-entities' name 2) the geocoordinates of detected bounding polygons intersect with the geo-entities' geometry. (Geo-coordinates are obtained from GeocoordConverter)
80 |
81 |
82 | ### How To Use
83 | All the modules can be launched from `run.py`. All the outputs will be saved in the `expt_name` subfolder in `output_folder` specified in the input arguments.
84 |
85 | ```
86 | usage: run.py [-h] [--map_kurator_system_dir MAP_KURATOR_SYSTEM_DIR] [--text_spotting_model_dir TEXT_SPOTTING_MODEL_DIR]
87 | [--sample_map_csv_path SAMPLE_MAP_CSV_PATH] [--output_folder OUTPUT_FOLDER] [--expt_name EXPT_NAME] [--module_get_dimension]
88 | [--module_gen_geotiff] [--module_cropping] [--module_text_spotting] [--module_img_geojson] [--module_geocoord_geojson] [--module_entity_linking]
89 | [--module_post_ocr] [--spotter_model {abcnet,testr}] [--spotter_config SPOTTER_CONFIG] [--spotter_expt_name SPOTTER_EXPT_NAME] [--print_command]
90 |
91 | optional arguments:
92 | -h, --help show this help message and exit
93 | --map_kurator_system_dir MAP_KURATOR_SYSTEM_DIR
94 | --text_spotting_model_dir TEXT_SPOTTING_MODEL_DIR
95 | --sample_map_csv_path SAMPLE_MAP_CSV_PATH
96 | --output_folder OUTPUT_FOLDER
97 | --expt_name EXPT_NAME
98 | --module_get_dimension
99 | --module_gen_geotiff
100 | --module_cropping
101 | --module_text_spotting
102 | --module_img_geojson
103 | --module_geocoord_geojson
104 | --module_entity_linking
105 | --module_post_ocr
106 | --spotter_model {abcnet,testr}
107 | Select text spotting model option from ["abcnet","testr"]
108 | --spotter_config SPOTTER_CONFIG
109 | Path to the config file for text spotting model
110 | --spotter_expt_name SPOTTER_EXPT_NAME
111 | Name of spotter experiment, if empty using config file name
112 | --print_command
113 | ```
114 |
115 | ### Model Tags
116 | - Text spotting
117 | - Entity Linking
118 | - Historical maps
119 |
120 |
121 | # Additional Information
122 |
123 | ### Licensing Information
124 |
125 | MIT License
126 |
127 | ### Contribution and Acknowledgement
128 |
129 | Thanks to [@zekun-li](https://zekun-li.github.io/),[@Jina-Kim](https://github.com/Jina-Kim), [@MinNamgung](https://github.com/MinNamgung) and [@linyijun](https://github.com/linyijun) for adding this dataset and models.
130 |
131 | Thanks to [TESTR](https://github.com/mlpc-ucsd/TESTR) for an open-source text spotting model.
132 |
--------------------------------------------------------------------------------
/external_id_search/script.py:
--------------------------------------------------------------------------------
1 | from elasticsearch_dsl import Search, Q
2 | from elasticsearch import Elasticsearch, helpers
3 | from elasticsearch import RequestsHttpConnection
4 | import argparse
5 | import os
6 | import glob
7 | import json
8 | import nltk
9 | import logging
10 | from dotenv import load_dotenv
11 |
12 | import pandas as pd
13 | import numpy as np
14 | import logging
15 | import re
16 | import warnings
17 | warnings.filterwarnings("ignore")
18 |
19 |
20 |
21 | def db_connect():
22 | """Elasticsearch Connection on Sansa"""
23 | load_dotenv()
24 |
25 | DB_HOST = os.getenv("DB_HOST")
26 | USER_NAME = os.getenv("DB_USERNAME")
27 | PASSWORD = os.getenv("DB_PASSWORD")
28 |
29 | es = Elasticsearch([DB_HOST], connection_class=RequestsHttpConnection, http_auth=(USER_NAME, PASSWORD), verify_certs=False)
30 | return es
31 |
32 |
33 | def query(target):
34 | es = db_connect()
35 | inputs = target.upper()
36 | query = {"query": {"match": {"text": f"{inputs}"}}}
37 | test = es.search(index="meta", body=query, size=10000)["hits"]["hits"]
38 |
39 | id_list = []
40 | if len(test) != 0 :
41 | for i in range(len(test)):
42 | map_id = test[i]['_source']['external_id']
43 | id_list.append(map_id)
44 |
45 |
46 | result = sorted(list(set(id_list)))
47 | return result
48 |
49 |
50 | def main(args):
51 | keyword = args.target
52 | metadata_path = args.metadata
53 | meta_df = pd.read_csv(metadata_path)
54 | meta_df['tmp'] = meta_df['image_no'].str.split(".").str[0]
55 |
56 | results = query(keyword)
57 | # print(f' "{keyword}" exist in: {results}')
58 |
59 | tmp_df = meta_df[meta_df.tmp.isin(results)]
60 |
61 | print(f'"{keyword}" exist in:')
62 | for index, row in tmp_df.iterrows():
63 | print(f'{row.tmp} \t {row.title}')
64 |
65 |
66 | if __name__ == '__main__':
67 | parser = argparse.ArgumentParser()
68 | parser.add_argument('--target', type=str, default='east', help='')
69 | parser.add_argument('--metadata', type=str, default='/home/maplord/maplist_csv/luna_omo_metadata_56628_20220724.csv', help='')
70 |
71 | args = parser.parse_args()
72 | print(args)
73 |
74 | main(args)
75 |
--------------------------------------------------------------------------------
/m0_preprocessing/convert_sid_to_jpg.py:
--------------------------------------------------------------------------------
1 | import os
2 | import glob
3 | import time
4 | import multiprocessing
5 |
6 | sid_dir = '/data/rumsey-sid'
7 | sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
8 | num_process = 20
9 | if_print_command = True
10 |
11 | sid_list = glob.glob(os.path.join(sid_dir, '*/*.sid'))
12 |
13 | def execute_command(command, if_print_command):
14 | t1 = time.time()
15 |
16 | if if_print_command:
17 | print(command)
18 | os.system(command)
19 |
20 | t2 = time.time()
21 | time_usage = t2 - t1
22 | return time_usage
23 |
24 |
25 | def conversion(img_path):
26 | mrsiddecode_executable="/home/zekun/dr_maps/mapkurator-system/m1_geotiff/MrSID_DSDK-9.5.4.4709-rhel6.x86-64.gcc531/Raster_DSDK/bin/mrsiddecode"
27 | map_name = os.path.basename(img_path)[:-4]
28 |
29 | redirected_path = os.path.join(sid_to_jpg_dir, map_name + '.jpg')
30 |
31 | run_sid_to_jpg_command = mrsiddecode_executable + ' -quiet -i '+ img_path + ' -o '+redirected_path
32 | time_usage = execute_command(run_sid_to_jpg_command, if_print_command)
33 |
34 |
35 |
36 | if __name__ == "__main__":
37 | pool = multiprocessing.Pool(num_process)
38 | start_time = time.perf_counter()
39 | processes = [pool.apply_async(conversion, args=(sid_path,)) for sid_path in sid_list]
40 | result = [p.get() for p in processes]
41 | finish_time = time.perf_counter()
42 | print(f"Program finished in {finish_time-start_time} seconds")
43 |
44 |
--------------------------------------------------------------------------------
/m1_geotiff/convert_image_to_geotiff.py:
--------------------------------------------------------------------------------
1 | import os
2 | import glob
3 | import pandas as pd
4 | import ast
5 | import argparse
6 | import logging
7 | import pdb
8 |
9 | logging.basicConfig(level=logging.INFO)
10 |
11 | def func_file_to_fullpath_dict(file_path_list):
12 |
13 | file_fullpath_dict = dict()
14 | for file_path in file_path_list:
15 | file_fullpath_dict[os.path.basename(file_path).split('.')[0]] = file_path
16 |
17 | return file_fullpath_dict
18 |
19 | def main(args):
20 |
21 | jp2_root_dir = args.jp2_root_dir
22 | sid_root_dir = args.sid_root_dir
23 | additional_root_dir = args.additional_root_dir
24 | out_geotiff_dir = args.out_geotiff_dir
25 |
26 | sample_map_path = args.sample_map_path
27 | external_id_key = args.external_id_key
28 |
29 | jp2_file_path_list = glob.glob(os.path.join(jp2_root_dir, '*/*.jp2'))
30 | sid_file_path_list = glob.glob(os.path.join(sid_root_dir, '*.jpg')) # use converted jpg directly
31 | add_file_path_list = glob.glob(os.path.join(additional_root_dir, '*'))
32 |
33 | jp2_file_fullpath_dict = func_file_to_fullpath_dict(jp2_file_path_list)
34 | sid_file_fullpath_dict = func_file_to_fullpath_dict(sid_file_path_list)
35 | add_file_fullpath_dict = func_file_to_fullpath_dict(add_file_path_list)
36 |
37 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
38 |
39 |
40 | for index, record in sample_map_df.iterrows():
41 | external_id = record.external_id
42 | transform_method = record.transformation_method
43 | gcps = record.gcps
44 | filename_without_extension = external_id.strip("'").replace('.','')
45 |
46 | full_path = ''
47 | if filename_without_extension in jp2_file_fullpath_dict:
48 | full_path = jp2_file_fullpath_dict[filename_without_extension]
49 | elif filename_without_extension in sid_file_fullpath_dict:
50 | full_path = sid_file_fullpath_dict[filename_without_extension]
51 | elif filename_without_extension in add_file_fullpath_dict:
52 | full_path = add_file_fullpath_dict[filename_without_extension]
53 | else:
54 | print('image with external_id not found in image_dir:', external_id)
55 | continue
56 | assert (len(full_path)!=0)
57 |
58 | gcps = ast.literal_eval(gcps)
59 |
60 | gcp_str = ''
61 | for gcp in gcps:
62 | lng, lat = gcp['location']
63 | x, y = gcp['pixel']
64 | gcp_str += '-gcp '+str(x) + ' ' + str(y) + ' ' + str(lng) + ' ' + str(lat) + ' '
65 |
66 | # gdal_translate to add GCP to raw image
67 | gdal_command = 'gdal_translate -of Gtiff '+gcp_str + full_path + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff'
68 | print(gdal_command)
69 | os.system(gdal_command)
70 |
71 |
72 | assert transform_method in ['affine','polynomial','tps']
73 |
74 | # reprojection with gdal_warp
75 | if transform_method == 'affine':
76 | # first order
77 |
78 | warp_command = 'gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3857 -r near -order 1 -of GTiff ' + os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '.geotiff'
79 |
80 | elif transform_method == 'polynomial':
81 | # second order
82 | warp_command = 'gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3857 -r near -order 2 -of GTiff '+ os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '.geotiff'
83 |
84 | elif transform_method == 'tps':
85 | # Thin plate spline #debug/11558008.geotiff #10057000.geotiff
86 | warp_command = 'gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3857 -r near -tps -of GTiff '+ os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '.geotiff'
87 |
88 | else:
89 | raise NotImplementedError
90 | print(warp_command)
91 | os.system(warp_command)
92 | # remove temporary tiff file
93 | # os.system('rm ' + os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff')
94 |
95 |
96 | logging.info('Done generating geotiff for %s', external_id)
97 |
98 |
99 | if __name__ == '__main__':
100 |
101 | parser = argparse.ArgumentParser()
102 | parser.add_argument('--jp2_root_dir', type=str, default='/data/rumsey-jp2/',
103 | help='image dir of jp2 files.')
104 | parser.add_argument('--sid_root_dir', type=str, default='/data2/rumsey_sid_to_jpg/',
105 | help='image dir of sid files.')
106 | parser.add_argument('--additional_root_dir', type=str, default='/data2/rumsey-luna-img/',
107 | help='image dir of additional luna files.')
108 | parser.add_argument('--out_geotiff_dir', type=str, default='data/geotiff/',
109 | help='output dir for geotiff')
110 | parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv',
111 | help='path to sample map csv, which contains gcps info')
112 | parser.add_argument('--external_id_key', type=str, default='external_id',
113 | help='key string for external id, could be external_id or ListNo')
114 |
115 | args = parser.parse_args()
116 | print(args)
117 |
118 |
119 | main(args)
120 |
--------------------------------------------------------------------------------
/m2_detection_recognition/crop_img.py:
--------------------------------------------------------------------------------
1 | import sys
2 | import os
3 | from PIL import Image, ImageFile
4 | import numpy as np
5 | import argparse
6 | import logging
7 |
8 | logging.basicConfig(level=logging.INFO)
9 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
10 |
11 | #add this one line and import ImageFile above
12 | ImageFile.LOAD_TRUNCATED_IMAGES = True
13 |
14 | def main(args):
15 |
16 | img_path = args.img_path
17 | output_dir = args.output_dir
18 |
19 | map_name = os.path.basename(img_path).split('.')[0] # get the map name without extension
20 | output_dir = os.path.join(output_dir, map_name)
21 |
22 | if not os.path.isdir(output_dir):
23 | os.makedirs(output_dir)
24 |
25 | map_img = Image.open(img_path)
26 | width, height = map_img.size
27 |
28 | #print(width, height)
29 |
30 | shift_size = 1000
31 |
32 | # pad the image to the size divisible by shift-size
33 | num_tiles_w = int(np.ceil(1. * width / shift_size))
34 | num_tiles_h = int(np.ceil(1. * height / shift_size))
35 | enlarged_width = int(shift_size * num_tiles_w)
36 | enlarged_height = int(shift_size * num_tiles_h)
37 |
38 | enlarged_map = Image.new(mode="RGB", size=(enlarged_width, enlarged_height))
39 | # paste map_imge to enlarged_map
40 | enlarged_map.paste(map_img)
41 |
42 | for idx in range(0, num_tiles_h):
43 | for jdx in range(0, num_tiles_w):
44 | img_clip = enlarged_map.crop((jdx * shift_size, idx * shift_size,(jdx + 1) * shift_size, (idx + 1) * shift_size, ))
45 |
46 | out_path = os.path.join(output_dir, 'h' + str(idx) + '_w' + str(jdx) + '.jpg')
47 | img_clip.save(out_path)
48 |
49 | logging.info('Done cropping %s' %img_path )
50 |
51 |
52 | if __name__ == '__main__':
53 |
54 | parser = argparse.ArgumentParser()
55 | parser.add_argument('--img_path', type=str, default='../data/100_maps/8628000.jp2',
56 | help='path to image file.')
57 | parser.add_argument('--output_dir', type=str, default='../data/100_maps_crop/',
58 | help='path to output dir')
59 |
60 | args = parser.parse_args()
61 | print(args)
62 |
63 |
64 | # if not os.path.isdir(args.output_dir):
65 | # os.makedirs(args.output_dir)
66 | # print('created dir',args.output_dir)
67 |
68 | main(args)
69 |
--------------------------------------------------------------------------------
/m3_image_geojson/stitch_output.py:
--------------------------------------------------------------------------------
1 | import os
2 | import glob
3 | import pandas as pd
4 | import numpy as np
5 | import argparse
6 | from geojson import Polygon, Feature, FeatureCollection, dump
7 | import logging
8 | import pdb
9 |
10 | logging.basicConfig(level=logging.INFO)
11 | pd.options.mode.chained_assignment = None
12 |
13 | def concatenate_and_convert_to_geojson(args):
14 | map_subdir = args.input_dir
15 | output_geojson = args.output_geojson
16 | shift_size = args.shift_size
17 | eval_bool = args.eval_only
18 |
19 | file_list = glob.glob(map_subdir + '/*.json')
20 | file_list = sorted(file_list)
21 | if len(file_list) == 0:
22 | logging.warning('No files found for %s' % map_subdir)
23 |
24 | map_data = []
25 | for file_path in file_list:
26 | patch_index_h, patch_index_w = os.path.basename(file_path).split('.')[0].split('_')
27 | patch_index_h = int(patch_index_h[1:])
28 | patch_index_w = int(patch_index_w[1:])
29 | try:
30 | df = pd.read_json(file_path)
31 | except pd.errors.EmptyDataError:
32 | logging.warning('%s is empty. Skipping.' % file_path)
33 |
34 |
35 | for index, line_data in df.iterrows():
36 | df['polygon_x'][index] = np.array(df['polygon_x'][index]) + shift_size * patch_index_w
37 | df['polygon_y'][index] = np.array(df['polygon_y'][index]) + shift_size * patch_index_h
38 | map_data.append(df)
39 |
40 | map_df = pd.concat(map_data)
41 |
42 | features = []
43 | for index, line_data in map_df.iterrows():
44 | polygon_x, polygon_y = list(line_data['polygon_x']), list(line_data['polygon_y'])
45 |
46 | if eval_bool == False:
47 | # y is kept to be positive. Needs to be negative for QGIS visualization
48 | polygon = Polygon([[[x,-y] for x,y in zip(polygon_x, polygon_y)]+[[polygon_x[0], -polygon_y[0]]]])
49 | else:
50 | polygon = Polygon([[[x,y] for x,y in zip(polygon_x, polygon_y)]+[[polygon_x[0], polygon_y[0]]]])
51 |
52 | text = line_data['text']
53 | score = line_data['score']
54 | features.append(Feature(geometry = polygon, properties={"text": text, "score": score} ))
55 |
56 | feature_collection = FeatureCollection(features)
57 | # with open(os.path.join(output_dir, map_subdir +'.geojson'), 'w') as f:
58 | # dump(feature_collection, f)
59 | with open(output_geojson, 'w') as f:
60 | dump(feature_collection, f)
61 |
62 | logging.info('Done generating geojson (img coord) for %s', map_subdir)
63 |
64 |
65 | if __name__ == '__main__':
66 |
67 | parser = argparse.ArgumentParser()
68 | parser.add_argument('--input_dir', type=str, default='data/100_maps_crop_abc/0063014',
69 | help='path to input json path.')
70 |
71 | parser.add_argument('--output_geojson', type=str, default='data/100_maps_geojson_abc/0063014.geojson',
72 | help='path to output geojson path')
73 |
74 | parser.add_argument('--shift_size', type=int, default = 1000,
75 | help='image patch size and shift size.')
76 |
77 | # This can not be of string type. Otherwise it will be interpreted to True all the time.
78 | parser.add_argument('--eval_only', default = False, action='store_true',
79 | help='keep positive coordinate')
80 |
81 | args = parser.parse_args()
82 | print(args)
83 |
84 | concatenate_and_convert_to_geojson(args)
85 |
86 |
87 |
88 |
89 |
90 |
--------------------------------------------------------------------------------
/m4_geocoordinate_converter/convert_geojson_to_geocoord.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import logging
4 | import ast
5 |
6 | import pandas as pd
7 | import numpy as np
8 | import geojson
9 |
10 | logging.basicConfig(level=logging.INFO)
11 |
12 |
13 | def main(args):
14 | geojson_file = args.in_geojson_file
15 | output_dir = args.out_geojson_dir
16 |
17 | sample_map_df = pd.read_csv(args.sample_map_path, dtype={'external_id': str})
18 | sample_map_df['external_id'] = sample_map_df['external_id'].str.strip("'").str.replace('.', '', regex=True)
19 | geojson_filename_id = geojson_file.split(".")[0].split("/")[-1]
20 |
21 | row = sample_map_df[sample_map_df['external_id'] == geojson_filename_id]
22 | if not row.empty:
23 | gcps = ast.literal_eval(row.iloc[0]['gcps'])
24 | gcp_str = ''
25 | for gcp in gcps:
26 | lng, lat = gcp['location']
27 | x, y = gcp['pixel']
28 | gcp_str += '-gcp ' + str(x) + ' ' + str(y) + ' ' + str(lng) + ' ' + str(lat) + ' '
29 |
30 | transform_method = row.iloc[0]['transformation_method']
31 | assert transform_method in ['affine', 'polynomial', 'tps']
32 |
33 | output = '"' + output_dir + geojson_filename_id + '.geojson"'
34 | input = '"' + geojson_file + '"'
35 |
36 | if transform_method == 'affine':
37 | gecoord_convert_command = 'ogr2ogr -f "GeoJSON" ' + output + " " + input + ' -order 1 ' + gcp_str
38 |
39 | elif transform_method == 'polynomial':
40 | gecoord_convert_command = 'ogr2ogr -f "GeoJSON" ' + output + " " + input + ' -order 2 ' + gcp_str
41 |
42 | elif transform_method == 'tps':
43 | gecoord_convert_command = 'ogr2ogr -f "GeoJSON" ' + output + " " + input + ' -tps ' + gcp_str
44 |
45 | else:
46 | raise NotImplementedError
47 |
48 | ret_value = os.system(gecoord_convert_command)
49 | if ret_value != 0:
50 | logging.info('Failed generating geocoord geojson for %s', geojson_file)
51 | else:
52 | with open(geojson_file) as img_geojson, open(output_dir + geojson_filename_id + '.geojson',
53 | 'r+') as geocoord_geojson:
54 | img_data = geojson.load(img_geojson)
55 | geocoord_data = geojson.load(geocoord_geojson)
56 | for img_feature, geocoord_feature in zip(img_data['features'], geocoord_data['features']):
57 | geocoord_feature['properties']['img_coordinates'] = np.array(img_feature['geometry']['coordinates'],
58 | dtype=np.int32).reshape(-1, 2).tolist()
59 |
60 | with open(output_dir + geojson_filename_id + '.geojson', 'w') as geocoord_geojson:
61 | geojson.dump(geocoord_data, geocoord_geojson)
62 |
63 | logging.info('Done generating geocoord geojson for %s', geojson_file)
64 |
65 |
66 | if __name__ == '__main__':
67 | parser = argparse.ArgumentParser()
68 | parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv',
69 | help='path to sample map csv, which contains gcps info')
70 | parser.add_argument('--in_geojson_file', type=str,
71 | help='input geojson file; results of M2')
72 | parser.add_argument('--out_geojson_dir', type=str, default='data/100_maps_geojson_abc_geocoord/',
73 | help='output dir for converted geojson files')
74 |
75 | args = parser.parse_args()
76 |
77 | main(args)
--------------------------------------------------------------------------------
/m5_entity_linker/entity_linker.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import ast
4 | from dotenv import load_dotenv
5 |
6 | import pandas as pd
7 | import numpy as np
8 |
9 | import geojson
10 |
11 | import sqlalchemy
12 | from sqlalchemy import create_engine
13 |
14 | import geocoder
15 | from shapely.geometry import Polygon
16 |
17 | import re
18 |
19 | load_dotenv()
20 |
21 | DB_HOST = os.getenv("DB_HOST")
22 | DB_USERNAME = os.getenv("DB_USERNAME")
23 | DB_PASSWORD = os.getenv("DB_PASSWORD")
24 | DB_NAME = os.getenv("DB_NAME")
25 |
26 | connection_string = f'postgresql://postgres:{DB_USERNAME}:{DB_PASSWORD}@{DB_HOST}:5432/{DB_NAME}'
27 |
28 |
29 | def main(args):
30 |
31 | # check if first pair of gcps is in midwest-US
32 | regex = re.compile('[^a-zA-Z]')
33 | conn = create_engine(connection_string, echo=False)
34 | sample_map_df = pd.read_csv(args.sample_map_path, dtype={'external_id': str})
35 | sample_map_df['external_id'] = sample_map_df['external_id'].str.strip("'").str.replace('.', '')
36 | midwest = ["Illinois", "Missouri", "Kansas", "Iowa", "South Dakota", "Indiana", "Ohio", "Wisconsin", "Minnesota", "Michigan"]
37 |
38 | geojson_files = os.listdir(args.in_geojson_dir)
39 | for i, geojson_file in enumerate(geojson_files):
40 | row = sample_map_df[sample_map_df['external_id']==geojson_file.split(".")[0]]
41 | gcps = ast.literal_eval(row.iloc[0]['gcps'])
42 | geocode = geocoder.osm(gcps[0]['location'][::-1], method='reverse')
43 |
44 | if geocode.state in midwest:
45 | with open(args.in_geojson_dir+geojson_file) as f:
46 | data = geojson.load(f)
47 | for feature_data in data['features']:
48 | pts = np.array(feature_data['geometry']['coordinates']).reshape(-1, 2)
49 | map_polygon = Polygon(pts)
50 | map_text = str(feature_data['properties']['text']).lower()
51 | map_text = regex.sub(' ', map_text) # remove all non-alphabetic characters
52 |
53 | query = f"""SELECT p.ogc_fid
54 | FROM polygon_features p
55 | WHERE LOWER(p.name) LIKE '%%{map_text}%%'
56 | AND ST_INTERSECTS(ST_TRANSFORM(ST_SetSRID(ST_MakeValid('{map_polygon}'::geometry), 4326)::geometry, 4326), p.wkb_geometry);
57 | """
58 |
59 | try:
60 | intersect_df = pd.read_sql(query, con=conn)
61 | except sqlalchemy.exc.InternalError:
62 | continue
63 |
64 | if not intersect_df.empty:
65 | feature_data['properties']['osm_ogc_fid'] = intersect_df['ogc_fid'].values.tolist()
66 | # else:
67 | # feature_data['properties']['osm_ogc_fid'] = []
68 |
69 | with open(args.out_geojson_dir+geojson_file, 'w') as output_geojson:
70 | geojson.dump(data, output_geojson)
71 |
72 |
73 | if __name__ == '__main__':
74 | parser = argparse.ArgumentParser()
75 | parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv',
76 | help='path to sample map csv, which contains gcps info')
77 | parser.add_argument('--in_geojson_dir', type=str, default='data/100_maps_geojson_abc_geocoord/',
78 | help='input dir for results of M2')
79 | parser.add_argument('--out_geojson_dir', type=str, default='data/100_maps_geojson_abc_linked/',
80 | help='output dir for converted geojson files')
81 |
82 | args = parser.parse_args()
83 |
84 | main(args)
85 |
--------------------------------------------------------------------------------
/m6_post_ocr/lexical_search.py:
--------------------------------------------------------------------------------
1 | #-*-coding utf-8-*-
2 | import logging
3 | import requests
4 | import json
5 | import argparse
6 | import http.client as http_client
7 | import nltk
8 | import re
9 | import glob
10 | import os
11 |
12 | # set the debug level
13 | http_client.HTTPConnection.debuglevel = 1
14 | logging.basicConfig(level=logging.INFO)
15 | warnings.filterwarnings("ignore")
16 |
17 | headers = {
18 | 'Content-Type': 'application/json',
19 | }
20 |
21 | def query(args):
22 | """ Query candidates and save them as 'postocr_label' """
23 |
24 | input_dir = args.in_geojson_dir
25 | output_geojson = args.out_geojson_dir
26 |
27 | map_name_output = input_dir.split('/')[-1]
28 |
29 | with open(input_dir) as json_file:
30 | json_df = json.load(json_file)
31 |
32 | if json_df != {}:
33 | query_result = []
34 | for i in range(len(json_df["features"])):
35 | target_text = json_df['features'][i]["properties"]["text"]
36 | target_pts = json_df['features'][i]["geometry"]["coordinates"]
37 |
38 | clean_txt = []
39 | if type(target_text) == str:
40 | for t in range(len(target_text)):
41 | txt = target_text[t]
42 | if txt.isalpha():
43 | clean_txt.append(txt)
44 |
45 | temp_label = ''.join([str(item) for item in clean_txt])
46 | if len(temp_label) != 0:
47 | target_text = temp_label
48 |
49 | process = re.findall('[A-Z][^A-Z]*', target_text)
50 | if all(c.isupper() for c in process) or len(process) == 1:
51 |
52 | if type(target_text) == str and any(c.isalpha() for c in target_text):
53 | # edist 0
54 | inputs = target_text.lower()
55 | q1 = '{"query": {"fuzzy": {"name": {"value": "'+ inputs +'", "fuzziness": "0"}}}}'
56 | resp = requests.get(f'http://localhost:9200/osm-voca/_search?', \
57 | data=q1.encode("utf-8"), \
58 | headers = headers)
59 | resp_json = json.loads(resp.text)
60 | test = resp_json["hits"]["hits"]
61 |
62 | edist = []
63 | edist_update = []
64 |
65 | edd_min_find = 0
66 | min_candidates = False
67 |
68 | if test != 'NaN':
69 | for tt in range(len(test)):
70 | if 'name' in test[tt]['_source']:
71 | candidate = test[tt]['_source']['name']
72 | edist.append(candidate)
73 |
74 | for e in range(len(edist)):
75 | edd = nltk.edit_distance(inputs.upper(), edist[e].upper())
76 |
77 | if edd == 0:
78 | edist_update.append(edist[e])
79 | min_candidates = edist[e]
80 | edd_min_find = 1
81 |
82 | # edd 1
83 | if edd_min_find != 1:
84 | # edist 1
85 | q2 = '{"query": {"fuzzy": {"name": {"value": "'+ inputs +'", "fuzziness": "1"}}}}'
86 | resp = requests.get(f'http://localhost:9200/osm-voca/_search?', \
87 | data=q2.encode("utf-8"), \
88 | headers = headers)
89 | resp_json = json.loads(resp.text)
90 | test = resp_json["hits"]["hits"]
91 |
92 | edist = []
93 | edist_count = []
94 | edist_update = []
95 | edist_count_update = []
96 |
97 | if test != 'NaN':
98 | for tt in range(len(test)):
99 | if 'name' in test[tt]['_source']:
100 | candidate = test[tt]['_source']['message']
101 | cand = candidate.split(',')[0]
102 | count = candidate.split(',')[1]
103 | edist.append(cand)
104 | edist_count.append(count)
105 |
106 | for e in range(len(edist)):
107 | edd = nltk.edit_distance(inputs.upper(), edist[e].upper())
108 |
109 | if edd == 1:
110 | edist_update.append(edist[e])
111 | edist_count_update.append(edist_count[e])
112 |
113 | if len(edist_update) != 0:
114 | index = edist_count_update.index(max(edist_count_update))
115 | min_candidates = edist_update[index]
116 | edd_min_find = 1
117 |
118 | # edd 2
119 | if edd_min_find != 1:
120 | # edist 2
121 | q3 = '{"query": {"fuzzy": {"name": {"value": "'+ inputs +'", "fuzziness": "2"}}}}'
122 | resp = requests.get(f'http://localhost:9200/osm-voca/_search?', \
123 | data=q3.encode("utf-8"), \
124 | headers = headers)
125 | resp_json = json.loads(resp.text)
126 | test = resp_json["hits"]["hits"]
127 |
128 | edist = []
129 | edist_count = []
130 | edist_update = []
131 | edist_count_update = []
132 |
133 | if test != 'NaN':
134 | for tt in range(len(test)):
135 | if 'name' in test[tt]['_source']:
136 | candidate = test[tt]['_source']['message']
137 | cand = candidate.split(',')[0]
138 | count = candidate.split(',')[1]
139 | edist.append(cand)
140 | edist_count.append(count)
141 |
142 | for e in range(len(edist)):
143 | edd = nltk.edit_distance(inputs.upper(), edist[e].upper())
144 |
145 | if edd == 2:
146 | edist_update.append(edist[e])
147 | edist_count_update.append(edist_count[e])
148 |
149 | if len(edist_update) != 0:
150 | index = edist_count_update.index(max(edist_count_update))
151 | min_candidates = edist_update[index]
152 | edd_min_find = 1
153 |
154 | if edd_min_find != 1:
155 | min_candidates = False
156 |
157 |
158 | if min_candidates != False:
159 | json_df['features'][i]["properties"]["postocr_label"] = str(min_candidates)
160 | else:
161 | json_df['features'][i]["properties"]["postocr_label"] = str(target_text)
162 |
163 | else: # added
164 | json_df['features'][i]["properties"]["postocr_label"] = str(target_text)
165 |
166 | else:
167 | # only numeric pred_text
168 | json_df['features'][i]["properties"]["postocr_label"] = str(target_text)
169 |
170 | else:
171 | json_df['features'][i]["properties"]["postocr_label"] = str(target_text)
172 |
173 | # Save
174 | with open(output_geojson, 'w') as json_file:
175 | json.dump(json_df, json_file, ensure_ascii=False)
176 |
177 | logging.info('Done generating post-OCR geojson for %s', map_name_output)
178 |
179 |
180 | def main(args):
181 | query(args)
182 |
183 |
184 | if __name__ == '__main__':
185 |
186 |
187 | parser = argparse.ArgumentParser()
188 | parser.add_argument('--in_geojson_dir', type=str, default='/data2/rumsey_output/test2/',
189 | help='input dir for post-OCR module (= the output of M4) /crop_MN/output_stitch/')
190 | parser.add_argument('--out_geojson_dir', type=str, default='/data2/rumsey_output/out/',
191 | help='post-OCR result')
192 |
193 | args = parser.parse_args()
194 | print(args)
195 |
196 | # if not os.path.isdir(args.out_geojson_dir):
197 | # os.makedirs(args.out_geojson_dir)
198 | # print('created dir',args.out_geojson_dir)
199 |
200 | main(args)
--------------------------------------------------------------------------------
/m_sanborn/s1_geocoding.py:
--------------------------------------------------------------------------------
1 | import os
2 | import argparse
3 | import geojson
4 | import geocoder
5 | import json
6 | import time
7 | import pdb
8 |
9 |
10 | def arcgic_geocoding(place_name, maxRows = 5):
11 | try:
12 | response = geocoder.arcgis(place_name,maxRows=maxRows)
13 | return response.json
14 | except exception as e:
15 | print(e)
16 | return -1
17 |
18 |
19 | def google_geocoding(place_name, api_key = None, maxRows = 5):
20 | try:
21 | response = geocoder.google(place_name, key=api_key, maxRows = maxRows)
22 | return response.json
23 | except exception as e:
24 | print(e)
25 | return -1
26 |
27 | def osm_geocoding(place_name, maxRows = 5):
28 | try:
29 | response = geocoder.osm(place_name, maxRows = maxRows)
30 | return response.json
31 | except exception as e:
32 | print(e)
33 | return -1
34 |
35 |
36 | def geonames_geocoding(place_name, user_name = None, maxRows = 5):
37 | try:
38 | response = geocoder.geonames(place_name, key = user_name, maxRows=maxRows)
39 | # hourly limit of 1000 credits
40 | time.sleep(4)
41 | return response.json
42 | except exception as e:
43 | print(e)
44 | return -1
45 |
46 |
47 | def geocoding(args):
48 | output_folder = args.output_folder
49 | input_map_geojson_path = args.input_map_geojson_path
50 | api_key = args.api_key
51 | user_name = args.user_name
52 | geocoder_option = args.geocoder_option
53 | max_results = args.max_results
54 | suffix = args.suffix
55 |
56 | with open(input_map_geojson_path, 'r') as f:
57 | data = geojson.load(f)
58 |
59 | map_name = os.path.basename(input_map_geojson_path).split('.')[0]
60 | output_folder = os.path.join(output_folder, geocoder_option)
61 |
62 | if not os.path.isdir(output_folder):
63 | os.makedirs(output_folder)
64 |
65 | output_path = os.path.join(output_folder, map_name) + '.json'
66 |
67 | with open(output_path, 'w') as f:
68 | pass # flush output file
69 |
70 | features = data['features']
71 | for feature in features: # iterate through all the detected text labels
72 | geometry = feature['geometry']
73 | text = feature['properties']['text']
74 | score = feature['properties']['score']
75 |
76 | # suffix = ', Los Angeles'
77 | text = str(text) + suffix
78 |
79 | print(text)
80 |
81 | if geocoder_option == 'arcgis':
82 | results = arcgic_geocoding(text, maxRows = max_results)
83 | elif geocoder_option == 'google':
84 | results = google_geocoding(text, api_key = api_key, maxRows = max_results)
85 | elif geocoder_option == 'geonames':
86 | results = geonames_geocoding(text, user_name = user_name, maxRows = max_results)
87 | elif geocoder_option == 'osm':
88 | results = osm_geocoding(text, maxRows = max_results)
89 | else:
90 | raise NotImplementedError
91 |
92 | if results == -1:
93 | # geocoder can not find match
94 | pass
95 | else:
96 | # save results
97 | with open(output_path, 'a') as f:
98 | json.dump({'text':text, 'score':score, 'geometry': geometry, 'geocoding':results}, f)
99 | f.write('\n')
100 |
101 | # pdb.set_trace()
102 |
103 |
104 | def main():
105 | parser = argparse.ArgumentParser()
106 |
107 | parser.add_argument('--output_folder', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geocoding/')
108 | parser.add_argument('--input_map_geojson_path', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geojson_testr/service-gmd-gmd436m-g4364m-g4364lm-g4364lm_g00656189401-00656_01_1894-0001l.geojson')
109 | parser.add_argument('--api_key', type=str, default=None, help='Specify API key if needed')
110 | parser.add_argument('--user_name', type=str, default=None, help='Specify user name if needed')
111 |
112 | parser.add_argument('--suffix', type=str, default=None, help='placename suffix (e.g. city name)')
113 |
114 | parser.add_argument('--max_results', type=int, default=5, help='max number of results returend by geocoder')
115 |
116 | parser.add_argument('--geocoder_option', type=str, default='arcgis',
117 | choices=['arcgis', 'google','geonames','osm'],
118 | help='Select text spotting model option from ["arcgis","google","geonames","osm"]') # select text spotting model
119 |
120 |
121 | args = parser.parse_args()
122 | print('\n')
123 | print(args)
124 | print('\n')
125 |
126 | if not os.path.isdir(args.output_folder):
127 | os.makedirs(args.output_folder)
128 |
129 | geocoding(args)
130 |
131 |
132 | if __name__ == '__main__':
133 |
134 | main()
135 |
136 |
137 |
138 |
--------------------------------------------------------------------------------
/m_sanborn/s2_clustering.py:
--------------------------------------------------------------------------------
1 | import os
2 | import json
3 | import argparse
4 | from sklearn.cluster import DBSCAN
5 | from matplotlib import pyplot as plt
6 | import geopandas as gpd
7 | import pandas as pd
8 | from bs4 import BeautifulSoup
9 | from mpl_toolkits.basemap import Basemap
10 | from pyproj import Proj, transform
11 |
12 | from shapely.geometry import Point
13 | from shapely.geometry.polygon import Polygon
14 | import numpy as np
15 | from shapely.geometry import MultiPoint
16 | from geopy.distance import great_circle
17 |
18 |
19 | county_index_dict = {'Cuyahoga County (OH)': 193,
20 | 'Fulton County (GA)': 73,
21 | 'Kern County (CA)': 2872,
22 | 'Lancaster County (NE)': 1629,
23 | 'Los Angeles County (CA)': 44,
24 | 'Mexico': -1,
25 | 'Nevada County (CA)': 46,
26 | 'New Orleans (LA)': -1,
27 | 'Pima County (AZ)': 2797,
28 | 'Placer County (CA)': 1273,
29 | 'Providence County (RI)\xa0': 1124,
30 | 'Saint Louis (MO)': -1,
31 | 'San Francisco County (CA)': 1261,
32 | 'San Joaquin County (CA)': 1213,
33 | 'Santa Clara (CA)': 48,
34 | 'Santa Cruz (CA)': 2386,
35 | 'Suffolk County (MA)': 272,
36 | 'Tulsa County (OK)': 526,
37 | 'Washington County (AK)': -1,
38 | 'Washington DC': -1}
39 |
40 | def get_centermost_point(cluster):
41 | centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
42 | centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
43 | return tuple(centermost_point)
44 |
45 | def clustering_func(lat_list, lng_list):
46 | X = [[a,b] for a,b in zip(lat_list, lng_list)]
47 | coords = np.array(X)
48 |
49 | # https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
50 | kms_per_radian = 6371.0088
51 | epsilon = 1.5 / kms_per_radian
52 | db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
53 | cluster_labels = db.labels_
54 | num_clusters = len(set(cluster_labels))
55 | clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
56 |
57 | centermost_points = get_centermost_point(clusters[0])
58 | return centermost_points
59 |
60 | def plot_points(lat_list, lng_list, target_lat_list=None, target_lng_list = None, pred_lat=None, pred_lng = None, title = None):
61 |
62 | plt.figure(figsize=(10,6))
63 | plt.title(title)
64 |
65 | plt.scatter(lng_list, lat_list, marker='o', c = 'violet', alpha=0.5)
66 | if pred_lat is not None and pred_lng is not None:
67 | plt.scatter(pred_lng, pred_lat, marker='o', c = 'red')
68 |
69 | if target_lat_list is not None and target_lng_list is not None:
70 | plt.scatter(target_lng_list, target_lat_list, 10, c = 'blue')
71 | plt.show()
72 |
73 | def plot_points_basemap(lat_list, lng_list, target_lat_list=None, target_lng_list = None, pred_lat=None, pred_lng = None, title = None):
74 |
75 | plt.figure(figsize=(10,6))
76 | plt.title(title)
77 |
78 | if len(lat_list) >0 and len(lng_list) > 0:
79 | anchor_lat, anchor_lng = lat_list[0], lng_list[0]
80 | elif target_lat_list is not None:
81 | anchor_lat, anchor_lng = target_lat_list[0], target_lng_list[0]
82 | else:
83 | anchor_lat, anchor_lng = 45, -100
84 |
85 | m = Basemap(projection='lcc', resolution=None,
86 | width=8E4, height=8E4,
87 | lat_0=anchor_lat, lon_0=anchor_lng)
88 | m.etopo(scale=0.5, alpha=0.5)
89 | # m.arcgisimage(service='ESRI_Imagery_World_2D', xpixels = 2000, verbose= True)
90 | # m.arcgisimage(service='ESRI_Imagery_World_2D',scale=0.5, alpha=0.5)
91 | # m.arcgisimage(service='ESRI_Imagery_World_2D', xpixels = 2000, verbose= True)
92 |
93 | lng_list, lat_list = m(lng_list, lat_list) # transform coordinates
94 | plt.scatter(lng_list, lat_list, marker='o', c = 'violet', alpha=0.5)
95 |
96 |
97 | if target_lat_list is not None and target_lng_list is not None:
98 | target_lng_list, target_lat_list = m(target_lng_list, target_lat_list)
99 | plt.scatter(target_lng_list, target_lat_list, marker='o', c = 'blue',edgecolor='blue')
100 |
101 | if pred_lat is not None and pred_lng is not None:
102 | pred_lng, pred_lat = m(pred_lng, pred_lat)
103 | plt.scatter(pred_lng, pred_lat, marker='o', c = 'red', edgecolor='black')
104 |
105 | plt.show()
106 |
107 | def plotting_func(loc_sanborn_dir, pred_dict, lat_lng_dict, dataset_name, geocoding_name):
108 |
109 | for map_name, pred in pred_dict.items():
110 |
111 | title = dataset_name + '-' + geocoding_name + '-' + map_name
112 | lat_list = lat_lng_dict[map_name]['lat_list']
113 | lng_list = lat_lng_dict[map_name]['lng_list']
114 |
115 | if dataset_name == 'LoC_sanborn':
116 | xml_path = os.path.join(loc_sanborn_dir,map_name + '.tif.aux.xml')
117 | try:
118 | with open(xml_path) as fp:
119 | soup = BeautifulSoup(fp)
120 |
121 | target_gcp_list = soup.findAll("metadata")[1].targetgcps.findAll("double")
122 | except Exception as e:
123 | print(xml_path)
124 | continue
125 |
126 | xy_list = []
127 | for target_gcp in target_gcp_list:
128 | xy_list.append(float(target_gcp.contents[0]))
129 |
130 | x_list = xy_list[0::2]
131 | y_list = xy_list[1::2]
132 |
133 | lng2_list, lat2_list = [],[]
134 | for x1,y1 in zip(x_list, y_list):
135 | x2,y2 = transform(inProj,outProj,x1,y1)
136 | #print (x2,y2)
137 | lng2_list.append(x2)
138 | lat2_list.append(y2)
139 |
140 | plot_points(lat_list, lng_list, lat2_list, lng2_list, pred_lat = pred[0], pred_lng = pred[1], title=title)
141 | else:
142 | plot_points(lat_list, lng_list,pred_lat = pred[0], pred_lng = pred[1], title=title)
143 |
144 |
145 | def clustering(args):
146 | dataset_name = args.dataset_name
147 | geocoding_name = args.geocoding_name
148 | remove_duplicate_location = args.remove_duplicate_location
149 | visualize = args.visualize
150 |
151 | sanborn_output_dir = '/data2/sanborn_maps_output'
152 |
153 | input_dir=os.path.join(sanborn_output_dir, dataset_name, 'geocoding_suffix_testr', geocoding_name)
154 | if remove_duplicate_location:
155 | output_dir = os.path.join(sanborn_output_dir, dataset_name, 'clustering_testr_removeduplicate', geocoding_name)
156 | else:
157 | output_dir = os.path.join(sanborn_output_dir, dataset_name, 'clustering_testr', geocoding_name)
158 |
159 | county_boundary_path = '/home/zekun/Sanborn/cb_2018_us_county_500k/cb_2018_us_county_500k.shp'
160 |
161 | if not os.path.isdir(output_dir):
162 | os.makedirs(output_dir)
163 |
164 | inProj = Proj(init='epsg:3857')
165 | outProj = Proj(init='epsg:4326')
166 |
167 | county_boundary_df = gpd.read_file(county_boundary_path)
168 |
169 | if dataset_name == 'LoC_sanborn':
170 | loc_sanborn_dir = '/data2/sanborn_maps/Sanborn100_Georef/' # for comparing with GT
171 | metadata_tsv_path = '/home/zekun/Sanborn/Sheet_List.tsv'
172 | meta_df = pd.read_csv(metadata_tsv_path, sep='\t')
173 |
174 | file_list = os.listdir(input_dir)
175 |
176 | pred_dict = dict()
177 | lat_lng_dict = dict()
178 | for file_path in file_list:
179 |
180 | map_name = os.path.basename(file_path).split('.')[0]
181 | if dataset_name == 'LoC_sanborn':
182 | county_name = meta_df[meta_df['filename'] == map_name]['County'].values[0]
183 | elif dataset_name == 'LA_sanborn' or 'two_more':
184 | county_name = 'Los Angeles County (CA)'
185 | else:
186 | raise NotImplementedError
187 |
188 | index = county_index_dict[county_name]
189 | if index >= 0:
190 | poly_geometry = county_boundary_df.iloc[index].geometry
191 |
192 | with open(os.path.join(input_dir,file_path), 'r') as f:
193 | data = f.readlines()
194 |
195 | lat_list = []
196 | lng_list = []
197 | for line in data:
198 |
199 | line_dict = json.loads(line)
200 | geocoding_dict = line_dict['geocoding']
201 | text = line_dict['text']
202 | score = line_dict['score']
203 | geometry = line_dict['geometry']
204 |
205 | if geocoding_dict is None:
206 | continue # if no geolocation returned by geocoder, then skip
207 |
208 | if 'lat' not in geocoding_dict or 'lng' not in geocoding_dict:
209 | #print(geocoding_dict)
210 | continue
211 |
212 | lat = float(geocoding_dict['lat'])
213 | lng = float(geocoding_dict['lng'])
214 |
215 | point = Point(lng, lat)
216 |
217 | if index >= 0:
218 | if point.within(poly_geometry): # geocoding point within county boundary
219 | lat_list.append(lat)
220 | lng_list.append(lng)
221 | else:
222 | pass
223 | else: # cluster based on all results
224 | lat_list.append(lat)
225 | lng_list.append(lng)
226 |
227 | if remove_duplicate_location:
228 | lat_list = list(set(lat_list))
229 | lng_list = list(set(lng_list))
230 |
231 | if len(lat_list) >0 and len(lng_list) > 0:
232 | pred = clustering_func(lat_list, lng_list)
233 | # print(pred)
234 | else:
235 | print('No data to cluster')
236 |
237 | print(map_name, pred)
238 | pred_dict[map_name] = pred
239 | lat_lng_dict[map_name]={'lat_list':lat_list, 'lng_list':lng_list}
240 |
241 | if visualize:
242 | plotting_func(loc_sanborn_dir = loc_sanborn_dir, pred_dict = pred_dict, lat_lng_dict = lat_lng_dict,
243 | dataset_name = dataset_name, geocoding_name = geocoding_name)
244 |
245 | with open(os.path.join(output_dir, 'pred_center.json'),'w') as f:
246 | json.dump(pred_dict, f)
247 |
248 |
249 | def main():
250 | parser = argparse.ArgumentParser()
251 |
252 | parser.add_argument('--dataset_name', type=str, default=None,
253 | choices=['LA_sanborn', 'LoC_sanborn',],
254 | help='dataset name, same as expt_name')
255 | parser.add_argument('--geocoding_name', type=str, default=None,
256 | choices=['google','arcgis','geonames','osm'],
257 | help='geocoder name')
258 | parser.add_argument('--visualize', default = False, action = 'store_true') # Enable this when in notebook
259 | parser.add_argument('--remove_duplicate_location', default=False, action='store_true') # whether remove duplicate geolocations for clustering
260 |
261 | # parser.add_argument('--output_folder', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geocoding/')
262 | # parser.add_argument('--input_map_geojson_path', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geojson_testr/service-gmd-gmd436m-g4364m-g4364lm-g4364lm_g00656189401-00656_01_1894-0001l.geojson')
263 |
264 |
265 | args = parser.parse_args()
266 | print('\n')
267 | print(args)
268 | print('\n')
269 |
270 | clustering(args)
271 |
272 |
273 | if __name__ == '__main__':
274 |
275 | main()
276 |
--------------------------------------------------------------------------------
/m_sanborn/s3_gen_geojson.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/machines-reading-maps/mapkurator-system/fc3bf223a6806bd3d5beaa3bc1e4640449caa482/m_sanborn/s3_gen_geojson.py
--------------------------------------------------------------------------------
/metadata/davidrumsey/davidrumsey.py:
--------------------------------------------------------------------------------
1 | import requests
2 | import pandas as pd
3 |
4 |
5 | class DavidRumsey:
6 |
7 | csv_filename="davidrumsey_metadata.csv"
8 | df = pd.read_csv(csv_filename)
9 |
10 | def __init__(self, api_key):
11 | self.api_key = api_key
12 | self.headers = {
13 | 'Authorization': self.api_key,
14 | 'Content-Type': 'application/json',
15 | 'charset': 'utf-8'
16 | }
17 |
18 | def get_ground_control_points(self, external_id):
19 | """
20 | Get ground control points of map image via Oldmapsonline API.
21 | Args:
22 | external_id: str
23 | Returns:
24 | transform_method: str
25 | Transformation method
26 | e.g., "affine", "polynomial", "tps"
27 | gcps: list
28 | All pairs of ground control points
29 | e.g., [{'location': [-118.269356489, 34.063140276], 'pixel': [5629, 5064]},{'location': , 'pixel': }, ... ]
30 | """
31 |
32 | # 404 ERROR on many maps
33 | # 1. GET /maps/external/{external_id}
34 | # baseurl = "https://api.oldmapsonline.org/1.0/maps/external/" + external_id
35 | # res = requests.get(baseurl, headers=self.headers)
36 | map_id = self.df[self.df['external_id']==external_id]['id']
37 |
38 | # 2. GET /maps/{id}/georeferences
39 | baseurl = "https://api.oldmapsonline.org/1.0/maps/" + map_id + "/georeferences"
40 | res = requests.get(baseurl, headers=self.headers)
41 |
42 | try:
43 | res.raise_for_status()
44 | except requests.exceptions.HTTPError as e:
45 | print(e)
46 | return None
47 |
48 | data = res.json()
49 | if not data['items']:
50 | return None
51 | else:
52 | transform_method = data['items'][0]['transformation_method']
53 | gcps = data['items'][0]['gcps']
54 | return transform_method, gcps
55 |
--------------------------------------------------------------------------------
/metadata/sanborn.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/machines-reading-maps/mapkurator-system/fc3bf223a6806bd3d5beaa3bc1e4640449caa482/metadata/sanborn.py
--------------------------------------------------------------------------------
/model_card_template:
--------------------------------------------------------------------------------
1 | ---
2 | license: cc-by-nc-2.0
3 | language:
4 | - en
5 | tags:
6 | - text spotting
7 | - scene text detection
8 | - maps
9 | - cultural heritage
10 | ---
11 | # Model Card for Model ID
12 |
13 |
14 |
15 |
16 | ## Model Details
17 |
18 | ### Model Description
19 |
20 |
21 |
22 |
23 |
24 | - **Developed by:** Knowledge Computing Lab, University of Minnesota: Leeje Jang, Jina Kim, Zekun Li, Yijun Lin, Min Namgung, Yao-Yi Chiang
25 | - **Shared by:** Machines Reading Maps
26 | - **Model type:** text spotter
27 | - **Language(s):** English
28 | - **License:** CC-BY-NC 2.0
29 |
30 | ### Model Sources [optional]
31 |
32 |
33 |
34 | - **Repository:** https://github.com/knowledge-computing/mapkurator-spotter
35 | - **Paper [optional]:** [More Information Needed]
36 | - **Documentation:** https://knowledge-computing.github.io/mapkurator-doc/#/
37 |
38 | ## Uses
39 |
40 |
41 |
42 | ### Direct Use
43 |
44 |
45 |
46 | The model detects and recognizes text on images. It was trained specifically to identify text on a wide range of historical maps with many styles printed between ca. 1500-2000 provided by the David Rumsey Map Collection.
47 | This version of the model was trained with an English language model.
48 |
49 |
50 | ### Downstream Use
51 |
52 |
53 | Using this model for new experiments will require attention to the style and language of text on images, including (possibly) the creation of new, synthetic or other training data.
54 |
55 |
56 | ### Out-of-Scope Use
57 |
58 |
59 |
60 |
61 | ## Bias, Risks, and Limitations
62 |
63 |
64 | This model will struggle to return high quality results for maps with complex fonts, low contrast images, complex background colors and textures, and non-English language words.
65 |
66 | [More Information Needed]
67 |
68 | ### Recommendations
69 |
70 |
71 |
72 | Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
73 |
74 | ## How to Get Started with the Model
75 |
76 | Please refer to the mapKurator documentation for details: https://knowledge-computing.github.io/mapkurator-doc/#/
77 |
78 | ## Training Details
79 |
80 | ### Training Data
81 |
82 |
83 |
84 | Synthetic training datasets:
85 | 1. SynthText: 40k text-free background images from COCO and use them to generate synthetic text images (see the left image). Code: https://github.com/ankush-me/SynthText; Dataset: TBD.
86 | 2. SynMap: "patches" of synthetic maps that mimic the text (e.g., font, spacing, orientation) and background styles in the real historical maps (see the right image). Code: TBD; Dataset: TBD.
87 |
88 |
89 | ## Citation [optional]
90 |
91 |
92 |
93 | **BibTeX:**
94 |
95 | [More Information Needed]
96 |
97 | **APA:**
98 |
99 | [More Information Needed]
100 |
101 |
102 |
103 | ## Model Card Authors
104 |
105 | Yijun Lin, Katherine McDonough, Valeria Vitale
106 |
107 | ## Model Card Contact
108 |
109 | Yijun Lin, lin00786 at umn.edu
110 |
111 |
112 |
--------------------------------------------------------------------------------
/pipe_run.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_get_dimension --module_cropping
4 |
5 | python run.py --sample_map_csv_path //home/maplord/maplist_csv/luna_omo_metadata_2508.csv --expt_name Rerun_2_2508 --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --output_folder /data2/rumsey_output/ --spotter_expt_name testr_syn
6 |
7 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_img_geojson
8 |
9 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_geocoord_geojson
10 |
11 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_post_ocr
12 |
--------------------------------------------------------------------------------
/pipe_run_img.sh:
--------------------------------------------------------------------------------
1 | #!/bin/bash
2 |
3 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_get_dimension --module_cropping
4 |
5 | python run_img.py --sample_map_csv_path /data2/rumsey_output/sample_sb/data/ --expt_name sample_sb_opt --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --output_folder /data2/rumsey_output/ --spotter_expt_name testr_syn
6 |
7 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_img_geojson
8 |
9 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_geocoord_geojson
10 |
11 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_post_ocr
12 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/machines-reading-maps/mapkurator-system/fc3bf223a6806bd3d5beaa3bc1e4640449caa482/requirements.txt
--------------------------------------------------------------------------------
/run.py:
--------------------------------------------------------------------------------
1 | import os
2 | import subprocess
3 | import glob
4 | import argparse
5 | import time
6 | import logging
7 | import pandas as pd
8 | import pdb
9 | import datetime
10 | from PIL import Image
11 | from utils import get_img_path_from_external_id, get_img_path_from_external_id_and_image_no
12 |
13 | import subprocess
14 |
15 |
16 | logging.basicConfig(level=logging.INFO)
17 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
18 |
19 | # def execute_command(command, if_print_command):
20 | # t1 = time.time()
21 |
22 | # if if_print_command:
23 | # print(command)
24 | # os.system(command)
25 |
26 | # t2 = time.time()
27 | # time_usage = t2 - t1
28 | # return time_usage
29 |
30 | def execute_command(command, if_print_command):
31 | t1 = time.time()
32 |
33 | if if_print_command:
34 | print(command)
35 |
36 | try:
37 | subprocess.run(command, shell=True,check=True, capture_output = True) #stderr=subprocess.STDOUT)
38 | t2 = time.time()
39 | time_usage = t2 - t1
40 | return {'time_usage':time_usage}
41 | except subprocess.CalledProcessError as err:
42 | error = err.stderr.decode('utf8')
43 | # format error message to one line
44 | error = error.replace('\n','\t')
45 | error = error.replace(',',';')
46 | return {'error': error}
47 |
48 |
49 | def get_img_dimension(img_path):
50 | map_img = Image.open(img_path)
51 | width, height = map_img.size
52 |
53 | return width, height
54 |
55 |
56 | def run_pipeline(args):
57 | # ------------------------- Pass arguments -----------------------------------------
58 | map_kurator_system_dir = args.map_kurator_system_dir
59 | text_spotting_model_dir = args.text_spotting_model_dir
60 | sample_map_path = args.sample_map_csv_path
61 | expt_name = args.expt_name
62 | output_folder = args.output_folder
63 |
64 | module_get_dimension = args.module_get_dimension
65 | module_gen_geotiff = args.module_gen_geotiff
66 | module_cropping = args.module_cropping
67 | module_text_spotting = args.module_text_spotting
68 | module_img_geojson = args.module_img_geojson
69 | module_geocoord_geojson = args.module_geocoord_geojson
70 | module_entity_linking = args.module_entity_linking
71 | module_post_ocr = args.module_post_ocr
72 |
73 | spotter_model = args.spotter_model
74 | spotter_config = args.spotter_config
75 | spotter_expt_name = args.spotter_expt_name
76 | gpu_id = args.gpu_id
77 |
78 | if_print_command = args.print_command
79 |
80 |
81 | # sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
82 |
83 | # ------------------------- Read sample map list and prepare output dir ----------------
84 | input_csv_path = sample_map_path
85 | if input_csv_path[-4:] == '.csv':
86 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
87 | elif input_csv_path[-4:] == '.tsv':
88 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}, sep='\t')
89 | else:
90 | raise NotImplementedError
91 |
92 | # external_id_to_img_path_dict = get_img_path_from_external_id( sample_map_path = input_csv_path)
93 | external_id_to_img_path_dict, unmatched_external_id_list = get_img_path_from_external_id_and_image_no( sample_map_path = input_csv_path)
94 |
95 | # initialize error reason dict
96 | error_reason_dict = dict()
97 | for ex_id in unmatched_external_id_list:
98 | error_reason_dict[ex_id] = {'img_path':None, 'error':'Can not find image given external_id.'}
99 |
100 | # initialize time_usage_dict
101 | # time_usage_dict = dict()
102 | # for ex_id in sample_map_df['external_id']:
103 | # time_usage_dict[ex_id] = {}
104 |
105 | expt_out_dir = os.path.join(output_folder, expt_name)
106 | geotiff_output_dir = os.path.join(output_folder, expt_name, 'geotiff')
107 | cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/')
108 | spotting_output_dir = os.path.join(output_folder, expt_name, 'spotter/' + spotter_expt_name)
109 | stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_expt_name)
110 | postocr_output_dir = os.path.join(output_folder, expt_name, 'postocr/'+ spotter_expt_name)
111 | geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_expt_name + '/')
112 |
113 | if not os.path.isdir(expt_out_dir):
114 | os.makedirs(expt_out_dir)
115 |
116 | # ------------------------ Get image dimension ------------------------------
117 | if module_get_dimension:
118 | for index, record in sample_map_df.iterrows():
119 | external_id = record.external_id
120 | # pdb.set_trace()
121 | if external_id not in external_id_to_img_path_dict:
122 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
123 | continue
124 |
125 | img_path = external_id_to_img_path_dict[external_id]
126 | map_name = os.path.basename(img_path).split('.')[0]
127 |
128 | try:
129 | width, height = get_img_dimension(img_path)
130 | except Exception as e:
131 | error_reason_dict[external_id] = {'img_path':img_path, 'error': e }
132 |
133 | time_usage_dict[external_id]['img_w'] = width
134 | time_usage_dict[external_id]['img_h'] = height
135 |
136 |
137 | # ------------------------- Generate geotiff ------------------------------
138 | time_start = time.time()
139 | if module_gen_geotiff:
140 | os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff'))
141 |
142 | if not os.path.isdir(geotiff_output_dir):
143 | os.makedirs(geotiff_output_dir)
144 |
145 | # use converted jpg folder instead of original sid folder
146 | run_geotiff_command = 'python convert_image_to_geotiff.py --sid_root_dir /data2/rumsey_sid_to_jpg/ --sample_map_path '+ input_csv_path +' --out_geotiff_dir '+geotiff_output_dir # can change params in argparse
147 | exe_ret = execute_command(run_geotiff_command, if_print_command)
148 | if 'error' in exe_ret:
149 | error = exe_ret['error']
150 | elif 'time_usage' in exe_ret:
151 | time_usage = exe_ret['time_usage']
152 |
153 | time_usage_dict[external_id]['geotiff'] = time_usage
154 |
155 |
156 | time_geotiff = time.time()
157 |
158 |
159 | # ------------------------- Image cropping ------------------------------
160 | if module_cropping:
161 | for index, record in sample_map_df.iterrows():
162 | external_id = record.external_id
163 |
164 | if external_id not in external_id_to_img_path_dict:
165 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
166 | continue
167 |
168 | img_path = external_id_to_img_path_dict[external_id]
169 | map_name = os.path.basename(img_path).split('.')[0]
170 |
171 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
172 | if not os.path.isdir(cropping_output_dir):
173 | os.makedirs(cropping_output_dir)
174 |
175 | run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir
176 |
177 | exe_ret = execute_command(run_crop_command, if_print_command)
178 |
179 | if 'error' in exe_ret:
180 | error = exe_ret['error']
181 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
182 | elif 'time_usage' in exe_ret:
183 | time_usage = exe_ret['time_usage']
184 | time_usage_dict[external_id]['cropping'] = time_usage
185 | else:
186 | raise NotImplementedError
187 |
188 |
189 | time_cropping = time.time()
190 |
191 | # ------------------------- Text Spotting (patch level) ------------------------------
192 | if module_text_spotting:
193 | assert os.path.exists(spotter_config), "Config file for spotter must exist!"
194 | os.chdir(text_spotting_model_dir)
195 |
196 | for index, record in sample_map_df.iterrows():
197 |
198 | external_id = record.external_id
199 | if external_id not in external_id_to_img_path_dict:
200 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
201 | continue
202 |
203 | img_path = external_id_to_img_path_dict[external_id]
204 | map_name = os.path.basename(img_path).split('.')[0]
205 |
206 | map_spotting_output_dir = os.path.join(spotting_output_dir, map_name)
207 | if not os.path.isdir(map_spotting_output_dir):
208 | os.makedirs(map_spotting_output_dir)
209 |
210 | if spotter_model == 'abcnet':
211 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
212 | elif spotter_model == 'testr':
213 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir}'
214 | elif spotter_model == 'spotter_v2':
215 | run_spotting_command = f'CUDA_VISIBLE_DEVICES={gpu_id} python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir}'
216 | print(run_spotting_command)
217 | else:
218 | raise NotImplementedError
219 |
220 | run_spotting_command += ' 1> /dev/null'
221 |
222 |
223 |
224 | exe_ret = execute_command(run_spotting_command, if_print_command)
225 | if 'error' in exe_ret:
226 | error = exe_ret['error']
227 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
228 | # elif 'time_usage' in exe_ret:
229 | # time_usage = exe_ret['time_usage']
230 | # time_usage_dict[external_id]['spotting'] = time_usage
231 | # else:
232 | # raise NotImplementedError
233 |
234 | logging.info('Done text spotting for %s', map_name)
235 | time_text_spotting = time.time()
236 |
237 |
238 | # ------------------------- Image coord geojson (map level) ------------------------------
239 | if module_img_geojson:
240 | os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson'))
241 |
242 | if not os.path.isdir(stitch_output_dir):
243 | os.makedirs(stitch_output_dir)
244 |
245 | for index, record in sample_map_df.iterrows():
246 | external_id = record.external_id
247 | if external_id not in external_id_to_img_path_dict:
248 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
249 | continue
250 |
251 | img_path = external_id_to_img_path_dict[external_id]
252 | map_name = os.path.basename(img_path).split('.')[0]
253 |
254 | stitch_input_dir = os.path.join(spotting_output_dir, map_name)
255 | output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson')
256 |
257 | run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson
258 |
259 | exe_ret = execute_command(run_stitch_command, if_print_command)
260 |
261 | if 'error' in exe_ret:
262 | error = exe_ret['error']
263 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
264 | elif 'time_usage' in exe_ret:
265 | time_usage = exe_ret['time_usage']
266 | time_usage_dict[external_id]['stitch'] = time_usage
267 | else:
268 | raise NotImplementedError
269 |
270 | time_img_geojson = time.time()
271 |
272 | # ------------------------- post-OCR ------------------------------
273 | if module_post_ocr:
274 |
275 | os.chdir(os.path.join(map_kurator_system_dir, 'm6_post_ocr'))
276 |
277 | if not os.path.isdir(postocr_output_dir):
278 | os.makedirs(postocr_output_dir)
279 |
280 | for index, record in sample_map_df.iterrows():
281 |
282 | external_id = record.external_id
283 | if external_id not in external_id_to_img_path_dict:
284 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
285 | continue
286 |
287 | img_path = external_id_to_img_path_dict[external_id]
288 | map_name = os.path.basename(img_path).split('.')[0]
289 |
290 | input_geojson_file = os.path.join(stitch_output_dir, map_name + '.geojson')
291 | geojson_postocr_output_file = os.path.join(postocr_output_dir, map_name + '.geojson')
292 |
293 | run_postocr_command = 'python lexical_search.py --in_geojson_dir '+ input_geojson_file +' --out_geojson_dir '+ geojson_postocr_output_file
294 |
295 | exe_ret = execute_command(run_postocr_command, if_print_command)
296 |
297 | if 'error' in exe_ret:
298 | error = exe_ret['error']
299 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
300 | elif 'time_usage' in exe_ret:
301 | time_usage = exe_ret['time_usage']
302 | time_usage_dict[external_id]['postocr'] = time_usage
303 | else:
304 | raise NotImplementedError
305 |
306 | time_post_ocr = time.time()
307 |
308 |
309 | # ------------------------- Convert image coordinates to geocoordinates ------------------------------
310 | if module_geocoord_geojson:
311 | os.chdir(os.path.join(map_kurator_system_dir, 'm4_geocoordinate_converter'))
312 |
313 | if not os.path.isdir(geojson_output_dir):
314 | os.makedirs(geojson_output_dir)
315 |
316 | for index, record in sample_map_df.iterrows():
317 | external_id = record.external_id
318 | if external_id not in external_id_to_img_path_dict:
319 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
320 | continue
321 |
322 | in_geojson = os.path.join(output_folder, postocr_output_dir+'/') + external_id.strip("'").replace('.', '') + ".geojson"
323 |
324 | run_converter_command = 'python convert_geojson_to_geocoord.py --sample_map_path '+ os.path.join(map_kurator_system_dir, input_csv_path) +' --in_geojson_file '+ in_geojson +' --out_geojson_dir '+ os.path.join(map_kurator_system_dir, geojson_output_dir)
325 |
326 | exe_ret = execute_command(run_converter_command, if_print_command)
327 |
328 | if 'error' in exe_ret:
329 | error = exe_ret['error']
330 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
331 | elif 'time_usage' in exe_ret:
332 | time_usage = exe_ret['time_usage']
333 | time_usage_dict[external_id]['geocoord_geojson'] = time_usage
334 | else:
335 | raise NotImplementedError
336 |
337 | time_geocoord_geojson = time.time()
338 |
339 | # ------------------------- Link entities in OSM ------------------------------
340 | if module_entity_linking:
341 | os.chdir(os.path.join(map_kurator_system_dir, 'm5_entity_linker'))
342 |
343 | geojson_linked_output_dir = os.path.join(map_kurator_system_dir, 'm5_entity_linker', 'data/100_maps_geojson_abc_linked/')
344 | if not os.path.isdir(geojson_output_dir):
345 | os.makedirs(geojson_output_dir)
346 |
347 | run_linker_command = 'python entity_linker.py --sample_map_path '+ input_csv_path +' --in_geojson_dir '+ geojson_output_dir +' --out_geojson_dir '+ geojson_linked_output_dir
348 | execute_command(run_linker_command, if_print_command)
349 |
350 | time_entity_linking = time.time()
351 |
352 |
353 | # --------------------- Time usage logging --------------------------
354 | # print('\n')
355 | # logging.info('Time for generating geotiff: %d', time_geotiff - time_start)
356 | # logging.info('Time for Cropping : %d',time_cropping - time_geotiff)
357 | # logging.info('Time for text spotting : %d',time_text_spotting - time_cropping)
358 | # logging.info('Time for generating geojson in img coordinate : %d',time_img_geojson - time_text_spotting)
359 | # logging.info('Time for generating geojson in geo coordinate : %d',time_geocoord_geojson - time_img_geojson)
360 | # logging.info('Time for entity linking : %d',time_entity_linking - time_geocoord_geojson)
361 | # logging.info('Time for post OCR : %d',time_post_ocr - time_img_geojson)
362 |
363 | # time_usage_df = pd.DataFrame.from_dict(time_usage_dict, orient='index')
364 | # time_usage_log_path = os.path.join(output_folder, expt_name, 'time_usage.csv')
365 |
366 | # # check if exist time_usage log file
367 | # if os.path.isfile(time_usage_log_path):
368 | # existing_df = pd.read_csv(time_usage_log_path, index_col='external_id', dtype={'external_id':str})
369 | # # if exist duplicate columns, ret time usage values to the latest run
370 | # cols_to_use = existing_df.columns.difference(time_usage_df.columns)
371 |
372 | # time_usage_df = time_usage_df.join(existing_df[cols_to_use])
373 |
374 | # # make sure time_usage_expt_name.csv always have the latest time usage
375 | # # move the old time_usage.csv to time_usage[timestamp].csv where timestamp is the last expt running time
376 | # m_time = os.path.getmtime(time_usage_log_path)
377 | # dt_m = datetime.datetime.fromtimestamp(m_time)
378 | # timestr = dt_m.strftime("%Y%m%d-%H%M%S")
379 |
380 | # deprecated_path = os.path.join(output_folder, expt_name, 'time_usage_' + timestr +'.csv')
381 | # run_command = 'mv ' + time_usage_log_path + ' ' + deprecated_path
382 | # execute_command(run_command, if_print_command)
383 |
384 | # time_usage_df.to_csv(time_usage_log_path, index_label='external_id')
385 |
386 | # --------------------- Error logging --------------------------
387 | print('\n')
388 | current_time = datetime.datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p")
389 | error_reason_df = pd.DataFrame.from_dict(error_reason_dict, orient='index')
390 | error_reason_log_path = os.path.join(output_folder, expt_name, 'error_reason_' + current_time +'.csv')
391 | error_reason_df.to_csv(error_reason_log_path, index_label='external_id')
392 |
393 |
394 | def main():
395 | parser = argparse.ArgumentParser()
396 |
397 | parser.add_argument('--map_kurator_system_dir', type=str, default='/home/maplord/rumsey/mapkurator-system/')
398 | parser.add_argument('--text_spotting_model_dir', type=str, default='/home/maplord/rumsey/TESTR/')
399 | parser.add_argument('--sample_map_csv_path', type=str, default='m1_geotiff/data/sample_US_jp2_100_maps.csv') # Original: sample_US_jp2_100_maps.csv
400 | parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') # Original: /data2/rumsey_output
401 | parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix
402 |
403 | parser.add_argument('--module_get_dimension', default=False, action='store_true')
404 | parser.add_argument('--module_gen_geotiff', default=False, action='store_true')
405 | parser.add_argument('--module_cropping', default=False, action='store_true')
406 | parser.add_argument('--module_text_spotting', default=False, action='store_true')
407 | parser.add_argument('--module_img_geojson', default=False, action='store_true')
408 | parser.add_argument('--module_geocoord_geojson', default=False, action='store_true')
409 | parser.add_argument('--module_entity_linking', default=False, action='store_true')
410 | parser.add_argument('--module_post_ocr', default=False, action='store_true')
411 |
412 | parser.add_argument('--spotter_model', type=str, default='spotter_v2', choices=['abcnet', 'testr', 'spotter_v2'],
413 | help='Select text spotting model option from ["abcnet","testr", "testr_v2"]') # select text spotting model
414 | parser.add_argument('--spotter_config', type=str, default='/home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml',
415 | help='Path to the config file for text spotting model')
416 | parser.add_argument('--spotter_expt_name', type=str, default='exp',
417 | help='Name of spotter experiment, if empty using config file name')
418 | # python run.py --text_spotting_model_dir /home/maplord/rumsey/testr_v2/TESTR/
419 | # --sample_map_csv_path /home/maplord/maplist_csv/luna_omo_splits/luna_omo_metadata_56628_20220724.csv
420 | # --expt_name 57k_maps_r2 --module_text_spotting
421 | # --spotter_model testr_v2 --spotter_config /home/maplord/rumsey/testr_v2/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --spotter_expt_name testr_synmap
422 |
423 | parser.add_argument('--print_command', default=False, action='store_true')
424 | parser.add_argument('--gpu_id', type=int, default=0)
425 |
426 |
427 | args = parser.parse_args()
428 | print('\n')
429 | print(args)
430 | print('\n')
431 |
432 | run_pipeline(args)
433 |
434 |
435 |
436 | if __name__ == '__main__':
437 |
438 | main()
439 |
440 |
441 |
--------------------------------------------------------------------------------
/run_img.py:
--------------------------------------------------------------------------------
1 | import os
2 | import subprocess
3 | import glob
4 | import argparse
5 | import time
6 | import logging
7 | import pandas as pd
8 | import pdb
9 | import datetime
10 | from PIL import Image
11 | from utils import get_img_path_from_external_id, get_img_path_from_external_id_and_image_no
12 |
13 | import subprocess
14 |
15 | #this code is the case for getting an input as folders which include images.
16 | #tested image : /home/maplord/rumsey/mapkurator-system/data/100_maps_crop/crop_leeje_2/test_run_img/
17 | logging.basicConfig(level=logging.INFO)
18 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
19 |
20 | # def execute_command(command, if_print_command):
21 | # t1 = time.time()
22 |
23 | # if if_print_command:
24 | # print(command)
25 | # os.system(command)
26 |
27 | # t2 = time.time()
28 | # time_usage = t2 - t1
29 | # return time_usage
30 |
31 | def execute_command(command, if_print_command):
32 | t1 = time.time()
33 |
34 | if if_print_command:
35 | print(command)
36 |
37 | try:
38 | subprocess.run(command, shell=True,check=True, capture_output = True) #stderr=subprocess.STDOUT)
39 | t2 = time.time()
40 | time_usage = t2 - t1
41 | return {'time_usage':time_usage}
42 | except subprocess.CalledProcessError as err:
43 | error = err.stderr.decode('utf8')
44 | # format error message to one line
45 | error = error.replace('\n','\t')
46 | error = error.replace(',',';')
47 | return {'error': error}
48 |
49 |
50 | def get_img_dimension(img_path):
51 | map_img = Image.open(img_path)
52 | width, height = map_img.size
53 |
54 | return width, height
55 |
56 |
57 | def run_pipeline(args):
58 | # ------------------------- Pass arguments -----------------------------------------
59 | map_kurator_system_dir = args.map_kurator_system_dir
60 | text_spotting_model_dir = args.text_spotting_model_dir
61 | sample_map_path = args.sample_map_csv_path
62 | expt_name = args.expt_name
63 | output_folder = args.output_folder
64 |
65 | module_get_dimension = args.module_get_dimension
66 | module_gen_geotiff = args.module_gen_geotiff
67 | module_cropping = args.module_cropping
68 | module_text_spotting = args.module_text_spotting
69 | module_img_geojson = args.module_img_geojson
70 | module_geocoord_geojson = args.module_geocoord_geojson
71 | module_entity_linking = args.module_entity_linking
72 | module_post_ocr = args.module_post_ocr
73 |
74 | spotter_model = args.spotter_model
75 | spotter_config = args.spotter_config
76 | spotter_expt_name = args.spotter_expt_name
77 |
78 | if_print_command = args.print_command
79 |
80 |
81 | # sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
82 |
83 | # ------------------------- Read sample map list and prepare output dir ----------------
84 |
85 |
86 | # if input_csv_path[-4:] == '.csv':
87 | # sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
88 | # elif input_csv_path[-4:] == '.tsv':
89 | # sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}, sep='\t')
90 | # else:
91 | # raise NotImplementedError
92 |
93 | input_img_path = sample_map_path
94 | sample_map_df = pd.DataFrame(columns = ["external_id"])
95 | for images in os.listdir(input_img_path):
96 | tmp_path={"external_id": input_img_path+images}
97 | sample_map_df=sample_map_df.append(tmp_path,ignore_index=True)
98 |
99 | # ------------------------- Read image and prepare output dir ----------------
100 |
101 |
102 | # # external_id_to_img_path_dict = get_img_path_from_external_id( sample_map_path = input_csv_path)
103 | # external_id_to_img_path_dict, unmatched_external_id_list = get_img_path_from_external_id_and_image_no( sample_map_path = input_csv_path)
104 |
105 | # # initialize error reason dict
106 | # error_reason_dict = dict()
107 | # for ex_id in unmatched_external_id_list:
108 | # error_reason_dict[ex_id] = {'img_path':None, 'error':'Can not find image given external_id.'}
109 |
110 | # initialize time_usage_dict
111 | # time_usage_dict = dict()
112 | # for ex_id in sample_map_df['external_id']:
113 | # time_usage_dict[ex_id] = {}
114 |
115 | expt_out_dir = os.path.join(output_folder, expt_name)
116 | geotiff_output_dir = os.path.join(output_folder, expt_name, 'geotiff')
117 | cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/')
118 | spotting_output_dir = os.path.join(output_folder, expt_name, 'spotter/' + spotter_expt_name)
119 | stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_expt_name)
120 | postocr_output_dir = os.path.join(output_folder, expt_name, 'postocr/'+ spotter_expt_name)
121 | geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_expt_name + '/')
122 |
123 |
124 |
125 | if not os.path.isdir(expt_out_dir):
126 | os.makedirs(expt_out_dir)
127 |
128 | # ------------------------ Get image dimension ------------------------------
129 | if module_get_dimension:
130 | for index, record in sample_map_df.iterrows():
131 | external_id = record.external_id
132 | # pdb.set_trace()
133 | # if external_id not in external_id_to_img_path_dict:
134 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
135 | # continue
136 |
137 | img_path = sample_map_df['external_id'].iloc[index]
138 | # print("img_path",img_path)
139 | map_name = os.path.basename(img_path).split('.')[0]
140 | # print("map_name",map_name)
141 | width, height = get_img_dimension(img_path)
142 |
143 |
144 | # time_usage_dict[external_id]['img_w'] = width
145 | # time_usage_dict[external_id]['img_h'] = height
146 |
147 |
148 | # ------------------------- Generate geotiff ------------------------------
149 | time_start = time.time()
150 | if module_gen_geotiff:
151 | os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff'))
152 |
153 | if not os.path.isdir(geotiff_output_dir):
154 | os.makedirs(geotiff_output_dir)
155 |
156 | # use converted jpg folder instead of original sid folder
157 | run_geotiff_command = 'python convert_image_to_geotiff.py --sid_root_dir /data2/rumsey_sid_to_jpg/ --sample_map_path '+ input_img_path +' --out_geotiff_dir '+geotiff_output_dir # can change params in argparse
158 | exe_ret = execute_command(run_geotiff_command, if_print_command)
159 | if 'error' in exe_ret:
160 | error = exe_ret['error']
161 | elif 'time_usage' in exe_ret:
162 | time_usage = exe_ret['time_usage']
163 |
164 | # time_usage_dict[external_id]['geotiff'] = time_usage
165 |
166 |
167 | time_geotiff = time.time()
168 |
169 |
170 | # ------------------------- Image cropping ------------------------------
171 | if module_cropping:
172 | for index, record in sample_map_df.iterrows():
173 | external_id = record.external_id
174 |
175 | # if external_id not in external_id_to_img_path_dict:
176 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
177 | # continue
178 |
179 | img_path = sample_map_df['external_id'].iloc[index]
180 | map_name = os.path.basename(img_path).split('.')[0]
181 |
182 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
183 | if not os.path.isdir(cropping_output_dir):
184 | os.makedirs(cropping_output_dir)
185 |
186 | run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir
187 |
188 | exe_ret = execute_command(run_crop_command, if_print_command)
189 |
190 | # if 'error' in exe_ret:
191 | # error = exe_ret['error']
192 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
193 | # if 'time_usage' in exe_ret:
194 | # time_usage = exe_ret['time_usage']
195 | # time_usage_dict[external_id]['cropping'] = time_usage
196 | # else:
197 | # raise NotImplementedError
198 |
199 |
200 | time_cropping = time.time()
201 |
202 | # ------------------------- Text Spotting (patch level) ------------------------------
203 | if module_text_spotting:
204 | assert os.path.exists(spotter_config), "Config file for spotter must exist!"
205 | os.chdir(text_spotting_model_dir)
206 |
207 | for index, record in sample_map_df.iterrows():
208 |
209 | external_id = record.external_id
210 | # if external_id not in external_id_to_img_path_dict:
211 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
212 | # continue
213 |
214 | img_path = sample_map_df['external_id'].iloc[index]
215 | map_name = os.path.basename(img_path).split('.')[0]
216 |
217 | map_spotting_output_dir = os.path.join(spotting_output_dir, map_name)
218 | if not os.path.isdir(map_spotting_output_dir):
219 | os.makedirs(map_spotting_output_dir)
220 |
221 | print(os.path.join(cropping_output_dir,map_name))
222 | if spotter_model == 'abcnet':
223 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
224 | elif spotter_model == 'testr':
225 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.TRANSFORMER.INFERENCE_TH_TEST 0.3'
226 | # print(run_spotting_command)
227 | else:
228 | raise NotImplementedError
229 |
230 | run_spotting_command += ' 1> /dev/null'
231 |
232 | exe_ret = execute_command(run_spotting_command, if_print_command)
233 |
234 | # if 'error' in exe_ret:
235 | # error = exe_ret['error']
236 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
237 | # elif 'time_usage' in exe_ret:
238 | # time_usage = exe_ret['time_usage']
239 | # time_usage_dict[external_id]['spotting'] = time_usage
240 | # else:
241 | # raise NotImplementedError
242 |
243 | logging.info('Done text spotting for %s', map_name)
244 | time_text_spotting = time.time()
245 |
246 |
247 | # ------------------------- Image coord geojson (map level) ------------------------------
248 | if module_img_geojson:
249 | os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson'))
250 |
251 | if not os.path.isdir(stitch_output_dir):
252 | os.makedirs(stitch_output_dir)
253 |
254 | for index, record in sample_map_df.iterrows():
255 | external_id = record.external_id
256 | # if external_id not in external_id_to_img_path_dict:
257 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
258 | # continue
259 |
260 | img_path = sample_map_df['external_id'].iloc[index]
261 | map_name = os.path.basename(img_path).split('.')[0]
262 |
263 | stitch_input_dir = os.path.join(spotting_output_dir, map_name)
264 | output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson')
265 |
266 | run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson
267 |
268 |
269 | exe_ret = execute_command(run_stitch_command, if_print_command)
270 |
271 | # if 'error' in exe_ret:
272 | # error = exe_ret['error']
273 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
274 | # elif 'time_usage' in exe_ret:
275 | # time_usage = exe_ret['time_usage']
276 | # time_usage_dict[external_id]['stitch'] = time_usage
277 | # else:
278 | # raise NotImplementedError
279 |
280 | time_img_geojson = time.time()
281 |
282 | # ------------------------- post-OCR ------------------------------
283 | if module_post_ocr:
284 |
285 | os.chdir(os.path.join(map_kurator_system_dir, 'm6_post_ocr'))
286 |
287 | if not os.path.isdir(postocr_output_dir):
288 | os.makedirs(postocr_output_dir)
289 |
290 | for index, record in sample_map_df.iterrows():
291 |
292 | external_id = record.external_id
293 | # if external_id not in external_id_to_img_path_dict:
294 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
295 | # continue
296 |
297 | img_path = sample_map_df['external_id'].iloc[index]
298 | map_name = os.path.basename(img_path).split('.')[0]
299 |
300 | input_geojson_file = os.path.join(stitch_output_dir, map_name + '.geojson')
301 | geojson_postocr_output_file = os.path.join(postocr_output_dir, map_name + '.geojson')
302 | print('input_geojson_file',input_geojson_file)
303 | print('geojson_postocr_output_file',geojson_postocr_output_file)
304 | run_postocr_command = 'python lexical_search.py --in_geojson_dir '+ input_geojson_file +' --out_geojson_dir '+ geojson_postocr_output_file
305 |
306 | exe_ret = execute_command(run_postocr_command, if_print_command)
307 | print('exe_ret',exe_ret)
308 | # if 'error' in exe_ret:
309 | # error = exe_ret['error']
310 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
311 | # elif 'time_usage' in exe_ret:
312 | # time_usage = exe_ret['time_usage']
313 | # time_usage_dict[external_id]['postocr'] = time_usage
314 | # else:
315 | # raise NotImplementedError
316 |
317 | time_post_ocr = time.time()
318 |
319 |
320 | # ------------------------- Convert image coordinates to geocoordinates ------------------------------
321 | if module_geocoord_geojson:
322 | os.chdir(os.path.join(map_kurator_system_dir, 'm4_geocoordinate_converter'))
323 |
324 | if not os.path.isdir(geojson_output_dir):
325 | os.makedirs(geojson_output_dir)
326 |
327 | for index, record in sample_map_df.iterrows():
328 | external_id = record.external_id
329 | # if external_id not in external_id_to_img_path_dict:
330 | # error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
331 | # continue
332 |
333 | in_geojson = os.path.join(output_folder, postocr_output_dir+'/') + external_id.strip("'").replace('.', '') + ".geojson"
334 |
335 | run_converter_command = 'python convert_geojson_to_geocoord.py --sample_map_path '+ os.path.join(map_kurator_system_dir, input_img_path) +' --in_geojson_file '+ in_geojson +' --out_geojson_dir '+ os.path.join(map_kurator_system_dir, geojson_output_dir)
336 |
337 | exe_ret = execute_command(run_converter_command, if_print_command)
338 |
339 | # if 'error' in exe_ret:
340 | # error = exe_ret['error']
341 | # error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
342 | # elif 'time_usage' in exe_ret:
343 | # time_usage = exe_ret['time_usage']
344 | # time_usage_dict[external_id]['geocoord_geojson'] = time_usage
345 | # else:
346 | # raise NotImplementedError
347 |
348 | time_geocoord_geojson = time.time()
349 |
350 | # ------------------------- Link entities in OSM ------------------------------
351 | if module_entity_linking:
352 | os.chdir(os.path.join(map_kurator_system_dir, 'm5_entity_linker'))
353 |
354 | geojson_linked_output_dir = os.path.join(map_kurator_system_dir, 'm5_entity_linker', 'data/100_maps_geojson_abc_linked/')
355 | if not os.path.isdir(geojson_output_dir):
356 | os.makedirs(geojson_output_dir)
357 |
358 | run_linker_command = 'python entity_linker.py --sample_map_path '+ input_img_path +' --in_geojson_dir '+ geojson_output_dir +' --out_geojson_dir '+ geojson_linked_output_dir
359 | execute_command(run_linker_command, if_print_command)
360 |
361 | time_entity_linking = time.time()
362 |
363 |
364 | # --------------------- Time usage logging --------------------------
365 | print('\n')
366 | logging.info('Time for generating geotiff: %d', time_geotiff - time_start)
367 | logging.info('Time for Cropping : %d',time_cropping - time_geotiff)
368 | logging.info('Time for text spotting : %d',time_text_spotting - time_cropping)
369 | logging.info('Time for generating geojson in img coordinate : %d',time_img_geojson - time_text_spotting)
370 | logging.info('Time for generating geojson in geo coordinate : %d',time_geocoord_geojson - time_img_geojson)
371 | logging.info('Time for entity linking : %d',time_entity_linking - time_geocoord_geojson)
372 | logging.info('Time for post OCR : %d',time_post_ocr - time_img_geojson)
373 |
374 | # time_usage_df = pd.DataFrame.from_dict(time_usage_dict, orient='index')
375 | # time_usage_log_path = os.path.join(output_folder, expt_name, 'time_usage.csv')
376 |
377 | # check if exist time_usage log file
378 | # if os.path.isfile(time_usage_log_path):
379 | # existing_df = pd.read_csv(time_usage_log_path, index_col='external_id', dtype={'external_id':str})
380 | # # if exist duplicate columns, ret time usage values to the latest run
381 | # cols_to_use = existing_df.columns.difference(time_usage_df.columns)
382 |
383 | # time_usage_df = time_usage_df.join(existing_df[cols_to_use])
384 |
385 | # # make sure time_usage_expt_name.csv always have the latest time usage
386 | # # move the old time_usage.csv to time_usage[timestamp].csv where timestamp is the last expt running time
387 | # m_time = os.path.getmtime(time_usage_log_path)
388 | # dt_m = datetime.datetime.fromtimestamp(m_time)
389 | # timestr = dt_m.strftime("%Y%m%d-%H%M%S")
390 |
391 | # deprecated_path = os.path.join(output_folder, expt_name, 'time_usage_' + timestr +'.csv')
392 | # run_command = 'mv ' + time_usage_log_path + ' ' + deprecated_path
393 | # execute_command(run_command, if_print_command)
394 |
395 | # time_usage_df.to_csv(time_usage_log_path, index_label='external_id')
396 |
397 | # --------------------- Error logging --------------------------
398 | # print('\n')
399 | # current_time = datetime.datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p")
400 | # error_reason_df = pd.DataFrame.from_dict(error_reason_dict, orient='index')
401 | # error_reason_log_path = os.path.join(output_folder, expt_name, 'error_reason_' + current_time +'.csv')
402 | # error_reason_df.to_csv(error_reason_log_path, index_label='external_id')
403 |
404 |
405 | def main():
406 | parser = argparse.ArgumentParser()
407 |
408 | parser.add_argument('--map_kurator_system_dir', type=str, default='/home/maplord/rumsey/mapkurator-system/')
409 | parser.add_argument('--text_spotting_model_dir', type=str, default='/home/maplord/rumsey/TESTR/')
410 | parser.add_argument('--sample_map_csv_path', type=str, default='m1_geotiff/data/sample_US_jp2_100_maps.csv') # Original: sample_US_jp2_100_maps.csv
411 | parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') # Original: /data2/rumsey_output
412 | parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix
413 |
414 | parser.add_argument('--module_get_dimension', default=False, action='store_true')
415 | parser.add_argument('--module_gen_geotiff', default=False, action='store_true')
416 | parser.add_argument('--module_cropping', default=False, action='store_true')
417 | parser.add_argument('--module_text_spotting', default=False, action='store_true')
418 | parser.add_argument('--module_img_geojson', default=False, action='store_true')
419 | parser.add_argument('--module_geocoord_geojson', default=False, action='store_true')
420 | parser.add_argument('--module_entity_linking', default=False, action='store_true')
421 | parser.add_argument('--module_post_ocr', default=False, action='store_true')
422 |
423 |
424 | parser.add_argument('--spotter_model', type=str, default='testr', choices=['abcnet', 'testr'],
425 | help='Select text spotting model option from ["abcnet","testr"]') # select text spotting model
426 | parser.add_argument('--spotter_config', type=str, default='/home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml',
427 | help='Path to the config file for text spotting model')
428 | parser.add_argument('--spotter_expt_name', type=str, default='testr_syn',
429 | help='Name of spotter experiment, if empty using config file name')
430 | # python run.py --sample_map_csv_path /home/maplord/maplist_csv/luna_omo_metadata_56628_20220724.csv --expt_name 57k_maps --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --spotter_expt_name testr_synmap
431 |
432 | parser.add_argument('--print_command', default=False, action='store_true')
433 |
434 |
435 | args = parser.parse_args()
436 | print('\n')
437 | print(args)
438 | print('\n')
439 |
440 | run_pipeline(args)
441 |
442 |
443 |
444 | if __name__ == '__main__':
445 |
446 | main()
447 |
448 |
449 |
--------------------------------------------------------------------------------
/run_leeje.py:
--------------------------------------------------------------------------------
1 | import os
2 | import subprocess
3 | import glob
4 | import argparse
5 | import time
6 | import logging
7 | import pandas as pd
8 | import pdb
9 | import datetime
10 | from PIL import Image
11 | from utils import get_img_path_from_external_id, get_img_path_from_external_id_and_image_no
12 |
13 | import subprocess
14 |
15 | ##input file : tiff and csv
16 | # read csv
17 | # change to format (tiff)
18 | # change the path
19 | # execute all module
20 |
21 | logging.basicConfig(level=logging.INFO)
22 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
23 |
24 | # def execute_command(command, if_print_command):
25 | # t1 = time.time()
26 |
27 | # if if_print_command:
28 | # print(command)
29 | # os.system(command)
30 |
31 | # t2 = time.time()
32 | # time_usage = t2 - t1
33 | # return time_usage
34 |
35 | def execute_command(command, if_print_command):
36 | t1 = time.time()
37 |
38 | if if_print_command:
39 | print(command)
40 |
41 | try:
42 | subprocess.run(command, shell=True,check=True, capture_output = True) #stderr=subprocess.STDOUT)
43 | t2 = time.time()
44 | time_usage = t2 - t1
45 | return {'time_usage':time_usage}
46 | except subprocess.CalledProcessError as err:
47 | error = err.stderr.decode('utf8')
48 | # format error message to one line
49 | error = error.replace('\n','\t')
50 | error = error.replace(',',';')
51 | return {'error': error}
52 |
53 |
54 | def get_img_dimension(img_path):
55 | map_img = Image.open(img_path)
56 | width, height = map_img.size
57 |
58 | return width, height
59 |
60 |
61 | def run_pipeline(args):
62 | # ------------------------- Pass arguments -----------------------------------------
63 | map_kurator_system_dir = args.map_kurator_system_dir
64 | text_spotting_model_dir = args.text_spotting_model_dir
65 | sample_map_path = args.sample_map_csv_path
66 | expt_name = args.expt_name
67 | output_folder = args.output_folder
68 |
69 | module_get_dimension = args.module_get_dimension
70 | module_gen_geotiff = args.module_gen_geotiff
71 | module_cropping = args.module_cropping
72 | module_text_spotting = args.module_text_spotting
73 | module_img_geojson = args.module_img_geojson
74 | module_geocoord_geojson = args.module_geocoord_geojson
75 | module_entity_linking = args.module_entity_linking
76 | module_post_ocr = args.module_post_ocr
77 |
78 | spotter_model = args.spotter_model
79 | spotter_config = args.spotter_config
80 | spotter_expt_name = args.spotter_expt_name
81 |
82 | if_print_command = args.print_command
83 |
84 |
85 | # sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
86 |
87 | # ------------------------- Read sample map list and prepare output dir ----------------
88 | input_csv_path = sample_map_path
89 | if input_csv_path[-4:] == '.csv':
90 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
91 | elif input_csv_path[-4:] == '.tsv':
92 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}, sep='\t')
93 | else:
94 | raise NotImplementedError
95 |
96 | # external_id_to_img_path_dict = get_img_path_from_external_id( sample_map_path = input_csv_path)
97 | external_id_to_img_path_dict, unmatched_external_id_list = get_img_path_from_external_id_and_image_no( sample_map_path = input_csv_path)
98 |
99 | # initialize error reason dict
100 | error_reason_dict = dict()
101 | for ex_id in unmatched_external_id_list:
102 | error_reason_dict[ex_id] = {'img_path':None, 'error':'Can not find image given external_id.'}
103 |
104 | # initialize time_usage_dict
105 | time_usage_dict = dict()
106 | for ex_id in sample_map_df['external_id']:
107 | time_usage_dict[ex_id] = {}
108 |
109 | expt_out_dir = os.path.join(output_folder, expt_name)
110 | geotiff_output_dir = os.path.join(output_folder, expt_name, 'geotiff')
111 | cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/')
112 | spotting_output_dir = os.path.join(output_folder, expt_name, 'spotter/' + spotter_expt_name)
113 | stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_expt_name)
114 | postocr_output_dir = os.path.join(output_folder, expt_name, 'postocr/'+ spotter_expt_name)
115 | geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_expt_name + '/')
116 |
117 |
118 |
119 | if not os.path.isdir(expt_out_dir):
120 | os.makedirs(expt_out_dir)
121 |
122 | # ------------------------ Get image dimension ------------------------------
123 | if module_get_dimension:
124 | for index, record in sample_map_df.iterrows():
125 | external_id = record.external_id
126 | # pdb.set_trace()
127 | if external_id not in external_id_to_img_path_dict:
128 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
129 | continue
130 |
131 | img_path = external_id_to_img_path_dict[external_id]
132 | map_name = os.path.basename(img_path).split('.')[0]
133 |
134 | try:
135 | width, height = get_img_dimension(img_path)
136 | except Exception as e:
137 | error_reason_dict[external_id] = {'img_path':img_path, 'error': e }
138 |
139 | time_usage_dict[external_id]['img_w'] = width
140 | time_usage_dict[external_id]['img_h'] = height
141 |
142 |
143 | # ------------------------- Generate geotiff ------------------------------
144 | time_start = time.time()
145 | if module_gen_geotiff:
146 | os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff'))
147 |
148 | if not os.path.isdir(geotiff_output_dir):
149 | os.makedirs(geotiff_output_dir)
150 |
151 | # use converted jpg folder instead of original sid folder
152 | run_geotiff_command = 'python convert_image_to_geotiff.py --sid_root_dir /data2/rumsey_sid_to_jpg/ --sample_map_path '+ input_csv_path +' --out_geotiff_dir '+geotiff_output_dir # can change params in argparse
153 | exe_ret = execute_command(run_geotiff_command, if_print_command)
154 | if 'error' in exe_ret:
155 | error = exe_ret['error']
156 | elif 'time_usage' in exe_ret:
157 | time_usage = exe_ret['time_usage']
158 |
159 | time_usage_dict[external_id]['geotiff'] = time_usage
160 |
161 |
162 | time_geotiff = time.time()
163 |
164 |
165 | # ------------------------- Image cropping ------------------------------
166 | if module_cropping:
167 | for index, record in sample_map_df.iterrows():
168 | external_id = record.external_id
169 |
170 | if external_id not in external_id_to_img_path_dict:
171 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
172 | continue
173 |
174 | img_path = external_id_to_img_path_dict[external_id]
175 | map_name = os.path.basename(img_path).split('.')[0]
176 |
177 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
178 | if not os.path.isdir(cropping_output_dir):
179 | os.makedirs(cropping_output_dir)
180 |
181 | run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir
182 |
183 | exe_ret = execute_command(run_crop_command, if_print_command)
184 |
185 | if 'error' in exe_ret:
186 | error = exe_ret['error']
187 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
188 | elif 'time_usage' in exe_ret:
189 | time_usage = exe_ret['time_usage']
190 | time_usage_dict[external_id]['cropping'] = time_usage
191 | else:
192 | raise NotImplementedError
193 |
194 |
195 | time_cropping = time.time()
196 |
197 | # ------------------------- Text Spotting (patch level) ------------------------------
198 | if module_text_spotting:
199 | assert os.path.exists(spotter_config), "Config file for spotter must exist!"
200 | os.chdir(text_spotting_model_dir)
201 |
202 | for index, record in sample_map_df.iterrows():
203 |
204 | external_id = record.external_id
205 | if external_id not in external_id_to_img_path_dict:
206 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
207 | continue
208 |
209 | img_path = external_id_to_img_path_dict[external_id]
210 | map_name = os.path.basename(img_path).split('.')[0]
211 |
212 | map_spotting_output_dir = os.path.join(spotting_output_dir, map_name)
213 | if not os.path.isdir(map_spotting_output_dir):
214 | os.makedirs(map_spotting_output_dir)
215 |
216 | if spotter_model == 'abcnet':
217 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
218 | elif spotter_model == 'testr':
219 | run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir}'
220 | else:
221 | raise NotImplementedError
222 |
223 | run_spotting_command += ' 1> /dev/null'
224 |
225 | exe_ret = execute_command(run_spotting_command, if_print_command)
226 |
227 | if 'error' in exe_ret:
228 | error = exe_ret['error']
229 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
230 | elif 'time_usage' in exe_ret:
231 | time_usage = exe_ret['time_usage']
232 | time_usage_dict[external_id]['spotting'] = time_usage
233 | else:
234 | raise NotImplementedError
235 |
236 | logging.info('Done text spotting for %s', map_name)
237 | time_text_spotting = time.time()
238 |
239 |
240 | # ------------------------- Image coord geojson (map level) ------------------------------
241 | if module_img_geojson:
242 | os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson'))
243 |
244 | if not os.path.isdir(stitch_output_dir):
245 | os.makedirs(stitch_output_dir)
246 |
247 | for index, record in sample_map_df.iterrows():
248 | external_id = record.external_id
249 | if external_id not in external_id_to_img_path_dict:
250 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
251 | continue
252 |
253 | img_path = external_id_to_img_path_dict[external_id]
254 | map_name = os.path.basename(img_path).split('.')[0]
255 |
256 | stitch_input_dir = os.path.join(spotting_output_dir, map_name)
257 | output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson')
258 |
259 | run_stitch_command = 'python stitch_output.py --eval_only --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson
260 |
261 | exe_ret = execute_command(run_stitch_command, if_print_command)
262 |
263 | if 'error' in exe_ret:
264 | error = exe_ret['error']
265 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
266 | elif 'time_usage' in exe_ret:
267 | time_usage = exe_ret['time_usage']
268 | time_usage_dict[external_id]['stitch'] = time_usage
269 | else:
270 | raise NotImplementedError
271 |
272 | time_img_geojson = time.time()
273 |
274 | # ------------------------- post-OCR ------------------------------
275 | if module_post_ocr:
276 |
277 | # Check if the geojson has been recorded
278 | geojson_postocr_output_dir_check = os.path.join(output_folder, '57k_maps', 'postocr/testr_syn')
279 | file_list = glob.glob(geojson_postocr_output_dir_check + '/*.geojson')
280 | file_list = sorted(file_list)
281 |
282 | existed = []
283 | for file in file_list:
284 | name = file.split('/')[-1].split('.')[0]
285 | existed.append(name)
286 | #####
287 |
288 | os.chdir(os.path.join(map_kurator_system_dir, 'm6_post_ocr'))
289 |
290 | sample_map_df2 = sample_map_df
291 | sample_map_df2['external_id_process'] = sample_map_df2['external_id']
292 | sample_map_df2['external_id_process'] = sample_map_df2['external_id_process'].str.strip("'")
293 | sample_map_df2['external_id_process'] = sample_map_df2['external_id_process'].str.replace('.', '')
294 | sample_map_df2 = sample_map_df2[~sample_map_df2['external_id_process'].isin(existed)]
295 |
296 | print(len(sample_map_df2))
297 | print(len(existed))
298 |
299 | #####
300 | for index, record in sample_map_df2.iterrows():
301 |
302 | external_id = record.external_id
303 | if external_id not in external_id_to_img_path_dict:
304 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
305 | continue
306 |
307 | img_path = external_id_to_img_path_dict[external_id]
308 | map_name = os.path.basename(img_path).split('.')[0]
309 |
310 | input_geojson_file = os.path.join(stitch_output_dir, map_name + '.geojson')
311 | geojson_postocr_output_file = os.path.join(postocr_output_dir, map_name + '.geojson')
312 |
313 | if os.path.isfile(in_geojson):
314 | run_postocr_command = 'python lexical_search.py --in_geojson_dir '+ input_geojson_file +' --out_geojson_dir '+ geojson_postocr_output_file
315 | exe_ret = execute_command(run_postocr_command, if_print_command)
316 |
317 | if 'error' in exe_ret:
318 | error = exe_ret['error']
319 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
320 | elif 'time_usage' in exe_ret:
321 | time_usage = exe_ret['time_usage']
322 | time_usage_dict[external_id]['postocr'] = time_usage
323 | else:
324 | raise NotImplementedError
325 |
326 | else:
327 | continue
328 |
329 | time_post_ocr = time.time()
330 |
331 |
332 | # ------------------------- Convert image coordinates to geocoordinates ------------------------------
333 | if module_geocoord_geojson:
334 | os.chdir(os.path.join(map_kurator_system_dir, 'm4_geocoordinate_converter'))
335 |
336 | if not os.path.isdir(geojson_output_dir):
337 | os.makedirs(geojson_output_dir)
338 |
339 | for index, record in sample_map_df.iterrows():
340 | external_id = record.external_id
341 | if external_id not in external_id_to_img_path_dict:
342 | error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'}
343 | continue
344 |
345 | in_geojson = os.path.join(output_folder, postocr_output_dir+'/') + external_id.strip("'").replace('.', '') + ".geojson"
346 |
347 | run_converter_command = 'python convert_geojson_to_geocoord.py --sample_map_path '+ os.path.join(map_kurator_system_dir, input_csv_path) +' --in_geojson_file '+ in_geojson +' --out_geojson_dir '+ os.path.join(map_kurator_system_dir, geojson_output_dir)
348 |
349 | exe_ret = execute_command(run_converter_command, if_print_command)
350 |
351 | if 'error' in exe_ret:
352 | error = exe_ret['error']
353 | error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
354 | elif 'time_usage' in exe_ret:
355 | time_usage = exe_ret['time_usage']
356 | time_usage_dict[external_id]['geocoord_geojson'] = time_usage
357 | else:
358 | raise NotImplementedError
359 |
360 | time_geocoord_geojson = time.time()
361 |
362 | # ------------------------- Link entities in OSM ------------------------------
363 | if module_entity_linking:
364 | os.chdir(os.path.join(map_kurator_system_dir, 'm5_entity_linker'))
365 |
366 | geojson_linked_output_dir = os.path.join(map_kurator_system_dir, 'm5_entity_linker', 'data/100_maps_geojson_abc_linked/')
367 | if not os.path.isdir(geojson_output_dir):
368 | os.makedirs(geojson_output_dir)
369 |
370 | run_linker_command = 'python entity_linker.py --sample_map_path '+ input_csv_path +' --in_geojson_dir '+ geojson_output_dir +' --out_geojson_dir '+ geojson_linked_output_dir
371 | execute_command(run_linker_command, if_print_command)
372 |
373 | time_entity_linking = time.time()
374 |
375 |
376 | # --------------------- Time usage logging --------------------------
377 | print('\n')
378 | logging.info('Time for generating geotiff: %d', time_geotiff - time_start)
379 | logging.info('Time for Cropping : %d',time_cropping - time_geotiff)
380 | logging.info('Time for text spotting : %d',time_text_spotting - time_cropping)
381 | logging.info('Time for generating geojson in img coordinate : %d',time_img_geojson - time_text_spotting)
382 | logging.info('Time for generating geojson in geo coordinate : %d',time_geocoord_geojson - time_img_geojson)
383 | logging.info('Time for entity linking : %d',time_entity_linking - time_geocoord_geojson)
384 | logging.info('Time for post OCR : %d',time_post_ocr - time_geocoord_geojson)
385 |
386 | time_usage_df = pd.DataFrame.from_dict(time_usage_dict, orient='index')
387 | time_usage_log_path = os.path.join(output_folder, expt_name, 'time_usage.csv')
388 |
389 | # check if exist time_usage log file
390 | if os.path.isfile(time_usage_log_path):
391 | existing_df = pd.read_csv(time_usage_log_path, index_col='external_id', dtype={'external_id':str})
392 | # if exist duplicate columns, ret time usage values to the latest run
393 | cols_to_use = existing_df.columns.difference(time_usage_df.columns)
394 |
395 | time_usage_df = time_usage_df.join(existing_df[cols_to_use])
396 |
397 | # make sure time_usage_expt_name.csv always have the latest time usage
398 | # move the old time_usage.csv to time_usage[timestamp].csv where timestamp is the last expt running time
399 | m_time = os.path.getmtime(time_usage_log_path)
400 | dt_m = datetime.datetime.fromtimestamp(m_time)
401 | timestr = dt_m.strftime("%Y%m%d-%H%M%S")
402 |
403 | deprecated_path = os.path.join(output_folder, expt_name, 'time_usage_' + timestr +'.csv')
404 | run_command = 'mv ' + time_usage_log_path + ' ' + deprecated_path
405 | execute_command(run_command, if_print_command)
406 |
407 | time_usage_df.to_csv(time_usage_log_path, index_label='external_id')
408 |
409 | # --------------------- Error logging --------------------------
410 | print('\n')
411 | current_time = datetime.datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p")
412 | error_reason_df = pd.DataFrame.from_dict(error_reason_dict, orient='index')
413 | error_reason_log_path = os.path.join(output_folder, expt_name, 'error_reason_' + current_time +'.csv')
414 | error_reason_df.to_csv(error_reason_log_path, index_label='external_id')
415 |
416 |
417 | def main():
418 | parser = argparse.ArgumentParser()
419 |
420 | parser.add_argument('--map_kurator_system_dir', type=str, default='/home/maplord/rumsey/mapkurator-system/')
421 | parser.add_argument('--text_spotting_model_dir', type=str, default='/home/maplord/rumsey/TESTR/')
422 | parser.add_argument('--sample_map_csv_path', type=str, default='m1_geotiff/data/sample_US_jp2_100_maps.csv') # Original: sample_US_jp2_100_maps.csv
423 | parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') # Original: /data2/rumsey_output
424 | parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix
425 |
426 | parser.add_argument('--module_get_dimension', default=False, action='store_true')
427 | parser.add_argument('--module_gen_geotiff', default=False, action='store_true')
428 | parser.add_argument('--module_cropping', default=False, action='store_true')
429 | parser.add_argument('--module_text_spotting', default=False, action='store_true')
430 | parser.add_argument('--module_img_geojson', default=False, action='store_true')
431 | parser.add_argument('--module_geocoord_geojson', default=False, action='store_true')
432 | parser.add_argument('--module_entity_linking', default=False, action='store_true')
433 | parser.add_argument('--module_post_ocr', default=False, action='store_true')
434 |
435 |
436 | parser.add_argument('--spotter_model', type=str, default='testr', choices=['abcnet', 'testr'],
437 | help='Select text spotting model option from ["abcnet","testr"]') # select text spotting model
438 | parser.add_argument('--spotter_config', type=str, default='/home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml',
439 | help='Path to the config file for text spotting model')
440 | parser.add_argument('--spotter_expt_name', type=str, default='testr_syn',
441 | help='Name of spotter experiment, if empty using config file name')
442 | # python run.py --sample_map_csv_path /home/maplord/maplist_csv/luna_omo_metadata_56628_20220724.csv --expt_name 57k_maps --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --spotter_expt_name testr_synmap
443 |
444 | parser.add_argument('--print_command', default=False, action='store_true')
445 |
446 |
447 | args = parser.parse_args()
448 | print('\n')
449 | print(args)
450 | print('\n')
451 |
452 | run_pipeline(args)
453 |
454 |
455 |
456 | if __name__ == '__main__':
457 |
458 | main()
459 |
460 |
461 |
--------------------------------------------------------------------------------
/run_only_eval.py:
--------------------------------------------------------------------------------
1 | import os
2 | import glob
3 |
4 | map_kurator_system_dir = '/home/zekun/dr_maps/mapkurator-system/'
5 | text_spotting_model_dir = '/home/zekun/antique_names/model/AdelaiDet/'
6 | sample_map_path = 'm1_geotiff/data/sample_US_jp2_100_maps.csv'
7 |
8 | # # run module1 to generate geotiff
9 | # os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff'))
10 | # input_csv = os.path.join(map_kurator_system_dir ,sample_map_path)
11 | # geotiff_output_dir = os.path.join(map_kurator_system_dir ,'m1_geotiff/data/geotiff')
12 | # if not os.path.isdir(geotiff_output_dir):
13 | # os.makedirs(geotiff_output_dir)
14 |
15 | # run_geotiff_command = 'python convert_image_to_geotiff.py --sample_map_path '+ input_csv +' --out_geotiff_dir '+geotiff_output_dir # can change params in argparse
16 | # print(run_geotiff_command)
17 | # #os.system(run_geotiff_command)
18 |
19 |
20 | # run module2: image cropping
21 |
22 | geotiff_path_list = glob.glob(os.path.join(geotiff_output_dir, '*.geotiff'))
23 | assert(len(geotiff_path_list) != 0)
24 |
25 | for geotiff_path in geotiff_path_list:
26 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
27 | map_name = os.path.basename(geotiff_path).split('.')[0]
28 |
29 | cropping_output_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_crop/')
30 | if not os.path.isdir(cropping_output_dir):
31 | os.makedirs(cropping_output_dir)
32 | run_crop_command = 'python crop_img.py --img_path '+geotiff_path + ' --output_dir '+ cropping_output_dir
33 | print(run_crop_command)
34 | os.system(run_crop_command)
35 |
36 | # run module2: text spotting
37 | os.chdir(text_spotting_model_dir)
38 | map_name = os.path.basename(geotiff_path).split('.')[0]
39 |
40 | spotting_output_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_crop_outabc/',map_name)
41 | if not os.path.isdir(spotting_output_dir):
42 | os.makedirs(spotting_output_dir)
43 |
44 | run_spotting_command = 'python demo/demo.py --config-file configs/BAText/CTW1500/attn_R_50.yaml --input '+ map_kurator_system_dir+'/m2_detection_recognition/data/100_maps_crop/'+map_name+' --output '+ spotting_output_dir + ' --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
45 | run_spotting_command += ' 1> /dev/null'
46 | print(run_spotting_command)
47 | os.system(run_spotting_command)
48 |
49 | #break
50 |
51 |
52 | # run module2: geojson stitching
53 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
54 |
55 |
56 | stitch_input_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_crop_outabc/')
57 | stitch_output_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_geojson_abc/')
58 | if not os.path.isdir(stitch_output_dir):
59 | os.makedirs(stitch_output_dir)
60 | run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_dir ' + stitch_output_dir
61 | print(run_stitch_command)
62 | os.system(run_stitch_command)
63 |
64 |
--------------------------------------------------------------------------------
/run_sanborn.py:
--------------------------------------------------------------------------------
1 | import os
2 | import glob
3 | import argparse
4 | import time
5 | import logging
6 | import pandas as pd
7 | import pdb
8 | import datetime
9 | from PIL import Image
10 | from utils import get_img_path_from_external_id
11 |
12 | logging.basicConfig(level=logging.INFO)
13 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
14 |
15 | '''
16 | This Sanborn processing pipeline shares some common modules as DR processing pipeline, including cropping and text spotting.
17 | The unique modules are geocoding, clustering, and output geojson generation module.
18 | The GeoTiff conversion, Image dimension retrival, Img_to_geo coord and entity linking modules are removed.
19 | Time usage analysis and error reason logging are removed.
20 | '''
21 |
22 | def execute_command(command, if_print_command):
23 | t1 = time.time()
24 |
25 | if if_print_command:
26 | print(command)
27 | os.system(command)
28 |
29 | t2 = time.time()
30 | time_usage = t2 - t1
31 | return time_usage
32 |
33 | def get_img_dimension(img_path):
34 | map_img = Image.open(img_path)
35 | width, height = map_img.size
36 |
37 | return width, height
38 |
39 |
40 | def run_pipeline(args):
41 | # ------------------------- Pass arguments -----------------------------------------
42 | map_kurator_system_dir = args.map_kurator_system_dir
43 | text_spotting_model_dir = args.text_spotting_model_dir
44 | # sample_map_path = args.sample_map_csv_path
45 | expt_name = args.expt_name
46 | output_folder = args.output_folder
47 | input_map_dir = args.input_map_dir
48 |
49 | module_get_dimension = args.module_get_dimension
50 | # module_gen_geotiff = args.module_gen_geotiff
51 | module_cropping = args.module_cropping
52 | module_text_spotting = args.module_text_spotting
53 | module_img_geojson = args.module_img_geojson
54 | # module_geocoord_geojson = args.module_geocoord_geojson
55 | # module_entity_linking = args.module_entity_linking
56 | module_geocoding = args.module_geocoding
57 | module_clustering = args.module_clustering
58 |
59 | spotter_option = args.spotter_option
60 | geocoder_option = args.geocoder_option
61 | api_key = args.api_key
62 | user_name = args.user_name
63 |
64 | metadata_tsv_path = args.metadata_tsv_path
65 |
66 | if_print_command = args.print_command
67 |
68 | sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
69 |
70 | file_list = os.listdir(input_map_dir)
71 |
72 | file_list = [f for f in file_list if os.path.basename(f).split('.')[-1] in ['sid','jp2','png','jpg','jpeg','tiff','tif','geotiff','geotiff']]
73 |
74 | print(len(file_list))
75 |
76 |
77 |
78 | # pdb.set_trace()
79 | # ------------------------- Read sample map list and prepare output dir ----------------
80 |
81 | cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/')
82 | spotting_output_dir = os.path.join(output_folder, expt_name, 'spotter/' + spotter_option)
83 | stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_option)
84 | # geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_option + '/')
85 | geocoding_output_dir = os.path.join(output_folder, expt_name, 'geocoding_suffix_' + spotter_option)
86 | clustering_output_dir = os.path.join(output_folder, expt_name, 'cluster_' + spotter_option + '/')
87 |
88 |
89 | # ------------------------- Image cropping ------------------------------
90 | if module_cropping:
91 | # for index, record in sample_map_df.iterrows():
92 | for file_path in file_list:
93 | img_path = os.path.join(input_map_dir, file_path)
94 | print(img_path)
95 | # external_id = record.external_id
96 | # img_path = external_id_to_img_path_dict[external_id]
97 |
98 | map_name = os.path.basename(img_path).split('.')[0]
99 |
100 |
101 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
102 | if not os.path.isdir(cropping_output_dir):
103 | os.makedirs(cropping_output_dir)
104 | run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir
105 |
106 | time_usage = execute_command(run_crop_command, if_print_command)
107 | # time_usage_dict[external_id]['cropping'] = time_usage
108 |
109 | time_cropping = time.time()
110 |
111 | # # ------------------------- Text Spotting (patch level) ------------------------------
112 | if module_text_spotting:
113 | os.chdir(text_spotting_model_dir)
114 |
115 | # for index, record in sample_map_df.iterrows():
116 | for file_path in file_list:
117 | map_name = os.path.basename(file_path).split('.')[0]
118 |
119 | map_spotting_output_dir = os.path.join(spotting_output_dir,map_name)
120 | if not os.path.isdir(map_spotting_output_dir):
121 | os.makedirs(map_spotting_output_dir)
122 |
123 | if spotter_option == 'abcnet':
124 | run_spotting_command = 'python demo/demo.py --config-file configs/BAText/CTW1500/attn_R_50.yaml --input='+ os.path.join(cropping_output_dir,map_name) + ' --output='+ map_spotting_output_dir + ' --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
125 | elif spotter_option == 'testr':
126 | run_spotting_command = 'python demo/demo.py --output_json --input='+ os.path.join(cropping_output_dir,map_name) + ' --output='+map_spotting_output_dir +' --opts MODEL.WEIGHTS icdar15_testr_R_50_polygon.pth'
127 | else:
128 | raise NotImplementedError
129 |
130 | run_spotting_command += ' 1> /dev/null'
131 |
132 | time_usage = execute_command(run_spotting_command, if_print_command)
133 |
134 | logging.info('Done text spotting for %s', map_name)
135 |
136 | time_text_spotting = time.time()
137 |
138 |
139 | # # ------------------------- Image coord geojson (map level) ------------------------------
140 | if module_img_geojson:
141 | os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson'))
142 | if not os.path.isdir(stitch_output_dir):
143 | os.makedirs(stitch_output_dir)
144 |
145 | for file_path in file_list:
146 | map_name = os.path.basename(file_path).split('.')[0]
147 |
148 | stitch_input_dir = os.path.join(spotting_output_dir, map_name)
149 | output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson')
150 |
151 | run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson
152 | time_usage = execute_command(run_stitch_command, if_print_command)
153 | # time_usage_dict[external_id]['imgcoord_geojson'] = time_usage
154 |
155 | time_img_geojson = time.time()
156 |
157 | # # ------------------------- Geocoding ------------------------------
158 | if module_geocoding:
159 | os.chdir(os.path.join(map_kurator_system_dir ,'m_sanborn'))
160 |
161 | if metadata_tsv_path is not None:
162 | map_df = pd.read_csv(metadata_tsv_path, sep='\t')
163 |
164 | if not os.path.isdir(geocoding_output_dir):
165 | os.makedirs(geocoding_output_dir)
166 |
167 | for file_path in file_list:
168 | map_name = os.path.basename(file_path).split('.')[0]
169 | if metadata_tsv_path is not None:
170 | suffix = map_df[map_df['filename'] == map_name]['City'].values[0] # LoC sanborn
171 | suffix = ', ' + suffix
172 | else:
173 | suffix = ', Los Angeles' # LA sanborn
174 |
175 | run_geocoding_command = 'python3 s1_geocoding.py --input_map_geojson_path='+ os.path.join(stitch_output_dir,map_name + '.geojson') + ' --output_folder=' + geocoding_output_dir + \
176 | ' --api_key=' + api_key + ' --user_name=' + user_name + ' --max_results=5 --geocoder_option=' + geocoder_option + ' --suffix="' + suffix + '"'
177 |
178 | time_usage = execute_command(run_geocoding_command, if_print_command)
179 |
180 | # break
181 |
182 | logging.info('Done geocoding for %s', map_name)
183 |
184 |
185 | time_geocoding = time.time()
186 |
187 |
188 | if module_clustering:
189 | os.chdir(os.path.join(map_kurator_system_dir ,'m_sanborn'))
190 |
191 | if not os.path.isdir(clustering_output_dir):
192 | os.makedirs(clustering_output_dir)
193 |
194 | # for file_path in file_list:
195 | # map_name = os.path.basename(file_path).split('.')[0]
196 |
197 | # run_clustering_command = 'python3 s2_clustering.py --dataset_name='+ expt_name + ' --output_folder=' + geocoding_output_dir + \
198 | # ' --api_key=' + api_key + ' --user_name=' + user_name + ' --max_results=5 --geocoder_option=' + geocoder_option + ' --suffix="' + suffix + '"'
199 |
200 | # time_usage = execute_command(run_clustering_command, if_print_command)
201 |
202 |
203 | # logging.info('Done geocoding for %s', map_name)
204 |
205 |
206 | def main():
207 | parser = argparse.ArgumentParser()
208 |
209 | parser.add_argument('--map_kurator_system_dir', type=str, default='/home/zekun/dr_maps/mapkurator-system/')
210 | parser.add_argument('--text_spotting_model_dir', type=str, default='/home/zekun/antique_names/model/AdelaiDet/')
211 |
212 | parser.add_argument('--input_map_dir', type=str, default='/data2/mrm_sanborn_maps/LA_sanborn')
213 | parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output')
214 | parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix
215 |
216 | parser.add_argument('--module_get_dimension', default=False, action='store_true')
217 | # parser.add_argument('--module_gen_geotiff', default=False, action='store_true') # only supports dr maps
218 | parser.add_argument('--module_cropping', default=False, action='store_true')
219 | parser.add_argument('--module_text_spotting', default=False, action='store_true')
220 | parser.add_argument('--module_img_geojson', default=False, action='store_true')
221 | parser.add_argument('--module_geocoding', default=False, action='store_true') # only supports sanborn
222 | # parser.add_argument('--module_geocoord_geojson', default=False, action='store_true') # only supports dr maps
223 | # parser.add_argument('--module_entity_linking', default=False, action='store_true') # only supports dr maps
224 | parser.add_argument('--module_clustering', default=False, action='store_true') # only supports dr maps
225 |
226 | parser.add_argument('--print_command', default=False, action='store_true')
227 |
228 | parser.add_argument('--spotter_option', type=str, default='testr',
229 | choices=['abcnet', 'testr'],
230 | help='Select text spotting model option from ["abcnet","testr"]') # select text spotting model
231 |
232 | parser.add_argument('--geocoder_option', type=str, default='arcgis',
233 | choices=['arcgis', 'google','geonames','osm'],
234 | help='Select text spotting model option from ["arcgis","google","geonames","osm"]') # select text spotting model
235 |
236 | # params for geocoder:
237 | parser.add_argument('--api_key', type=str, default=None, help='api_key for geocoder. can be None if not running geocoding module')
238 | parser.add_argument('--user_name', type=str, default=None, help='user_name for geocoder. can be None if not running geocoding module')
239 | parser.add_argument('--metadata_tsv_path', type=str, default=None) # '/home/zekun/Sanborn/Sheet_List.tsv'
240 |
241 |
242 | args = parser.parse_args()
243 | print('\n')
244 | print(args)
245 | print('\n')
246 |
247 | run_pipeline(args)
248 |
249 |
250 | if __name__ == '__main__':
251 |
252 | main()
253 |
254 |
255 |
--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
1 | import os
2 | import glob
3 | import pandas as pd
4 | import ast
5 | import argparse
6 | import logging
7 | import pdb
8 |
9 | logging.basicConfig(level=logging.INFO)
10 |
11 | def func_file_to_fullpath_dict(file_path_list):
12 |
13 | file_fullpath_dict = dict()
14 | for file_path in file_path_list:
15 | file_fullpath_dict[os.path.basename(file_path).split('.')[0]] = file_path
16 |
17 | return file_fullpath_dict
18 |
19 | def get_img_path_from_external_id(jp2_root_dir = '/data/rumsey-jp2/', sid_root_dir = '/data2/rumsey_sid_to_jpg/', additional_root_dir='/data2/rumsey-luna-img/', sample_map_path = None,external_id_key = 'external_id') :
20 | # returns (1) a dict with external-id as key, full image path as value (2) list of external-id that can not find image path
21 |
22 | jp2_file_path_list = glob.glob(os.path.join(jp2_root_dir, '*/*.jp2'))
23 | sid_file_path_list = glob.glob(os.path.join(sid_root_dir, '*.jpg'))
24 | add_file_path_list = glob.glob(os.path.join(additional_root_dir, '*'))
25 |
26 | jp2_file_fullpath_dict = func_file_to_fullpath_dict(jp2_file_path_list)
27 | sid_file_fullpath_dict = func_file_to_fullpath_dict(sid_file_path_list)
28 | add_file_fullpath_dict = func_file_to_fullpath_dict(add_file_path_list)
29 |
30 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
31 |
32 | external_id_to_img_path_dict = {}
33 |
34 | unmatched_external_id_list = []
35 |
36 | for index, record in sample_map_df.iterrows():
37 | external_id = record.external_id
38 | filename_without_extension = external_id.strip("'").replace('.','')
39 |
40 | full_path = ''
41 | if filename_without_extension in jp2_file_fullpath_dict:
42 | full_path = jp2_file_fullpath_dict[filename_without_extension]
43 | elif filename_without_extension in sid_file_fullpath_dict:
44 | full_path = sid_file_fullpath_dict[filename_without_extension]
45 | elif filename_without_extension in add_file_fullpath_dict:
46 | full_path = add_file_fullpath_dict[filename_without_extension]
47 | else:
48 | # print('image with external_id not found in image_dir:', external_id)
49 | unmatched_external_id_list.append(external_id)
50 | continue
51 | assert (len(full_path)!=0)
52 |
53 | external_id_to_img_path_dict[external_id] = full_path
54 |
55 | return external_id_to_img_path_dict, unmatched_external_id_list
56 |
57 | def get_img_path_from_external_id_and_image_no(jp2_root_dir = '/data/rumsey-jp2/', sid_root_dir = '/data2/rumsey_sid_to_jpg/', additional_root_dir='/data2/rumsey-luna-img/', sample_map_path = None,external_id_key = 'external_id') :
58 | # returns (1) a dict with external-id as key, full image path as value (2) list of external-id that can not find image path
59 |
60 | jp2_file_path_list = glob.glob(os.path.join(jp2_root_dir, '*/*.jp2'))
61 | sid_file_path_list = glob.glob(os.path.join(sid_root_dir, '*.jpg')) # use converted jpg directly
62 | add_file_path_list = glob.glob(os.path.join(additional_root_dir, '*'))
63 |
64 | jp2_file_fullpath_dict = func_file_to_fullpath_dict(jp2_file_path_list)
65 | sid_file_fullpath_dict = func_file_to_fullpath_dict(sid_file_path_list)
66 | add_file_fullpath_dict = func_file_to_fullpath_dict(add_file_path_list)
67 |
68 | sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
69 |
70 | external_id_to_img_path_dict = {}
71 |
72 | unmatched_external_id_list = []
73 | for index, record in sample_map_df.iterrows():
74 | external_id = record.external_id
75 | image_no = record.image_no
76 | # filename_without_extension = external_id.strip("'").replace('.','')
77 | filename_without_extension = image_no.strip("'").split('.')[0]
78 |
79 | full_path = ''
80 | if filename_without_extension in jp2_file_fullpath_dict:
81 | full_path = jp2_file_fullpath_dict[filename_without_extension]
82 | elif filename_without_extension in sid_file_fullpath_dict:
83 | full_path = sid_file_fullpath_dict[filename_without_extension]
84 | elif filename_without_extension in add_file_fullpath_dict:
85 | full_path = add_file_fullpath_dict[filename_without_extension]
86 | else:
87 | print('image with external_id not found in image_dir:', external_id)
88 | unmatched_external_id_list.append(external_id)
89 | continue
90 | assert (len(full_path)!=0)
91 |
92 | external_id_to_img_path_dict[external_id] = full_path
93 |
94 | return external_id_to_img_path_dict, unmatched_external_id_list
95 |
96 |
97 | if __name__ == '__main__':
98 |
99 | parser = argparse.ArgumentParser()
100 | parser.add_argument('--jp2_root_dir', type=str, default='/data/rumsey-jp2/',
101 | help='image dir of jp2 files.')
102 | parser.add_argument('--sid_root_dir', type=str, default='/data2/rumsey_sid_to_jpg/',
103 | help='image dir of sid files.')
104 | parser.add_argument('--additional_root_dir', type=str, default='/data2/rumsey-luna-img/',
105 | help='image dir of additional luna files.')
106 | parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv',
107 | help='path to sample map csv, which contains gcps info')
108 | parser.add_argument('--external_id_key', type=str, default='external_id',
109 | help='key string for external id, could be external_id or ListNo')
110 |
111 | args = parser.parse_args()
112 | print(args)
113 |
114 | # get_img_path_from_external_id(jp2_root_dir = args.jp2_root_dir, sid_root_dir = args.sid_root_dir, additional_root_dir = args.additional_root_dir,
115 | # sample_map_path = args.sample_map_path,external_id_key = args.external_id_key)
116 |
117 | get_img_path_from_external_id_and_image_no(jp2_root_dir = args.jp2_root_dir, sid_root_dir = args.sid_root_dir, additional_root_dir = args.additional_root_dir,
118 | sample_map_path = args.sample_map_path,external_id_key = args.external_id_key)
119 |
--------------------------------------------------------------------------------