├── .gitignore
├── README.md
├── external_id_search
    └── script.py
├── m0_preprocessing
    └── convert_sid_to_jpg.py
├── m1_geotiff
    └── convert_image_to_geotiff.py
├── m2_detection_recognition
    └── crop_img.py
├── m3_image_geojson
    └── stitch_output.py
├── m4_geocoordinate_converter
    └── convert_geojson_to_geocoord.py
├── m5_entity_linker
    └── entity_linker.py
├── m6_post_ocr
    └── lexical_search.py
├── m_sanborn
    ├── s1_geocoding.py
    ├── s2_clustering.py
    └── s3_gen_geojson.py
├── metadata
    ├── davidrumsey
    │   ├── davidrumsey.py
    │   └── davidrumsey_metadata.csv
    └── sanborn.py
├── model_card_template
├── pipe_run.sh
├── pipe_run_img.sh
├── requirements.txt
├── run.py
├── run_img.py
├── run_leeje.py
├── run_only_eval.py
├── run_sanborn.py
└── utils.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | data/
 2 | data0/
 3 | data1/
 4 | rumsey_output/
 5 | .idea/
 6 | .env
 7 | MrSID*
 8 | __pycache__
 9 | debug/
10 | .ipynb_checkpoints/
11 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 
  2 | ---
  3 | 
  4 | # Table of Contents
  5 | - [Dataset Card](#dataset-card)
  6 |   - [Dataset Description](#dataset-description)
  7 |     - [Dataset Download Link](#dataset-download-link)
  8 |     - [Dataset Languages](#dataset-languages)
  9 |   - [Dataset Structure](#dataset-structure)
 10 |     - [Data Fields](#data-fields)
 11 | - [Model Card](#model-card)
 12 |   - [Model Description](#model-description)
 13 |     - [Model Summary](#model-summary)
 14 |     - [Model Tags](#model-tags)
 15 |     - [Model Input and Output](#model-input-and-output)
 16 | - [Additional Information](#additional-information)
 17 |   - [Licensing Information](#licensing-information)
 18 |   - [Contributions](#contributions)    
 19 | 
 20 | 
 21 | # Dataset Card 
 22 | 
 23 | ## Dataset Description
 24 | 
 25 | Map text recognized from the georeferenced Rumsey historical map collection.
 26 | 
 27 | ### Dataset Download Link
 28 | 
 29 | - **Original Map Images:**  https://www.davidrumsey.com/
 30 | - **Processed Output:** https://s3.msi.umn.edu/rumsey_output/geojson_testr_syn_54119.zip
 31 | 
 32 | ### Dataset Languages
 33 | 
 34 | English
 35 | 
 36 | ### Language Creators:
 37 | 
 38 | Machine-generated
 39 | 
 40 | ## Dataset Structure
 41 | 
 42 | ### Data Fields
 43 | 
 44 | <img src="https://user-images.githubusercontent.com/5383572/188784909-10cd04fd-4b61-4205-a563-33d20f9026db.png" width="700">
 45 | 
 46 | ### Output File Name
 47 | 
 48 | Output geojson file is named after the external ID of origina map image.
 49 | 
 50 | <img src="https://user-images.githubusercontent.com/5383572/188785367-446690fd-76fc-47db-b2ae-a1fac4fc61d6.png" width="700">
 51 | 
 52 | 
 53 | 
 54 | # Model Card 
 55 | 
 56 | ## Model Description
 57 | 
 58 | A **fully automatic** pipeline to process a large amount of scanned historical map images. **Outputs** include the recognized text labels, label bounding polygons, labels after post-OCR correction and geo-entity identifier in OSM database. 
 59 | 
 60 | ### Model Summary
 61 | 
 62 | - **Orange boxes:** Modules in the pipeline
 63 | - **Blue boxes:** Inputs of the modules
 64 | - **Green boxes:** Outputs of the modules
 65 | 
 66 | <img width="1000" alt="image" src="https://user-images.githubusercontent.com/5383572/230442791-93497b26-5071-4c47-9947-7f6000306efb.png">
 67 | 
 68 | ### Model Details
 69 | - **ImageCropping** module divides huge map images (>10K pixels) to smaller image patches (1K pixels) so that TextSpotter could process.
 70 | 
 71 | - **PatchTextSpotter** uses a state-of-the-art network architecture [TESTR](https://github.com/mlpc-ucsd/TESTR) for detecting and recognizing text labels on image patches. Due to the lack of annotated samples for training, we create a set of synthetic maps to mimic the text styles (e.g., font, spacing, orientation) in the real historical maps. We place the location names from OpenStreetMap on a map by considering the shape of the location geometry and merge the text with various background styles extracted from the Rumsey collection maps. We train the model with these unlimited synthetic maps and apply the model to the historical maps.
 72 | 
 73 | - **PatchtoMapMerging** is the module to merge the patch-level spotting results into map-level.
 74 | 
 75 | - **GeocoordinateConverter** converts the text label bounding polygons from image coordinates system to geocoordinates system. Note: polygons in both coordinate systems are saved in the output. 
 76 | 
 77 | - **PostOCR** helps to verify the output and correct misspelled words from PatchTextSpotter using the OpenStreetMap dictionary. PostOCR module finds words' candidates using [fuzzy query function](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html) from elasticsearch, which contains the place name attribute from the Openstreetmap dictionary. Once PostOCR module identifies words' candidates, the module picks one candidate by the word popularity from the dictionary.
 78 | 
 79 | - **EntityLinker** links each map text to the candidate geo-entities in the OpenStreetMap. The entity linking retrieves the candidates that satisfy two criteria: 1) the recognized text from the text spotter contains the geo-entities' name 2) the geocoordinates of detected bounding polygons intersect with the geo-entities' geometry. (Geo-coordinates are obtained from GeocoordConverter)
 80 | 
 81 | 
 82 | ### How To Use
 83 | All the modules can be launched from `run.py`. All the outputs will be saved in the `expt_name` subfolder in `output_folder` specified in the input arguments. 
 84 | 
 85 | ```
 86 | usage: run.py [-h] [--map_kurator_system_dir MAP_KURATOR_SYSTEM_DIR] [--text_spotting_model_dir TEXT_SPOTTING_MODEL_DIR]
 87 |               [--sample_map_csv_path SAMPLE_MAP_CSV_PATH] [--output_folder OUTPUT_FOLDER] [--expt_name EXPT_NAME] [--module_get_dimension]
 88 |               [--module_gen_geotiff] [--module_cropping] [--module_text_spotting] [--module_img_geojson] [--module_geocoord_geojson] [--module_entity_linking]
 89 |               [--module_post_ocr] [--spotter_model {abcnet,testr}] [--spotter_config SPOTTER_CONFIG] [--spotter_expt_name SPOTTER_EXPT_NAME] [--print_command]
 90 | 
 91 | optional arguments:
 92 |   -h, --help            show this help message and exit
 93 |   --map_kurator_system_dir MAP_KURATOR_SYSTEM_DIR
 94 |   --text_spotting_model_dir TEXT_SPOTTING_MODEL_DIR
 95 |   --sample_map_csv_path SAMPLE_MAP_CSV_PATH
 96 |   --output_folder OUTPUT_FOLDER
 97 |   --expt_name EXPT_NAME
 98 |   --module_get_dimension
 99 |   --module_gen_geotiff
100 |   --module_cropping
101 |   --module_text_spotting
102 |   --module_img_geojson
103 |   --module_geocoord_geojson
104 |   --module_entity_linking
105 |   --module_post_ocr
106 |   --spotter_model {abcnet,testr}
107 |                         Select text spotting model option from ["abcnet","testr"]
108 |   --spotter_config SPOTTER_CONFIG
109 |                         Path to the config file for text spotting model
110 |   --spotter_expt_name SPOTTER_EXPT_NAME
111 |                         Name of spotter experiment, if empty using config file name
112 |   --print_command
113 |   ```
114 | 
115 | ### Model Tags
116 | - Text spotting
117 | - Entity Linking
118 | - Historical maps
119 | 
120 | 
121 | # Additional Information
122 | 
123 | ### Licensing Information
124 | 
125 | MIT License
126 | 
127 | ### Contribution and Acknowledgement
128 | 
129 | Thanks to [@zekun-li](https://zekun-li.github.io/),[@Jina-Kim](https://github.com/Jina-Kim), [@MinNamgung](https://github.com/MinNamgung) and [@linyijun](https://github.com/linyijun) for adding this dataset and models. 
130 | 
131 | Thanks to [TESTR](https://github.com/mlpc-ucsd/TESTR) for an open-source text spotting model. 
132 | 


--------------------------------------------------------------------------------
/external_id_search/script.py:
--------------------------------------------------------------------------------
 1 | from elasticsearch_dsl import Search, Q
 2 | from elasticsearch import Elasticsearch, helpers
 3 | from elasticsearch import RequestsHttpConnection
 4 | import argparse
 5 | import os
 6 | import glob
 7 | import json
 8 | import nltk
 9 | import logging
10 | from dotenv import load_dotenv
11 | 
12 | import pandas as pd
13 | import numpy as np
14 | import logging
15 | import re
16 | import warnings
17 | warnings.filterwarnings("ignore")
18 | 
19 | 
20 | 
21 | def db_connect():
22 |     """Elasticsearch Connection on Sansa"""
23 |     load_dotenv()
24 |     
25 |     DB_HOST = os.getenv("DB_HOST")
26 |     USER_NAME = os.getenv("DB_USERNAME")
27 |     PASSWORD = os.getenv("DB_PASSWORD")
28 | 
29 |     es = Elasticsearch([DB_HOST], connection_class=RequestsHttpConnection, http_auth=(USER_NAME, PASSWORD), verify_certs=False)
30 |     return es
31 | 
32 | 
33 | def query(target):
34 |     es = db_connect()
35 |     inputs = target.upper()
36 |     query = {"query": {"match": {"text": f"{inputs}"}}} 
37 |     test = es.search(index="meta", body=query, size=10000)["hits"]["hits"] 
38 |     
39 |     id_list = []
40 |     if len(test) != 0 :
41 |         for i in range(len(test)):
42 |             map_id = test[i]['_source']['external_id']
43 |             id_list.append(map_id)
44 |     
45 |     
46 |     result = sorted(list(set(id_list)))
47 |     return result
48 |    
49 | 
50 | def main(args):
51 |     keyword = args.target
52 |     metadata_path = args.metadata
53 |     meta_df = pd.read_csv(metadata_path)
54 |     meta_df['tmp'] = meta_df['image_no'].str.split(".").str[0]
55 | 
56 |     results = query(keyword)
57 |     # print(f' "{keyword}" exist in: {results}')
58 | 
59 |     tmp_df = meta_df[meta_df.tmp.isin(results)]
60 | 
61 |     print(f'"{keyword}" exist in:')
62 |     for index, row in tmp_df.iterrows():
63 |         print(f'{row.tmp} \t {row.title}')
64 | 
65 |     
66 | if __name__ == '__main__':
67 |     parser = argparse.ArgumentParser()
68 |     parser.add_argument('--target', type=str, default='east', help='')
69 |     parser.add_argument('--metadata', type=str, default='/home/maplord/maplist_csv/luna_omo_metadata_56628_20220724.csv', help='')
70 |     
71 |     args = parser.parse_args()
72 |     print(args)
73 |     
74 |     main(args)
75 |     


--------------------------------------------------------------------------------
/m0_preprocessing/convert_sid_to_jpg.py:
--------------------------------------------------------------------------------
 1 | import os 
 2 | import glob 
 3 | import time
 4 | import multiprocessing
 5 | 
 6 | sid_dir = '/data/rumsey-sid'
 7 | sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
 8 | num_process = 20
 9 | if_print_command  = True
10 | 
11 | sid_list = glob.glob(os.path.join(sid_dir, '*/*.sid'))
12 | 
13 | def execute_command(command, if_print_command):
14 |     t1 = time.time()
15 | 
16 |     if if_print_command:
17 |         print(command)
18 |     os.system(command)
19 | 
20 |     t2 = time.time()
21 |     time_usage = t2 - t1 
22 |     return time_usage
23 | 
24 | 
25 | def conversion(img_path):
26 |     mrsiddecode_executable="/home/zekun/dr_maps/mapkurator-system/m1_geotiff/MrSID_DSDK-9.5.4.4709-rhel6.x86-64.gcc531/Raster_DSDK/bin/mrsiddecode"
27 |     map_name = os.path.basename(img_path)[:-4]
28 | 
29 |     redirected_path = os.path.join(sid_to_jpg_dir, map_name + '.jpg')
30 | 
31 |     run_sid_to_jpg_command = mrsiddecode_executable + ' -quiet -i '+ img_path + ' -o '+redirected_path
32 |     time_usage = execute_command(run_sid_to_jpg_command, if_print_command)
33 | 
34 | 
35 | 
36 | if __name__ == "__main__":
37 |     pool = multiprocessing.Pool(num_process)
38 |     start_time = time.perf_counter()
39 |     processes = [pool.apply_async(conversion, args=(sid_path,)) for sid_path in sid_list]
40 |     result = [p.get() for p in processes]
41 |     finish_time = time.perf_counter()
42 |     print(f"Program finished in {finish_time-start_time} seconds")
43 |     
44 | 


--------------------------------------------------------------------------------
/m1_geotiff/convert_image_to_geotiff.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import glob
  3 | import pandas as pd
  4 | import ast
  5 | import argparse
  6 | import logging
  7 | import pdb
  8 | 
  9 | logging.basicConfig(level=logging.INFO)
 10 | 
 11 | def func_file_to_fullpath_dict(file_path_list):
 12 | 
 13 |     file_fullpath_dict = dict()
 14 |     for file_path in file_path_list:
 15 |         file_fullpath_dict[os.path.basename(file_path).split('.')[0]] = file_path
 16 | 
 17 |     return file_fullpath_dict  
 18 | 
 19 | def main(args):
 20 | 
 21 |     jp2_root_dir = args.jp2_root_dir
 22 |     sid_root_dir = args.sid_root_dir
 23 |     additional_root_dir = args.additional_root_dir
 24 |     out_geotiff_dir = args.out_geotiff_dir
 25 | 
 26 |     sample_map_path = args.sample_map_path
 27 |     external_id_key = args.external_id_key
 28 | 
 29 |     jp2_file_path_list = glob.glob(os.path.join(jp2_root_dir, '*/*.jp2'))
 30 |     sid_file_path_list = glob.glob(os.path.join(sid_root_dir, '*.jpg')) # use converted jpg directly
 31 |     add_file_path_list = glob.glob(os.path.join(additional_root_dir, '*'))
 32 | 
 33 |     jp2_file_fullpath_dict = func_file_to_fullpath_dict(jp2_file_path_list) 
 34 |     sid_file_fullpath_dict = func_file_to_fullpath_dict(sid_file_path_list) 
 35 |     add_file_fullpath_dict = func_file_to_fullpath_dict(add_file_path_list) 
 36 | 
 37 |     sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
 38 | 
 39 | 
 40 |     for index, record in sample_map_df.iterrows():
 41 |         external_id = record.external_id
 42 |         transform_method = record.transformation_method
 43 |         gcps = record.gcps
 44 |         filename_without_extension = external_id.strip("'").replace('.','')
 45 | 
 46 |         full_path = ''
 47 |         if filename_without_extension in jp2_file_fullpath_dict:
 48 |             full_path = jp2_file_fullpath_dict[filename_without_extension]
 49 |         elif filename_without_extension in sid_file_fullpath_dict:
 50 |             full_path = sid_file_fullpath_dict[filename_without_extension]
 51 |         elif filename_without_extension in add_file_fullpath_dict:
 52 |             full_path = add_file_fullpath_dict[filename_without_extension]
 53 |         else:
 54 |             print('image with external_id not found in image_dir:', external_id)
 55 |             continue
 56 |         assert (len(full_path)!=0)
 57 | 
 58 |         gcps = ast.literal_eval(gcps)
 59 | 
 60 |         gcp_str = ''
 61 |         for gcp in gcps:
 62 |             lng, lat = gcp['location']
 63 |             x, y = gcp['pixel']
 64 |             gcp_str += '-gcp '+str(x) + ' ' + str(y) + ' ' + str(lng) + ' ' + str(lat) + ' '
 65 |         
 66 |         # gdal_translate to add GCP to raw image
 67 |         gdal_command = 'gdal_translate -of Gtiff '+gcp_str + full_path + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff'
 68 |         print(gdal_command)
 69 |         os.system(gdal_command)
 70 |         
 71 |         
 72 |         assert transform_method in ['affine','polynomial','tps']
 73 |              
 74 |         # reprojection with gdal_warp
 75 |         if transform_method == 'affine': 
 76 |             # first order
 77 |             
 78 |             warp_command = 'gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3857 -r near -order 1 -of GTiff ' + os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '.geotiff'  
 79 |             
 80 |         elif transform_method == 'polynomial':
 81 |             # second order
 82 |             warp_command = 'gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3857 -r near -order 2 -of GTiff '+ os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '.geotiff'  
 83 | 
 84 |         elif transform_method == 'tps':
 85 |             # Thin plate spline #debug/11558008.geotiff  #10057000.geotiff
 86 |             warp_command = 'gdalwarp -s_srs EPSG:4326 -t_srs EPSG:3857  -r near -tps -of GTiff '+ os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff' + ' ' + os.path.join(out_geotiff_dir, filename_without_extension) + '.geotiff'  
 87 |             
 88 |         else:
 89 |             raise NotImplementedError
 90 |         print(warp_command)
 91 |         os.system(warp_command)
 92 |         # remove temporary tiff file
 93 |         # os.system('rm ' + os.path.join(out_geotiff_dir, filename_without_extension) + '_temp.geotiff')
 94 | 
 95 | 
 96 |         logging.info('Done generating geotiff for %s', external_id)
 97 | 
 98 | 
 99 | if __name__ == '__main__':
100 | 
101 |     parser = argparse.ArgumentParser()
102 |     parser.add_argument('--jp2_root_dir', type=str, default='/data/rumsey-jp2/',
103 |                         help='image dir of jp2 files.')
104 |     parser.add_argument('--sid_root_dir', type=str, default='/data2/rumsey_sid_to_jpg/',
105 |                         help='image dir of sid files.')
106 |     parser.add_argument('--additional_root_dir', type=str, default='/data2/rumsey-luna-img/',
107 |                         help='image dir of additional luna files.')
108 |     parser.add_argument('--out_geotiff_dir', type=str, default='data/geotiff/',
109 |                         help='output dir for geotiff')
110 |     parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv',
111 |                         help='path to sample map csv, which contains gcps info')
112 |     parser.add_argument('--external_id_key', type=str, default='external_id',
113 |                         help='key string for external id, could be external_id or ListNo')
114 |  
115 |     args = parser.parse_args()
116 |     print(args)
117 | 
118 | 
119 |     main(args)
120 | 


--------------------------------------------------------------------------------
/m2_detection_recognition/crop_img.py:
--------------------------------------------------------------------------------
 1 | import sys
 2 | import os
 3 | from PIL import Image, ImageFile
 4 | import numpy as np
 5 | import argparse
 6 | import logging
 7 | 
 8 | logging.basicConfig(level=logging.INFO)
 9 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
10 | 
11 | #add this one line and import ImageFile above 
12 | ImageFile.LOAD_TRUNCATED_IMAGES = True
13 | 
14 | def main(args):
15 | 
16 |     img_path = args.img_path
17 |     output_dir = args.output_dir
18 | 
19 |     map_name = os.path.basename(img_path).split('.')[0] # get the map name without extension
20 |     output_dir = os.path.join(output_dir, map_name)
21 | 
22 |     if not os.path.isdir(output_dir):
23 |         os.makedirs(output_dir)
24 | 
25 |     map_img = Image.open(img_path) 
26 |     width, height = map_img.size 
27 | 
28 |     #print(width, height)
29 | 
30 |     shift_size = 1000
31 | 
32 |     # pad the image to the size divisible by shift-size
33 |     num_tiles_w = int(np.ceil(1. * width / shift_size))
34 |     num_tiles_h = int(np.ceil(1. * height / shift_size))
35 |     enlarged_width = int(shift_size * num_tiles_w)
36 |     enlarged_height = int(shift_size * num_tiles_h)
37 | 
38 |     enlarged_map = Image.new(mode="RGB", size=(enlarged_width, enlarged_height))
39 |     # paste map_imge to enlarged_map
40 |     enlarged_map.paste(map_img) 
41 | 
42 |     for idx in range(0, num_tiles_h):
43 |         for jdx in range(0, num_tiles_w):
44 |             img_clip = enlarged_map.crop((jdx * shift_size, idx * shift_size,(jdx + 1) * shift_size, (idx + 1) * shift_size, ))
45 | 
46 |             out_path = os.path.join(output_dir, 'h' + str(idx) + '_w' + str(jdx) + '.jpg')
47 |             img_clip.save(out_path)
48 | 
49 |     logging.info('Done cropping %s' %img_path )
50 | 
51 | 
52 | if __name__ == '__main__':
53 | 
54 |     parser = argparse.ArgumentParser()
55 |     parser.add_argument('--img_path', type=str, default='../data/100_maps/8628000.jp2',
56 |                         help='path to image file.')
57 |     parser.add_argument('--output_dir', type=str, default='../data/100_maps_crop/',
58 |                         help='path to output dir')
59 |    
60 |     args = parser.parse_args()
61 |     print(args)
62 | 
63 |     
64 |     # if not os.path.isdir(args.output_dir):
65 |     #     os.makedirs(args.output_dir)
66 |     #     print('created dir',args.output_dir)
67 | 
68 |     main(args)
69 | 


--------------------------------------------------------------------------------
/m3_image_geojson/stitch_output.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import glob
 3 | import pandas as pd 
 4 | import numpy as np
 5 | import argparse
 6 | from geojson import Polygon, Feature, FeatureCollection, dump
 7 | import logging
 8 | import pdb
 9 | 
10 | logging.basicConfig(level=logging.INFO)
11 | pd.options.mode.chained_assignment = None
12 | 
13 | def concatenate_and_convert_to_geojson(args):
14 |     map_subdir = args.input_dir
15 |     output_geojson = args.output_geojson
16 |     shift_size = args.shift_size
17 |     eval_bool = args.eval_only
18 | 
19 |     file_list = glob.glob(map_subdir + '/*.json')
20 |     file_list = sorted(file_list)
21 |     if len(file_list) == 0:
22 |         logging.warning('No files found for %s' % map_subdir)
23 |     
24 |     map_data = []
25 |     for file_path in file_list:
26 |         patch_index_h, patch_index_w = os.path.basename(file_path).split('.')[0].split('_')
27 |         patch_index_h = int(patch_index_h[1:])
28 |         patch_index_w = int(patch_index_w[1:])
29 |         try:
30 |             df = pd.read_json(file_path)
31 |         except pd.errors.EmptyDataError:
32 |             logging.warning('%s is empty. Skipping.' % file_path)
33 |             
34 | 
35 |         for index, line_data in df.iterrows():
36 |             df['polygon_x'][index] = np.array(df['polygon_x'][index]) + shift_size * patch_index_w
37 |             df['polygon_y'][index] = np.array(df['polygon_y'][index]) + shift_size * patch_index_h
38 |         map_data.append(df)
39 | 
40 |     map_df = pd.concat(map_data)
41 | 
42 |     features = []
43 |     for index, line_data in map_df.iterrows():
44 |         polygon_x, polygon_y = list(line_data['polygon_x']), list(line_data['polygon_y'])
45 |         
46 |         if eval_bool ==  False: 
47 |              # y is kept to be positive.  Needs to be negative for QGIS visualization
48 |             polygon = Polygon([[[x,-y] for x,y in zip(polygon_x, polygon_y)]+[[polygon_x[0], -polygon_y[0]]]])
49 |         else:
50 |             polygon = Polygon([[[x,y] for x,y in zip(polygon_x, polygon_y)]+[[polygon_x[0], polygon_y[0]]]])
51 |             
52 |         text = line_data['text']
53 |         score = line_data['score']
54 |         features.append(Feature(geometry = polygon, properties={"text": text, "score": score} ))
55 | 
56 |     feature_collection = FeatureCollection(features)
57 |     # with open(os.path.join(output_dir, map_subdir +'.geojson'), 'w') as f:
58 |     #     dump(feature_collection, f)
59 |     with open(output_geojson, 'w') as f:
60 |         dump(feature_collection, f)
61 |     
62 |     logging.info('Done generating geojson (img coord) for %s', map_subdir)
63 | 
64 | 
65 | if __name__ == '__main__':
66 | 
67 |     parser = argparse.ArgumentParser()
68 |     parser.add_argument('--input_dir', type=str, default='data/100_maps_crop_abc/0063014',
69 |                         help='path to input json path.')
70 |     
71 |     parser.add_argument('--output_geojson', type=str, default='data/100_maps_geojson_abc/0063014.geojson',
72 |                         help='path to output geojson path')
73 | 
74 |     parser.add_argument('--shift_size', type=int, default = 1000,
75 |                         help='image patch size and shift size.')
76 |     
77 |     # This can not be of string type. Otherwise it will be interpreted to True all the time.
78 |     parser.add_argument('--eval_only', default = False, action='store_true',
79 |                         help='keep positive coordinate')
80 |    
81 |     args = parser.parse_args()
82 |     print(args)
83 | 
84 |     concatenate_and_convert_to_geojson(args)
85 | 
86 | 
87 | 
88 | 
89 | 
90 | 


--------------------------------------------------------------------------------
/m4_geocoordinate_converter/convert_geojson_to_geocoord.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import argparse
 3 | import logging
 4 | import ast
 5 | 
 6 | import pandas as pd
 7 | import numpy as np
 8 | import geojson
 9 | 
10 | logging.basicConfig(level=logging.INFO)
11 | 
12 | 
13 | def main(args):
14 |     geojson_file = args.in_geojson_file
15 |     output_dir = args.out_geojson_dir
16 | 
17 |     sample_map_df = pd.read_csv(args.sample_map_path, dtype={'external_id': str})
18 |     sample_map_df['external_id'] = sample_map_df['external_id'].str.strip("'").str.replace('.', '', regex=True)
19 |     geojson_filename_id = geojson_file.split(".")[0].split("/")[-1]
20 | 
21 |     row = sample_map_df[sample_map_df['external_id'] == geojson_filename_id]
22 |     if not row.empty:
23 |         gcps = ast.literal_eval(row.iloc[0]['gcps'])
24 |         gcp_str = ''
25 |         for gcp in gcps:
26 |             lng, lat = gcp['location']
27 |             x, y = gcp['pixel']
28 |             gcp_str += '-gcp ' + str(x) + ' ' + str(y) + ' ' + str(lng) + ' ' + str(lat) + ' '
29 | 
30 |         transform_method = row.iloc[0]['transformation_method']
31 |         assert transform_method in ['affine', 'polynomial', 'tps']
32 | 
33 |         output = '"' + output_dir + geojson_filename_id + '.geojson"'
34 |         input = '"' + geojson_file + '"'
35 | 
36 |         if transform_method == 'affine':
37 |             gecoord_convert_command = 'ogr2ogr -f "GeoJSON" ' + output + " " + input + ' -order 1 ' + gcp_str
38 | 
39 |         elif transform_method == 'polynomial':
40 |             gecoord_convert_command = 'ogr2ogr -f "GeoJSON" ' + output + " " + input + ' -order 2 ' + gcp_str
41 | 
42 |         elif transform_method == 'tps':
43 |             gecoord_convert_command = 'ogr2ogr -f "GeoJSON" ' + output + " " + input + ' -tps ' + gcp_str
44 | 
45 |         else:
46 |             raise NotImplementedError
47 | 
48 |         ret_value = os.system(gecoord_convert_command)
49 |         if ret_value != 0:
50 |             logging.info('Failed generating geocoord geojson for %s', geojson_file)
51 |         else:
52 |             with open(geojson_file) as img_geojson, open(output_dir + geojson_filename_id + '.geojson',
53 |                                                          'r+') as geocoord_geojson:
54 |                 img_data = geojson.load(img_geojson)
55 |                 geocoord_data = geojson.load(geocoord_geojson)
56 |                 for img_feature, geocoord_feature in zip(img_data['features'], geocoord_data['features']):
57 |                     geocoord_feature['properties']['img_coordinates'] = np.array(img_feature['geometry']['coordinates'],
58 |                                                                                  dtype=np.int32).reshape(-1, 2).tolist()
59 | 
60 |             with open(output_dir + geojson_filename_id + '.geojson', 'w') as geocoord_geojson:
61 |                 geojson.dump(geocoord_data, geocoord_geojson)
62 | 
63 |             logging.info('Done generating geocoord geojson for %s', geojson_file)
64 | 
65 | 
66 | if __name__ == '__main__':
67 |     parser = argparse.ArgumentParser()
68 |     parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv',
69 |                         help='path to sample map csv, which contains gcps info')
70 |     parser.add_argument('--in_geojson_file', type=str,
71 |                         help='input geojson file; results of M2')
72 |     parser.add_argument('--out_geojson_dir', type=str, default='data/100_maps_geojson_abc_geocoord/',
73 |                         help='output dir for converted geojson files')
74 | 
75 |     args = parser.parse_args()
76 | 
77 |     main(args)


--------------------------------------------------------------------------------
/m5_entity_linker/entity_linker.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import argparse
 3 | import ast
 4 | from dotenv import load_dotenv
 5 | 
 6 | import pandas as pd
 7 | import numpy as np
 8 | 
 9 | import geojson
10 | 
11 | import sqlalchemy
12 | from sqlalchemy import create_engine
13 | 
14 | import geocoder
15 | from shapely.geometry import Polygon
16 | 
17 | import re
18 | 
19 | load_dotenv()
20 | 
21 | DB_HOST = os.getenv("DB_HOST")
22 | DB_USERNAME = os.getenv("DB_USERNAME")
23 | DB_PASSWORD = os.getenv("DB_PASSWORD")
24 | DB_NAME = os.getenv("DB_NAME")
25 | 
26 | connection_string = f'postgresql://postgres:{DB_USERNAME}:{DB_PASSWORD}@{DB_HOST}:5432/{DB_NAME}'
27 | 
28 | 
29 | def main(args):
30 | 
31 |     # check if first pair of gcps is in midwest-US
32 |     regex = re.compile('[^a-zA-Z]')
33 |     conn = create_engine(connection_string, echo=False)
34 |     sample_map_df = pd.read_csv(args.sample_map_path, dtype={'external_id': str})
35 |     sample_map_df['external_id'] = sample_map_df['external_id'].str.strip("'").str.replace('.', '')
36 |     midwest = ["Illinois", "Missouri", "Kansas", "Iowa", "South Dakota", "Indiana", "Ohio", "Wisconsin", "Minnesota", "Michigan"]
37 | 
38 |     geojson_files = os.listdir(args.in_geojson_dir)
39 |     for i, geojson_file in enumerate(geojson_files):
40 |         row = sample_map_df[sample_map_df['external_id']==geojson_file.split(".")[0]]
41 |         gcps = ast.literal_eval(row.iloc[0]['gcps'])
42 |         geocode = geocoder.osm(gcps[0]['location'][::-1], method='reverse')
43 | 
44 |         if geocode.state in midwest:
45 |             with open(args.in_geojson_dir+geojson_file) as f:
46 |                 data = geojson.load(f)
47 |                 for feature_data in data['features']:
48 |                     pts = np.array(feature_data['geometry']['coordinates']).reshape(-1, 2)
49 |                     map_polygon = Polygon(pts)
50 |                     map_text = str(feature_data['properties']['text']).lower()
51 |                     map_text = regex.sub(' ', map_text) # remove all non-alphabetic characters
52 |                     
53 |                     query = f"""SELECT p.ogc_fid
54 |                         FROM  polygon_features p
55 |                         WHERE LOWER(p.name) LIKE '%%{map_text}%%'
56 |                         AND ST_INTERSECTS(ST_TRANSFORM(ST_SetSRID(ST_MakeValid('{map_polygon}'::geometry), 4326)::geometry, 4326), p.wkb_geometry);
57 |                     """
58 | 
59 |                     try:
60 |                         intersect_df = pd.read_sql(query, con=conn)
61 |                     except sqlalchemy.exc.InternalError:
62 |                         continue
63 | 
64 |                     if not intersect_df.empty:
65 |                         feature_data['properties']['osm_ogc_fid'] = intersect_df['ogc_fid'].values.tolist()
66 |                     # else:
67 |                     #     feature_data['properties']['osm_ogc_fid'] = []
68 | 
69 |             with open(args.out_geojson_dir+geojson_file, 'w') as output_geojson:
70 |                 geojson.dump(data, output_geojson)
71 | 
72 | 
73 | if __name__ == '__main__':
74 |     parser = argparse.ArgumentParser()
75 |     parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv',
76 |                         help='path to sample map csv, which contains gcps info')
77 |     parser.add_argument('--in_geojson_dir', type=str, default='data/100_maps_geojson_abc_geocoord/',
78 |                         help='input dir for results of M2')
79 |     parser.add_argument('--out_geojson_dir', type=str, default='data/100_maps_geojson_abc_linked/',
80 |                         help='output dir for converted geojson files')
81 | 
82 |     args = parser.parse_args()
83 | 
84 |     main(args)
85 | 


--------------------------------------------------------------------------------
/m6_post_ocr/lexical_search.py:
--------------------------------------------------------------------------------
  1 | #-*-coding utf-8-*-
  2 | import logging 
  3 | import requests
  4 | import json
  5 | import argparse
  6 | import http.client as http_client
  7 | import nltk
  8 | import re
  9 | import glob
 10 | import os
 11 | 
 12 | # set the debug level
 13 | http_client.HTTPConnection.debuglevel = 1
 14 | logging.basicConfig(level=logging.INFO)
 15 | warnings.filterwarnings("ignore")
 16 |         
 17 | headers = {
 18 |     'Content-Type': 'application/json',
 19 | }
 20 | 
 21 | def query(args):
 22 |     """ Query candidates and save them as 'postocr_label' """
 23 | 
 24 |     input_dir = args.in_geojson_dir
 25 |     output_geojson = args.out_geojson_dir
 26 | 
 27 |     map_name_output = input_dir.split('/')[-1]
 28 | 
 29 |     with open(input_dir) as json_file: 
 30 |         json_df = json.load(json_file)
 31 | 
 32 |         if json_df != {}:  
 33 |             query_result = []
 34 |             for i in range(len(json_df["features"])):
 35 |                 target_text = json_df['features'][i]["properties"]["text"]
 36 |                 target_pts = json_df['features'][i]["geometry"]["coordinates"]
 37 |             
 38 |                 clean_txt = []
 39 |                 if type(target_text) == str:
 40 |                     for t in range(len(target_text)):
 41 |                         txt = target_text[t]
 42 |                         if txt.isalpha():
 43 |                             clean_txt.append(txt)
 44 |                             
 45 |                     temp_label = ''.join([str(item) for item in clean_txt])
 46 |                     if len(temp_label) != 0:
 47 |                         target_text = temp_label
 48 |                         
 49 |                         process = re.findall('[A-Z][^A-Z]*', target_text)
 50 |                         if all(c.isupper() for c in process) or len(process) == 1: 
 51 |                             
 52 |                             if type(target_text) == str and any(c.isalpha() for c in target_text): 
 53 |                                 # edist 0
 54 |                                 inputs = target_text.lower()
 55 |                                 q1 = '{"query": {"fuzzy": {"name": {"value": "'+ inputs +'", "fuzziness": "0"}}}}' 
 56 |                                 resp = requests.get(f'http://localhost:9200/osm-voca/_search?', \
 57 |                                             data=q1.encode("utf-8"), \
 58 |                                             headers = headers)
 59 |                                 resp_json = json.loads(resp.text)
 60 |                                 test = resp_json["hits"]["hits"]
 61 | 
 62 |                             edist = []
 63 |                             edist_update = []
 64 |                             
 65 |                             edd_min_find = 0
 66 |                             min_candidates = False
 67 |                             
 68 |                             if test != 'NaN':
 69 |                                 for tt in range(len(test)):
 70 |                                     if 'name' in test[tt]['_source']:
 71 |                                         candidate = test[tt]['_source']['name']
 72 |                                         edist.append(candidate)
 73 |                             
 74 |                                 for e in range(len(edist)):
 75 |                                     edd = nltk.edit_distance(inputs.upper(), edist[e].upper())
 76 | 
 77 |                                     if edd == 0:
 78 |                                         edist_update.append(edist[e])
 79 |                                         min_candidates = edist[e]
 80 |                                         edd_min_find = 1
 81 |                             
 82 |                             # edd 1
 83 |                             if edd_min_find != 1:
 84 |                                 # edist 1
 85 |                                 q2 = '{"query": {"fuzzy": {"name": {"value": "'+ inputs +'", "fuzziness": "1"}}}}' 
 86 |                                 resp = requests.get(f'http://localhost:9200/osm-voca/_search?', \
 87 |                                             data=q2.encode("utf-8"), \
 88 |                                             headers = headers)
 89 |                                 resp_json = json.loads(resp.text)
 90 |                                 test = resp_json["hits"]["hits"]
 91 |                                 
 92 |                                 edist = []
 93 |                                 edist_count = []
 94 |                                 edist_update = []
 95 |                                 edist_count_update = []
 96 | 
 97 |                                 if test != 'NaN':
 98 |                                     for tt in range(len(test)):
 99 |                                         if 'name' in test[tt]['_source']:
100 |                                             candidate = test[tt]['_source']['message']
101 |                                             cand = candidate.split(',')[0]
102 |                                             count = candidate.split(',')[1]
103 |                                             edist.append(cand)
104 |                                             edist_count.append(count)
105 |                                                                             
106 |                                     for e in range(len(edist)):
107 |                                         edd = nltk.edit_distance(inputs.upper(), edist[e].upper())
108 | 
109 |                                         if edd == 1:
110 |                                             edist_update.append(edist[e])
111 |                                             edist_count_update.append(edist_count[e])
112 |                                             
113 |                                     if len(edist_update) != 0:
114 |                                         index = edist_count_update.index(max(edist_count_update))
115 |                                         min_candidates = edist_update[index]
116 |                                         edd_min_find = 1
117 |                                 
118 |                             # edd 2
119 |                             if edd_min_find != 1:
120 |                                 # edist 2
121 |                                 q3 = '{"query": {"fuzzy": {"name": {"value": "'+ inputs +'", "fuzziness": "2"}}}}' 
122 |                                 resp = requests.get(f'http://localhost:9200/osm-voca/_search?', \
123 |                                             data=q3.encode("utf-8"), \
124 |                                             headers = headers)
125 |                                 resp_json = json.loads(resp.text)
126 |                                 test = resp_json["hits"]["hits"]
127 |                                 
128 |                                 edist = []
129 |                                 edist_count = []
130 |                                 edist_update = []
131 |                                 edist_count_update = []
132 | 
133 |                                 if test != 'NaN':
134 |                                     for tt in range(len(test)):
135 |                                         if 'name' in test[tt]['_source']:
136 |                                             candidate = test[tt]['_source']['message']
137 |                                             cand = candidate.split(',')[0]
138 |                                             count = candidate.split(',')[1]
139 |                                             edist.append(cand)
140 |                                             edist_count.append(count)
141 |                                                                             
142 |                                     for e in range(len(edist)):
143 |                                         edd = nltk.edit_distance(inputs.upper(), edist[e].upper())
144 | 
145 |                                         if edd == 2:
146 |                                             edist_update.append(edist[e])
147 |                                             edist_count_update.append(edist_count[e])
148 |                                             
149 |                                     if len(edist_update) != 0:
150 |                                         index = edist_count_update.index(max(edist_count_update))
151 |                                         min_candidates = edist_update[index]
152 |                                         edd_min_find = 1
153 |                             
154 |                             if edd_min_find != 1:
155 |                                 min_candidates = False
156 |                             
157 |                             
158 |                             if min_candidates != False:
159 |                                 json_df['features'][i]["properties"]["postocr_label"] = str(min_candidates)
160 |                             else:
161 |                                 json_df['features'][i]["properties"]["postocr_label"] = str(target_text)
162 |                         
163 |                         else: # added
164 |                             json_df['features'][i]["properties"]["postocr_label"] = str(target_text)
165 |                     
166 |                     else:
167 |                         # only numeric pred_text
168 |                         json_df['features'][i]["properties"]["postocr_label"] = str(target_text)
169 | 
170 |                 else:
171 |                     json_df['features'][i]["properties"]["postocr_label"] = str(target_text)
172 |             
173 |             # Save
174 |             with open(output_geojson, 'w') as json_file:
175 |                 json.dump(json_df, json_file, ensure_ascii=False)
176 |         
177 |             logging.info('Done generating post-OCR geojson for %s', map_name_output)
178 | 
179 | 
180 | def main(args):
181 |     query(args)
182 | 
183 | 
184 | if __name__ == '__main__':
185 |     
186 |    
187 |     parser = argparse.ArgumentParser()
188 |     parser.add_argument('--in_geojson_dir', type=str, default='/data2/rumsey_output/test2/', 
189 |                         help='input dir for post-OCR module (= the output of M4) /crop_MN/output_stitch/')
190 |     parser.add_argument('--out_geojson_dir', type=str, default='/data2/rumsey_output/out/',
191 |                         help='post-OCR result')
192 | 
193 |     args = parser.parse_args()
194 |     print(args)
195 |     
196 |     # if not os.path.isdir(args.out_geojson_dir):
197 |     #     os.makedirs(args.out_geojson_dir)
198 |     #     print('created dir',args.out_geojson_dir)
199 |     
200 |     main(args)


--------------------------------------------------------------------------------
/m_sanborn/s1_geocoding.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import argparse 
  3 | import geojson
  4 | import geocoder
  5 | import json
  6 | import time
  7 | import pdb
  8 | 
  9 | 
 10 | def arcgic_geocoding(place_name, maxRows = 5):
 11 |     try:
 12 |         response = geocoder.arcgis(place_name,maxRows=maxRows)
 13 |         return response.json
 14 |     except exception as e:
 15 |         print(e)
 16 |         return -1
 17 |     
 18 |     
 19 | def google_geocoding(place_name, api_key = None, maxRows = 5):
 20 |     try:
 21 |         response = geocoder.google(place_name, key=api_key, maxRows = maxRows)
 22 |         return response.json
 23 |     except exception as e:
 24 |         print(e)
 25 |         return -1
 26 |         
 27 | def osm_geocoding(place_name,  maxRows = 5):
 28 |     try:
 29 |         response = geocoder.osm(place_name,  maxRows = maxRows)
 30 |         return response.json
 31 |     except exception as e:
 32 |         print(e)
 33 |         return -1   
 34 |     
 35 | 
 36 | def geonames_geocoding(place_name, user_name = None, maxRows = 5):
 37 |     try:
 38 |         response = geocoder.geonames(place_name, key = user_name,  maxRows=maxRows)
 39 |         # hourly limit of 1000 credits
 40 |         time.sleep(4)
 41 |         return response.json
 42 |     except exception as e:
 43 |         print(e)
 44 |         return -1
 45 |         
 46 | 
 47 | def geocoding(args):
 48 |     output_folder = args.output_folder
 49 |     input_map_geojson_path =  args.input_map_geojson_path
 50 |     api_key = args.api_key
 51 |     user_name = args.user_name
 52 |     geocoder_option = args.geocoder_option
 53 |     max_results = args.max_results
 54 |     suffix = args.suffix
 55 | 
 56 |     with open(input_map_geojson_path, 'r') as f:
 57 |         data = geojson.load(f)
 58 | 
 59 |     map_name = os.path.basename(input_map_geojson_path).split('.')[0]
 60 |     output_folder = os.path.join(output_folder, geocoder_option)
 61 | 
 62 |     if not os.path.isdir(output_folder):
 63 |         os.makedirs(output_folder)
 64 | 
 65 |     output_path = os.path.join(output_folder, map_name) + '.json'
 66 | 
 67 |     with open(output_path, 'w') as f:
 68 |         pass # flush output file
 69 |     
 70 |     features = data['features']
 71 |     for feature in features: # iterate through all the detected text labels
 72 |         geometry = feature['geometry']
 73 |         text = feature['properties']['text']
 74 |         score = feature['properties']['score']
 75 | 
 76 |         # suffix = ', Los Angeles'
 77 |         text = str(text) + suffix
 78 | 
 79 |         print(text)
 80 | 
 81 |         if geocoder_option == 'arcgis':
 82 |             results = arcgic_geocoding(text, maxRows = max_results)
 83 |         elif geocoder_option == 'google':
 84 |             results = google_geocoding(text, api_key = api_key, maxRows = max_results)
 85 |         elif geocoder_option == 'geonames':
 86 |             results = geonames_geocoding(text, user_name = user_name, maxRows = max_results)
 87 |         elif geocoder_option == 'osm':
 88 |             results = osm_geocoding(text, maxRows = max_results)
 89 |         else:
 90 |             raise NotImplementedError
 91 | 
 92 |         if results == -1:
 93 |             # geocoder can not find match
 94 |             pass 
 95 |         else:
 96 |             # save results 
 97 |             with open(output_path, 'a') as f:
 98 |                 json.dump({'text':text, 'score':score, 'geometry': geometry, 'geocoding':results}, f)
 99 |                 f.write('\n')
100 | 
101 |         # pdb.set_trace()
102 | 
103 | 
104 | def main():
105 |     parser = argparse.ArgumentParser()
106 |     
107 |     parser.add_argument('--output_folder', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geocoding/')
108 |     parser.add_argument('--input_map_geojson_path', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geojson_testr/service-gmd-gmd436m-g4364m-g4364lm-g4364lm_g00656189401-00656_01_1894-0001l.geojson')
109 |     parser.add_argument('--api_key', type=str, default=None, help='Specify API key if needed')
110 |     parser.add_argument('--user_name', type=str, default=None, help='Specify user name if needed')
111 | 
112 |     parser.add_argument('--suffix', type=str, default=None, help='placename suffix (e.g. city name)')
113 |     
114 |     parser.add_argument('--max_results', type=int, default=5, help='max number of results returend by geocoder')
115 | 
116 |     parser.add_argument('--geocoder_option', type=str, default='arcgis', 
117 |         choices=['arcgis', 'google','geonames','osm'], 
118 |         help='Select text spotting model option from ["arcgis","google","geonames","osm"]') # select text spotting model
119 | 
120 |                         
121 |     args = parser.parse_args()
122 |     print('\n')
123 |     print(args)
124 |     print('\n')
125 | 
126 |     if not os.path.isdir(args.output_folder):
127 |         os.makedirs(args.output_folder)
128 | 
129 |     geocoding(args)
130 | 
131 | 
132 | if __name__ == '__main__':
133 | 
134 |     main()
135 | 
136 |     
137 | 
138 | 


--------------------------------------------------------------------------------
/m_sanborn/s2_clustering.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import json
  3 | import argparse
  4 | from sklearn.cluster import DBSCAN
  5 | from matplotlib import pyplot as plt
  6 | import geopandas as gpd
  7 | import pandas as pd
  8 | from bs4 import BeautifulSoup
  9 | from mpl_toolkits.basemap import Basemap
 10 | from pyproj import Proj, transform
 11 | 
 12 | from shapely.geometry import Point
 13 | from shapely.geometry.polygon import Polygon
 14 | import numpy as np
 15 | from shapely.geometry import MultiPoint
 16 | from geopy.distance import great_circle
 17 | 
 18 | 
 19 | county_index_dict = {'Cuyahoga County (OH)': 193,
 20 |  'Fulton County (GA)': 73,
 21 |  'Kern County (CA)': 2872,
 22 |  'Lancaster County (NE)': 1629,
 23 |  'Los Angeles County (CA)': 44,
 24 |  'Mexico': -1,
 25 |  'Nevada County (CA)': 46,
 26 |  'New Orleans (LA)': -1,
 27 |  'Pima County (AZ)': 2797,
 28 |  'Placer County (CA)': 1273,
 29 |  'Providence County (RI)\xa0': 1124,
 30 |  'Saint Louis (MO)': -1,
 31 |  'San Francisco County (CA)': 1261,
 32 |  'San Joaquin County (CA)': 1213,
 33 |  'Santa Clara (CA)': 48,
 34 |  'Santa Cruz (CA)': 2386,
 35 |  'Suffolk County (MA)': 272,
 36 |  'Tulsa County (OK)': 526,
 37 |  'Washington County (AK)': -1,
 38 |  'Washington DC': -1}
 39 | 
 40 | def get_centermost_point(cluster):
 41 |     centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
 42 |     centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
 43 |     return tuple(centermost_point)
 44 | 
 45 | def clustering_func(lat_list, lng_list):
 46 |     X = [[a,b] for a,b in zip(lat_list, lng_list)]
 47 |     coords = np.array(X)
 48 |     
 49 |     # https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/
 50 |     kms_per_radian = 6371.0088
 51 |     epsilon = 1.5 / kms_per_radian
 52 |     db = DBSCAN(eps=epsilon, min_samples=1, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
 53 |     cluster_labels = db.labels_
 54 |     num_clusters = len(set(cluster_labels))
 55 |     clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
 56 |     
 57 |     centermost_points = get_centermost_point(clusters[0])
 58 |     return centermost_points
 59 | 
 60 | def plot_points(lat_list, lng_list, target_lat_list=None, target_lng_list = None, pred_lat=None, pred_lng = None, title = None):
 61 |     
 62 |     plt.figure(figsize=(10,6))
 63 |     plt.title(title)
 64 |     
 65 |     plt.scatter(lng_list, lat_list, marker='o', c = 'violet', alpha=0.5)
 66 |     if pred_lat is not None and pred_lng is not None:
 67 |         plt.scatter(pred_lng, pred_lat, marker='o', c = 'red')
 68 |     
 69 |     if target_lat_list is not None and target_lng_list is not None:
 70 |         plt.scatter(target_lng_list, target_lat_list, 10, c = 'blue')
 71 |     plt.show()
 72 | 
 73 | def plot_points_basemap(lat_list, lng_list, target_lat_list=None, target_lng_list = None, pred_lat=None, pred_lng = None, title = None):
 74 |     
 75 |     plt.figure(figsize=(10,6))
 76 |     plt.title(title)
 77 |     
 78 |     if len(lat_list) >0 and len(lng_list) > 0:
 79 |         anchor_lat, anchor_lng = lat_list[0], lng_list[0]
 80 |     elif target_lat_list is not None:
 81 |         anchor_lat, anchor_lng = target_lat_list[0], target_lng_list[0]
 82 |     else:
 83 |         anchor_lat, anchor_lng = 45, -100
 84 |         
 85 |     m = Basemap(projection='lcc', resolution=None,
 86 |             width=8E4, height=8E4, 
 87 |             lat_0=anchor_lat, lon_0=anchor_lng)
 88 |     m.etopo(scale=0.5, alpha=0.5)
 89 |     # m.arcgisimage(service='ESRI_Imagery_World_2D', xpixels = 2000, verbose= True)
 90 |     # m.arcgisimage(service='ESRI_Imagery_World_2D',scale=0.5, alpha=0.5)
 91 |     # m.arcgisimage(service='ESRI_Imagery_World_2D', xpixels = 2000, verbose= True)
 92 |     
 93 |     lng_list, lat_list = m(lng_list, lat_list)  # transform coordinates
 94 |     plt.scatter(lng_list, lat_list, marker='o', c = 'violet', alpha=0.5)
 95 |     
 96 |     
 97 |     if target_lat_list is not None and target_lng_list is not None:
 98 |         target_lng_list, target_lat_list = m(target_lng_list, target_lat_list) 
 99 |         plt.scatter(target_lng_list, target_lat_list,  marker='o', c = 'blue',edgecolor='blue')
100 |         
101 |     if pred_lat is not None and pred_lng is not None:
102 |         pred_lng, pred_lat = m(pred_lng, pred_lat) 
103 |         plt.scatter(pred_lng, pred_lat, marker='o', c = 'red', edgecolor='black')
104 |         
105 |     plt.show()
106 | 
107 | def plotting_func(loc_sanborn_dir, pred_dict, lat_lng_dict, dataset_name, geocoding_name):
108 | 
109 |     for map_name, pred in pred_dict.items():
110 |         
111 |         title = dataset_name + '-' + geocoding_name + '-' + map_name
112 |         lat_list = lat_lng_dict[map_name]['lat_list']
113 |         lng_list = lat_lng_dict[map_name]['lng_list']
114 |         
115 |         if dataset_name == 'LoC_sanborn':
116 |             xml_path = os.path.join(loc_sanborn_dir,map_name + '.tif.aux.xml')
117 |             try:
118 |                 with open(xml_path) as fp:
119 |                     soup = BeautifulSoup(fp)
120 |                 
121 |                 target_gcp_list = soup.findAll("metadata")[1].targetgcps.findAll("double")
122 |             except Exception as e:
123 |                 print(xml_path)
124 |                 continue
125 |             
126 |             xy_list = []
127 |             for target_gcp in target_gcp_list:
128 |                 xy_list.append(float(target_gcp.contents[0]))
129 |                 
130 |             x_list = xy_list[0::2]
131 |             y_list = xy_list[1::2]
132 |             
133 |             lng2_list,  lat2_list = [],[]
134 |             for x1,y1 in zip(x_list, y_list):
135 |                 x2,y2 = transform(inProj,outProj,x1,y1)
136 |                 #print (x2,y2)
137 |                 lng2_list.append(x2)
138 |                 lat2_list.append(y2)
139 |                 
140 |             plot_points(lat_list, lng_list, lat2_list, lng2_list, pred_lat = pred[0], pred_lng = pred[1], title=title)
141 |         else:
142 |             plot_points(lat_list, lng_list,pred_lat = pred[0], pred_lng = pred[1], title=title)
143 |         
144 | 
145 | def clustering(args):
146 |     dataset_name = args.dataset_name
147 |     geocoding_name = args.geocoding_name
148 |     remove_duplicate_location = args.remove_duplicate_location
149 |     visualize = args.visualize
150 | 
151 |     sanborn_output_dir = '/data2/sanborn_maps_output'
152 | 
153 |     input_dir=os.path.join(sanborn_output_dir, dataset_name, 'geocoding_suffix_testr', geocoding_name)
154 |     if remove_duplicate_location:
155 |         output_dir = os.path.join(sanborn_output_dir, dataset_name, 'clustering_testr_removeduplicate', geocoding_name)
156 |     else:
157 |         output_dir = os.path.join(sanborn_output_dir, dataset_name, 'clustering_testr', geocoding_name)
158 |         
159 |     county_boundary_path = '/home/zekun/Sanborn/cb_2018_us_county_500k/cb_2018_us_county_500k.shp'
160 | 
161 |     if not os.path.isdir(output_dir):
162 |         os.makedirs(output_dir)
163 | 
164 |     inProj = Proj(init='epsg:3857')
165 |     outProj = Proj(init='epsg:4326')
166 | 
167 |     county_boundary_df = gpd.read_file(county_boundary_path)
168 | 
169 |     if dataset_name == 'LoC_sanborn':
170 |         loc_sanborn_dir = '/data2/sanborn_maps/Sanborn100_Georef/' # for comparing with GT
171 |         metadata_tsv_path = '/home/zekun/Sanborn/Sheet_List.tsv'
172 |         meta_df = pd.read_csv(metadata_tsv_path, sep='\t')
173 | 
174 |     file_list = os.listdir(input_dir)
175 | 
176 |     pred_dict = dict()
177 |     lat_lng_dict = dict()
178 |     for file_path in file_list:
179 |         
180 |         map_name = os.path.basename(file_path).split('.')[0]
181 |         if dataset_name == 'LoC_sanborn':
182 |             county_name = meta_df[meta_df['filename'] == map_name]['County'].values[0]
183 |         elif dataset_name == 'LA_sanborn' or 'two_more':
184 |             county_name = 'Los Angeles County (CA)'
185 |         else:
186 |             raise NotImplementedError
187 | 
188 |         index = county_index_dict[county_name]
189 |         if index >= 0:
190 |             poly_geometry = county_boundary_df.iloc[index].geometry
191 |         
192 |         with open(os.path.join(input_dir,file_path), 'r') as f:
193 |             data = f.readlines()
194 |             
195 |         lat_list = []
196 |         lng_list = []
197 |         for line in data:
198 | 
199 |             line_dict = json.loads(line)
200 |             geocoding_dict = line_dict['geocoding']
201 |             text = line_dict['text']
202 |             score = line_dict['score']
203 |             geometry = line_dict['geometry']
204 | 
205 |             if geocoding_dict is None:
206 |                 continue # if no geolocation returned by geocoder, then skip 
207 |             
208 |             if 'lat' not in geocoding_dict or 'lng' not in geocoding_dict:
209 |                 #print(geocoding_dict)
210 |                 continue 
211 | 
212 |             lat = float(geocoding_dict['lat'])
213 |             lng = float(geocoding_dict['lng'])
214 |             
215 |             point = Point(lng, lat)
216 |             
217 |             if index >= 0:
218 |                 if point.within(poly_geometry): # geocoding point within county boundary
219 |                     lat_list.append(lat)
220 |                     lng_list.append(lng)
221 |                 else:
222 |                     pass
223 |             else: # cluster based on all results
224 |                 lat_list.append(lat)
225 |                 lng_list.append(lng)
226 | 
227 |         if remove_duplicate_location:
228 |             lat_list = list(set(lat_list))
229 |             lng_list = list(set(lng_list))
230 |             
231 |         if len(lat_list) >0 and len(lng_list) > 0:
232 |             pred = clustering_func(lat_list, lng_list)
233 |             # print(pred)
234 |         else:
235 |             print('No data to cluster')
236 | 
237 |         print(map_name, pred)
238 |         pred_dict[map_name] = pred
239 |         lat_lng_dict[map_name]={'lat_list':lat_list, 'lng_list':lng_list}
240 | 
241 |     if visualize:
242 |         plotting_func(loc_sanborn_dir = loc_sanborn_dir, pred_dict = pred_dict, lat_lng_dict = lat_lng_dict,
243 |             dataset_name = dataset_name, geocoding_name = geocoding_name)
244 | 
245 |     with open(os.path.join(output_dir, 'pred_center.json'),'w') as f:
246 |         json.dump(pred_dict, f)
247 |         
248 | 
249 | def main():
250 |     parser = argparse.ArgumentParser()
251 | 
252 |     parser.add_argument('--dataset_name', type=str, default=None,
253 |         choices=['LA_sanborn', 'LoC_sanborn',],
254 |         help='dataset name, same as expt_name')
255 |     parser.add_argument('--geocoding_name', type=str, default=None, 
256 |         choices=['google','arcgis','geonames','osm'],
257 |         help='geocoder name')
258 |     parser.add_argument('--visualize', default = False, action = 'store_true') # Enable this when in notebook
259 |     parser.add_argument('--remove_duplicate_location', default=False, action='store_true') # whether remove duplicate geolocations for clustering
260 |     
261 |     # parser.add_argument('--output_folder', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geocoding/')
262 |     # parser.add_argument('--input_map_geojson_path', type=str, default='/data2/sanborn_maps_output/LA_sanborn/geojson_testr/service-gmd-gmd436m-g4364m-g4364lm-g4364lm_g00656189401-00656_01_1894-0001l.geojson')
263 |    
264 |                         
265 |     args = parser.parse_args()
266 |     print('\n')
267 |     print(args)
268 |     print('\n')
269 |     
270 |     clustering(args)
271 | 
272 | 
273 | if __name__ == '__main__':
274 | 
275 |     main()
276 | 


--------------------------------------------------------------------------------
/m_sanborn/s3_gen_geojson.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/machines-reading-maps/mapkurator-system/fc3bf223a6806bd3d5beaa3bc1e4640449caa482/m_sanborn/s3_gen_geojson.py


--------------------------------------------------------------------------------
/metadata/davidrumsey/davidrumsey.py:
--------------------------------------------------------------------------------
 1 | import requests
 2 | import pandas as pd
 3 | 
 4 | 
 5 | class DavidRumsey:
 6 | 
 7 |     csv_filename="davidrumsey_metadata.csv"
 8 |     df = pd.read_csv(csv_filename)
 9 | 
10 |     def __init__(self, api_key):
11 |         self.api_key = api_key
12 |         self.headers = {
13 |             'Authorization': self.api_key,
14 |             'Content-Type': 'application/json',
15 |             'charset': 'utf-8'
16 |             }
17 | 
18 |     def get_ground_control_points(self, external_id):
19 |         """
20 |         Get ground control points of map image via Oldmapsonline API.
21 |         Args:
22 |             external_id: str
23 |         Returns:
24 |             transform_method: str
25 |                 Transformation method
26 |                 e.g., "affine", "polynomial", "tps"
27 |             gcps: list
28 |                 All pairs of ground control points
29 |                 e.g., [{'location': [-118.269356489, 34.063140276], 'pixel': [5629, 5064]},{'location': , 'pixel': }, ... ]
30 |         """
31 | 
32 |         # 404 ERROR on many maps
33 |         # 1. GET /maps/external/{external_id}
34 |         # baseurl = "https://api.oldmapsonline.org/1.0/maps/external/" + external_id
35 |         # res = requests.get(baseurl, headers=self.headers)
36 |         map_id = self.df[self.df['external_id']==external_id]['id']
37 | 
38 |         # 2. GET /maps/{id}/georeferences
39 |         baseurl = "https://api.oldmapsonline.org/1.0/maps/" + map_id + "/georeferences"
40 |         res = requests.get(baseurl, headers=self.headers)
41 | 
42 |         try:
43 |             res.raise_for_status()
44 |         except requests.exceptions.HTTPError as e:
45 |             print(e)
46 |             return None
47 | 
48 |         data = res.json()
49 |         if not data['items']:
50 |             return None
51 |         else:
52 |             transform_method = data['items'][0]['transformation_method']
53 |             gcps = data['items'][0]['gcps']
54 |             return transform_method, gcps
55 | 


--------------------------------------------------------------------------------
/metadata/sanborn.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/machines-reading-maps/mapkurator-system/fc3bf223a6806bd3d5beaa3bc1e4640449caa482/metadata/sanborn.py


--------------------------------------------------------------------------------
/model_card_template:
--------------------------------------------------------------------------------
  1 | ---
  2 | license: cc-by-nc-2.0
  3 | language:
  4 | - en
  5 | tags:
  6 | - text spotting
  7 | - scene text detection
  8 | - maps
  9 | - cultural heritage
 10 | ---
 11 | # Model Card for Model ID
 12 | 
 13 | <!-- Provide a quick summary of what the model is/does. -->
 14 | 
 15 | 
 16 | ## Model Details
 17 | 
 18 | ### Model Description
 19 | 
 20 | <!-- Provide a longer summary of what this model is. -->
 21 | 
 22 | 
 23 | <!-- Change names and language per model as needed -->
 24 | - **Developed by:** Knowledge Computing Lab, University of Minnesota: Leeje Jang, Jina Kim, Zekun Li, Yijun Lin, Min Namgung, Yao-Yi Chiang
 25 | - **Shared by:** Machines Reading Maps
 26 | - **Model type:** text spotter
 27 | - **Language(s):** English
 28 | - **License:** CC-BY-NC 2.0
 29 | 
 30 | ### Model Sources [optional]
 31 | 
 32 | <!-- Provide the basic links for the model. -->
 33 | 
 34 | - **Repository:** https://github.com/knowledge-computing/mapkurator-spotter
 35 | - **Paper [optional]:** [More Information Needed]
 36 | - **Documentation:** https://knowledge-computing.github.io/mapkurator-doc/#/
 37 | 
 38 | ## Uses
 39 | 
 40 | <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 41 | 
 42 | ### Direct Use
 43 | 
 44 | <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 45 | 
 46 | The model detects and recognizes text on images. It was trained specifically to identify text on a wide range of historical maps with many styles printed between ca. 1500-2000 provided by the David Rumsey Map Collection.
 47 | This version of the model was trained with an English language model.
 48 | 
 49 | 
 50 | ### Downstream Use
 51 | 
 52 | <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 53 | Using this model for new experiments will require attention to the style and language of text on images, including (possibly) the creation of new, synthetic or other training data.
 54 | 
 55 | 
 56 | ### Out-of-Scope Use
 57 | 
 58 | <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 59 | 
 60 | 
 61 | ## Bias, Risks, and Limitations
 62 | 
 63 | <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 64 | This model will struggle to return high quality results for maps with complex fonts, low contrast images, complex background colors and textures, and non-English language words.
 65 | 
 66 | [More Information Needed]
 67 | 
 68 | ### Recommendations
 69 | 
 70 | <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 71 | 
 72 | Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 73 | 
 74 | ## How to Get Started with the Model
 75 | 
 76 | Please refer to the mapKurator documentation for details: https://knowledge-computing.github.io/mapkurator-doc/#/
 77 | 
 78 | ## Training Details
 79 | 
 80 | ### Training Data
 81 | 
 82 | <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 83 | 
 84 | Synthetic training datasets:
 85 | 1. SynthText: 40k text-free background images from COCO and use them to generate synthetic text images (see the left image). Code: https://github.com/ankush-me/SynthText; Dataset: TBD.
 86 | 2. SynMap: "patches" of synthetic maps that mimic the text (e.g., font, spacing, orientation) and background styles in the real historical maps (see the right image). Code: TBD; Dataset: TBD.
 87 | 
 88 | 
 89 | ## Citation [optional]
 90 | 
 91 | <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 92 | 
 93 | **BibTeX:**
 94 | 
 95 | [More Information Needed]
 96 | 
 97 | **APA:**
 98 | 
 99 | [More Information Needed]
100 | 
101 | 
102 | 
103 | ## Model Card Authors
104 | 
105 | Yijun Lin, Katherine McDonough, Valeria Vitale
106 | 
107 | ## Model Card Contact
108 | 
109 | Yijun Lin, lin00786 at umn.edu 
110 | 
111 | 
112 | 


--------------------------------------------------------------------------------
/pipe_run.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_get_dimension --module_cropping
 4 | 
 5 | python run.py --sample_map_csv_path //home/maplord/maplist_csv/luna_omo_metadata_2508.csv --expt_name Rerun_2_2508 --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --output_folder /data2/rumsey_output/ --spotter_expt_name testr_syn
 6 | 
 7 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_img_geojson
 8 | 
 9 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_geocoord_geojson
10 | 
11 | python3 run.py --sample_map_csv_path='/home/maplord/maplist_csv/luna_omo_metadata_2508.csv' --expt_name='Rerun_2_2508' --module_post_ocr
12 | 


--------------------------------------------------------------------------------
/pipe_run_img.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/bash
 2 | 
 3 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_get_dimension --module_cropping
 4 | 
 5 | python run_img.py --sample_map_csv_path /data2/rumsey_output/sample_sb/data/ --expt_name sample_sb_opt --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --output_folder /data2/rumsey_output/ --spotter_expt_name testr_syn
 6 | 
 7 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_img_geojson
 8 | 
 9 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_geocoord_geojson
10 | 
11 | python3 run_img.py --sample_map_csv_path='/data2/rumsey_output/sample_sb/data/' --expt_name='sample_sb_opt' --module_post_ocr
12 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/machines-reading-maps/mapkurator-system/fc3bf223a6806bd3d5beaa3bc1e4640449caa482/requirements.txt


--------------------------------------------------------------------------------
/run.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import subprocess
  3 | import glob
  4 | import argparse
  5 | import time
  6 | import logging
  7 | import pandas as pd
  8 | import pdb
  9 | import datetime
 10 | from PIL import Image 
 11 | from utils import get_img_path_from_external_id, get_img_path_from_external_id_and_image_no
 12 | 
 13 | import subprocess
 14 | 
 15 | 
 16 | logging.basicConfig(level=logging.INFO)
 17 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
 18 | 
 19 | # def execute_command(command, if_print_command):
 20 | #     t1 = time.time()
 21 | 
 22 | #     if if_print_command:
 23 | #         print(command)
 24 | #     os.system(command)
 25 | 
 26 | #     t2 = time.time()
 27 | #     time_usage = t2 - t1 
 28 | #     return time_usage
 29 | 
 30 | def execute_command(command, if_print_command):
 31 |     t1 = time.time()
 32 | 
 33 |     if if_print_command:
 34 |         print(command)
 35 | 
 36 |     try:
 37 |         subprocess.run(command, shell=True,check=True, capture_output = True) #stderr=subprocess.STDOUT)
 38 |         t2 = time.time()
 39 |         time_usage = t2 - t1 
 40 |         return {'time_usage':time_usage}
 41 |     except subprocess.CalledProcessError as err:
 42 |         error = err.stderr.decode('utf8')
 43 |         # format error message to one line
 44 |         error  = error.replace('\n','\t')
 45 |         error = error.replace(',',';')
 46 |         return {'error': error}
 47 | 
 48 | 
 49 | def get_img_dimension(img_path):
 50 |     map_img = Image.open(img_path) 
 51 |     width, height = map_img.size 
 52 | 
 53 |     return width, height
 54 | 
 55 | 
 56 | def run_pipeline(args):
 57 |     # -------------------------  Pass arguments -----------------------------------------
 58 |     map_kurator_system_dir = args.map_kurator_system_dir
 59 |     text_spotting_model_dir = args.text_spotting_model_dir
 60 |     sample_map_path = args.sample_map_csv_path
 61 |     expt_name = args.expt_name
 62 |     output_folder = args.output_folder
 63 | 
 64 |     module_get_dimension = args.module_get_dimension
 65 |     module_gen_geotiff = args.module_gen_geotiff
 66 |     module_cropping = args.module_cropping
 67 |     module_text_spotting = args.module_text_spotting
 68 |     module_img_geojson = args.module_img_geojson 
 69 |     module_geocoord_geojson = args.module_geocoord_geojson 
 70 |     module_entity_linking = args.module_entity_linking
 71 |     module_post_ocr = args.module_post_ocr
 72 | 
 73 |     spotter_model = args.spotter_model
 74 |     spotter_config = args.spotter_config
 75 |     spotter_expt_name = args.spotter_expt_name
 76 |     gpu_id = args.gpu_id
 77 |     
 78 |     if_print_command = args.print_command
 79 |     
 80 | 
 81 |     # sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
 82 | 
 83 |     # ------------------------- Read sample map list and prepare output dir ----------------
 84 |     input_csv_path = sample_map_path
 85 |     if input_csv_path[-4:] == '.csv':
 86 |         sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
 87 |     elif input_csv_path[-4:] == '.tsv':
 88 |         sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}, sep='\t')
 89 |     else:
 90 |         raise NotImplementedError
 91 | 
 92 |     # external_id_to_img_path_dict = get_img_path_from_external_id( sample_map_path = input_csv_path)
 93 |     external_id_to_img_path_dict, unmatched_external_id_list = get_img_path_from_external_id_and_image_no( sample_map_path = input_csv_path)
 94 | 
 95 |     # initialize error reason dict
 96 |     error_reason_dict = dict()
 97 |     for ex_id in unmatched_external_id_list:
 98 |         error_reason_dict[ex_id] = {'img_path':None, 'error':'Can not find image given external_id.'} 
 99 | 
100 |     # initialize time_usage_dict
101 |     # time_usage_dict = dict()
102 |     # for ex_id in sample_map_df['external_id']:
103 |     #     time_usage_dict[ex_id] = {} 
104 | 
105 |     expt_out_dir = os.path.join(output_folder, expt_name)
106 |     geotiff_output_dir = os.path.join(output_folder, expt_name,  'geotiff')
107 |     cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/')
108 |     spotting_output_dir = os.path.join(output_folder, expt_name,  'spotter/' + spotter_expt_name)
109 |     stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_expt_name)
110 |     postocr_output_dir = os.path.join(output_folder, expt_name, 'postocr/'+ spotter_expt_name)
111 |     geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_expt_name + '/')
112 | 
113 |     if not os.path.isdir(expt_out_dir):
114 |         os.makedirs(expt_out_dir)
115 | 
116 |     # ------------------------ Get image dimension  ------------------------------
117 |     if module_get_dimension:
118 |         for index, record in sample_map_df.iterrows():
119 |             external_id = record.external_id
120 |             # pdb.set_trace()
121 |             if external_id not in external_id_to_img_path_dict:
122 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
123 |                 continue 
124 | 
125 |             img_path = external_id_to_img_path_dict[external_id]
126 |             map_name = os.path.basename(img_path).split('.')[0]
127 | 
128 |             try:
129 |                 width, height = get_img_dimension(img_path)
130 |             except Exception as e:
131 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': e } 
132 |             
133 |             time_usage_dict[external_id]['img_w'] = width
134 |             time_usage_dict[external_id]['img_h'] = height
135 |             
136 |             
137 |     # ------------------------- Generate geotiff ------------------------------
138 |     time_start =  time.time()
139 |     if module_gen_geotiff:
140 |         os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff'))
141 |         
142 |         if not os.path.isdir(geotiff_output_dir):
143 |             os.makedirs(geotiff_output_dir)
144 | 
145 |         # use converted jpg folder instead of original sid folder
146 |         run_geotiff_command = 'python convert_image_to_geotiff.py --sid_root_dir /data2/rumsey_sid_to_jpg/ --sample_map_path '+ input_csv_path +' --out_geotiff_dir '+geotiff_output_dir  # can change params in argparse
147 |         exe_ret = execute_command(run_geotiff_command, if_print_command)
148 |         if 'error' in exe_ret:
149 |             error = exe_ret['error']
150 |         elif 'time_usage' in exe_ret:
151 |             time_usage = exe_ret['time_usage']
152 |         
153 |         time_usage_dict[external_id]['geotiff'] = time_usage
154 |         
155 | 
156 |     time_geotiff = time.time()
157 |     
158 | 
159 |     # ------------------------- Image cropping  ------------------------------
160 |     if module_cropping:
161 |         for index, record in sample_map_df.iterrows():
162 |             external_id = record.external_id
163 | 
164 |             if external_id not in external_id_to_img_path_dict:
165 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
166 |                 continue 
167 | 
168 |             img_path = external_id_to_img_path_dict[external_id]
169 |             map_name = os.path.basename(img_path).split('.')[0]
170 | 
171 |             os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
172 |             if not os.path.isdir(cropping_output_dir):
173 |                 os.makedirs(cropping_output_dir)
174 |             
175 |             run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir
176 | 
177 |             exe_ret = execute_command(run_crop_command, if_print_command)
178 | 
179 |             if 'error' in exe_ret:
180 |                 error = exe_ret['error']
181 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
182 |             elif 'time_usage' in exe_ret:
183 |                 time_usage = exe_ret['time_usage']
184 |                 time_usage_dict[external_id]['cropping'] = time_usage
185 |             else:
186 |                 raise NotImplementedError
187 |                 
188 |             
189 |     time_cropping = time.time()
190 |     
191 |     # ------------------------- Text Spotting (patch level) ------------------------------
192 |     if module_text_spotting:
193 |         assert os.path.exists(spotter_config), "Config file for spotter must exist!"
194 |         os.chdir(text_spotting_model_dir) 
195 | 
196 |         for index, record in sample_map_df.iterrows():
197 | 
198 |             external_id = record.external_id
199 |             if external_id not in external_id_to_img_path_dict:
200 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
201 |                 continue 
202 | 
203 |             img_path = external_id_to_img_path_dict[external_id]
204 |             map_name = os.path.basename(img_path).split('.')[0]
205 | 
206 |             map_spotting_output_dir = os.path.join(spotting_output_dir, map_name)
207 |             if not os.path.isdir(map_spotting_output_dir):
208 |                 os.makedirs(map_spotting_output_dir)
209 |         
210 |             if spotter_model == 'abcnet':
211 |                 run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
212 |             elif spotter_model == 'testr':
213 |                 run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir}'
214 |             elif spotter_model == 'spotter_v2':
215 |                 run_spotting_command = f'CUDA_VISIBLE_DEVICES={gpu_id} python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir}'
216 |                 print(run_spotting_command)
217 |             else:
218 |                 raise NotImplementedError
219 |             
220 |             run_spotting_command  += ' 1> /dev/null'
221 |             
222 | 
223 |             
224 |             exe_ret = execute_command(run_spotting_command, if_print_command)            
225 |             if 'error' in exe_ret:
226 |                 error = exe_ret['error']
227 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
228 |             # elif 'time_usage' in exe_ret:
229 |             #     time_usage = exe_ret['time_usage']
230 |             #     time_usage_dict[external_id]['spotting'] = time_usage
231 |             # else:
232 |             #     raise NotImplementedError
233 | 
234 |             logging.info('Done text spotting for %s', map_name)
235 |     time_text_spotting = time.time()
236 |     
237 | 
238 |     # ------------------------- Image coord geojson (map level) ------------------------------
239 |     if module_img_geojson:
240 |         os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson'))
241 |         
242 |         if not os.path.isdir(stitch_output_dir):
243 |             os.makedirs(stitch_output_dir)
244 | 
245 |         for index, record in sample_map_df.iterrows():
246 |             external_id = record.external_id
247 |             if external_id not in external_id_to_img_path_dict:
248 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
249 |                 continue 
250 | 
251 |             img_path = external_id_to_img_path_dict[external_id]
252 |             map_name = os.path.basename(img_path).split('.')[0]
253 | 
254 |             stitch_input_dir = os.path.join(spotting_output_dir, map_name)
255 |             output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson')
256 |             
257 |             run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson
258 |             
259 |             exe_ret = execute_command(run_stitch_command, if_print_command)
260 |             
261 |             if 'error' in exe_ret:
262 |                 error = exe_ret['error']
263 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
264 |             elif 'time_usage' in exe_ret:
265 |                 time_usage = exe_ret['time_usage']
266 |                 time_usage_dict[external_id]['stitch'] = time_usage
267 |             else:
268 |                 raise NotImplementedError
269 |             
270 |     time_img_geojson = time.time()
271 | 
272 |     # ------------------------- post-OCR ------------------------------
273 |     if module_post_ocr:
274 |         
275 |         os.chdir(os.path.join(map_kurator_system_dir, 'm6_post_ocr'))
276 | 
277 |         if not os.path.isdir(postocr_output_dir):
278 |             os.makedirs(postocr_output_dir)
279 | 
280 |         for index, record in sample_map_df.iterrows():
281 |             
282 |             external_id = record.external_id
283 |             if external_id not in external_id_to_img_path_dict:
284 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
285 |                 continue
286 | 
287 |             img_path = external_id_to_img_path_dict[external_id]
288 |             map_name = os.path.basename(img_path).split('.')[0]
289 | 
290 |             input_geojson_file = os.path.join(stitch_output_dir, map_name + '.geojson')
291 |             geojson_postocr_output_file = os.path.join(postocr_output_dir, map_name + '.geojson')
292 | 
293 |             run_postocr_command = 'python lexical_search.py --in_geojson_dir '+ input_geojson_file +' --out_geojson_dir '+ geojson_postocr_output_file
294 | 
295 |             exe_ret = execute_command(run_postocr_command, if_print_command)
296 | 
297 |             if 'error' in exe_ret:
298 |                 error = exe_ret['error']
299 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
300 |             elif 'time_usage' in exe_ret:
301 |                 time_usage = exe_ret['time_usage']
302 |                 time_usage_dict[external_id]['postocr'] = time_usage
303 |             else:
304 |                 raise NotImplementedError
305 | 
306 |     time_post_ocr = time.time()
307 |     
308 |     
309 |      # ------------------------- Convert image coordinates to geocoordinates ------------------------------
310 |     if module_geocoord_geojson:
311 |         os.chdir(os.path.join(map_kurator_system_dir, 'm4_geocoordinate_converter'))
312 |         
313 |         if not os.path.isdir(geojson_output_dir):
314 |             os.makedirs(geojson_output_dir)
315 | 
316 |         for index, record in sample_map_df.iterrows():
317 |             external_id = record.external_id
318 |             if external_id not in external_id_to_img_path_dict:
319 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
320 |                 continue 
321 | 
322 |             in_geojson = os.path.join(output_folder, postocr_output_dir+'/') + external_id.strip("'").replace('.', '') + ".geojson"
323 | 
324 |             run_converter_command = 'python convert_geojson_to_geocoord.py --sample_map_path '+ os.path.join(map_kurator_system_dir, input_csv_path) +' --in_geojson_file '+ in_geojson +' --out_geojson_dir '+ os.path.join(map_kurator_system_dir, geojson_output_dir)
325 | 
326 |             exe_ret = execute_command(run_converter_command, if_print_command)
327 | 
328 |             if 'error' in exe_ret:
329 |                 error = exe_ret['error']
330 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
331 |             elif 'time_usage' in exe_ret:
332 |                 time_usage = exe_ret['time_usage']
333 |                 time_usage_dict[external_id]['geocoord_geojson'] = time_usage
334 |             else:
335 |                 raise NotImplementedError
336 | 
337 |     time_geocoord_geojson = time.time()
338 | 
339 |     # ------------------------- Link entities in OSM ------------------------------
340 |     if module_entity_linking:
341 |         os.chdir(os.path.join(map_kurator_system_dir, 'm5_entity_linker'))
342 |         
343 |         geojson_linked_output_dir = os.path.join(map_kurator_system_dir, 'm5_entity_linker', 'data/100_maps_geojson_abc_linked/')
344 |         if not os.path.isdir(geojson_output_dir):
345 |             os.makedirs(geojson_output_dir)
346 | 
347 |         run_linker_command = 'python entity_linker.py --sample_map_path '+ input_csv_path +' --in_geojson_dir '+ geojson_output_dir +' --out_geojson_dir '+ geojson_linked_output_dir
348 |         execute_command(run_linker_command, if_print_command)
349 | 
350 |     time_entity_linking = time.time()
351 | 
352 | 
353 |     # --------------------- Time usage logging --------------------------
354 | #     print('\n')
355 | #     logging.info('Time for generating geotiff: %d', time_geotiff - time_start)
356 | #     logging.info('Time for Cropping : %d',time_cropping - time_geotiff)
357 | #     logging.info('Time for text spotting : %d',time_text_spotting - time_cropping)
358 | #     logging.info('Time for generating geojson in img coordinate : %d',time_img_geojson - time_text_spotting)
359 | #     logging.info('Time for generating geojson in geo coordinate : %d',time_geocoord_geojson - time_img_geojson)
360 | #     logging.info('Time for entity linking : %d',time_entity_linking - time_geocoord_geojson)
361 | #     logging.info('Time for post OCR : %d',time_post_ocr - time_img_geojson)
362 | 
363 | #     time_usage_df = pd.DataFrame.from_dict(time_usage_dict, orient='index')
364 | #     time_usage_log_path = os.path.join(output_folder, expt_name, 'time_usage.csv')
365 | 
366 | #     # check if exist time_usage log file 
367 | #     if os.path.isfile(time_usage_log_path):
368 | #         existing_df = pd.read_csv(time_usage_log_path, index_col='external_id', dtype={'external_id':str})
369 | #         # if exist duplicate columns, ret time usage values to the latest run
370 | #         cols_to_use = existing_df.columns.difference(time_usage_df.columns)
371 | 
372 | #         time_usage_df = time_usage_df.join(existing_df[cols_to_use])
373 | 
374 | #         # make sure time_usage_expt_name.csv always have the latest time usage
375 | #         # move the old time_usage.csv to time_usage[timestamp].csv where timestamp is the last expt running time
376 | #         m_time = os.path.getmtime(time_usage_log_path)
377 | #         dt_m = datetime.datetime.fromtimestamp(m_time)
378 | #         timestr = dt_m.strftime("%Y%m%d-%H%M%S") 
379 | 
380 | #         deprecated_path = os.path.join(output_folder, expt_name, 'time_usage_' +  timestr +'.csv')
381 | #         run_command = 'mv ' + time_usage_log_path + ' ' + deprecated_path
382 | #         execute_command(run_command, if_print_command)
383 | 
384 | #     time_usage_df.to_csv(time_usage_log_path, index_label='external_id')
385 | 
386 |     # --------------------- Error logging --------------------------
387 |     print('\n')
388 |     current_time = datetime.datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p")
389 |     error_reason_df = pd.DataFrame.from_dict(error_reason_dict, orient='index')
390 |     error_reason_log_path = os.path.join(output_folder, expt_name, 'error_reason_' +  current_time +'.csv')
391 |     error_reason_df.to_csv(error_reason_log_path, index_label='external_id')
392 | 
393 | 
394 | def main():
395 |     parser = argparse.ArgumentParser()
396 | 
397 |     parser.add_argument('--map_kurator_system_dir', type=str, default='/home/maplord/rumsey/mapkurator-system/')
398 |     parser.add_argument('--text_spotting_model_dir', type=str, default='/home/maplord/rumsey/TESTR/')
399 |     parser.add_argument('--sample_map_csv_path', type=str, default='m1_geotiff/data/sample_US_jp2_100_maps.csv') # Original: sample_US_jp2_100_maps.csv
400 |     parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') # Original: /data2/rumsey_output
401 |     parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix 
402 |     
403 |     parser.add_argument('--module_get_dimension', default=False, action='store_true')
404 |     parser.add_argument('--module_gen_geotiff', default=False, action='store_true')
405 |     parser.add_argument('--module_cropping', default=False, action='store_true')
406 |     parser.add_argument('--module_text_spotting', default=False, action='store_true')
407 |     parser.add_argument('--module_img_geojson', default=False, action='store_true')
408 |     parser.add_argument('--module_geocoord_geojson', default=False, action='store_true')
409 |     parser.add_argument('--module_entity_linking', default=False, action='store_true')
410 |     parser.add_argument('--module_post_ocr', default=False, action='store_true')
411 | 
412 |     parser.add_argument('--spotter_model', type=str, default='spotter_v2', choices=['abcnet', 'testr', 'spotter_v2'], 
413 |         help='Select text spotting model option from ["abcnet","testr", "testr_v2"]') # select text spotting model
414 |     parser.add_argument('--spotter_config', type=str, default='/home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml',
415 |         help='Path to the config file for text spotting model')
416 |     parser.add_argument('--spotter_expt_name', type=str, default='exp',
417 |         help='Name of spotter experiment, if empty using config file name') 
418 |     # python run.py --text_spotting_model_dir /home/maplord/rumsey/testr_v2/TESTR/
419 |     #               --sample_map_csv_path /home/maplord/maplist_csv/luna_omo_splits/luna_omo_metadata_56628_20220724.csv 
420 |     #               --expt_name 57k_maps_r2 --module_text_spotting 
421 |     #               --spotter_model testr_v2 --spotter_config /home/maplord/rumsey/testr_v2/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --spotter_expt_name testr_synmap
422 | 
423 |     parser.add_argument('--print_command', default=False, action='store_true')
424 |     parser.add_argument('--gpu_id', type=int, default=0)
425 | 
426 |                         
427 |     args = parser.parse_args()
428 |     print('\n')
429 |     print(args)
430 |     print('\n')
431 | 
432 |     run_pipeline(args)
433 | 
434 | 
435 | 
436 | if __name__ == '__main__':
437 | 
438 |     main()
439 | 
440 |     
441 | 


--------------------------------------------------------------------------------
/run_img.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import subprocess
  3 | import glob
  4 | import argparse
  5 | import time
  6 | import logging
  7 | import pandas as pd
  8 | import pdb
  9 | import datetime
 10 | from PIL import Image 
 11 | from utils import get_img_path_from_external_id, get_img_path_from_external_id_and_image_no
 12 | 
 13 | import subprocess
 14 | 
 15 | #this code is the case for getting an input as folders which include images.  
 16 | #tested image : /home/maplord/rumsey/mapkurator-system/data/100_maps_crop/crop_leeje_2/test_run_img/
 17 | logging.basicConfig(level=logging.INFO)
 18 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
 19 | 
 20 | # def execute_command(command, if_print_command):
 21 | #     t1 = time.time()
 22 | 
 23 | #     if if_print_command:
 24 | #         print(command)
 25 | #     os.system(command)
 26 | 
 27 | #     t2 = time.time()
 28 | #     time_usage = t2 - t1 
 29 | #     return time_usage
 30 | 
 31 | def execute_command(command, if_print_command):
 32 |     t1 = time.time()
 33 | 
 34 |     if if_print_command:
 35 |         print(command)
 36 | 
 37 |     try:
 38 |         subprocess.run(command, shell=True,check=True, capture_output = True) #stderr=subprocess.STDOUT)
 39 |         t2 = time.time()
 40 |         time_usage = t2 - t1 
 41 |         return {'time_usage':time_usage}
 42 |     except subprocess.CalledProcessError as err:
 43 |         error = err.stderr.decode('utf8')
 44 |         # format error message to one line
 45 |         error  = error.replace('\n','\t')
 46 |         error = error.replace(',',';')
 47 |         return {'error': error}
 48 | 
 49 | 
 50 | def get_img_dimension(img_path):
 51 |     map_img = Image.open(img_path) 
 52 |     width, height = map_img.size 
 53 | 
 54 |     return width, height
 55 | 
 56 | 
 57 | def run_pipeline(args):
 58 |     # -------------------------  Pass arguments -----------------------------------------
 59 |     map_kurator_system_dir = args.map_kurator_system_dir
 60 |     text_spotting_model_dir = args.text_spotting_model_dir
 61 |     sample_map_path = args.sample_map_csv_path
 62 |     expt_name = args.expt_name
 63 |     output_folder = args.output_folder
 64 | 
 65 |     module_get_dimension = args.module_get_dimension
 66 |     module_gen_geotiff = args.module_gen_geotiff
 67 |     module_cropping = args.module_cropping
 68 |     module_text_spotting = args.module_text_spotting
 69 |     module_img_geojson = args.module_img_geojson 
 70 |     module_geocoord_geojson = args.module_geocoord_geojson 
 71 |     module_entity_linking = args.module_entity_linking
 72 |     module_post_ocr = args.module_post_ocr
 73 | 
 74 |     spotter_model = args.spotter_model
 75 |     spotter_config = args.spotter_config
 76 |     spotter_expt_name = args.spotter_expt_name
 77 |     
 78 |     if_print_command = args.print_command
 79 |     
 80 | 
 81 |     # sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
 82 | 
 83 |     # ------------------------- Read sample map list and prepare output dir ----------------
 84 | 
 85 | 
 86 |     # if input_csv_path[-4:] == '.csv':
 87 |     #     sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
 88 |     # elif input_csv_path[-4:] == '.tsv':
 89 |     #     sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}, sep='\t')
 90 |     # else:
 91 |     #     raise NotImplementedError
 92 | 
 93 |     input_img_path = sample_map_path 
 94 |     sample_map_df = pd.DataFrame(columns = ["external_id"])
 95 |     for images in os.listdir(input_img_path):
 96 |             tmp_path={"external_id": input_img_path+images}
 97 |             sample_map_df=sample_map_df.append(tmp_path,ignore_index=True)
 98 | 
 99 |    # ------------------------- Read image and prepare output dir ----------------
100 | 
101 | 
102 |     # # external_id_to_img_path_dict = get_img_path_from_external_id( sample_map_path = input_csv_path)
103 |     # external_id_to_img_path_dict, unmatched_external_id_list = get_img_path_from_external_id_and_image_no( sample_map_path = input_csv_path)
104 | 
105 |     # # initialize error reason dict
106 |     # error_reason_dict = dict()
107 |     # for ex_id in unmatched_external_id_list:
108 |     #     error_reason_dict[ex_id] = {'img_path':None, 'error':'Can not find image given external_id.'} 
109 | 
110 |     # initialize time_usage_dict
111 |     # time_usage_dict = dict()
112 |     # for ex_id in sample_map_df['external_id']:
113 |     #     time_usage_dict[ex_id] = {} 
114 | 
115 |     expt_out_dir = os.path.join(output_folder, expt_name)
116 |     geotiff_output_dir = os.path.join(output_folder, expt_name,  'geotiff')
117 |     cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/')
118 |     spotting_output_dir = os.path.join(output_folder, expt_name,  'spotter/' + spotter_expt_name)
119 |     stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_expt_name)
120 |     postocr_output_dir = os.path.join(output_folder, expt_name, 'postocr/'+ spotter_expt_name)
121 |     geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_expt_name + '/')
122 | 
123 |     
124 | 
125 |     if not os.path.isdir(expt_out_dir):
126 |         os.makedirs(expt_out_dir)
127 | 
128 |     # ------------------------ Get image dimension  ------------------------------
129 |     if module_get_dimension:
130 |         for index, record in sample_map_df.iterrows():
131 |             external_id = record.external_id
132 |             # pdb.set_trace()
133 |             # if external_id not in external_id_to_img_path_dict:
134 |             #     error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
135 |             #     continue 
136 | 
137 |             img_path = sample_map_df['external_id'].iloc[index]
138 |             # print("img_path",img_path)
139 |             map_name = os.path.basename(img_path).split('.')[0]
140 |             # print("map_name",map_name)
141 |             width, height = get_img_dimension(img_path)
142 | 
143 |             
144 |             # time_usage_dict[external_id]['img_w'] = width
145 |             # time_usage_dict[external_id]['img_h'] = height
146 |             
147 |             
148 |     # ------------------------- Generate geotiff ------------------------------
149 |     time_start =  time.time()
150 |     if module_gen_geotiff:
151 |         os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff'))
152 |         
153 |         if not os.path.isdir(geotiff_output_dir):
154 |             os.makedirs(geotiff_output_dir)
155 | 
156 |         # use converted jpg folder instead of original sid folder
157 |         run_geotiff_command = 'python convert_image_to_geotiff.py --sid_root_dir /data2/rumsey_sid_to_jpg/ --sample_map_path '+ input_img_path +' --out_geotiff_dir '+geotiff_output_dir  # can change params in argparse
158 |         exe_ret = execute_command(run_geotiff_command, if_print_command)
159 |         if 'error' in exe_ret:
160 |             error = exe_ret['error']
161 |         elif 'time_usage' in exe_ret:
162 |             time_usage = exe_ret['time_usage']
163 |         
164 |         # time_usage_dict[external_id]['geotiff'] = time_usage
165 |         
166 | 
167 |     time_geotiff = time.time()
168 |     
169 | 
170 |     # ------------------------- Image cropping  ------------------------------
171 |     if module_cropping:
172 |         for index, record in sample_map_df.iterrows():
173 |             external_id = record.external_id
174 | 
175 |             # if external_id not in external_id_to_img_path_dict:
176 |             #     error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
177 |             #     continue 
178 | 
179 |             img_path = sample_map_df['external_id'].iloc[index]
180 |             map_name = os.path.basename(img_path).split('.')[0]
181 | 
182 |             os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
183 |             if not os.path.isdir(cropping_output_dir):
184 |                 os.makedirs(cropping_output_dir)
185 |             
186 |             run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir
187 | 
188 |             exe_ret = execute_command(run_crop_command, if_print_command)
189 | 
190 |             # if 'error' in exe_ret:
191 |             #     error = exe_ret['error']
192 |             #     error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
193 |             # if 'time_usage' in exe_ret:
194 |             #     time_usage = exe_ret['time_usage']
195 |             #     time_usage_dict[external_id]['cropping'] = time_usage
196 |             # else:
197 |             #     raise NotImplementedError
198 |                 
199 |             
200 |     time_cropping = time.time()
201 |     
202 |     # ------------------------- Text Spotting (patch level) ------------------------------
203 |     if module_text_spotting:
204 |         assert os.path.exists(spotter_config), "Config file for spotter must exist!"
205 |         os.chdir(text_spotting_model_dir) 
206 | 
207 |         for index, record in sample_map_df.iterrows():
208 | 
209 |             external_id = record.external_id
210 |             # if external_id not in external_id_to_img_path_dict:
211 |             #     error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
212 |             #     continue 
213 | 
214 |             img_path = sample_map_df['external_id'].iloc[index]
215 |             map_name = os.path.basename(img_path).split('.')[0]
216 | 
217 |             map_spotting_output_dir = os.path.join(spotting_output_dir, map_name)
218 |             if not os.path.isdir(map_spotting_output_dir):
219 |                 os.makedirs(map_spotting_output_dir)
220 |         
221 |             print(os.path.join(cropping_output_dir,map_name))
222 |             if spotter_model == 'abcnet':
223 |                 run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
224 |             elif spotter_model == 'testr':
225 |                 run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.TRANSFORMER.INFERENCE_TH_TEST 0.3'
226 |                 # print(run_spotting_command)
227 |             else:
228 |                 raise NotImplementedError
229 |             
230 |             run_spotting_command  += ' 1> /dev/null'
231 | 
232 |             exe_ret = execute_command(run_spotting_command, if_print_command)
233 |             
234 |             # if 'error' in exe_ret:
235 |             #     error = exe_ret['error']
236 |             #     error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
237 |             # elif 'time_usage' in exe_ret:
238 |             #     time_usage = exe_ret['time_usage']
239 |             #     time_usage_dict[external_id]['spotting'] = time_usage
240 |             # else:
241 |             #     raise NotImplementedError
242 | 
243 |             logging.info('Done text spotting for %s', map_name)
244 |     time_text_spotting = time.time()
245 |     
246 | 
247 |     # ------------------------- Image coord geojson (map level) ------------------------------
248 |     if module_img_geojson:
249 |         os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson'))
250 |         
251 |         if not os.path.isdir(stitch_output_dir):
252 |             os.makedirs(stitch_output_dir)
253 | 
254 |         for index, record in sample_map_df.iterrows():
255 |             external_id = record.external_id
256 |             # if external_id not in external_id_to_img_path_dict:
257 |             #     error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
258 |             #     continue 
259 | 
260 |             img_path = sample_map_df['external_id'].iloc[index]
261 |             map_name = os.path.basename(img_path).split('.')[0]
262 | 
263 |             stitch_input_dir = os.path.join(spotting_output_dir, map_name)
264 |             output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson')
265 | 
266 |             run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson
267 |             
268 |             
269 |             exe_ret = execute_command(run_stitch_command, if_print_command)
270 |             
271 |             # if 'error' in exe_ret:
272 |             #     error = exe_ret['error']
273 |             #     error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
274 |             # elif 'time_usage' in exe_ret:
275 |             #     time_usage = exe_ret['time_usage']
276 |             #     time_usage_dict[external_id]['stitch'] = time_usage
277 |             # else:
278 |             #     raise NotImplementedError
279 |             
280 |     time_img_geojson = time.time()
281 | 
282 |     # ------------------------- post-OCR ------------------------------
283 |     if module_post_ocr:
284 |         
285 |         os.chdir(os.path.join(map_kurator_system_dir, 'm6_post_ocr'))
286 | 
287 |         if not os.path.isdir(postocr_output_dir):
288 |             os.makedirs(postocr_output_dir)
289 | 
290 |         for index, record in sample_map_df.iterrows():
291 |             
292 |             external_id = record.external_id
293 |             # if external_id not in external_id_to_img_path_dict:
294 |             #     error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
295 |             #     continue
296 | 
297 |             img_path = sample_map_df['external_id'].iloc[index]
298 |             map_name = os.path.basename(img_path).split('.')[0]
299 | 
300 |             input_geojson_file = os.path.join(stitch_output_dir, map_name + '.geojson')
301 |             geojson_postocr_output_file = os.path.join(postocr_output_dir, map_name + '.geojson')
302 |             print('input_geojson_file',input_geojson_file)
303 |             print('geojson_postocr_output_file',geojson_postocr_output_file)
304 |             run_postocr_command = 'python lexical_search.py --in_geojson_dir '+ input_geojson_file +' --out_geojson_dir '+ geojson_postocr_output_file
305 | 
306 |             exe_ret = execute_command(run_postocr_command, if_print_command)
307 |             print('exe_ret',exe_ret)
308 |             # if 'error' in exe_ret:
309 |             #     error = exe_ret['error']
310 |             #     error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
311 |             # elif 'time_usage' in exe_ret:
312 |             #     time_usage = exe_ret['time_usage']
313 |             #     time_usage_dict[external_id]['postocr'] = time_usage
314 |             # else:
315 |             #     raise NotImplementedError
316 | 
317 |     time_post_ocr = time.time()
318 |     
319 |     
320 |      # ------------------------- Convert image coordinates to geocoordinates ------------------------------
321 |     if module_geocoord_geojson:
322 |         os.chdir(os.path.join(map_kurator_system_dir, 'm4_geocoordinate_converter'))
323 |         
324 |         if not os.path.isdir(geojson_output_dir):
325 |             os.makedirs(geojson_output_dir)
326 | 
327 |         for index, record in sample_map_df.iterrows():
328 |             external_id = record.external_id
329 |             # if external_id not in external_id_to_img_path_dict:
330 |             #     error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
331 |             #     continue 
332 | 
333 |             in_geojson = os.path.join(output_folder, postocr_output_dir+'/') + external_id.strip("'").replace('.', '') + ".geojson"
334 | 
335 |             run_converter_command = 'python convert_geojson_to_geocoord.py --sample_map_path '+ os.path.join(map_kurator_system_dir, input_img_path) +' --in_geojson_file '+ in_geojson +' --out_geojson_dir '+ os.path.join(map_kurator_system_dir, geojson_output_dir)
336 | 
337 |             exe_ret = execute_command(run_converter_command, if_print_command)
338 | 
339 |             # if 'error' in exe_ret:
340 |             #     error = exe_ret['error']
341 |             #     error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
342 |             # elif 'time_usage' in exe_ret:
343 |             #     time_usage = exe_ret['time_usage']
344 |             #     time_usage_dict[external_id]['geocoord_geojson'] = time_usage
345 |             # else:
346 |             #     raise NotImplementedError
347 | 
348 |     time_geocoord_geojson = time.time()
349 | 
350 |     # ------------------------- Link entities in OSM ------------------------------
351 |     if module_entity_linking:
352 |         os.chdir(os.path.join(map_kurator_system_dir, 'm5_entity_linker'))
353 |         
354 |         geojson_linked_output_dir = os.path.join(map_kurator_system_dir, 'm5_entity_linker', 'data/100_maps_geojson_abc_linked/')
355 |         if not os.path.isdir(geojson_output_dir):
356 |             os.makedirs(geojson_output_dir)
357 | 
358 |         run_linker_command = 'python entity_linker.py --sample_map_path '+ input_img_path +' --in_geojson_dir '+ geojson_output_dir +' --out_geojson_dir '+ geojson_linked_output_dir
359 |         execute_command(run_linker_command, if_print_command)
360 | 
361 |     time_entity_linking = time.time()
362 | 
363 | 
364 |     # --------------------- Time usage logging --------------------------
365 |     print('\n')
366 |     logging.info('Time for generating geotiff: %d', time_geotiff - time_start)
367 |     logging.info('Time for Cropping : %d',time_cropping - time_geotiff)
368 |     logging.info('Time for text spotting : %d',time_text_spotting - time_cropping)
369 |     logging.info('Time for generating geojson in img coordinate : %d',time_img_geojson - time_text_spotting)
370 |     logging.info('Time for generating geojson in geo coordinate : %d',time_geocoord_geojson - time_img_geojson)
371 |     logging.info('Time for entity linking : %d',time_entity_linking - time_geocoord_geojson)
372 |     logging.info('Time for post OCR : %d',time_post_ocr - time_img_geojson)
373 | 
374 |     # time_usage_df = pd.DataFrame.from_dict(time_usage_dict, orient='index')
375 |     # time_usage_log_path = os.path.join(output_folder, expt_name, 'time_usage.csv')
376 | 
377 |     # check if exist time_usage log file 
378 |     # if os.path.isfile(time_usage_log_path):
379 |     #     existing_df = pd.read_csv(time_usage_log_path, index_col='external_id', dtype={'external_id':str})
380 |     #     # if exist duplicate columns, ret time usage values to the latest run
381 |     #     cols_to_use = existing_df.columns.difference(time_usage_df.columns)
382 | 
383 |     #     time_usage_df = time_usage_df.join(existing_df[cols_to_use])
384 | 
385 |     #     # make sure time_usage_expt_name.csv always have the latest time usage
386 |     #     # move the old time_usage.csv to time_usage[timestamp].csv where timestamp is the last expt running time
387 |     #     m_time = os.path.getmtime(time_usage_log_path)
388 |     #     dt_m = datetime.datetime.fromtimestamp(m_time)
389 |     #     timestr = dt_m.strftime("%Y%m%d-%H%M%S") 
390 | 
391 |     #     deprecated_path = os.path.join(output_folder, expt_name, 'time_usage_' +  timestr +'.csv')
392 |     #     run_command = 'mv ' + time_usage_log_path + ' ' + deprecated_path
393 |     #     execute_command(run_command, if_print_command)
394 | 
395 |     # time_usage_df.to_csv(time_usage_log_path, index_label='external_id')
396 | 
397 |     # --------------------- Error logging --------------------------
398 |     # print('\n')
399 |     # current_time = datetime.datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p")
400 |     # error_reason_df = pd.DataFrame.from_dict(error_reason_dict, orient='index')
401 |     # error_reason_log_path = os.path.join(output_folder, expt_name, 'error_reason_' +  current_time +'.csv')
402 |     # error_reason_df.to_csv(error_reason_log_path, index_label='external_id')
403 | 
404 | 
405 | def main():
406 |     parser = argparse.ArgumentParser()
407 | 
408 |     parser.add_argument('--map_kurator_system_dir', type=str, default='/home/maplord/rumsey/mapkurator-system/')
409 |     parser.add_argument('--text_spotting_model_dir', type=str, default='/home/maplord/rumsey/TESTR/')
410 |     parser.add_argument('--sample_map_csv_path', type=str, default='m1_geotiff/data/sample_US_jp2_100_maps.csv') # Original: sample_US_jp2_100_maps.csv
411 |     parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') # Original: /data2/rumsey_output
412 |     parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix 
413 |     
414 |     parser.add_argument('--module_get_dimension', default=False, action='store_true')
415 |     parser.add_argument('--module_gen_geotiff', default=False, action='store_true')
416 |     parser.add_argument('--module_cropping', default=False, action='store_true')
417 |     parser.add_argument('--module_text_spotting', default=False, action='store_true')
418 |     parser.add_argument('--module_img_geojson', default=False, action='store_true')
419 |     parser.add_argument('--module_geocoord_geojson', default=False, action='store_true')
420 |     parser.add_argument('--module_entity_linking', default=False, action='store_true')
421 |     parser.add_argument('--module_post_ocr', default=False, action='store_true')
422 | 
423 |     
424 |     parser.add_argument('--spotter_model', type=str, default='testr', choices=['abcnet', 'testr'], 
425 |         help='Select text spotting model option from ["abcnet","testr"]') # select text spotting model
426 |     parser.add_argument('--spotter_config', type=str, default='/home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml',
427 |         help='Path to the config file for text spotting model')
428 |     parser.add_argument('--spotter_expt_name', type=str, default='testr_syn',
429 |         help='Name of spotter experiment, if empty using config file name') 
430 |     # python run.py --sample_map_csv_path /home/maplord/maplist_csv/luna_omo_metadata_56628_20220724.csv --expt_name 57k_maps --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --spotter_expt_name testr_synmap
431 | 
432 |     parser.add_argument('--print_command', default=False, action='store_true')
433 | 
434 |                         
435 |     args = parser.parse_args()
436 |     print('\n')
437 |     print(args)
438 |     print('\n')
439 | 
440 |     run_pipeline(args)
441 | 
442 | 
443 | 
444 | if __name__ == '__main__':
445 | 
446 |     main()
447 | 
448 |     
449 | 


--------------------------------------------------------------------------------
/run_leeje.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import subprocess
  3 | import glob
  4 | import argparse
  5 | import time
  6 | import logging
  7 | import pandas as pd
  8 | import pdb
  9 | import datetime
 10 | from PIL import Image 
 11 | from utils import get_img_path_from_external_id, get_img_path_from_external_id_and_image_no
 12 | 
 13 | import subprocess
 14 | 
 15 | ##input file : tiff and csv 
 16 | # read csv 
 17 | # change to format (tiff)
 18 | # change the path
 19 | # execute all module
 20 | 
 21 | logging.basicConfig(level=logging.INFO)
 22 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
 23 | 
 24 | # def execute_command(command, if_print_command):
 25 | #     t1 = time.time()
 26 | 
 27 | #     if if_print_command:
 28 | #         print(command)
 29 | #     os.system(command)
 30 | 
 31 | #     t2 = time.time()
 32 | #     time_usage = t2 - t1 
 33 | #     return time_usage
 34 | 
 35 | def execute_command(command, if_print_command):
 36 |     t1 = time.time()
 37 | 
 38 |     if if_print_command:
 39 |         print(command)
 40 | 
 41 |     try:
 42 |         subprocess.run(command, shell=True,check=True, capture_output = True) #stderr=subprocess.STDOUT)
 43 |         t2 = time.time()
 44 |         time_usage = t2 - t1 
 45 |         return {'time_usage':time_usage}
 46 |     except subprocess.CalledProcessError as err:
 47 |         error = err.stderr.decode('utf8')
 48 |         # format error message to one line
 49 |         error  = error.replace('\n','\t')
 50 |         error = error.replace(',',';')
 51 |         return {'error': error}
 52 | 
 53 | 
 54 | def get_img_dimension(img_path):
 55 |     map_img = Image.open(img_path) 
 56 |     width, height = map_img.size 
 57 | 
 58 |     return width, height
 59 | 
 60 | 
 61 | def run_pipeline(args):
 62 |     # -------------------------  Pass arguments -----------------------------------------
 63 |     map_kurator_system_dir = args.map_kurator_system_dir
 64 |     text_spotting_model_dir = args.text_spotting_model_dir
 65 |     sample_map_path = args.sample_map_csv_path
 66 |     expt_name = args.expt_name
 67 |     output_folder = args.output_folder
 68 | 
 69 |     module_get_dimension = args.module_get_dimension
 70 |     module_gen_geotiff = args.module_gen_geotiff
 71 |     module_cropping = args.module_cropping
 72 |     module_text_spotting = args.module_text_spotting
 73 |     module_img_geojson = args.module_img_geojson 
 74 |     module_geocoord_geojson = args.module_geocoord_geojson 
 75 |     module_entity_linking = args.module_entity_linking
 76 |     module_post_ocr = args.module_post_ocr
 77 | 
 78 |     spotter_model = args.spotter_model
 79 |     spotter_config = args.spotter_config
 80 |     spotter_expt_name = args.spotter_expt_name
 81 |     
 82 |     if_print_command = args.print_command
 83 |     
 84 | 
 85 |     # sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
 86 | 
 87 |     # ------------------------- Read sample map list and prepare output dir ----------------
 88 |     input_csv_path = sample_map_path
 89 |     if input_csv_path[-4:] == '.csv':
 90 |         sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
 91 |     elif input_csv_path[-4:] == '.tsv':
 92 |         sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str}, sep='\t')
 93 |     else:
 94 |         raise NotImplementedError
 95 | 
 96 |     # external_id_to_img_path_dict = get_img_path_from_external_id( sample_map_path = input_csv_path)
 97 |     external_id_to_img_path_dict, unmatched_external_id_list = get_img_path_from_external_id_and_image_no( sample_map_path = input_csv_path)
 98 | 
 99 |     # initialize error reason dict
100 |     error_reason_dict = dict()
101 |     for ex_id in unmatched_external_id_list:
102 |         error_reason_dict[ex_id] = {'img_path':None, 'error':'Can not find image given external_id.'} 
103 | 
104 |     # initialize time_usage_dict
105 |     time_usage_dict = dict()
106 |     for ex_id in sample_map_df['external_id']:
107 |         time_usage_dict[ex_id] = {} 
108 | 
109 |     expt_out_dir = os.path.join(output_folder, expt_name)
110 |     geotiff_output_dir = os.path.join(output_folder, expt_name,  'geotiff')
111 |     cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/')
112 |     spotting_output_dir = os.path.join(output_folder, expt_name,  'spotter/' + spotter_expt_name)
113 |     stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_expt_name)
114 |     postocr_output_dir = os.path.join(output_folder, expt_name, 'postocr/'+ spotter_expt_name)
115 |     geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_expt_name + '/')
116 | 
117 |     
118 | 
119 |     if not os.path.isdir(expt_out_dir):
120 |         os.makedirs(expt_out_dir)
121 | 
122 |     # ------------------------ Get image dimension  ------------------------------
123 |     if module_get_dimension:
124 |         for index, record in sample_map_df.iterrows():
125 |             external_id = record.external_id
126 |             # pdb.set_trace()
127 |             if external_id not in external_id_to_img_path_dict:
128 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
129 |                 continue 
130 | 
131 |             img_path = external_id_to_img_path_dict[external_id]
132 |             map_name = os.path.basename(img_path).split('.')[0]
133 | 
134 |             try:
135 |                 width, height = get_img_dimension(img_path)
136 |             except Exception as e:
137 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': e } 
138 |             
139 |             time_usage_dict[external_id]['img_w'] = width
140 |             time_usage_dict[external_id]['img_h'] = height
141 |             
142 |             
143 |     # ------------------------- Generate geotiff ------------------------------
144 |     time_start =  time.time()
145 |     if module_gen_geotiff:
146 |         os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff'))
147 |         
148 |         if not os.path.isdir(geotiff_output_dir):
149 |             os.makedirs(geotiff_output_dir)
150 | 
151 |         # use converted jpg folder instead of original sid folder
152 |         run_geotiff_command = 'python convert_image_to_geotiff.py --sid_root_dir /data2/rumsey_sid_to_jpg/ --sample_map_path '+ input_csv_path +' --out_geotiff_dir '+geotiff_output_dir  # can change params in argparse
153 |         exe_ret = execute_command(run_geotiff_command, if_print_command)
154 |         if 'error' in exe_ret:
155 |             error = exe_ret['error']
156 |         elif 'time_usage' in exe_ret:
157 |             time_usage = exe_ret['time_usage']
158 |         
159 |         time_usage_dict[external_id]['geotiff'] = time_usage
160 |         
161 | 
162 |     time_geotiff = time.time()
163 |     
164 | 
165 |     # ------------------------- Image cropping  ------------------------------
166 |     if module_cropping:
167 |         for index, record in sample_map_df.iterrows():
168 |             external_id = record.external_id
169 | 
170 |             if external_id not in external_id_to_img_path_dict:
171 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
172 |                 continue 
173 | 
174 |             img_path = external_id_to_img_path_dict[external_id]
175 |             map_name = os.path.basename(img_path).split('.')[0]
176 | 
177 |             os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
178 |             if not os.path.isdir(cropping_output_dir):
179 |                 os.makedirs(cropping_output_dir)
180 |             
181 |             run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir
182 | 
183 |             exe_ret = execute_command(run_crop_command, if_print_command)
184 | 
185 |             if 'error' in exe_ret:
186 |                 error = exe_ret['error']
187 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
188 |             elif 'time_usage' in exe_ret:
189 |                 time_usage = exe_ret['time_usage']
190 |                 time_usage_dict[external_id]['cropping'] = time_usage
191 |             else:
192 |                 raise NotImplementedError
193 |                 
194 |             
195 |     time_cropping = time.time()
196 |     
197 |     # ------------------------- Text Spotting (patch level) ------------------------------
198 |     if module_text_spotting:
199 |         assert os.path.exists(spotter_config), "Config file for spotter must exist!"
200 |         os.chdir(text_spotting_model_dir) 
201 | 
202 |         for index, record in sample_map_df.iterrows():
203 | 
204 |             external_id = record.external_id
205 |             if external_id not in external_id_to_img_path_dict:
206 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
207 |                 continue 
208 | 
209 |             img_path = external_id_to_img_path_dict[external_id]
210 |             map_name = os.path.basename(img_path).split('.')[0]
211 | 
212 |             map_spotting_output_dir = os.path.join(spotting_output_dir, map_name)
213 |             if not os.path.isdir(map_spotting_output_dir):
214 |                 os.makedirs(map_spotting_output_dir)
215 |         
216 |             if spotter_model == 'abcnet':
217 |                 run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir} --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
218 |             elif spotter_model == 'testr':
219 |                 run_spotting_command = f'python demo/demo.py --config-file {spotter_config} --output_json --input {os.path.join(cropping_output_dir,map_name)} --output {map_spotting_output_dir}'
220 |             else:
221 |                 raise NotImplementedError
222 |             
223 |             run_spotting_command  += ' 1> /dev/null'
224 | 
225 |             exe_ret = execute_command(run_spotting_command, if_print_command)
226 |             
227 |             if 'error' in exe_ret:
228 |                 error = exe_ret['error']
229 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
230 |             elif 'time_usage' in exe_ret:
231 |                 time_usage = exe_ret['time_usage']
232 |                 time_usage_dict[external_id]['spotting'] = time_usage
233 |             else:
234 |                 raise NotImplementedError
235 | 
236 |             logging.info('Done text spotting for %s', map_name)
237 |     time_text_spotting = time.time()
238 |     
239 | 
240 |     # ------------------------- Image coord geojson (map level) ------------------------------
241 |     if module_img_geojson:
242 |         os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson'))
243 |         
244 |         if not os.path.isdir(stitch_output_dir):
245 |             os.makedirs(stitch_output_dir)
246 | 
247 |         for index, record in sample_map_df.iterrows():
248 |             external_id = record.external_id
249 |             if external_id not in external_id_to_img_path_dict:
250 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
251 |                 continue 
252 | 
253 |             img_path = external_id_to_img_path_dict[external_id]
254 |             map_name = os.path.basename(img_path).split('.')[0]
255 | 
256 |             stitch_input_dir = os.path.join(spotting_output_dir, map_name)
257 |             output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson')
258 |             
259 |             run_stitch_command = 'python stitch_output.py --eval_only --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson
260 |             
261 |             exe_ret = execute_command(run_stitch_command, if_print_command)
262 |             
263 |             if 'error' in exe_ret:
264 |                 error = exe_ret['error']
265 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
266 |             elif 'time_usage' in exe_ret:
267 |                 time_usage = exe_ret['time_usage']
268 |                 time_usage_dict[external_id]['stitch'] = time_usage
269 |             else:
270 |                 raise NotImplementedError
271 |             
272 |     time_img_geojson = time.time()
273 | 
274 |     # ------------------------- post-OCR ------------------------------
275 |     if module_post_ocr:
276 |         
277 |         # Check if the geojson has been recorded 
278 |         geojson_postocr_output_dir_check = os.path.join(output_folder, '57k_maps', 'postocr/testr_syn')
279 |         file_list = glob.glob(geojson_postocr_output_dir_check + '/*.geojson')
280 |         file_list = sorted(file_list)
281 |         
282 |         existed = []
283 |         for file in file_list:
284 |             name = file.split('/')[-1].split('.')[0]
285 |             existed.append(name)
286 |         #####
287 | 
288 |         os.chdir(os.path.join(map_kurator_system_dir, 'm6_post_ocr'))
289 | 
290 |         sample_map_df2 = sample_map_df
291 |         sample_map_df2['external_id_process'] = sample_map_df2['external_id']
292 |         sample_map_df2['external_id_process'] = sample_map_df2['external_id_process'].str.strip("'")
293 |         sample_map_df2['external_id_process'] = sample_map_df2['external_id_process'].str.replace('.', '')
294 |         sample_map_df2 = sample_map_df2[~sample_map_df2['external_id_process'].isin(existed)]
295 | 
296 |         print(len(sample_map_df2))
297 |         print(len(existed))
298 |         
299 |         #####
300 |         for index, record in sample_map_df2.iterrows():
301 |             
302 |             external_id = record.external_id
303 |             if external_id not in external_id_to_img_path_dict:
304 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
305 |                 continue
306 | 
307 |             img_path = external_id_to_img_path_dict[external_id]
308 |             map_name = os.path.basename(img_path).split('.')[0]
309 | 
310 |             input_geojson_file = os.path.join(stitch_output_dir, map_name + '.geojson')
311 |             geojson_postocr_output_file = os.path.join(postocr_output_dir, map_name + '.geojson')
312 | 
313 |             if os.path.isfile(in_geojson):
314 |                 run_postocr_command = 'python lexical_search.py --in_geojson_dir '+ input_geojson_file +' --out_geojson_dir '+ geojson_postocr_output_file
315 |                 exe_ret = execute_command(run_postocr_command, if_print_command)
316 |             
317 |                 if 'error' in exe_ret:
318 |                     error = exe_ret['error']
319 |                     error_reason_dict[external_id] = {'img_path':img_path, 'error': error } 
320 |                 elif 'time_usage' in exe_ret:
321 |                     time_usage = exe_ret['time_usage']
322 |                     time_usage_dict[external_id]['postocr'] = time_usage
323 |                 else:
324 |                     raise NotImplementedError
325 |                     
326 |             else:
327 |                 continue
328 | 
329 |     time_post_ocr = time.time()
330 |     
331 |     
332 |      # ------------------------- Convert image coordinates to geocoordinates ------------------------------
333 |     if module_geocoord_geojson:
334 |         os.chdir(os.path.join(map_kurator_system_dir, 'm4_geocoordinate_converter'))
335 |         
336 |         if not os.path.isdir(geojson_output_dir):
337 |             os.makedirs(geojson_output_dir)
338 | 
339 |         for index, record in sample_map_df.iterrows():
340 |             external_id = record.external_id
341 |             if external_id not in external_id_to_img_path_dict:
342 |                 error_reason_dict[external_id] = {'img_path':None, 'error':'key not in external_id_to_img_path_dict'} 
343 |                 continue 
344 | 
345 |             in_geojson = os.path.join(output_folder, postocr_output_dir+'/') + external_id.strip("'").replace('.', '') + ".geojson"
346 | 
347 |             run_converter_command = 'python convert_geojson_to_geocoord.py --sample_map_path '+ os.path.join(map_kurator_system_dir, input_csv_path) +' --in_geojson_file '+ in_geojson +' --out_geojson_dir '+ os.path.join(map_kurator_system_dir, geojson_output_dir)
348 | 
349 |             exe_ret = execute_command(run_converter_command, if_print_command)
350 | 
351 |             if 'error' in exe_ret:
352 |                 error = exe_ret['error']
353 |                 error_reason_dict[external_id] = {'img_path':img_path, 'error': error }
354 |             elif 'time_usage' in exe_ret:
355 |                 time_usage = exe_ret['time_usage']
356 |                 time_usage_dict[external_id]['geocoord_geojson'] = time_usage
357 |             else:
358 |                 raise NotImplementedError
359 | 
360 |     time_geocoord_geojson = time.time()
361 | 
362 |     # ------------------------- Link entities in OSM ------------------------------
363 |     if module_entity_linking:
364 |         os.chdir(os.path.join(map_kurator_system_dir, 'm5_entity_linker'))
365 |         
366 |         geojson_linked_output_dir = os.path.join(map_kurator_system_dir, 'm5_entity_linker', 'data/100_maps_geojson_abc_linked/')
367 |         if not os.path.isdir(geojson_output_dir):
368 |             os.makedirs(geojson_output_dir)
369 | 
370 |         run_linker_command = 'python entity_linker.py --sample_map_path '+ input_csv_path +' --in_geojson_dir '+ geojson_output_dir +' --out_geojson_dir '+ geojson_linked_output_dir
371 |         execute_command(run_linker_command, if_print_command)
372 | 
373 |     time_entity_linking = time.time()
374 | 
375 | 
376 |     # --------------------- Time usage logging --------------------------
377 |     print('\n')
378 |     logging.info('Time for generating geotiff: %d', time_geotiff - time_start)
379 |     logging.info('Time for Cropping : %d',time_cropping - time_geotiff)
380 |     logging.info('Time for text spotting : %d',time_text_spotting - time_cropping)
381 |     logging.info('Time for generating geojson in img coordinate : %d',time_img_geojson - time_text_spotting)
382 |     logging.info('Time for generating geojson in geo coordinate : %d',time_geocoord_geojson - time_img_geojson)
383 |     logging.info('Time for entity linking : %d',time_entity_linking - time_geocoord_geojson)
384 |     logging.info('Time for post OCR : %d',time_post_ocr - time_geocoord_geojson)
385 | 
386 |     time_usage_df = pd.DataFrame.from_dict(time_usage_dict, orient='index')
387 |     time_usage_log_path = os.path.join(output_folder, expt_name, 'time_usage.csv')
388 | 
389 |     # check if exist time_usage log file 
390 |     if os.path.isfile(time_usage_log_path):
391 |         existing_df = pd.read_csv(time_usage_log_path, index_col='external_id', dtype={'external_id':str})
392 |         # if exist duplicate columns, ret time usage values to the latest run
393 |         cols_to_use = existing_df.columns.difference(time_usage_df.columns)
394 | 
395 |         time_usage_df = time_usage_df.join(existing_df[cols_to_use])
396 | 
397 |         # make sure time_usage_expt_name.csv always have the latest time usage
398 |         # move the old time_usage.csv to time_usage[timestamp].csv where timestamp is the last expt running time
399 |         m_time = os.path.getmtime(time_usage_log_path)
400 |         dt_m = datetime.datetime.fromtimestamp(m_time)
401 |         timestr = dt_m.strftime("%Y%m%d-%H%M%S") 
402 | 
403 |         deprecated_path = os.path.join(output_folder, expt_name, 'time_usage_' +  timestr +'.csv')
404 |         run_command = 'mv ' + time_usage_log_path + ' ' + deprecated_path
405 |         execute_command(run_command, if_print_command)
406 | 
407 |     time_usage_df.to_csv(time_usage_log_path, index_label='external_id')
408 | 
409 |     # --------------------- Error logging --------------------------
410 |     print('\n')
411 |     current_time = datetime.datetime.now().strftime("%Y_%m_%d-%I:%M:%S_%p")
412 |     error_reason_df = pd.DataFrame.from_dict(error_reason_dict, orient='index')
413 |     error_reason_log_path = os.path.join(output_folder, expt_name, 'error_reason_' +  current_time +'.csv')
414 |     error_reason_df.to_csv(error_reason_log_path, index_label='external_id')
415 | 
416 | 
417 | def main():
418 |     parser = argparse.ArgumentParser()
419 | 
420 |     parser.add_argument('--map_kurator_system_dir', type=str, default='/home/maplord/rumsey/mapkurator-system/')
421 |     parser.add_argument('--text_spotting_model_dir', type=str, default='/home/maplord/rumsey/TESTR/')
422 |     parser.add_argument('--sample_map_csv_path', type=str, default='m1_geotiff/data/sample_US_jp2_100_maps.csv') # Original: sample_US_jp2_100_maps.csv
423 |     parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output') # Original: /data2/rumsey_output
424 |     parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix 
425 |     
426 |     parser.add_argument('--module_get_dimension', default=False, action='store_true')
427 |     parser.add_argument('--module_gen_geotiff', default=False, action='store_true')
428 |     parser.add_argument('--module_cropping', default=False, action='store_true')
429 |     parser.add_argument('--module_text_spotting', default=False, action='store_true')
430 |     parser.add_argument('--module_img_geojson', default=False, action='store_true')
431 |     parser.add_argument('--module_geocoord_geojson', default=False, action='store_true')
432 |     parser.add_argument('--module_entity_linking', default=False, action='store_true')
433 |     parser.add_argument('--module_post_ocr', default=False, action='store_true')
434 | 
435 |     
436 |     parser.add_argument('--spotter_model', type=str, default='testr', choices=['abcnet', 'testr'], 
437 |         help='Select text spotting model option from ["abcnet","testr"]') # select text spotting model
438 |     parser.add_argument('--spotter_config', type=str, default='/home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml',
439 |         help='Path to the config file for text spotting model')
440 |     parser.add_argument('--spotter_expt_name', type=str, default='testr_syn',
441 |         help='Name of spotter experiment, if empty using config file name') 
442 |     # python run.py --sample_map_csv_path /home/maplord/maplist_csv/luna_omo_metadata_56628_20220724.csv --expt_name 57k_maps --module_text_spotting --spotter_model testr --spotter_config /home/maplord/rumsey/TESTR/configs/TESTR/SynMap/SynMap_Polygon.yaml --spotter_expt_name testr_synmap
443 | 
444 |     parser.add_argument('--print_command', default=False, action='store_true')
445 | 
446 |                         
447 |     args = parser.parse_args()
448 |     print('\n')
449 |     print(args)
450 |     print('\n')
451 | 
452 |     run_pipeline(args)
453 | 
454 | 
455 | 
456 | if __name__ == '__main__':
457 | 
458 |     main()
459 | 
460 |     
461 | 


--------------------------------------------------------------------------------
/run_only_eval.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import glob
 3 | 
 4 | map_kurator_system_dir = '/home/zekun/dr_maps/mapkurator-system/'
 5 | text_spotting_model_dir = '/home/zekun/antique_names/model/AdelaiDet/'
 6 | sample_map_path = 'm1_geotiff/data/sample_US_jp2_100_maps.csv'
 7 | 
 8 | # # run module1 to generate geotiff
 9 | # os.chdir(os.path.join(map_kurator_system_dir ,'m1_geotiff'))
10 | # input_csv = os.path.join(map_kurator_system_dir ,sample_map_path)
11 | # geotiff_output_dir = os.path.join(map_kurator_system_dir ,'m1_geotiff/data/geotiff')
12 | # if not os.path.isdir(geotiff_output_dir):
13 | #     os.makedirs(geotiff_output_dir)
14 | 
15 | # run_geotiff_command = 'python convert_image_to_geotiff.py --sample_map_path '+ input_csv +' --out_geotiff_dir '+geotiff_output_dir  # can change params in argparse
16 | # print(run_geotiff_command)
17 | # #os.system(run_geotiff_command)
18 | 
19 | 
20 | # run module2: image cropping
21 | 
22 | geotiff_path_list = glob.glob(os.path.join(geotiff_output_dir, '*.geotiff'))
23 | assert(len(geotiff_path_list) != 0)
24 | 
25 | for geotiff_path in geotiff_path_list:
26 |     os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
27 |     map_name = os.path.basename(geotiff_path).split('.')[0]
28 | 
29 |     cropping_output_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_crop/')
30 |     if not os.path.isdir(cropping_output_dir):
31 |         os.makedirs(cropping_output_dir)
32 |     run_crop_command = 'python crop_img.py --img_path '+geotiff_path + ' --output_dir '+ cropping_output_dir
33 |     print(run_crop_command)
34 |     os.system(run_crop_command)
35 | 
36 |     # run module2: text spotting
37 |     os.chdir(text_spotting_model_dir)
38 |     map_name = os.path.basename(geotiff_path).split('.')[0]
39 | 
40 |     spotting_output_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_crop_outabc/',map_name)
41 |     if not os.path.isdir(spotting_output_dir):
42 |         os.makedirs(spotting_output_dir)
43 | 
44 |     run_spotting_command = 'python demo/demo.py 	--config-file configs/BAText/CTW1500/attn_R_50.yaml 	--input '+ map_kurator_system_dir+'/m2_detection_recognition/data/100_maps_crop/'+map_name+'  --output '+ spotting_output_dir + '   --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
45 |     run_spotting_command  += ' 1> /dev/null'
46 |     print(run_spotting_command)               
47 |     os.system(run_spotting_command)
48 | 
49 |     #break
50 | 
51 | 
52 | # run module2: geojson stitching
53 | os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
54 | 
55 | 
56 | stitch_input_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_crop_outabc/')
57 | stitch_output_dir = os.path.join(map_kurator_system_dir, 'm2_detection_recognition', 'data/100_maps_geojson_abc/')
58 | if not os.path.isdir(stitch_output_dir):
59 |     os.makedirs(stitch_output_dir)
60 | run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_dir ' + stitch_output_dir 
61 | print(run_stitch_command)
62 | os.system(run_stitch_command)
63 | 
64 |     


--------------------------------------------------------------------------------
/run_sanborn.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import glob
  3 | import argparse
  4 | import time
  5 | import logging
  6 | import pandas as pd
  7 | import pdb
  8 | import datetime
  9 | from PIL import Image 
 10 | from utils import get_img_path_from_external_id
 11 | 
 12 | logging.basicConfig(level=logging.INFO)
 13 | Image.MAX_IMAGE_PIXELS=None # allow reading huge images
 14 | 
 15 | '''
 16 | This Sanborn processing pipeline shares some common modules as DR processing pipeline, including cropping and text spotting. 
 17 | The unique modules are geocoding, clustering, and output geojson generation module. 
 18 | The GeoTiff conversion, Image dimension retrival, Img_to_geo coord and entity linking modules are removed.
 19 | Time usage analysis and error reason logging are removed. 
 20 | '''
 21 | 
 22 | def execute_command(command, if_print_command):
 23 |     t1 = time.time()
 24 | 
 25 |     if if_print_command:
 26 |         print(command)
 27 |     os.system(command)
 28 | 
 29 |     t2 = time.time()
 30 |     time_usage = t2 - t1 
 31 |     return time_usage
 32 | 
 33 | def get_img_dimension(img_path):
 34 |     map_img = Image.open(img_path) 
 35 |     width, height = map_img.size 
 36 | 
 37 |     return width, height
 38 | 
 39 | 
 40 | def run_pipeline(args):
 41 |     # -------------------------  Pass arguments -----------------------------------------
 42 |     map_kurator_system_dir = args.map_kurator_system_dir
 43 |     text_spotting_model_dir = args.text_spotting_model_dir
 44 |     # sample_map_path = args.sample_map_csv_path
 45 |     expt_name = args.expt_name
 46 |     output_folder = args.output_folder
 47 |     input_map_dir = args.input_map_dir
 48 | 
 49 |     module_get_dimension = args.module_get_dimension
 50 |     # module_gen_geotiff = args.module_gen_geotiff
 51 |     module_cropping = args.module_cropping
 52 |     module_text_spotting = args.module_text_spotting
 53 |     module_img_geojson = args.module_img_geojson 
 54 |     # module_geocoord_geojson = args.module_geocoord_geojson 
 55 |     # module_entity_linking = args.module_entity_linking
 56 |     module_geocoding = args.module_geocoding
 57 |     module_clustering = args.module_clustering
 58 | 
 59 |     spotter_option = args.spotter_option
 60 |     geocoder_option = args.geocoder_option
 61 |     api_key = args.api_key 
 62 |     user_name = args.user_name
 63 | 
 64 |     metadata_tsv_path = args.metadata_tsv_path
 65 | 
 66 |     if_print_command = args.print_command
 67 | 
 68 |     sid_to_jpg_dir = '/data2/rumsey_sid_to_jpg/'
 69 | 
 70 |     file_list = os.listdir(input_map_dir)
 71 | 
 72 |     file_list = [f for f in file_list if os.path.basename(f).split('.')[-1] in ['sid','jp2','png','jpg','jpeg','tiff','tif','geotiff','geotiff']]
 73 |     
 74 |     print(len(file_list))
 75 | 
 76 |     
 77 | 
 78 |     # pdb.set_trace()
 79 |     # ------------------------- Read sample map list and prepare output dir ----------------
 80 | 
 81 |     cropping_output_dir = os.path.join(output_folder, expt_name, 'crop/')
 82 |     spotting_output_dir = os.path.join(output_folder, expt_name,  'spotter/' + spotter_option)
 83 |     stitch_output_dir = os.path.join(output_folder, expt_name, 'stitch/' + spotter_option)
 84 |     # geojson_output_dir = os.path.join(output_folder, expt_name, 'geojson_' + spotter_option + '/')
 85 |     geocoding_output_dir = os.path.join(output_folder, expt_name, 'geocoding_suffix_' + spotter_option)
 86 |     clustering_output_dir = os.path.join(output_folder, expt_name, 'cluster_' + spotter_option + '/')
 87 |     
 88 | 
 89 |     # ------------------------- Image cropping  ------------------------------
 90 |     if module_cropping:
 91 |         # for index, record in sample_map_df.iterrows():
 92 |         for file_path in file_list:
 93 |             img_path = os.path.join(input_map_dir, file_path)
 94 |             print(img_path)
 95 |             # external_id = record.external_id
 96 |             # img_path = external_id_to_img_path_dict[external_id]
 97 |             
 98 |             map_name = os.path.basename(img_path).split('.')[0]
 99 |             
100 | 
101 |             os.chdir(os.path.join(map_kurator_system_dir ,'m2_detection_recognition'))
102 |             if not os.path.isdir(cropping_output_dir):
103 |                 os.makedirs(cropping_output_dir)
104 |             run_crop_command = 'python crop_img.py --img_path '+img_path + ' --output_dir '+ cropping_output_dir
105 | 
106 |             time_usage = execute_command(run_crop_command, if_print_command)
107 |             # time_usage_dict[external_id]['cropping'] = time_usage
108 | 
109 |     time_cropping = time.time()
110 |     
111 |     # # ------------------------- Text Spotting (patch level) ------------------------------
112 |     if module_text_spotting:
113 |         os.chdir(text_spotting_model_dir)
114 | 
115 |         # for index, record in sample_map_df.iterrows():
116 |         for file_path in file_list:
117 |             map_name = os.path.basename(file_path).split('.')[0]
118 | 
119 |             map_spotting_output_dir = os.path.join(spotting_output_dir,map_name)
120 |             if not os.path.isdir(map_spotting_output_dir):
121 |                 os.makedirs(map_spotting_output_dir)
122 | 
123 |             if spotter_option == 'abcnet':
124 |                 run_spotting_command = 'python demo/demo.py 	--config-file configs/BAText/CTW1500/attn_R_50.yaml 	--input='+ os.path.join(cropping_output_dir,map_name) + '  --output='+ map_spotting_output_dir + '   --opts MODEL.WEIGHTS ctw1500_attn_R_50.pth'
125 |             elif spotter_option == 'testr':
126 |                 run_spotting_command = 'python demo/demo.py --output_json	--input='+ os.path.join(cropping_output_dir,map_name) + ' --output='+map_spotting_output_dir +'   --opts MODEL.WEIGHTS icdar15_testr_R_50_polygon.pth'
127 |             else:
128 |                 raise NotImplementedError
129 | 
130 |             run_spotting_command  += ' 1> /dev/null'
131 |             
132 |             time_usage = execute_command(run_spotting_command, if_print_command)
133 |             
134 |             logging.info('Done text spotting for %s', map_name)
135 | 
136 |     time_text_spotting = time.time()
137 |     
138 | 
139 |     # # ------------------------- Image coord geojson (map level) ------------------------------
140 |     if module_img_geojson:
141 |         os.chdir(os.path.join(map_kurator_system_dir ,'m3_image_geojson'))
142 |         if not os.path.isdir(stitch_output_dir):
143 |             os.makedirs(stitch_output_dir)
144 | 
145 |         for file_path in file_list:
146 |             map_name = os.path.basename(file_path).split('.')[0]
147 | 
148 |             stitch_input_dir = os.path.join(spotting_output_dir, map_name)
149 |             output_geojson = os.path.join(stitch_output_dir, map_name + '.geojson')
150 |             
151 |             run_stitch_command = 'python stitch_output.py --input_dir '+stitch_input_dir + ' --output_geojson ' + output_geojson
152 |             time_usage = execute_command(run_stitch_command, if_print_command)
153 |             # time_usage_dict[external_id]['imgcoord_geojson'] = time_usage
154 | 
155 |     time_img_geojson = time.time()
156 | 
157 |     # # ------------------------- Geocoding ------------------------------
158 |     if module_geocoding:
159 |         os.chdir(os.path.join(map_kurator_system_dir ,'m_sanborn'))
160 | 
161 |         if metadata_tsv_path is not None:
162 |             map_df = pd.read_csv(metadata_tsv_path, sep='\t')
163 | 
164 |         if not os.path.isdir(geocoding_output_dir):
165 |             os.makedirs(geocoding_output_dir)
166 | 
167 |         for file_path in file_list:
168 |             map_name = os.path.basename(file_path).split('.')[0]
169 |             if metadata_tsv_path is not None:
170 |                 suffix = map_df[map_df['filename'] == map_name]['City'].values[0] # LoC sanborn
171 |                 suffix = ', ' + suffix
172 |             else:
173 |                 suffix = ', Los Angeles' # LA sanborn
174 | 
175 |             run_geocoding_command = 'python3 s1_geocoding.py --input_map_geojson_path='+ os.path.join(stitch_output_dir,map_name + '.geojson')  + ' --output_folder=' + geocoding_output_dir + \
176 |                 ' --api_key=' + api_key + ' --user_name=' + user_name + ' --max_results=5 --geocoder_option=' + geocoder_option + ' --suffix="' + suffix + '"'
177 |             
178 |             time_usage = execute_command(run_geocoding_command, if_print_command)
179 | 
180 |             # break
181 | 
182 |         logging.info('Done geocoding for %s', map_name)
183 | 
184 |     
185 |     time_geocoding = time.time()
186 | 
187 | 
188 |     if module_clustering:
189 |         os.chdir(os.path.join(map_kurator_system_dir ,'m_sanborn'))
190 | 
191 |         if not os.path.isdir(clustering_output_dir):
192 |             os.makedirs(clustering_output_dir)
193 | 
194 |         # for file_path in file_list:
195 |         #     map_name = os.path.basename(file_path).split('.')[0]
196 |             
197 |         # run_clustering_command = 'python3 s2_clustering.py --dataset_name='+ expt_name + ' --output_folder=' + geocoding_output_dir + \
198 |         #         ' --api_key=' + api_key + ' --user_name=' + user_name + ' --max_results=5 --geocoder_option=' + geocoder_option + ' --suffix="' + suffix + '"'
199 |         
200 |         # time_usage = execute_command(run_clustering_command, if_print_command)
201 | 
202 | 
203 |         # logging.info('Done geocoding for %s', map_name)
204 | 
205 | 
206 | def main():
207 |     parser = argparse.ArgumentParser()
208 | 
209 |     parser.add_argument('--map_kurator_system_dir', type=str, default='/home/zekun/dr_maps/mapkurator-system/')
210 |     parser.add_argument('--text_spotting_model_dir', type=str, default='/home/zekun/antique_names/model/AdelaiDet/')
211 |     
212 |     parser.add_argument('--input_map_dir', type=str, default='/data2/mrm_sanborn_maps/LA_sanborn')
213 |     parser.add_argument('--output_folder', type=str, default='/data2/rumsey_output')
214 |     parser.add_argument('--expt_name', type=str, default='1000_maps') # output prefix 
215 | 
216 |     parser.add_argument('--module_get_dimension', default=False, action='store_true')
217 |     # parser.add_argument('--module_gen_geotiff', default=False, action='store_true') # only supports dr maps
218 |     parser.add_argument('--module_cropping', default=False, action='store_true')
219 |     parser.add_argument('--module_text_spotting', default=False, action='store_true')
220 |     parser.add_argument('--module_img_geojson', default=False, action='store_true')
221 |     parser.add_argument('--module_geocoding', default=False, action='store_true') # only supports sanborn
222 |     # parser.add_argument('--module_geocoord_geojson', default=False, action='store_true') # only supports dr maps
223 |     # parser.add_argument('--module_entity_linking', default=False, action='store_true') # only supports dr maps
224 |     parser.add_argument('--module_clustering', default=False, action='store_true') # only supports dr maps
225 | 
226 |     parser.add_argument('--print_command', default=False, action='store_true')
227 | 
228 |     parser.add_argument('--spotter_option', type=str, default='testr', 
229 |         choices=['abcnet', 'testr'], 
230 |         help='Select text spotting model option from ["abcnet","testr"]') # select text spotting model
231 | 
232 |     parser.add_argument('--geocoder_option', type=str, default='arcgis', 
233 |         choices=['arcgis', 'google','geonames','osm'], 
234 |         help='Select text spotting model option from ["arcgis","google","geonames","osm"]') # select text spotting model
235 | 
236 |     # params for geocoder:
237 |     parser.add_argument('--api_key', type=str, default=None, help='api_key for geocoder. can be None if not running geocoding module')
238 |     parser.add_argument('--user_name', type=str, default=None, help='user_name for geocoder. can be None if not running geocoding module')
239 |     parser.add_argument('--metadata_tsv_path', type=str, default=None) # '/home/zekun/Sanborn/Sheet_List.tsv'
240 | 
241 | 
242 |     args = parser.parse_args()
243 |     print('\n')
244 |     print(args)
245 |     print('\n')
246 | 
247 |     run_pipeline(args)
248 | 
249 | 
250 | if __name__ == '__main__':
251 | 
252 |     main()
253 | 
254 |     
255 | 


--------------------------------------------------------------------------------
/utils.py:
--------------------------------------------------------------------------------
  1 | import os
  2 | import glob
  3 | import pandas as pd
  4 | import ast
  5 | import argparse
  6 | import logging
  7 | import pdb
  8 | 
  9 | logging.basicConfig(level=logging.INFO)
 10 | 
 11 | def func_file_to_fullpath_dict(file_path_list):
 12 | 
 13 |     file_fullpath_dict = dict()
 14 |     for file_path in file_path_list:
 15 |         file_fullpath_dict[os.path.basename(file_path).split('.')[0]] = file_path
 16 | 
 17 |     return file_fullpath_dict  
 18 | 
 19 | def get_img_path_from_external_id(jp2_root_dir = '/data/rumsey-jp2/', sid_root_dir = '/data2/rumsey_sid_to_jpg/', additional_root_dir='/data2/rumsey-luna-img/', sample_map_path = None,external_id_key = 'external_id') :
 20 |     # returns (1) a dict with external-id as key, full image path as value (2) list of external-id that can not find image path
 21 | 
 22 |     jp2_file_path_list = glob.glob(os.path.join(jp2_root_dir, '*/*.jp2'))
 23 |     sid_file_path_list = glob.glob(os.path.join(sid_root_dir, '*.jpg'))
 24 |     add_file_path_list = glob.glob(os.path.join(additional_root_dir, '*'))
 25 | 
 26 |     jp2_file_fullpath_dict = func_file_to_fullpath_dict(jp2_file_path_list) 
 27 |     sid_file_fullpath_dict = func_file_to_fullpath_dict(sid_file_path_list) 
 28 |     add_file_fullpath_dict = func_file_to_fullpath_dict(add_file_path_list) 
 29 | 
 30 |     sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
 31 | 
 32 |     external_id_to_img_path_dict = {}
 33 | 
 34 |     unmatched_external_id_list = []
 35 | 
 36 |     for index, record in sample_map_df.iterrows():
 37 |         external_id = record.external_id
 38 |         filename_without_extension = external_id.strip("'").replace('.','')
 39 | 
 40 |         full_path = ''
 41 |         if filename_without_extension in jp2_file_fullpath_dict:
 42 |             full_path = jp2_file_fullpath_dict[filename_without_extension]
 43 |         elif filename_without_extension in sid_file_fullpath_dict:
 44 |             full_path = sid_file_fullpath_dict[filename_without_extension]
 45 |         elif filename_without_extension in add_file_fullpath_dict:
 46 |             full_path = add_file_fullpath_dict[filename_without_extension]
 47 |         else:
 48 |             # print('image with external_id not found in image_dir:', external_id)
 49 |             unmatched_external_id_list.append(external_id)
 50 |             continue
 51 |         assert (len(full_path)!=0)
 52 | 
 53 |         external_id_to_img_path_dict[external_id] = full_path
 54 |     
 55 |     return external_id_to_img_path_dict,  unmatched_external_id_list
 56 | 
 57 | def get_img_path_from_external_id_and_image_no(jp2_root_dir = '/data/rumsey-jp2/', sid_root_dir = '/data2/rumsey_sid_to_jpg/', additional_root_dir='/data2/rumsey-luna-img/', sample_map_path = None,external_id_key = 'external_id') :
 58 |     # returns (1) a dict with external-id as key, full image path as value (2) list of external-id that can not find image path
 59 | 
 60 |     jp2_file_path_list = glob.glob(os.path.join(jp2_root_dir, '*/*.jp2'))
 61 |     sid_file_path_list = glob.glob(os.path.join(sid_root_dir, '*.jpg')) # use converted jpg directly
 62 |     add_file_path_list = glob.glob(os.path.join(additional_root_dir, '*'))
 63 | 
 64 |     jp2_file_fullpath_dict = func_file_to_fullpath_dict(jp2_file_path_list) 
 65 |     sid_file_fullpath_dict = func_file_to_fullpath_dict(sid_file_path_list) 
 66 |     add_file_fullpath_dict = func_file_to_fullpath_dict(add_file_path_list) 
 67 | 
 68 |     sample_map_df = pd.read_csv(sample_map_path, dtype={'external_id':str})
 69 | 
 70 |     external_id_to_img_path_dict = {}
 71 | 
 72 |     unmatched_external_id_list = []
 73 |     for index, record in sample_map_df.iterrows():
 74 |         external_id = record.external_id 
 75 |         image_no = record.image_no
 76 |         # filename_without_extension = external_id.strip("'").replace('.','')
 77 |         filename_without_extension = image_no.strip("'").split('.')[0]
 78 | 
 79 |         full_path = ''
 80 |         if filename_without_extension in jp2_file_fullpath_dict:
 81 |             full_path = jp2_file_fullpath_dict[filename_without_extension]
 82 |         elif filename_without_extension in sid_file_fullpath_dict:
 83 |             full_path = sid_file_fullpath_dict[filename_without_extension]
 84 |         elif filename_without_extension in add_file_fullpath_dict:
 85 |             full_path = add_file_fullpath_dict[filename_without_extension]
 86 |         else:
 87 |             print('image with external_id not found in image_dir:', external_id)
 88 |             unmatched_external_id_list.append(external_id)
 89 |             continue
 90 |         assert (len(full_path)!=0)
 91 | 
 92 |         external_id_to_img_path_dict[external_id] = full_path
 93 |     
 94 |     return external_id_to_img_path_dict, unmatched_external_id_list
 95 | 
 96 | 
 97 | if __name__ == '__main__':
 98 | 
 99 |     parser = argparse.ArgumentParser()
100 |     parser.add_argument('--jp2_root_dir', type=str, default='/data/rumsey-jp2/',
101 |                         help='image dir of jp2 files.')
102 |     parser.add_argument('--sid_root_dir', type=str, default='/data2/rumsey_sid_to_jpg/',
103 |                         help='image dir of sid files.')
104 |     parser.add_argument('--additional_root_dir', type=str, default='/data2/rumsey-luna-img/',
105 |                         help='image dir of additional luna files.')
106 |     parser.add_argument('--sample_map_path', type=str, default='data/initial_US_100_maps.csv',
107 |                         help='path to sample map csv, which contains gcps info')
108 |     parser.add_argument('--external_id_key', type=str, default='external_id',
109 |                         help='key string for external id, could be external_id or ListNo')
110 |  
111 |     args = parser.parse_args()
112 |     print(args)
113 | 
114 |     # get_img_path_from_external_id(jp2_root_dir = args.jp2_root_dir, sid_root_dir = args.sid_root_dir, additional_root_dir = args.additional_root_dir,
115 |     # sample_map_path = args.sample_map_path,external_id_key = args.external_id_key)
116 | 
117 |     get_img_path_from_external_id_and_image_no(jp2_root_dir = args.jp2_root_dir, sid_root_dir = args.sid_root_dir, additional_root_dir = args.additional_root_dir,
118 |      sample_map_path = args.sample_map_path,external_id_key = args.external_id_key)
119 | 


--------------------------------------------------------------------------------