Figure 1. An overview of the proposed M5HisDoc dataset. For better visibility, please zoom in on 15 | the image. (a) Multiple layouts. (b) Multiple document types. (c) Multiple calligraphy styles. (d) 16 | Multiple backgrounds. (e) Multiple challenges, including dense texts, distortion, rotation, damage, 17 | image blurriness, and variations in font sizes.
18 | 19 | 20 | As shown in Figrure 1, M5 indicates five properties of style, ie., Multiple layouts, Multiple document types, Multiple calligraphy styles, Multiple backgrounds and Multiple challenges. The M5HisDoc dataset consists of two subsets, M5HisDoc-R (Regular) and M5HisDoc-H (Hard). The M5HisDoc-R subset comprises 4,000 historical document images. To ensure high-quality annotations, we meticulously perform manual annotation and triple-checking. 21 | 22 |  23 |Figure 2. Example of data processing to generate M5HisDoc-H subset.
24 | 25 | As shown in Figrure 2, to replicate real-world conditions for historical document analysis applications, we incorporate image rotation, distortion, and resolution reduction into M5HisDoc-R subset to form an new challenging subset named M5HisDoc-H, which contains the same number of images as M5HisDoc-R. 26 | 27 | 28 | Both the annotations in character-level and text-line-level are provided, including text bounding box, text content, and the corresponding reading order. Therefore, M5HisDoc can be applied on a wide range of tasks, including text-line/charater detection, recognition and reading order prediction. 29 | 30 | 31 | # Collection 32 | Our data collection process consists of three main sources. Firstly, we carefully select 300 images from the train set of MTHv2 and 700 images from SCUT-CAB, which serve as representative samples. Secondly, we gather tens of thousands of scanned images from electronic ancient books available on the Internet, encompassing 131 ancient books. From this collection, we manually curate 2,799 historical document images taken from 37 representative books. Thirdly, we conduct realistic photo shoots to simulate photographing situation. By selecting four physical Chinese ancient books, we capture 201 images using a scanner, considering various angles and lighting conditions. In total, we obtain 4,000 images to establish the M5HisDoc benchmark. 33 | 34 | 35 | # Statistics 36 |  37 |Figure 3. Statistics of M5HisDoc. (a) The aspect ratio of the images. (b) The number of characters 38 | per category. (c) The distributions of the text line length.
39 | We also calculate the aspect ratio of the images, the number of characters per category, and the distributions of the text line length. As shown in Fig. 3a, the aspect ratio of the images varies significantly, ranging from less than 0.4 to more than 1.8. This is because M5HisDoc contains a variety of styles and layouts. The number of samples per category exhibit a clear long-tail distribution, as demonstrated in Fig. 3b. The category with the largest number of samples consists of over 30,000 instances, whereas the category with the fewest has fewer than 3 samples. It can also be observed from Fig. 3c that there exists a notable diversity in text length in M5HisDoc. About 40% of texts exhibit a length of fewer than 6 characters, while about 7% of texts surpass the threshold of 25 characters. Some texts are extremely long, with the longest text containing 58 characters. 40 | 41 | 42 | # Directory Format 43 | The dataset is organized in the following directory format: 44 | ``` 45 | ├── M5HisDoc 46 | ├── M5HisDoc_regular 47 | ├── images 48 | │ ├── xxx.jpg 49 | │ └── ... 50 | ├── label_textline 51 | │ ├── xxx.txt 52 | │ └── ... 53 | ├── label_char 54 | │ ├── xxx.txt 55 | │ └── ... 56 | ├── M5HisDoc_hard 57 | ├── images 58 | │ ├── xxx.jpg 59 | │ └── ... 60 | ├── label_textline 61 | │ ├── xxx.txt 62 | │ └── ... 63 | ├── label_char 64 | │ ├── xxx.txt 65 | │ └── ... 66 | ├── split_train.txt 67 | ├── split_val.txt 68 | ├── split_test.txt 69 | ├── char_dict.txt 70 | 71 | 72 | ``` 73 | 74 | # Citation 75 | ``` 76 | @inproceedings{shi2023m5hisdoc, 77 | title={M5HisDoc: A Large-scale Multi-style Chinese Historical Document Analysis Benchmark}, 78 | author={Shi, Yongxin and Liu, Chongyu and Peng, Dezhi and Jian, Cheng and Huang, Jiarong and Jin, Lianwen}, 79 | booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, 80 | year={2023} 81 | } 82 | ``` 83 | 84 | # Contact 85 | For any questions about the dataset, please contact the authors by sending an email to Prof. Jin([eelwjin@scut.edu.cn](mailto:eelwjin@scut.edu.cn), or [lianwen.jin@gmail.com](mailto:lianwen.jin@gmail.com)). 86 | -------------------------------------------------------------------------------- /images/M5HisDoc_overview.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HCIILAB/M5HisDoc/de82663a292a1be5d2b0abe3f5867e285de64f57/images/M5HisDoc_overview.png -------------------------------------------------------------------------------- /images/M5HisDoc_processing.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HCIILAB/M5HisDoc/de82663a292a1be5d2b0abe3f5867e285de64f57/images/M5HisDoc_processing.png -------------------------------------------------------------------------------- /images/M5HisDoc_statsics.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/HCIILAB/M5HisDoc/de82663a292a1be5d2b0abe3f5867e285de64f57/images/M5HisDoc_statsics.png --------------------------------------------------------------------------------