└── README.md /README.md: -------------------------------------------------------------------------------- 1 | # DocLLM_reimplementation 2 | 3 | This repository is the reimplemantion of [DOCLLM: A LAYOUT-AWARE GENERATIVE LANGUAGE MODEL 4 | FOR MULTIMODAL DOCUMENT UNDERSTANDING](https://arxiv.org/pdf/2401.00908.pdf) 5 | 6 | # Model architecture 7 | 8 | We re-implement the model architecture based on baichuan2-7b instead (paper uses llama2-7b since they focus on English data), model size 7.5B -> 9.1B 9 | 10 | The re-implemented model architecture is availabled at 11 | https://huggingface.co/JinghuiLuAstronaut/DocLLM_baichuan2_7b 12 | 13 | **Note that this is an re-implementation of model architecture, all newly added parameters are random initialized, you can download the model and continue pre-training or fine-tuning.** 14 | 15 | **The inference code is available at readme.md (you can extend that code to perform training as well), model architectures are available at modeling_baichuan.py when you download the huggingface model.** 16 | 17 | # Performance 18 | 19 | We test the performance of fine-tuned DocLLM_baichuan2_7b on the in-house KIE dataset, demonstrating that though without pre-training, it still achieves improvement. 20 | 21 | 22 | | Model | F-score | 23 | | ------------- | ------------- | 24 | | DocLLM\_baichuan2\_7b | 76.75 | 25 | | baichuan2\_7b | 74.95 | 26 | 27 | # Quick start 28 | 29 | ```python 30 | 31 | from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM 32 | import torch 33 | 34 | # Load tokenizer and model 35 | device = "cuda:0" 36 | model_path = "model_path" 37 | tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code = True, padding_side = 'left') 38 | model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True).to(device) 39 | 40 | input_str = "公司:byd\n产品:极氪001" 41 | ## one poly corresponding to a token id while [-1,-1,-1,-1] represents masked poly, corresponding to "\n" in this example 42 | input_poly = [ 43 | [0.1749,0.1466,0.5317,0.5486], 44 | [0.1749,0.1466,0.5317,0.5486], 45 | [0.1749,0.1466,0.5317,0.5486], 46 | [0.1749,0.1466,0.5317,0.5486], 47 | [-1,-1,-1,-1], 48 | [0.6545,0.2287,0.8743,0.4666], 49 | [0.6545,0.2287,0.8743,0.4666], 50 | [0.6545,0.2287,0.8743,0.4666], 51 | [0.6545,0.2287,0.8743,0.4666], 52 | [0.6545,0.2287,0.8743,0.4666], 53 | [0.6545,0.2287,0.8743,0.4666], 54 | [0.6545,0.2287,0.8743,0.4666] 55 | ] 56 | 57 | input_ids = tokenizer.encode(input_str) 58 | input_ids = torch.as_tensor(input_ids, dtype=torch.int64) 59 | input_coordinates = torch.as_tensor(input_poly) 60 | 61 | output = model( 62 | input_ids=input_ids.unsqueeze(0).to(device), 63 | input_coordinates=input_coordinates.unsqueeze(0).to(device), 64 | ) 65 | --------------------------------------------------------------------------------