└── README.md


/README.md:
--------------------------------------------------------------------------------
 1 | # DocLLM_reimplementation
 2 | 
 3 | This repository is the reimplemantion of [DOCLLM: A LAYOUT-AWARE GENERATIVE LANGUAGE MODEL
 4 | FOR MULTIMODAL DOCUMENT UNDERSTANDING](https://arxiv.org/pdf/2401.00908.pdf)
 5 | 
 6 | # Model architecture
 7 | 
 8 | We re-implement the model architecture based on baichuan2-7b instead (paper uses llama2-7b since they focus on English data), model size 7.5B -> 9.1B
 9 | 
10 | The re-implemented model architecture is availabled at 
11 | https://huggingface.co/JinghuiLuAstronaut/DocLLM_baichuan2_7b
12 | 
13 | **Note that this is an re-implementation of model architecture, all newly added parameters are random initialized, you can download the model and continue pre-training or fine-tuning.**
14 | 
15 | **The inference code is available at readme.md (you can extend that code to perform training as well), model architectures are available at modeling_baichuan.py when you download the huggingface model.**
16 | 
17 | # Performance
18 | 
19 | We test the performance of fine-tuned DocLLM_baichuan2_7b on the in-house KIE dataset, demonstrating that though without pre-training, it still achieves improvement.
20 | 
21 | 
22 | | Model  | F-score |
23 | | ------------- | ------------- |
24 | | DocLLM\_baichuan2\_7b  | 76.75  |
25 | | baichuan2\_7b | 74.95  |
26 | 
27 | # Quick start
28 | 
29 | ```python
30 | 
31 | from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM
32 | import torch
33 | 
34 | # Load tokenizer and model
35 | device = "cuda:0"
36 | model_path = "model_path"
37 | tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code = True, padding_side = 'left')
38 | model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True).to(device)
39 | 
40 | input_str = "公司:byd\n产品:极氪001"
41 | ## one poly corresponding to a token id while [-1,-1,-1,-1] represents masked poly, corresponding to "\n" in this example
42 | input_poly = [
43 |   [0.1749,0.1466,0.5317,0.5486],
44 |   [0.1749,0.1466,0.5317,0.5486],
45 |   [0.1749,0.1466,0.5317,0.5486],
46 |   [0.1749,0.1466,0.5317,0.5486],
47 |   [-1,-1,-1,-1],
48 |   [0.6545,0.2287,0.8743,0.4666],
49 |   [0.6545,0.2287,0.8743,0.4666],
50 |   [0.6545,0.2287,0.8743,0.4666],
51 |   [0.6545,0.2287,0.8743,0.4666],
52 |   [0.6545,0.2287,0.8743,0.4666],
53 |   [0.6545,0.2287,0.8743,0.4666],
54 |   [0.6545,0.2287,0.8743,0.4666]
55 |   ]
56 | 
57 | input_ids = tokenizer.encode(input_str)
58 | input_ids = torch.as_tensor(input_ids, dtype=torch.int64)
59 | input_coordinates = torch.as_tensor(input_poly)
60 | 
61 | output = model(
62 |     input_ids=input_ids.unsqueeze(0).to(device), 
63 |     input_coordinates=input_coordinates.unsqueeze(0).to(device),
64 |     )
65 | 


--------------------------------------------------------------------------------