The world of large language models (LLMs) is constantly evolving, with new advances emerging rapidly. An exciting area is the development of multimodal LLMs (MLLMs), capable of understanding and interacting with texts and images. This opens up a world of possibilities for tasks like understanding documents, answering questions visually, and more.
I recently wrote a general article on one of these models which you can consult here:
But in this one, we'll explore a powerful combination: the InternVL model and the QLoRA fine-tuning technique. We will focus on how we can easily customize these templates for any specific use case. We will use these tools to create a receipt understanding pipeline that extracts key information such as company name, address, and total purchase amount with high accuracy.
This project aims to develop a system capable of accurately extracting specific information from scanned receipts, using the capabilities of InternVL. The task presents a unique challenge, requiring not only robust natural language processing (NLP) but also the ability to interpret the visual layout of the input image. This will allow us to create a unique, OCR-free, end-to-end pipeline that demonstrates strong generalization to complex documents.
To train and evaluate our model, we will use the SROIE database. SROIE provides 1,000 digitized receipt images, each annotated with key entities such as:
- Company: The name of the store or company.
- Date: the date of purchase.
- Address: The address of the store.
- Total: the total amount paid.
We will evaluate the performance of our model using a fuzzy similarity score, a metric that measures the similarity between predicted and ground truth features. This metric ranges from 0 (irrelevant results) to 100 (perfect predictions).
InternVL is a family of multimodal LLMs from OpenGVLab, designed to excel in tasks involving images and text. Its architecture combines a vision model (like InternViT) with a language model (like InternLM2 or Phi-3). We will focus on the Mini-InternVL-Chat-2B-V1-5 variant, a smaller version well suited to running on consumer GPUs.
The main advantages of InternVL:
- Efficiency: Its compact size allows for efficient training and inference.
- Accuracy: Although it is smaller, it achieves competitive performance in various benchmarks.
- Multimodal Capabilities: It seamlessly combines understanding of images and text.
Demo: You can explore a live demo of InternVL here.
To further improve the performance of our model, we will use QLoRA, a fine-tuning technique that significantly reduces memory consumption while preserving performance. Here's how it works:
- Quantization: The pre-trained LLM is quantized to 4-bit precision, reducing its memory footprint.
- Low-Rank Adapters (LoRA): Instead of changing all parameters of the pre-trained model, LoRA adds small trainable adapters to the network. These adapters capture task-specific information without requiring changes to the main model.
- Efficient training: The combination of quantization and LoRA enables efficient fine-tuning even on GPUs with limited memory.
Let's dive into the code. First, we will evaluate the basic performance of Mini-InternVL-Chat-2B-V1-5 without any tuning:
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)model = InternVLChatModel.from_pretrained(
args.path,
device_map={"": 0},
quantization_config=quant_config if args.quant else None,
torch_dtype=torch.bfloat16,
)
tokenizer = InternLM2Tokenizer.from_pretrained(args.path)
# set the max number of tiles in `max_num`
model.eval()
pixel_values = (
load_image(image_base_path / "X51005255805.jpg", max_num=6)
.to(torch.bfloat16)
.cuda()
)
generation_config = dict(
num_beams=1,
max_new_tokens=512,
do_sample=False,
)
# single-round single-image conversation
question = (
"Extract the company, date, address and total in json format."
"Respond with a valid JSON only."
)
# print(model)
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(response)
The result:
```json
{
"company": "SAM SAM TRADING CO",
"date": "Fri, 29-12-2017",
"address": "67, JLN MENHAW 25/63 TNN SRI HUDA, 40400 SHAH ALAM",
"total": "RM 14.10"
}
```
This code:
- Load the model from the Hugging Face hub.
- Loads a sample receipt image and converts it to a tensor.
- Formulates a question asking the model to extract relevant information from the image.
- Runs the model and outputs the extracted information in JSON format.
This zero-shot evaluation shows impressive results, achieving an average fuzzy similarity score of 74.24%. This demonstrates InternVL's ability to understand receipts and extract information without fine-tuning.
To further improve the accuracy, we will refine the model using QLoRA. Here's how we implement it:
_data = load_data(args.data_path, fold="train")# Quantization Config
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = InternVLChatModel.from_pretrained(
path,
device_map={"": 0},
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
tokenizer = InternLM2Tokenizer.from_pretrained(path)
# set the max number of tiles in `max_num`
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
print("img_context_token_id", img_context_token_id)
model.img_context_token_id = img_context_token_id
model.config.llm_config.use_cache = False
model = wrap_lora(model, r=128, lora_alpha=256)
training_data = SFTDataset(
data=_data, template=model.config.template, tokenizer=tokenizer
)
collator = CustomDataCollator(pad_token=tokenizer.pad_token_id, ignore_index=-100)
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
print("img_context_token_id", img_context_token_id)
model.img_context_token_id = img_context_token_id
print("model.img_context_token_id", model.img_context_token_id)
train_params = TrainingArguments(
output_dir=str(BASE_PATH / "results_modified"),
num_train_epochs=EPOCHS,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
optim="paged_adamw_32bit",
save_steps=len(training_data) // 10,
logging_steps=len(training_data) // 50,
learning_rate=5e-4,
lr_scheduler_type="cosine",
warmup_steps=100,
weight_decay=0.001,
max_steps=-1,
group_by_length=False,
max_grad_norm=1.0,
)
# Trainer
fine_tuning = SFTTrainer(
model=model,
train_dataset=training_data,
dataset_text_field="###",
tokenizer=tokenizer,
args=train_params,
data_collator=collator,
max_seq_length=tokenizer.model_max_length,
)
print(fine_tuning.model.print_trainable_parameters())
# Training
fine_tuning.train()
# Save Model
fine_tuning.model.save_pretrained(refined_model)
This code:
- Loads the model with quantization enabled.
- Wraps the model with LoRA, adding trainable adapters.
- Creates a dataset from the SROIE dataset.
- Defines training arguments such as learning rate, batch size, and epochs.
- Initializes a trainer to manage the training process.
- Trains the model on the SROIE dataset.
- Saves the refined model.
Here is an example of comparison between the base model and the refined QLoRA model:
Ground Truth: {
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2018",
"address": "NO 4,JALAN PERJIRANAN 10, TAMAN AIR BIRU, 81700 PASIR GUDANG, JOHOR.",
"total": "72.00"
}
Prediction Base: KO
```json
{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2016",
"address": "JM092487-D",
"total": "67.92"
}
```
Prediction QLoRA: OK
{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2018",
"address": "NO 4, JALAN PERUBANAN 10, TAMAN AIR BIRU, 81700 PASIR GUDANG, JOHOR",
"total": "72.00"
}
After adjustment with QLoRA, our model achieves a remarkable result 95.4% fuzzy similarity score, a significant improvement over baseline performance (74.24%). This demonstrates the power of QLoRA to improve model accuracy without requiring massive computing resources (15 minutes training on 600 samples on an RTX 3080 GPU).
We successfully created a robust receipt understanding pipeline using InternVL and QLoRA. This approach highlights the potential of multimodal LLMs for real-world tasks such as document analysis and information extraction. In this example use case, we gained 30 points in prediction quality using a few hundred examples and a few minutes of calculation time on a consumer GPU.
You can find the full code implementation for this project here.
The development of multimodal LLMs is only just beginning and the future holds exciting possibilities. The field of automated document processing has immense potential in the era of MLLMs. These models can revolutionize the way we extract information from contracts, invoices and other documents, requiring minimal training data. By integrating text and vision, they can analyze the layout of complex documents with unprecedented precision, paving the way for more efficient and intelligent information management.
The future of AI is multimodal, and InternVL and QLoRA are powerful tools to help us unlock its potential on a small compute budget.
Connections:
Coded: https://github.com/CVxTz/doc-llm
Dataset source: https://rrc.cvc.uab.es/?ch=13&com=introduction
Dataset License: Licensed Creative Commons Attribution 4.0 International License.