4.4 KiB

Raw Blame History

PaddleOCRVL Integration Guide

Overview

test_accuracy_batch_full.py now supports two OCR models for seal text recognition:

PP-OCRv5_server_rec (default) - Traditional OCR model
PaddleOCRVL - Vision-Language model with superior accuracy

Usage

Option 1: Command Line Arguments

# Use default PP-OCRv5 model
python test_accuracy_batch_full.py

# Use PaddleOCRVL model (recommended for better accuracy)
python test_accuracy_batch_full.py --ocr-model paddleocr_vl

# Process specific number of PDFs
python test_accuracy_batch_full.py --batch-size 5 --ocr-model paddleocr_vl

Option 2: Environment Variable

# Set environment variable
export OCR_MODEL=paddleocr_vl  # Linux/Mac
set OCR_MODEL=paddleocr_vl     # Windows

# Run script (will use environment variable)
python test_accuracy_batch_full.py

Performance Comparison

Based on WTS2025-21283.pdf test:

Model	Recognized Text	Accuracy	Score
PP-OCRv5_server_rec	械检测技术有限公司	84.2%	0.8291
PaddleOCRVL	威凯检测技术有限公司	100% ✅	N/A

Requirements

For PaddleOCRVL, ensure you have:

pip install paddleocr[doc-parser]
pip install paddlepaddle==3.2.0  # Use 3.2.0, not 3.3.0

API Usage

In your own code:

from paddleocr import PaddleOCRVL
import json

# Initialize PaddleOCRVL with seal recognition
pipeline = PaddleOCRVL(
    use_seal_recognition=True,
    use_ocr_for_image_block=True,
    use_layout_detection=True
)

# Run prediction on unwarp seal image
output = pipeline.predict("seal_unwarp_0.png")

# Extract seal text from result
result = output[0]
result.save_to_json(save_path="output")

# Read JSON to get seal text
with open("output/seal_unwarp_0_res.json", 'r', encoding='utf-8') as f:
    data = json.load(f)
    for block in data['parsing_res_list']:
        if block['block_label'] == 'seal':
            seal_text = block['block_content']
            print(f"Seal text: {seal_text}")

Implementation Details

Modified Functions

run_ocr_recognition_vl() - New function for PaddleOCRVL recognition
- Saves temp JSON files
- Extracts block_content from seal blocks
- Returns standardized result format
extract_seals_and_institutions() - Enhanced with OCR model selection
- Added ocr_model parameter ("ppocr_v5" or "paddleocr_vl")
- Added vl_pipeline parameter for PaddleOCRVL instance
- Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable
process_single_pdf() - Updated to pass OCR model parameters
main() - Added command line argument parsing

Key Configuration

# In test_accuracy_batch_full.py

# OCR Model Selection (via environment variable or command line)
OCR_MODEL = os.environ.get("OCR_MODEL", "ppocr_v5")

# Check PaddleOCRVL availability
try:
    from paddleocr import PaddleOCRVL
    PADDLEOCRVL_AVAILABLE = True
except ImportError:
    PADDLEOCRVL_AVAILABLE = False

Troubleshooting

Issue: "PaddleOCRVL not available"

Solution:

pip install paddleocr[doc-parser]

Issue: "use_seal_recognition or use_ocr_for_image_block not enabled"

Solution: Make sure to initialize with correct parameters:

pipeline = PaddleOCRVL(
    use_seal_recognition=True,    # Required!
    use_ocr_for_image_block=True  # Required!
)

Issue: PaddlePaddle 3.3.0 compatibility error

Solution: Downgrade to 3.2.0:

pip install paddlepaddle==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/

File Structure

test_accuracy_batch_full.py
├── run_ocr_recognition()           # PP-OCRv5 recognition (existing)
├── run_ocr_recognition_vl()        # PaddleOCRVL recognition (new)
├── extract_seals_and_institutions() # Enhanced with model selection
└── main()                          # Added CLI argument parsing

Recommendations

For production use: Use PaddleOCRVL for better accuracy
For testing/debugging: Use PP-OCRv5 for faster iteration
For batch processing: PaddleOCRVL is slower but more accurate

Next Steps

Run full batch test with PaddleOCRVL on all PDFs
Compare accuracy metrics between models
Benchmark processing time for both models
Consider adding hybrid approach (try PP-OCRv5 first, fallback to PaddleOCRVL on low confidence)

4.4 KiB Raw Blame History