report-detect/PADDLEOCRVL_INTEGRATION.md

4.4 KiB

PaddleOCRVL Integration Guide

Overview

test_accuracy_batch_full.py now supports two OCR models for seal text recognition:

  1. PP-OCRv5_server_rec (default) - Traditional OCR model
  2. PaddleOCRVL - Vision-Language model with superior accuracy

Usage

Option 1: Command Line Arguments

# Use default PP-OCRv5 model
python test_accuracy_batch_full.py

# Use PaddleOCRVL model (recommended for better accuracy)
python test_accuracy_batch_full.py --ocr-model paddleocr_vl

# Process specific number of PDFs
python test_accuracy_batch_full.py --batch-size 5 --ocr-model paddleocr_vl

Option 2: Environment Variable

# Set environment variable
export OCR_MODEL=paddleocr_vl  # Linux/Mac
set OCR_MODEL=paddleocr_vl     # Windows

# Run script (will use environment variable)
python test_accuracy_batch_full.py

Performance Comparison

Based on WTS2025-21283.pdf test:

Model Recognized Text Accuracy Score
PP-OCRv5_server_rec 械检测技术有限公司 84.2% 0.8291
PaddleOCRVL 威凯检测技术有限公司 100% N/A

Requirements

For PaddleOCRVL, ensure you have:

pip install paddleocr[doc-parser]
pip install paddlepaddle==3.2.0  # Use 3.2.0, not 3.3.0

API Usage

In your own code:

from paddleocr import PaddleOCRVL
import json

# Initialize PaddleOCRVL with seal recognition
pipeline = PaddleOCRVL(
    use_seal_recognition=True,
    use_ocr_for_image_block=True,
    use_layout_detection=True
)

# Run prediction on unwarp seal image
output = pipeline.predict("seal_unwarp_0.png")

# Extract seal text from result
result = output[0]
result.save_to_json(save_path="output")

# Read JSON to get seal text
with open("output/seal_unwarp_0_res.json", 'r', encoding='utf-8') as f:
    data = json.load(f)
    for block in data['parsing_res_list']:
        if block['block_label'] == 'seal':
            seal_text = block['block_content']
            print(f"Seal text: {seal_text}")

Implementation Details

Modified Functions

  1. run_ocr_recognition_vl() - New function for PaddleOCRVL recognition

    • Saves temp JSON files
    • Extracts block_content from seal blocks
    • Returns standardized result format
  2. extract_seals_and_institutions() - Enhanced with OCR model selection

    • Added ocr_model parameter ("ppocr_v5" or "paddleocr_vl")
    • Added vl_pipeline parameter for PaddleOCRVL instance
    • Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable
  3. process_single_pdf() - Updated to pass OCR model parameters

  4. main() - Added command line argument parsing

Key Configuration

# In test_accuracy_batch_full.py

# OCR Model Selection (via environment variable or command line)
OCR_MODEL = os.environ.get("OCR_MODEL", "ppocr_v5")

# Check PaddleOCRVL availability
try:
    from paddleocr import PaddleOCRVL
    PADDLEOCRVL_AVAILABLE = True
except ImportError:
    PADDLEOCRVL_AVAILABLE = False

Troubleshooting

Issue: "PaddleOCRVL not available"

Solution:

pip install paddleocr[doc-parser]

Issue: "use_seal_recognition or use_ocr_for_image_block not enabled"

Solution: Make sure to initialize with correct parameters:

pipeline = PaddleOCRVL(
    use_seal_recognition=True,    # Required!
    use_ocr_for_image_block=True  # Required!
)

Issue: PaddlePaddle 3.3.0 compatibility error

Solution: Downgrade to 3.2.0:

pip install paddlepaddle==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/

File Structure

test_accuracy_batch_full.py
├── run_ocr_recognition()           # PP-OCRv5 recognition (existing)
├── run_ocr_recognition_vl()        # PaddleOCRVL recognition (new)
├── extract_seals_and_institutions() # Enhanced with model selection
└── main()                          # Added CLI argument parsing

Recommendations

  1. For production use: Use PaddleOCRVL for better accuracy
  2. For testing/debugging: Use PP-OCRv5 for faster iteration
  3. For batch processing: PaddleOCRVL is slower but more accurate

Next Steps

  • Run full batch test with PaddleOCRVL on all PDFs
  • Compare accuracy metrics between models
  • Benchmark processing time for both models
  • Consider adding hybrid approach (try PP-OCRv5 first, fallback to PaddleOCRVL on low confidence)