4.4 KiB
4.4 KiB
PaddleOCRVL Integration Guide
Overview
test_accuracy_batch_full.py now supports two OCR models for seal text recognition:
- PP-OCRv5_server_rec (default) - Traditional OCR model
- PaddleOCRVL - Vision-Language model with superior accuracy
Usage
Option 1: Command Line Arguments
# Use default PP-OCRv5 model
python test_accuracy_batch_full.py
# Use PaddleOCRVL model (recommended for better accuracy)
python test_accuracy_batch_full.py --ocr-model paddleocr_vl
# Process specific number of PDFs
python test_accuracy_batch_full.py --batch-size 5 --ocr-model paddleocr_vl
Option 2: Environment Variable
# Set environment variable
export OCR_MODEL=paddleocr_vl # Linux/Mac
set OCR_MODEL=paddleocr_vl # Windows
# Run script (will use environment variable)
python test_accuracy_batch_full.py
Performance Comparison
Based on WTS2025-21283.pdf test:
| Model | Recognized Text | Accuracy | Score |
|---|---|---|---|
| PP-OCRv5_server_rec | 械检测技术有限公司 | 84.2% | 0.8291 |
| PaddleOCRVL | 威凯检测技术有限公司 | 100% ✅ | N/A |
Requirements
For PaddleOCRVL, ensure you have:
pip install paddleocr[doc-parser]
pip install paddlepaddle==3.2.0 # Use 3.2.0, not 3.3.0
API Usage
In your own code:
from paddleocr import PaddleOCRVL
import json
# Initialize PaddleOCRVL with seal recognition
pipeline = PaddleOCRVL(
use_seal_recognition=True,
use_ocr_for_image_block=True,
use_layout_detection=True
)
# Run prediction on unwarp seal image
output = pipeline.predict("seal_unwarp_0.png")
# Extract seal text from result
result = output[0]
result.save_to_json(save_path="output")
# Read JSON to get seal text
with open("output/seal_unwarp_0_res.json", 'r', encoding='utf-8') as f:
data = json.load(f)
for block in data['parsing_res_list']:
if block['block_label'] == 'seal':
seal_text = block['block_content']
print(f"Seal text: {seal_text}")
Implementation Details
Modified Functions
-
run_ocr_recognition_vl()- New function for PaddleOCRVL recognition- Saves temp JSON files
- Extracts
block_contentfromsealblocks - Returns standardized result format
-
extract_seals_and_institutions()- Enhanced with OCR model selection- Added
ocr_modelparameter ("ppocr_v5" or "paddleocr_vl") - Added
vl_pipelineparameter for PaddleOCRVL instance - Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable
- Added
-
process_single_pdf()- Updated to pass OCR model parameters -
main()- Added command line argument parsing
Key Configuration
# In test_accuracy_batch_full.py
# OCR Model Selection (via environment variable or command line)
OCR_MODEL = os.environ.get("OCR_MODEL", "ppocr_v5")
# Check PaddleOCRVL availability
try:
from paddleocr import PaddleOCRVL
PADDLEOCRVL_AVAILABLE = True
except ImportError:
PADDLEOCRVL_AVAILABLE = False
Troubleshooting
Issue: "PaddleOCRVL not available"
Solution:
pip install paddleocr[doc-parser]
Issue: "use_seal_recognition or use_ocr_for_image_block not enabled"
Solution: Make sure to initialize with correct parameters:
pipeline = PaddleOCRVL(
use_seal_recognition=True, # Required!
use_ocr_for_image_block=True # Required!
)
Issue: PaddlePaddle 3.3.0 compatibility error
Solution: Downgrade to 3.2.0:
pip install paddlepaddle==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/
File Structure
test_accuracy_batch_full.py
├── run_ocr_recognition() # PP-OCRv5 recognition (existing)
├── run_ocr_recognition_vl() # PaddleOCRVL recognition (new)
├── extract_seals_and_institutions() # Enhanced with model selection
└── main() # Added CLI argument parsing
Recommendations
- For production use: Use PaddleOCRVL for better accuracy
- For testing/debugging: Use PP-OCRv5 for faster iteration
- For batch processing: PaddleOCRVL is slower but more accurate
Next Steps
- Run full batch test with PaddleOCRVL on all PDFs
- Compare accuracy metrics between models
- Benchmark processing time for both models
- Consider adding hybrid approach (try PP-OCRv5 first, fallback to PaddleOCRVL on low confidence)