# PaddleOCRVL Integration Guide ## Overview `test_accuracy_batch_full.py` now supports two OCR models for seal text recognition: 1. **PP-OCRv5_server_rec** (default) - Traditional OCR model 2. **PaddleOCRVL** - Vision-Language model with superior accuracy ## Usage ### Option 1: Command Line Arguments ```bash # Use default PP-OCRv5 model python test_accuracy_batch_full.py # Use PaddleOCRVL model (recommended for better accuracy) python test_accuracy_batch_full.py --ocr-model paddleocr_vl # Process specific number of PDFs python test_accuracy_batch_full.py --batch-size 5 --ocr-model paddleocr_vl ``` ### Option 2: Environment Variable ```bash # Set environment variable export OCR_MODEL=paddleocr_vl # Linux/Mac set OCR_MODEL=paddleocr_vl # Windows # Run script (will use environment variable) python test_accuracy_batch_full.py ``` ## Performance Comparison Based on WTS2025-21283.pdf test: | Model | Recognized Text | Accuracy | Score | |-------|----------------|----------|-------| | PP-OCRv5_server_rec | 械检测技术有限公司 | 84.2% | 0.8291 | | **PaddleOCRVL** | **威凯检测技术有限公司** | **100%** ✅ | N/A | ## Requirements For PaddleOCRVL, ensure you have: ```bash pip install paddleocr[doc-parser] pip install paddlepaddle==3.2.0 # Use 3.2.0, not 3.3.0 ``` ## API Usage ### In your own code: ```python from paddleocr import PaddleOCRVL import json # Initialize PaddleOCRVL with seal recognition pipeline = PaddleOCRVL( use_seal_recognition=True, use_ocr_for_image_block=True, use_layout_detection=True ) # Run prediction on unwarp seal image output = pipeline.predict("seal_unwarp_0.png") # Extract seal text from result result = output[0] result.save_to_json(save_path="output") # Read JSON to get seal text with open("output/seal_unwarp_0_res.json", 'r', encoding='utf-8') as f: data = json.load(f) for block in data['parsing_res_list']: if block['block_label'] == 'seal': seal_text = block['block_content'] print(f"Seal text: {seal_text}") ``` ## Implementation Details ### Modified Functions 1. **`run_ocr_recognition_vl()`** - New function for PaddleOCRVL recognition - Saves temp JSON files - Extracts `block_content` from `seal` blocks - Returns standardized result format 2. **`extract_seals_and_institutions()`** - Enhanced with OCR model selection - Added `ocr_model` parameter ("ppocr_v5" or "paddleocr_vl") - Added `vl_pipeline` parameter for PaddleOCRVL instance - Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable 3. **`process_single_pdf()`** - Updated to pass OCR model parameters 4. **`main()`** - Added command line argument parsing ### Key Configuration ```python # In test_accuracy_batch_full.py # OCR Model Selection (via environment variable or command line) OCR_MODEL = os.environ.get("OCR_MODEL", "ppocr_v5") # Check PaddleOCRVL availability try: from paddleocr import PaddleOCRVL PADDLEOCRVL_AVAILABLE = True except ImportError: PADDLEOCRVL_AVAILABLE = False ``` ## Troubleshooting ### Issue: "PaddleOCRVL not available" **Solution:** ```bash pip install paddleocr[doc-parser] ``` ### Issue: "use_seal_recognition or use_ocr_for_image_block not enabled" **Solution:** Make sure to initialize with correct parameters: ```python pipeline = PaddleOCRVL( use_seal_recognition=True, # Required! use_ocr_for_image_block=True # Required! ) ``` ### Issue: PaddlePaddle 3.3.0 compatibility error **Solution:** Downgrade to 3.2.0: ```bash pip install paddlepaddle==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cpu/ ``` ## File Structure ``` test_accuracy_batch_full.py ├── run_ocr_recognition() # PP-OCRv5 recognition (existing) ├── run_ocr_recognition_vl() # PaddleOCRVL recognition (new) ├── extract_seals_and_institutions() # Enhanced with model selection └── main() # Added CLI argument parsing ``` ## Recommendations 1. **For production use**: Use PaddleOCRVL for better accuracy 2. **For testing/debugging**: Use PP-OCRv5 for faster iteration 3. **For batch processing**: PaddleOCRVL is slower but more accurate ## Next Steps - [ ] Run full batch test with PaddleOCRVL on all PDFs - [ ] Compare accuracy metrics between models - [ ] Benchmark processing time for both models - [ ] Consider adding hybrid approach (try PP-OCRv5 first, fallback to PaddleOCRVL on low confidence)