Key improvements:
1. Double verification mechanism for OCR failures
- When unwarp OCR fails (empty text), automatically try PaddleOCRVL backup on crop
- Fixes issue where correct seal was ignored due to unwarp image distortion
- Test result: 4% → 93.8% similarity on problematic PDFs
2. Institution name cleaning
- Remove unwanted suffixes: 检验检测专用章, 专用章, etc.
- Clean names before adding to results and similarity calculation
- Improves matching accuracy
3. Enhanced logging for institution selection
- Show all extracted institutions with similarity scores
- Track why specific institution was selected
- Better debugging and transparency
Example impact:
- Before: "成都虹之川科技有限公司" (wrong seal, 4% similarity)
- After: "中科测试技术(广东)集团有限公司" (correct seal, 93.8% similarity)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add extent limit (max 350°) to prevent polar unwarp distortion
- Add polygon count check (<3 polygons → use PaddleOCRVL backup)
- Add imwrite_safe() to handle Chinese paths on Windows
- Add --pdf-names parameter for targeted debugging
Fixes issue where seal extraction returned empty string when:
- Arc extent exceeded 360° causing severe image distortion
- Too few text polygons detected leading to inaccurate arc calculation
Test results:
- Before: 0% similarity (empty string)
- After: 52.4% similarity (partial extraction)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add PaddleOCRVL as optional OCR model for seal text recognition
- New parameter: --ocr-model {ppocr_v5,paddleocr_vl}
- PaddleOCRVL achieves 100% accuracy on test cases (vs 84% for PP-OCRv5)
- Backward compatible: defaults to PP-OCRv5
- Fix CMA recognition regression
- Ensure ocr_engine is always initialized for CMA extraction
- PaddleOCRVL only used for seal text, not CMA recognition
- Add comprehensive integration guide
- PADDLEOCRVL_INTEGRATION.md with usage examples
- test_paddleocr_vl_quick.py for validation
Implementation details:
- run_ocr_recognition_vl(): New function for PaddleOCRVL recognition
- extract_seals_and_institutions(): Enhanced with OCR model selection
- Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>