黄仁欢
|
5baf0ac18e
|
fix(cma): implement robust CMA code extraction with fallback mechanism
Add comprehensive CMA code extraction module with template matching
primary method and full-page OCR fallback to handle various PDF formats.
Key improvements:
- Add cma_extraction_template_primary.py module
- Support 11-12 digit CMA codes (prioritize 12-digit matches)
- Implement template matching + ROI OCR as primary method
- Add full-page OCR fallback when template matching fails
- Fix critical bug where low template match confidence prevented fallback
- Improve scoring algorithm considering position, confidence, and format
Fixed issues:
- YDQ23_001838.pdf: Extracts 210020349096 (12-digit code)
- WTS2025-21283.pdf: Extracts 220020349627 (12-digit code)
- Both PDFs now use fullpage_fallback successfully
Technical details:
- Template match threshold: 0.4 confidence
- ROI calculation: extends rightward from logo center
- Fallback triggers on: template load failure, match failure, or low confidence
- Scoring weights: confidence*100 + starts_with_2*50 + top_right*30
Co-Authored-By: Claude Code <noreply@anthropic.com>
|
2026-02-16 14:16:34 +08:00 |