Commit Graph

6 Commits

Author SHA1 Message Date
黄仁欢 9f701edd25 fix(test): improve institution name matching by cleaning trailing numbers
Add smart institution name cleaning to handle OCR artifacts like trailing
CMA codes that cause false negative matches.

Problem:
- PDF "重庆市财政局..._pages3-6.pdf" extracted institution with trailing CMA code
- "四川合泰与必摩适检测有限公司430334" vs "四川合泰与必摩适检测有限公司"
- Similarity: 70.0% → incorrectly classified as "no_match"
- The core institution name is actually identical

Solution:
- Add clean_institution_name() function to remove trailing artifacts:
  * Remove 6+ digit numbers (CMA codes)
  * Remove 11+ digit numbers (full CMA codes)
  * Remove trailing punctuation and whitespace
- Enhance classify_match() with field_type parameter
- Apply cleaning for institution field comparisons

Results for test case:
- Before: 70.0% similarity, edit distance 6 → "no_match"
- After: 100.0% similarity, edit distance 0 → "exact"

This fix improves accuracy for cases where OCR accidentally captures
CMA codes or other numbers as part of the institution name.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-02-16 14:51:28 +08:00
黄仁欢 5baf0ac18e fix(cma): implement robust CMA code extraction with fallback mechanism
Add comprehensive CMA code extraction module with template matching
primary method and full-page OCR fallback to handle various PDF formats.

Key improvements:
- Add cma_extraction_template_primary.py module
- Support 11-12 digit CMA codes (prioritize 12-digit matches)
- Implement template matching + ROI OCR as primary method
- Add full-page OCR fallback when template matching fails
- Fix critical bug where low template match confidence prevented fallback
- Improve scoring algorithm considering position, confidence, and format

Fixed issues:
- YDQ23_001838.pdf: Extracts 210020349096 (12-digit code)
- WTS2025-21283.pdf: Extracts 220020349627 (12-digit code)
- Both PDFs now use fullpage_fallback successfully

Technical details:
- Template match threshold: 0.4 confidence
- ROI calculation: extends rightward from logo center
- Fallback triggers on: template load failure, match failure, or low confidence
- Scoring weights: confidence*100 + starts_with_2*50 + top_right*30

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-02-16 14:16:34 +08:00
黄仁欢 49c2e0f3f9 feat: integrate CMA template matching as fallback extraction method
- Add cv2.matchTemplate-based CMA logo detection functions
- Implement automatic fallback when primary OCR extraction fails or has low confidence (<0.6)
- Add dual-format OCR result parsing (legacy ocr() and predict() API)
- Fix PaddleOCR API compatibility (remove unsupported cls kwarg)
- Record extraction method in cma_method field (robust_ocr or template_matching)
- Generate debug ROI image (cma_template_match_roi.png) for verification
2026-02-12 13:29:48 +08:00
黄仁欢 52f283c7c9 feat(seal): add double verification and institution name cleaning
Key improvements:
1. Double verification mechanism for OCR failures
   - When unwarp OCR fails (empty text), automatically try PaddleOCRVL backup on crop
   - Fixes issue where correct seal was ignored due to unwarp image distortion
   - Test result: 4% → 93.8% similarity on problematic PDFs

2. Institution name cleaning
   - Remove unwanted suffixes: 检验检测专用章, 专用章, etc.
   - Clean names before adding to results and similarity calculation
   - Improves matching accuracy

3. Enhanced logging for institution selection
   - Show all extracted institutions with similarity scores
   - Track why specific institution was selected
   - Better debugging and transparency

Example impact:
- Before: "成都虹之川科技有限公司" (wrong seal, 4% similarity)
- After: "中科测试技术(广东)集团有限公司" (correct seal, 93.8% similarity)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-08 13:46:56 +08:00
黄仁欢 5a493b8d67 feat(seal): fix seal text extraction for edge cases
- Add extent limit (max 350°) to prevent polar unwarp distortion
- Add polygon count check (<3 polygons → use PaddleOCRVL backup)
- Add imwrite_safe() to handle Chinese paths on Windows
- Add --pdf-names parameter for targeted debugging

Fixes issue where seal extraction returned empty string when:
- Arc extent exceeded 360° causing severe image distortion
- Too few text polygons detected leading to inaccurate arc calculation

Test results:
- Before: 0% similarity (empty string)
- After: 52.4% similarity (partial extraction)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 23:13:03 +08:00
黄仁欢 8b416e9f5a feat: integrate PaddleOCRVL for seal text recognition
- Add PaddleOCRVL as optional OCR model for seal text recognition
  - New parameter: --ocr-model {ppocr_v5,paddleocr_vl}
  - PaddleOCRVL achieves 100% accuracy on test cases (vs 84% for PP-OCRv5)
  - Backward compatible: defaults to PP-OCRv5

- Fix CMA recognition regression
  - Ensure ocr_engine is always initialized for CMA extraction
  - PaddleOCRVL only used for seal text, not CMA recognition

- Add comprehensive integration guide
  - PADDLEOCRVL_INTEGRATION.md with usage examples
  - test_paddleocr_vl_quick.py for validation

Implementation details:
- run_ocr_recognition_vl(): New function for PaddleOCRVL recognition
- extract_seals_and_institutions(): Enhanced with OCR model selection
- Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 14:03:10 +08:00