report-detect

Commit Graph

Author	SHA1	Message	Date
黄仁欢	9f701edd25	fix(test): improve institution name matching by cleaning trailing numbers Add smart institution name cleaning to handle OCR artifacts like trailing CMA codes that cause false negative matches. Problem: - PDF "重庆市财政局..._pages3-6.pdf" extracted institution with trailing CMA code - "四川合泰与必摩适检测有限公司430334" vs "四川合泰与必摩适检测有限公司" - Similarity: 70.0% → incorrectly classified as "no_match" - The core institution name is actually identical Solution: - Add clean_institution_name() function to remove trailing artifacts: * Remove 6+ digit numbers (CMA codes) * Remove 11+ digit numbers (full CMA codes) * Remove trailing punctuation and whitespace - Enhance classify_match() with field_type parameter - Apply cleaning for institution field comparisons Results for test case: - Before: 70.0% similarity, edit distance 6 → "no_match" - After: 100.0% similarity, edit distance 0 → "exact" This fix improves accuracy for cases where OCR accidentally captures CMA codes or other numbers as part of the institution name. Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-02-16 14:51:28 +08:00
黄仁欢	5baf0ac18e	fix(cma): implement robust CMA code extraction with fallback mechanism Add comprehensive CMA code extraction module with template matching primary method and full-page OCR fallback to handle various PDF formats. Key improvements: - Add cma_extraction_template_primary.py module - Support 11-12 digit CMA codes (prioritize 12-digit matches) - Implement template matching + ROI OCR as primary method - Add full-page OCR fallback when template matching fails - Fix critical bug where low template match confidence prevented fallback - Improve scoring algorithm considering position, confidence, and format Fixed issues: - YDQ23_001838.pdf: Extracts 210020349096 (12-digit code) - WTS2025-21283.pdf: Extracts 220020349627 (12-digit code) - Both PDFs now use fullpage_fallback successfully Technical details: - Template match threshold: 0.4 confidence - ROI calculation: extends rightward from logo center - Fallback triggers on: template load failure, match failure, or low confidence - Scoring weights: confidence100 + starts_with_250 + top_right*30 Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-02-16 14:16:34 +08:00
黄仁欢	49c2e0f3f9	feat: integrate CMA template matching as fallback extraction method - Add cv2.matchTemplate-based CMA logo detection functions - Implement automatic fallback when primary OCR extraction fails or has low confidence (<0.6) - Add dual-format OCR result parsing (legacy ocr() and predict() API) - Fix PaddleOCR API compatibility (remove unsupported cls kwarg) - Record extraction method in cma_method field (robust_ocr or template_matching) - Generate debug ROI image (cma_template_match_roi.png) for verification	2026-02-12 13:29:48 +08:00
黄仁欢	52f283c7c9	feat(seal): add double verification and institution name cleaning Key improvements: 1. Double verification mechanism for OCR failures - When unwarp OCR fails (empty text), automatically try PaddleOCRVL backup on crop - Fixes issue where correct seal was ignored due to unwarp image distortion - Test result: 4% → 93.8% similarity on problematic PDFs 2. Institution name cleaning - Remove unwanted suffixes: 检验检测专用章, 专用章, etc. - Clean names before adding to results and similarity calculation - Improves matching accuracy 3. Enhanced logging for institution selection - Show all extracted institutions with similarity scores - Track why specific institution was selected - Better debugging and transparency Example impact: - Before: "成都虹之川科技有限公司" (wrong seal, 4% similarity) - After: "中科测试技术（广东）集团有限公司" (correct seal, 93.8% similarity) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-08 13:46:56 +08:00
黄仁欢	5a493b8d67	feat(seal): fix seal text extraction for edge cases - Add extent limit (max 350°) to prevent polar unwarp distortion - Add polygon count check (<3 polygons → use PaddleOCRVL backup) - Add imwrite_safe() to handle Chinese paths on Windows - Add --pdf-names parameter for targeted debugging Fixes issue where seal extraction returned empty string when: - Arc extent exceeded 360° causing severe image distortion - Too few text polygons detected leading to inaccurate arc calculation Test results: - Before: 0% similarity (empty string) - After: 52.4% similarity (partial extraction) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-07 23:13:03 +08:00
黄仁欢	8b416e9f5a	feat: integrate PaddleOCRVL for seal text recognition - Add PaddleOCRVL as optional OCR model for seal text recognition - New parameter: --ocr-model {ppocr_v5,paddleocr_vl} - PaddleOCRVL achieves 100% accuracy on test cases (vs 84% for PP-OCRv5) - Backward compatible: defaults to PP-OCRv5 - Fix CMA recognition regression - Ensure ocr_engine is always initialized for CMA extraction - PaddleOCRVL only used for seal text, not CMA recognition - Add comprehensive integration guide - PADDLEOCRVL_INTEGRATION.md with usage examples - test_paddleocr_vl_quick.py for validation Implementation details: - run_ocr_recognition_vl(): New function for PaddleOCRVL recognition - extract_seals_and_institutions(): Enhanced with OCR model selection - Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-07 14:03:10 +08:00

6 Commits