Clarify that CMA extraction modules are core dependencies and must
remain in the project root directory. These files cannot be archived as they
are imported by test_accuracy_batch_full.py at runtime.
Core files (in root):
- cma_extraction_template_primary.py (19 KB) - Primary CMA extraction module
- cma_extraction_final.py (16 KB) - Fallback CMA extraction module
Dependency chain:
test_accuracy_batch_full.py
→ imports: cma_extraction_template_primary.py
→ fallback: cma_extraction_final.py
Why these cannot be archived:
1. Runtime import dependency - script will fail without them
2. Core business logic - not temporary/debug scripts
3. Required for main functionality - not optional or auxiliary
Archive directory should only contain:
- Temporary test scripts
- Debug/analysis scripts
- Old documentation
- Auxiliary tools
Verification:
✓ Both files present in root directory
✓ Already tracked in git (commit 9562cf1)
✓ No duplicate copies in archive/
Related documentation:
- TEST_ACCURACY_BATCH_DEPENDENCIES.md - Full dependency analysis
- CLEANUP_PLAN.md - Cleanup plan and file categorization
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Major improvements to batch OCR testing script:
1. PaddleOCRVL Timeout Protection
- Add multiprocessing-based timeout mechanism (default: 60s, configurable up to 300s)
- Prevents indefinite hangs when PaddleOCRVL encounters problematic seal images
- Added _run_ocr_vl_wrapper() function for subprocess execution
- All PaddleOCRVL calls now use PADDLEOCRVL_TIMEOUT global variable
2. Command-Line Arguments
- --paddleocrvl-timeout: Set custom timeout in seconds (default: 60, recommended: 300)
- --disable-paddleocrvl: Skip PaddleOCRVL initialization for faster testing
3. CMA Template Matching Improvements
- Change matching method from TM_CCOEFF_NORMED to TM_CCORR_NORMED
- Add position filtering (upper 60% of page only)
- Prevents false matches in footer areas
4. OCR Result Validation
- Add robust handling for different PaddleOCR API response formats
- Improved error handling for edge cases
- Better CMA code extraction with 11-12 digit pattern matching
5. Bug Fixes
- Fixed IndexError when processing OCR results with inconsistent formats
- Improved text cleaning for CMA code extraction
- Added validation for OCR data structures
Performance:
- CMA accuracy: 85-100% (depending on PDF quality)
- Institution accuracy: 27-100% (improved with seal OCR validation)
- Average processing time: 18-35 seconds per PDF
Related files:
- test_paddleocrvl_timeout.py: Timeout mechanism verification
- PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md: Detailed implementation guide
- PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md: Usage guide for 5-min timeout
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a new match category 'acceptable' for institution name matches with
similarity between 60% and 85%, providing more nuanced matching results.
Changes:
1. Add ACCEPTABLE_THRESHOLD = 60.0 constant
2. Update classify_match() to include 'acceptable' category
3. Add blue color (#2196f3) for acceptable matches in reports
4. Update all statistics to count acceptable matches separately
5. Modify HTML summary to show 5 columns instead of 4
6. Update JSON output to include acceptable count
7. Add [ACCEPTABLE] symbol in result tables
Match levels (from highest to lowest):
- exact: 100% similarity → green
- partial: >= 85% similarity → orange
- acceptable: >= 60% similarity → blue ← NEW
- no_match: < 60% similarity → red
This improves the granularity of match reporting, especially for cases
where OCR artifacts or minor variations cause similarity to drop below
the 85% partial threshold but are still reasonably accurate.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Extend institution name cleaning to handle OCR artifacts from seal text
that gets merged with company names during extraction.
Problem:
- 3 PDFs failed matching due to "检验检测专用章" (Seal for Inspection & Testing)
being included in extracted institution names
- Example: "四川合泰与必摩适检测有限公司检验检测专用章"
vs "四川合泰与必摩适检测有限公司"
- Similarity dropped to ~60-67% → incorrectly classified as "no_match"
- Affected PDFs:
* pages3-6.pdf: 60.87% similarity
* pages7-14.pdf: 60.0% similarity
* pages12-15.pdf: 62.5% similarity
Solution:
- Add seal suffix removal to clean_institution_name() function
- Remove common seal names: 检验检测专用章, 检测专用章, 检验专用章, etc.
- Use string replacement (not regex) to handle middle-of-text occurrences
- Apply before number removal to handle combined artifacts like "专用章123456"
Test Results:
All 4 test cases now achieve 100% similarity and "exact" match:
1. "检验检测专用章" suffix → 66.67% → 100.00% ✓
2. "检验检测专用章" suffix (different company) → 65.00% → 100.00% ✓
3. "430334" suffix → 70.00% → 100.00% ✓
4. "检验检测专用章430334" combined → 51.85% → 100.00% ✓
This fix complements the previous CMA code suffix removal and
significantly improves matching accuracy for seal-related OCR artifacts.
Co-Authored-By: Claude Code <noreply@anthropic.com>
Add smart institution name cleaning to handle OCR artifacts like trailing
CMA codes that cause false negative matches.
Problem:
- PDF "重庆市财政局..._pages3-6.pdf" extracted institution with trailing CMA code
- "四川合泰与必摩适检测有限公司430334" vs "四川合泰与必摩适检测有限公司"
- Similarity: 70.0% → incorrectly classified as "no_match"
- The core institution name is actually identical
Solution:
- Add clean_institution_name() function to remove trailing artifacts:
* Remove 6+ digit numbers (CMA codes)
* Remove 11+ digit numbers (full CMA codes)
* Remove trailing punctuation and whitespace
- Enhance classify_match() with field_type parameter
- Apply cleaning for institution field comparisons
Results for test case:
- Before: 70.0% similarity, edit distance 6 → "no_match"
- After: 100.0% similarity, edit distance 0 → "exact"
This fix improves accuracy for cases where OCR accidentally captures
CMA codes or other numbers as part of the institution name.
Co-Authored-By: Claude Code <noreply@anthropic.com>
- Add cv2.matchTemplate-based CMA logo detection functions
- Implement automatic fallback when primary OCR extraction fails or has low confidence (<0.6)
- Add dual-format OCR result parsing (legacy ocr() and predict() API)
- Fix PaddleOCR API compatibility (remove unsupported cls kwarg)
- Record extraction method in cma_method field (robust_ocr or template_matching)
- Generate debug ROI image (cma_template_match_roi.png) for verification
Key improvements:
1. Double verification mechanism for OCR failures
- When unwarp OCR fails (empty text), automatically try PaddleOCRVL backup on crop
- Fixes issue where correct seal was ignored due to unwarp image distortion
- Test result: 4% → 93.8% similarity on problematic PDFs
2. Institution name cleaning
- Remove unwanted suffixes: 检验检测专用章, 专用章, etc.
- Clean names before adding to results and similarity calculation
- Improves matching accuracy
3. Enhanced logging for institution selection
- Show all extracted institutions with similarity scores
- Track why specific institution was selected
- Better debugging and transparency
Example impact:
- Before: "成都虹之川科技有限公司" (wrong seal, 4% similarity)
- After: "中科测试技术(广东)集团有限公司" (correct seal, 93.8% similarity)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add extent limit (max 350°) to prevent polar unwarp distortion
- Add polygon count check (<3 polygons → use PaddleOCRVL backup)
- Add imwrite_safe() to handle Chinese paths on Windows
- Add --pdf-names parameter for targeted debugging
Fixes issue where seal extraction returned empty string when:
- Arc extent exceeded 360° causing severe image distortion
- Too few text polygons detected leading to inaccurate arc calculation
Test results:
- Before: 0% similarity (empty string)
- After: 52.4% similarity (partial extraction)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add PaddleOCRVL as optional OCR model for seal text recognition
- New parameter: --ocr-model {ppocr_v5,paddleocr_vl}
- PaddleOCRVL achieves 100% accuracy on test cases (vs 84% for PP-OCRv5)
- Backward compatible: defaults to PP-OCRv5
- Fix CMA recognition regression
- Ensure ocr_engine is always initialized for CMA extraction
- PaddleOCRVL only used for seal text, not CMA recognition
- Add comprehensive integration guide
- PADDLEOCRVL_INTEGRATION.md with usage examples
- test_paddleocr_vl_quick.py for validation
Implementation details:
- run_ocr_recognition_vl(): New function for PaddleOCRVL recognition
- extract_seals_and_institutions(): Enhanced with OCR model selection
- Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>