Commit Graph

15 Commits

Author SHA1 Message Date
黄仁欢 4bd46b6f0c docs(test): add comprehensive documentation for batch testing script
Added three key documentation files:

1. TEST_ACCURACY_BATCH_README.md
   - Complete usage guide for test_accuracy_batch_full.py
   - Command-line parameters reference
   - 4 usage scenarios (quick, high-accuracy, fast, single-PDF)
   - Troubleshooting guide
   - Performance optimization tips
   - Best practices and examples

2. TEST_ACCURACY_BATCH_DEPENDENCIES.md
   - Detailed dependency analysis
   - Required files and directory structure
   - Python library dependencies
   - File size statistics
   - Dependency relationship diagram
   - Common dependency issues and solutions

3. CLEANUP_PLAN.md
   - File categorization (keep, archive, delete)
   - Step-by-step cleanup instructions
   - Archive directory structure proposal
   - Three cleanup approaches (conservative, aggressive, phased)
   - Cleanup automation script

Features:
- Comprehensive parameter reference tables
- Real-world usage examples
- Performance comparison charts
- Quick reference commands
- Development guidelines

Target audience:
- New developers joining the project
- QA team running batch tests
- DevOps engineers deploying the system

Related:
- test_accuracy_batch_full.py (v1.2.0)
- PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md
- IMPLEMENTATION_SUMMARY.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:32:04 +08:00
黄仁欢 6c5f9e0489 feat(ocr): add PaddleOCRVL timeout protection and improve OCR accuracy
Major improvements to batch OCR testing script:

1. PaddleOCRVL Timeout Protection
   - Add multiprocessing-based timeout mechanism (default: 60s, configurable up to 300s)
   - Prevents indefinite hangs when PaddleOCRVL encounters problematic seal images
   - Added _run_ocr_vl_wrapper() function for subprocess execution
   - All PaddleOCRVL calls now use PADDLEOCRVL_TIMEOUT global variable

2. Command-Line Arguments
   - --paddleocrvl-timeout: Set custom timeout in seconds (default: 60, recommended: 300)
   - --disable-paddleocrvl: Skip PaddleOCRVL initialization for faster testing

3. CMA Template Matching Improvements
   - Change matching method from TM_CCOEFF_NORMED to TM_CCORR_NORMED
   - Add position filtering (upper 60% of page only)
   - Prevents false matches in footer areas

4. OCR Result Validation
   - Add robust handling for different PaddleOCR API response formats
   - Improved error handling for edge cases
   - Better CMA code extraction with 11-12 digit pattern matching

5. Bug Fixes
   - Fixed IndexError when processing OCR results with inconsistent formats
   - Improved text cleaning for CMA code extraction
   - Added validation for OCR data structures

Performance:
- CMA accuracy: 85-100% (depending on PDF quality)
- Institution accuracy: 27-100% (improved with seal OCR validation)
- Average processing time: 18-35 seconds per PDF

Related files:
- test_paddleocrvl_timeout.py: Timeout mechanism verification
- PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md: Detailed implementation guide
- PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md: Usage guide for 5-min timeout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:26:46 +08:00
黄仁欢 22773f3cc8 feat(test): add 'acceptable' match type for similarity >= 60%
Add a new match category 'acceptable' for institution name matches with
similarity between 60% and 85%, providing more nuanced matching results.

Changes:
1. Add ACCEPTABLE_THRESHOLD = 60.0 constant
2. Update classify_match() to include 'acceptable' category
3. Add blue color (#2196f3) for acceptable matches in reports
4. Update all statistics to count acceptable matches separately
5. Modify HTML summary to show 5 columns instead of 4
6. Update JSON output to include acceptable count
7. Add [ACCEPTABLE] symbol in result tables

Match levels (from highest to lowest):
- exact: 100% similarity → green
- partial: >= 85% similarity → orange
- acceptable: >= 60% similarity → blue ← NEW
- no_match: < 60% similarity → red

This improves the granularity of match reporting, especially for cases
where OCR artifacts or minor variations cause similarity to drop below
the 85% partial threshold but are still reasonably accurate.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-02-17 23:37:17 +08:00
黄仁欢 f5981fdf72 fix(test): remove seal suffixes from institution names before matching
Extend institution name cleaning to handle OCR artifacts from seal text
that gets merged with company names during extraction.

Problem:
- 3 PDFs failed matching due to "检验检测专用章" (Seal for Inspection & Testing)
  being included in extracted institution names
- Example: "四川合泰与必摩适检测有限公司检验检测专用章"
           vs "四川合泰与必摩适检测有限公司"
- Similarity dropped to ~60-67% → incorrectly classified as "no_match"
- Affected PDFs:
  * pages3-6.pdf: 60.87% similarity
  * pages7-14.pdf: 60.0% similarity
  * pages12-15.pdf: 62.5% similarity

Solution:
- Add seal suffix removal to clean_institution_name() function
- Remove common seal names: 检验检测专用章, 检测专用章, 检验专用章, etc.
- Use string replacement (not regex) to handle middle-of-text occurrences
- Apply before number removal to handle combined artifacts like "专用章123456"

Test Results:
All 4 test cases now achieve 100% similarity and "exact" match:
1. "检验检测专用章" suffix → 66.67% → 100.00% ✓
2. "检验检测专用章" suffix (different company) → 65.00% → 100.00% ✓
3. "430334" suffix → 70.00% → 100.00% ✓
4. "检验检测专用章430334" combined → 51.85% → 100.00% ✓

This fix complements the previous CMA code suffix removal and
significantly improves matching accuracy for seal-related OCR artifacts.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-02-16 21:22:23 +08:00
黄仁欢 9f701edd25 fix(test): improve institution name matching by cleaning trailing numbers
Add smart institution name cleaning to handle OCR artifacts like trailing
CMA codes that cause false negative matches.

Problem:
- PDF "重庆市财政局..._pages3-6.pdf" extracted institution with trailing CMA code
- "四川合泰与必摩适检测有限公司430334" vs "四川合泰与必摩适检测有限公司"
- Similarity: 70.0% → incorrectly classified as "no_match"
- The core institution name is actually identical

Solution:
- Add clean_institution_name() function to remove trailing artifacts:
  * Remove 6+ digit numbers (CMA codes)
  * Remove 11+ digit numbers (full CMA codes)
  * Remove trailing punctuation and whitespace
- Enhance classify_match() with field_type parameter
- Apply cleaning for institution field comparisons

Results for test case:
- Before: 70.0% similarity, edit distance 6 → "no_match"
- After: 100.0% similarity, edit distance 0 → "exact"

This fix improves accuracy for cases where OCR accidentally captures
CMA codes or other numbers as part of the institution name.

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-02-16 14:51:28 +08:00
黄仁欢 5baf0ac18e fix(cma): implement robust CMA code extraction with fallback mechanism
Add comprehensive CMA code extraction module with template matching
primary method and full-page OCR fallback to handle various PDF formats.

Key improvements:
- Add cma_extraction_template_primary.py module
- Support 11-12 digit CMA codes (prioritize 12-digit matches)
- Implement template matching + ROI OCR as primary method
- Add full-page OCR fallback when template matching fails
- Fix critical bug where low template match confidence prevented fallback
- Improve scoring algorithm considering position, confidence, and format

Fixed issues:
- YDQ23_001838.pdf: Extracts 210020349096 (12-digit code)
- WTS2025-21283.pdf: Extracts 220020349627 (12-digit code)
- Both PDFs now use fullpage_fallback successfully

Technical details:
- Template match threshold: 0.4 confidence
- ROI calculation: extends rightward from logo center
- Fallback triggers on: template load failure, match failure, or low confidence
- Scoring weights: confidence*100 + starts_with_2*50 + top_right*30

Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-02-16 14:16:34 +08:00
黄仁欢 49c2e0f3f9 feat: integrate CMA template matching as fallback extraction method
- Add cv2.matchTemplate-based CMA logo detection functions
- Implement automatic fallback when primary OCR extraction fails or has low confidence (<0.6)
- Add dual-format OCR result parsing (legacy ocr() and predict() API)
- Fix PaddleOCR API compatibility (remove unsupported cls kwarg)
- Record extraction method in cma_method field (robust_ocr or template_matching)
- Generate debug ROI image (cma_template_match_roi.png) for verification
2026-02-12 13:29:48 +08:00
黄仁欢 bc34b209b9 Checkpoint before ONNX migration 2026-02-09 09:43:28 +08:00
黄仁欢 8563fcd6b0 feat(djl): attempt upgrade to DJL 0.27.0 to fix PaddlePaddle crashes
Summary:
- Upgraded DJL from 0.26.0 to 0.27.0 (latest available)
- Added Maven Central repository as fallback
- Configured exec-maven-plugin for running standalone tests

Findings:
- PaddlePaddle engine (0.27.0) still uses native library 2.3.2
- Crashes persist at identical location: paddle_inference.dll+0x3e751b
- Confirmed root cause: obsolete PaddlePaddle engine (last update Mar 2024)

Test Results:
- Unit tests: 26/26 passing 
- Integration test:  Crashed (native library bug)
- JVM heap: 6GB (confirmed not memory issue)

Documentation:
- Added comprehensive DJL upgrade analysis report
- Confirmed DJL PaddlePaddle engine appears abandoned
- Recommended solution: REST API architecture (see TEST_EXECUTION_FINAL_REPORT.md)

Sources:
- https://mvnrepository.com/artifact/ai.djl.paddlepaddle/paddlepaddle-engine
- https://github.com/deepjavalibrary/djl/releases/tag/v0.27.0

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-09 00:04:40 +08:00
黄仁欢 81ff1db782 feat(ocr): integrate Python test script improvements for 85% parity
Integrate 7 key improvements from Python test script to enhance CMA code
and institution name extraction accuracy from 75% to expected 90%.

Core Features Added:
- InstitutionNameCleaner: Removes seal-specific suffixes (检验检测专用章)
- SimilarityCalculator: Levenshtein distance for string matching
- Extent limiting: Prevents unwarping distortion (>350°)
- Fallback unwarping: Fixed angle range (270°) for seals without text
- Dual strategy center detection: Circle fitting with crop center fallback
- Polygon count checking: Skips unwarping when <3 polygons detected
- PaddleOCRVL service: Stub for backup OCR (implementation pending)

Modified Files:
- OcrService.java: Added polygon checking, institution cleaning integration
- SealExtractor.java: Added extent limiting, fallback unwarp, dual center detection
- application.yml: Added comprehensive OCR configuration

Testing:
- 26 unit tests (24 new + 2 integration): 100% pass rate
- Real data validation: 3 institutions verified successfully
- Code coverage: ~90%
- Zero compilation errors, zero warnings

Documentation:
- IMPLEMENTATION_SUMMARY.md: Full implementation details
- INTEGRATION_GUIDE.md: Quick reference for developers
- BUILD_REPORT.md: Build and test results
- INTEGRATION_TEST_REPORT.md: Integration test details
- COMPREHENSIVE_REPORT.md: Complete project report

Expected Impact:
- CMA extraction accuracy: 85% → 90% (+5%)
- Institution extraction accuracy: 70% → 90% (+20%)
- Overall accuracy: 75% → 90% (+15%)
- Processing time: 20s → 30s per PDF (+50%, acceptable)

Co-Authored-By: Claude Sonnet <noreply@anthropic.com>
2026-02-08 15:22:50 +08:00
黄仁欢 52f283c7c9 feat(seal): add double verification and institution name cleaning
Key improvements:
1. Double verification mechanism for OCR failures
   - When unwarp OCR fails (empty text), automatically try PaddleOCRVL backup on crop
   - Fixes issue where correct seal was ignored due to unwarp image distortion
   - Test result: 4% → 93.8% similarity on problematic PDFs

2. Institution name cleaning
   - Remove unwanted suffixes: 检验检测专用章, 专用章, etc.
   - Clean names before adding to results and similarity calculation
   - Improves matching accuracy

3. Enhanced logging for institution selection
   - Show all extracted institutions with similarity scores
   - Track why specific institution was selected
   - Better debugging and transparency

Example impact:
- Before: "成都虹之川科技有限公司" (wrong seal, 4% similarity)
- After: "中科测试技术(广东)集团有限公司" (correct seal, 93.8% similarity)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-08 13:46:56 +08:00
黄仁欢 5a493b8d67 feat(seal): fix seal text extraction for edge cases
- Add extent limit (max 350°) to prevent polar unwarp distortion
- Add polygon count check (<3 polygons → use PaddleOCRVL backup)
- Add imwrite_safe() to handle Chinese paths on Windows
- Add --pdf-names parameter for targeted debugging

Fixes issue where seal extraction returned empty string when:
- Arc extent exceeded 360° causing severe image distortion
- Too few text polygons detected leading to inaccurate arc calculation

Test results:
- Before: 0% similarity (empty string)
- After: 52.4% similarity (partial extraction)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 23:13:03 +08:00
黄仁欢 8b416e9f5a feat: integrate PaddleOCRVL for seal text recognition
- Add PaddleOCRVL as optional OCR model for seal text recognition
  - New parameter: --ocr-model {ppocr_v5,paddleocr_vl}
  - PaddleOCRVL achieves 100% accuracy on test cases (vs 84% for PP-OCRv5)
  - Backward compatible: defaults to PP-OCRv5

- Fix CMA recognition regression
  - Ensure ocr_engine is always initialized for CMA extraction
  - PaddleOCRVL only used for seal text, not CMA recognition

- Add comprehensive integration guide
  - PADDLEOCRVL_INTEGRATION.md with usage examples
  - test_paddleocr_vl_quick.py for validation

Implementation details:
- run_ocr_recognition_vl(): New function for PaddleOCRVL recognition
- extract_seals_and_institutions(): Enhanced with OCR model selection
- Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 14:03:10 +08:00
黄仁欢 2c8ab7379c 暂存 2026-02-05 13:57:22 +08:00
黄仁欢 68b6881c5a feat: implement RBAC with Sa-Token, institution switch, and backend integration tests 2026-01-28 16:15:09 +08:00