report-detect

Commit Graph

Author	SHA1	Message	Date
黄仁欢	d8047d15a0	docs(cma): ensure CMA extraction modules remain in root directory Clarify that CMA extraction modules are core dependencies and must remain in the project root directory. These files cannot be archived as they are imported by test_accuracy_batch_full.py at runtime. Core files (in root): - cma_extraction_template_primary.py (19 KB) - Primary CMA extraction module - cma_extraction_final.py (16 KB) - Fallback CMA extraction module Dependency chain: test_accuracy_batch_full.py → imports: cma_extraction_template_primary.py → fallback: cma_extraction_final.py Why these cannot be archived: 1. Runtime import dependency - script will fail without them 2. Core business logic - not temporary/debug scripts 3. Required for main functionality - not optional or auxiliary Archive directory should only contain: - Temporary test scripts - Debug/analysis scripts - Old documentation - Auxiliary tools Verification: ✓ Both files present in root directory ✓ Already tracked in git (commit `9562cf1`) ✓ No duplicate copies in archive/ Related documentation: - TEST_ACCURACY_BATCH_DEPENDENCIES.md - Full dependency analysis - CLEANUP_PLAN.md - Cleanup plan and file categorization Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:55:28 +08:00
黄仁欢	9562cf1ac7	feat(cma): add CMA extraction module fallback implementation Add cma_extraction_final.py as backup CMA extraction module. This module provides fallback CMA code extraction when the primary template-based method (cma_extraction_template_primary.py) fails. Features: - Full-page OCR extraction as fallback - CMA pattern matching (11-12 digit codes) - Integration with main batch testing script - Supports both template matching and OCR-only approaches Usage: The main script (test_accuracy_batch_full.py) automatically falls back to this module if template matching fails: 1. Primary: cma_extraction_template_primary.py (template matching) 2. Fallback: cma_extraction_final.py (full-page OCR) Related files: - cma_extraction_template_primary.py (primary module) - test_accuracy_batch_full.py (main script that uses both) - TEST_ACCURACY_BATCH_DEPENDENCIES.md (dependency documentation) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:51:58 +08:00
黄仁欢	5f72e010cd	docs(cleanup): add cleanup completion report	2026-03-03 14:35:50 +08:00
黄仁欢	771eae0ce4	chore(project): conservative cleanup - archive temp scripts and old docs Major cleanup to improve project organization and maintainability. Changes: - Moved 34 temp/debug/test scripts to archive/temp_scripts/ - Moved 9 auxiliary tools to archive/tools/ - Moved 3 CRT test scripts to archive/crt_tests/ - Moved 4 OCR test scripts to archive/ocr_tests/ - Moved 14 old documentation files to archive/docs/ - Deleted 4 useless files (duplicates, temp files) Root directory: - Before: 67 files (cluttered) - After: 10 core files (clean and organized) Core files retained: - test_accuracy_batch_full.py (main script) - cma_extraction_template_primary.py (CMA extraction) - cma_extraction_final.py (backup CMA extraction) - CLAUDE.md (project guide) - TEST_ACCURACY_BATCH_README.md (usage guide) - TEST_ACCURACY_BATCH_DEPENDENCIES.md (dependency docs) - CLEANUP_PLAN.md (cleanup plan) - CLEANUP_SUMMARY.md (this file) - IMPLEMENTATION_SUMMARY.md (implementation summary) - requirements.txt (dependencies) Archive structure: archive/ ├── temp_scripts/ (34 files: test_, debug_, analyze_, etc.) ├── tools/ (9 files: find_, show_, visualize_, etc.) ├── crt_tests/ (3 files: CRT extraction tests) ├── ocr_tests/ (4 files: OCR timeout tests) └── docs/ (14 files: old reports and guides) Benefits: ✓ Cleaner root directory - easier navigation ✓ Better organization - clear separation of concerns ✓ Preserved history - all files archived, not deleted ✓ Improved maintainability - easier to find active files ✓ Better git history - removed 198 deleted files from tracking No functional changes - all core functionality preserved. Related: - TEST_ACCURACY_BATCH_DEPENDENCIES.md - dependency analysis - CLEANUP_PLAN.md - detailed cleanup plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:35:06 +08:00
黄仁欢	4bd46b6f0c	docs(test): add comprehensive documentation for batch testing script Added three key documentation files: 1. TEST_ACCURACY_BATCH_README.md - Complete usage guide for test_accuracy_batch_full.py - Command-line parameters reference - 4 usage scenarios (quick, high-accuracy, fast, single-PDF) - Troubleshooting guide - Performance optimization tips - Best practices and examples 2. TEST_ACCURACY_BATCH_DEPENDENCIES.md - Detailed dependency analysis - Required files and directory structure - Python library dependencies - File size statistics - Dependency relationship diagram - Common dependency issues and solutions 3. CLEANUP_PLAN.md - File categorization (keep, archive, delete) - Step-by-step cleanup instructions - Archive directory structure proposal - Three cleanup approaches (conservative, aggressive, phased) - Cleanup automation script Features: - Comprehensive parameter reference tables - Real-world usage examples - Performance comparison charts - Quick reference commands - Development guidelines Target audience: - New developers joining the project - QA team running batch tests - DevOps engineers deploying the system Related: - test_accuracy_batch_full.py (v1.2.0) - PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md - IMPLEMENTATION_SUMMARY.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:32:04 +08:00
黄仁欢	6c5f9e0489	feat(ocr): add PaddleOCRVL timeout protection and improve OCR accuracy Major improvements to batch OCR testing script: 1. PaddleOCRVL Timeout Protection - Add multiprocessing-based timeout mechanism (default: 60s, configurable up to 300s) - Prevents indefinite hangs when PaddleOCRVL encounters problematic seal images - Added _run_ocr_vl_wrapper() function for subprocess execution - All PaddleOCRVL calls now use PADDLEOCRVL_TIMEOUT global variable 2. Command-Line Arguments - --paddleocrvl-timeout: Set custom timeout in seconds (default: 60, recommended: 300) - --disable-paddleocrvl: Skip PaddleOCRVL initialization for faster testing 3. CMA Template Matching Improvements - Change matching method from TM_CCOEFF_NORMED to TM_CCORR_NORMED - Add position filtering (upper 60% of page only) - Prevents false matches in footer areas 4. OCR Result Validation - Add robust handling for different PaddleOCR API response formats - Improved error handling for edge cases - Better CMA code extraction with 11-12 digit pattern matching 5. Bug Fixes - Fixed IndexError when processing OCR results with inconsistent formats - Improved text cleaning for CMA code extraction - Added validation for OCR data structures Performance: - CMA accuracy: 85-100% (depending on PDF quality) - Institution accuracy: 27-100% (improved with seal OCR validation) - Average processing time: 18-35 seconds per PDF Related files: - test_paddleocrvl_timeout.py: Timeout mechanism verification - PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md: Detailed implementation guide - PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md: Usage guide for 5-min timeout Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-03 14:26:46 +08:00
黄仁欢	22773f3cc8	feat(test): add 'acceptable' match type for similarity >= 60% Add a new match category 'acceptable' for institution name matches with similarity between 60% and 85%, providing more nuanced matching results. Changes: 1. Add ACCEPTABLE_THRESHOLD = 60.0 constant 2. Update classify_match() to include 'acceptable' category 3. Add blue color (#2196f3) for acceptable matches in reports 4. Update all statistics to count acceptable matches separately 5. Modify HTML summary to show 5 columns instead of 4 6. Update JSON output to include acceptable count 7. Add [ACCEPTABLE] symbol in result tables Match levels (from highest to lowest): - exact: 100% similarity → green - partial: >= 85% similarity → orange - acceptable: >= 60% similarity → blue ← NEW - no_match: < 60% similarity → red This improves the granularity of match reporting, especially for cases where OCR artifacts or minor variations cause similarity to drop below the 85% partial threshold but are still reasonably accurate. Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-02-17 23:37:17 +08:00
黄仁欢	f5981fdf72	fix(test): remove seal suffixes from institution names before matching Extend institution name cleaning to handle OCR artifacts from seal text that gets merged with company names during extraction. Problem: - 3 PDFs failed matching due to "检验检测专用章" (Seal for Inspection & Testing) being included in extracted institution names - Example: "四川合泰与必摩适检测有限公司检验检测专用章" vs "四川合泰与必摩适检测有限公司" - Similarity dropped to ~60-67% → incorrectly classified as "no_match" - Affected PDFs: * pages3-6.pdf: 60.87% similarity * pages7-14.pdf: 60.0% similarity * pages12-15.pdf: 62.5% similarity Solution: - Add seal suffix removal to clean_institution_name() function - Remove common seal names: 检验检测专用章, 检测专用章, 检验专用章, etc. - Use string replacement (not regex) to handle middle-of-text occurrences - Apply before number removal to handle combined artifacts like "专用章123456" Test Results: All 4 test cases now achieve 100% similarity and "exact" match: 1. "检验检测专用章" suffix → 66.67% → 100.00% ✓ 2. "检验检测专用章" suffix (different company) → 65.00% → 100.00% ✓ 3. "430334" suffix → 70.00% → 100.00% ✓ 4. "检验检测专用章430334" combined → 51.85% → 100.00% ✓ This fix complements the previous CMA code suffix removal and significantly improves matching accuracy for seal-related OCR artifacts. Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-02-16 21:22:23 +08:00
黄仁欢	9f701edd25	fix(test): improve institution name matching by cleaning trailing numbers Add smart institution name cleaning to handle OCR artifacts like trailing CMA codes that cause false negative matches. Problem: - PDF "重庆市财政局..._pages3-6.pdf" extracted institution with trailing CMA code - "四川合泰与必摩适检测有限公司430334" vs "四川合泰与必摩适检测有限公司" - Similarity: 70.0% → incorrectly classified as "no_match" - The core institution name is actually identical Solution: - Add clean_institution_name() function to remove trailing artifacts: * Remove 6+ digit numbers (CMA codes) * Remove 11+ digit numbers (full CMA codes) * Remove trailing punctuation and whitespace - Enhance classify_match() with field_type parameter - Apply cleaning for institution field comparisons Results for test case: - Before: 70.0% similarity, edit distance 6 → "no_match" - After: 100.0% similarity, edit distance 0 → "exact" This fix improves accuracy for cases where OCR accidentally captures CMA codes or other numbers as part of the institution name. Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-02-16 14:51:28 +08:00
黄仁欢	5baf0ac18e	fix(cma): implement robust CMA code extraction with fallback mechanism Add comprehensive CMA code extraction module with template matching primary method and full-page OCR fallback to handle various PDF formats. Key improvements: - Add cma_extraction_template_primary.py module - Support 11-12 digit CMA codes (prioritize 12-digit matches) - Implement template matching + ROI OCR as primary method - Add full-page OCR fallback when template matching fails - Fix critical bug where low template match confidence prevented fallback - Improve scoring algorithm considering position, confidence, and format Fixed issues: - YDQ23_001838.pdf: Extracts 210020349096 (12-digit code) - WTS2025-21283.pdf: Extracts 220020349627 (12-digit code) - Both PDFs now use fullpage_fallback successfully Technical details: - Template match threshold: 0.4 confidence - ROI calculation: extends rightward from logo center - Fallback triggers on: template load failure, match failure, or low confidence - Scoring weights: confidence100 + starts_with_250 + top_right*30 Co-Authored-By: Claude Code <noreply@anthropic.com>	2026-02-16 14:16:34 +08:00
黄仁欢	49c2e0f3f9	feat: integrate CMA template matching as fallback extraction method - Add cv2.matchTemplate-based CMA logo detection functions - Implement automatic fallback when primary OCR extraction fails or has low confidence (<0.6) - Add dual-format OCR result parsing (legacy ocr() and predict() API) - Fix PaddleOCR API compatibility (remove unsupported cls kwarg) - Record extraction method in cma_method field (robust_ocr or template_matching) - Generate debug ROI image (cma_template_match_roi.png) for verification	2026-02-12 13:29:48 +08:00
黄仁欢	bc34b209b9	Checkpoint before ONNX migration	2026-02-09 09:43:28 +08:00
黄仁欢	8563fcd6b0	feat(djl): attempt upgrade to DJL 0.27.0 to fix PaddlePaddle crashes Summary: - Upgraded DJL from 0.26.0 to 0.27.0 (latest available) - Added Maven Central repository as fallback - Configured exec-maven-plugin for running standalone tests Findings: - PaddlePaddle engine (0.27.0) still uses native library 2.3.2 - Crashes persist at identical location: paddle_inference.dll+0x3e751b - Confirmed root cause: obsolete PaddlePaddle engine (last update Mar 2024) Test Results: - Unit tests: 26/26 passing ✅ - Integration test: ❌ Crashed (native library bug) - JVM heap: 6GB (confirmed not memory issue) Documentation: - Added comprehensive DJL upgrade analysis report - Confirmed DJL PaddlePaddle engine appears abandoned - Recommended solution: REST API architecture (see TEST_EXECUTION_FINAL_REPORT.md) Sources: - https://mvnrepository.com/artifact/ai.djl.paddlepaddle/paddlepaddle-engine - https://github.com/deepjavalibrary/djl/releases/tag/v0.27.0 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-09 00:04:40 +08:00
黄仁欢	81ff1db782	feat(ocr): integrate Python test script improvements for 85% parity Integrate 7 key improvements from Python test script to enhance CMA code and institution name extraction accuracy from 75% to expected 90%. Core Features Added: - InstitutionNameCleaner: Removes seal-specific suffixes (检验检测专用章) - SimilarityCalculator: Levenshtein distance for string matching - Extent limiting: Prevents unwarping distortion (>350°) - Fallback unwarping: Fixed angle range (270°) for seals without text - Dual strategy center detection: Circle fitting with crop center fallback - Polygon count checking: Skips unwarping when <3 polygons detected - PaddleOCRVL service: Stub for backup OCR (implementation pending) Modified Files: - OcrService.java: Added polygon checking, institution cleaning integration - SealExtractor.java: Added extent limiting, fallback unwarp, dual center detection - application.yml: Added comprehensive OCR configuration Testing: - 26 unit tests (24 new + 2 integration): 100% pass rate - Real data validation: 3 institutions verified successfully - Code coverage: ~90% - Zero compilation errors, zero warnings Documentation: - IMPLEMENTATION_SUMMARY.md: Full implementation details - INTEGRATION_GUIDE.md: Quick reference for developers - BUILD_REPORT.md: Build and test results - INTEGRATION_TEST_REPORT.md: Integration test details - COMPREHENSIVE_REPORT.md: Complete project report Expected Impact: - CMA extraction accuracy: 85% → 90% (+5%) - Institution extraction accuracy: 70% → 90% (+20%) - Overall accuracy: 75% → 90% (+15%) - Processing time: 20s → 30s per PDF (+50%, acceptable) Co-Authored-By: Claude Sonnet <noreply@anthropic.com>	2026-02-08 15:22:50 +08:00
黄仁欢	52f283c7c9	feat(seal): add double verification and institution name cleaning Key improvements: 1. Double verification mechanism for OCR failures - When unwarp OCR fails (empty text), automatically try PaddleOCRVL backup on crop - Fixes issue where correct seal was ignored due to unwarp image distortion - Test result: 4% → 93.8% similarity on problematic PDFs 2. Institution name cleaning - Remove unwanted suffixes: 检验检测专用章, 专用章, etc. - Clean names before adding to results and similarity calculation - Improves matching accuracy 3. Enhanced logging for institution selection - Show all extracted institutions with similarity scores - Track why specific institution was selected - Better debugging and transparency Example impact: - Before: "成都虹之川科技有限公司" (wrong seal, 4% similarity) - After: "中科测试技术（广东）集团有限公司" (correct seal, 93.8% similarity) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-08 13:46:56 +08:00
黄仁欢	5a493b8d67	feat(seal): fix seal text extraction for edge cases - Add extent limit (max 350°) to prevent polar unwarp distortion - Add polygon count check (<3 polygons → use PaddleOCRVL backup) - Add imwrite_safe() to handle Chinese paths on Windows - Add --pdf-names parameter for targeted debugging Fixes issue where seal extraction returned empty string when: - Arc extent exceeded 360° causing severe image distortion - Too few text polygons detected leading to inaccurate arc calculation Test results: - Before: 0% similarity (empty string) - After: 52.4% similarity (partial extraction) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-07 23:13:03 +08:00
黄仁欢	8b416e9f5a	feat: integrate PaddleOCRVL for seal text recognition - Add PaddleOCRVL as optional OCR model for seal text recognition - New parameter: --ocr-model {ppocr_v5,paddleocr_vl} - PaddleOCRVL achieves 100% accuracy on test cases (vs 84% for PP-OCRv5) - Backward compatible: defaults to PP-OCRv5 - Fix CMA recognition regression - Ensure ocr_engine is always initialized for CMA extraction - PaddleOCRVL only used for seal text, not CMA recognition - Add comprehensive integration guide - PADDLEOCRVL_INTEGRATION.md with usage examples - test_paddleocr_vl_quick.py for validation Implementation details: - run_ocr_recognition_vl(): New function for PaddleOCRVL recognition - extract_seals_and_institutions(): Enhanced with OCR model selection - Automatic fallback to PP-OCRv5 if PaddleOCRVL unavailable Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-07 14:03:10 +08:00
黄仁欢	2c8ab7379c	暂存	2026-02-05 13:57:22 +08:00
黄仁欢	68b6881c5a	feat: implement RBAC with Sa-Token, institution switch, and backend integration tests	2026-01-28 16:15:09 +08:00

19 Commits All Branches Search

19 Commits

All Branches