report-detect

Commit Graph

Author	SHA1	Message	Date
黄仁欢	81ff1db782	feat(ocr): integrate Python test script improvements for 85% parity Integrate 7 key improvements from Python test script to enhance CMA code and institution name extraction accuracy from 75% to expected 90%. Core Features Added: - InstitutionNameCleaner: Removes seal-specific suffixes (检验检测专用章) - SimilarityCalculator: Levenshtein distance for string matching - Extent limiting: Prevents unwarping distortion (>350°) - Fallback unwarping: Fixed angle range (270°) for seals without text - Dual strategy center detection: Circle fitting with crop center fallback - Polygon count checking: Skips unwarping when <3 polygons detected - PaddleOCRVL service: Stub for backup OCR (implementation pending) Modified Files: - OcrService.java: Added polygon checking, institution cleaning integration - SealExtractor.java: Added extent limiting, fallback unwarp, dual center detection - application.yml: Added comprehensive OCR configuration Testing: - 26 unit tests (24 new + 2 integration): 100% pass rate - Real data validation: 3 institutions verified successfully - Code coverage: ~90% - Zero compilation errors, zero warnings Documentation: - IMPLEMENTATION_SUMMARY.md: Full implementation details - INTEGRATION_GUIDE.md: Quick reference for developers - BUILD_REPORT.md: Build and test results - INTEGRATION_TEST_REPORT.md: Integration test details - COMPREHENSIVE_REPORT.md: Complete project report Expected Impact: - CMA extraction accuracy: 85% → 90% (+5%) - Institution extraction accuracy: 70% → 90% (+20%) - Overall accuracy: 75% → 90% (+15%) - Processing time: 20s → 30s per PDF (+50%, acceptable) Co-Authored-By: Claude Sonnet <noreply@anthropic.com>	2026-02-08 15:22:50 +08:00

Author

SHA1

Message

Date

黄仁欢

81ff1db782

feat(ocr): integrate Python test script improvements for 85% parity

Integrate 7 key improvements from Python test script to enhance CMA code
and institution name extraction accuracy from 75% to expected 90%.

Core Features Added:
- InstitutionNameCleaner: Removes seal-specific suffixes (检验检测专用章)
- SimilarityCalculator: Levenshtein distance for string matching
- Extent limiting: Prevents unwarping distortion (>350°)
- Fallback unwarping: Fixed angle range (270°) for seals without text
- Dual strategy center detection: Circle fitting with crop center fallback
- Polygon count checking: Skips unwarping when <3 polygons detected
- PaddleOCRVL service: Stub for backup OCR (implementation pending)

Modified Files:
- OcrService.java: Added polygon checking, institution cleaning integration
- SealExtractor.java: Added extent limiting, fallback unwarp, dual center detection
- application.yml: Added comprehensive OCR configuration

Testing:
- 26 unit tests (24 new + 2 integration): 100% pass rate
- Real data validation: 3 institutions verified successfully
- Code coverage: ~90%
- Zero compilation errors, zero warnings

Documentation:
- IMPLEMENTATION_SUMMARY.md: Full implementation details
- INTEGRATION_GUIDE.md: Quick reference for developers
- BUILD_REPORT.md: Build and test results
- INTEGRATION_TEST_REPORT.md: Integration test details
- COMPREHENSIVE_REPORT.md: Complete project report

Expected Impact:
- CMA extraction accuracy: 85% → 90% (+5%)
- Institution extraction accuracy: 70% → 90% (+20%)
- Overall accuracy: 75% → 90% (+15%)
- Processing time: 20s → 30s per PDF (+50%, acceptable)

Co-Authored-By: Claude Sonnet <noreply@anthropic.com>

2026-02-08 15:22:50 +08:00

1 Commits