Extend institution name cleaning to handle OCR artifacts from seal text
that gets merged with company names during extraction.
Problem:
- 3 PDFs failed matching due to "检验检测专用章" (Seal for Inspection & Testing)
being included in extracted institution names
- Example: "四川合泰与必摩适检测有限公司检验检测专用章"
vs "四川合泰与必摩适检测有限公司"
- Similarity dropped to ~60-67% → incorrectly classified as "no_match"
- Affected PDFs:
* pages3-6.pdf: 60.87% similarity
* pages7-14.pdf: 60.0% similarity
* pages12-15.pdf: 62.5% similarity
Solution:
- Add seal suffix removal to clean_institution_name() function
- Remove common seal names: 检验检测专用章, 检测专用章, 检验专用章, etc.
- Use string replacement (not regex) to handle middle-of-text occurrences
- Apply before number removal to handle combined artifacts like "专用章123456"
Test Results:
All 4 test cases now achieve 100% similarity and "exact" match:
1. "检验检测专用章" suffix → 66.67% → 100.00% ✓
2. "检验检测专用章" suffix (different company) → 65.00% → 100.00% ✓
3. "430334" suffix → 70.00% → 100.00% ✓
4. "检验检测专用章430334" combined → 51.85% → 100.00% ✓
This fix complements the previous CMA code suffix removal and
significantly improves matching accuracy for seal-related OCR artifacts.
Co-Authored-By: Claude Code <noreply@anthropic.com>
|
||
|---|---|---|
| data | ||
| report_viz | ||
| scripts | ||
| src | ||
| temp_classpath | ||
| .gitignore | ||
| BUILD_REPORT.md | ||
| COMPREHENSIVE_REPORT.md | ||
| DJL_UPGRADE_ATTEMPT_REPORT.md | ||
| IMPLEMENTATION_SUMMARY.md | ||
| INTEGRATION_GUIDE.md | ||
| INTEGRATION_TEST_REPORT.md | ||
| ManualTest.java | ||
| PADDLEOCRVL_INTEGRATION.md | ||
| README.md | ||
| cma_extraction_template_primary.py | ||
| jar_paths.txt | ||
| pom.xml | ||
| reply.md | ||
| res.json | ||
| run_reference_test.bat | ||
| run_test.bat | ||
| run_test_v2.bat | ||
| run_viz_report.bat | ||
| settings.xml | ||
| test_accuracy_batch_full.py | ||
| test_paddleocr_vl_quick.py | ||
| v_verify_logic.py | ||
| 测试结果汇总.txt | ||
README.md
Report Detection Backend
Java-based backend system for automated report validation and comparison using OCR.
Technology Stack
- Core: Java 8 (Spring Boot 2.7.18)
- Security: Sa-Token (RBAC, Session Management)
- OCR Engine: PaddleOCR (via DJL - Deep Java Library)
- Database: PostgreSQL (with Dynamic Datasource support)
- Build Tool: Maven
Features
- RBAC Implementation: Multi-role support (ADMIN, AUDITOR, USER) with uppercase standardization.
- Sa-Token Security: Annotation-based permission checks and secure login.
- Auditor Context Switch: Specialized feature for Auditors to switch between institutional views.
- PDF Processing: Automatic conversion of PDF reports to images for OCR analysis.
- Automated Verification: Integration tests using H2 in-memory database.
Getting Started
Prerequisites
- JDK 8 or 17
- Maven 3.6+
- PostgreSQL (optional for local dev if using H2 profile)
Run the Application
mvn clean package
java -jar target/report-detect-backend-1.0.0.jar
Run Tests
mvn test -Dtest=SecurityRBACVerificationTest
Security Configuration
Default accounts created on initialization:
admin/123456(ADMIN)auditor/123456(AUDITOR)user/123456(USER)