检验检测报告识别

Go to file

黄仁欢 6c5f9e0489 feat(ocr): add PaddleOCRVL timeout protection and improve OCR accuracy Major improvements to batch OCR testing script: 1. PaddleOCRVL Timeout Protection - Add multiprocessing-based timeout mechanism (default: 60s, configurable up to 300s) - Prevents indefinite hangs when PaddleOCRVL encounters problematic seal images - Added _run_ocr_vl_wrapper() function for subprocess execution - All PaddleOCRVL calls now use PADDLEOCRVL_TIMEOUT global variable 2. Command-Line Arguments - --paddleocrvl-timeout: Set custom timeout in seconds (default: 60, recommended: 300) - --disable-paddleocrvl: Skip PaddleOCRVL initialization for faster testing 3. CMA Template Matching Improvements - Change matching method from TM_CCOEFF_NORMED to TM_CCORR_NORMED - Add position filtering (upper 60% of page only) - Prevents false matches in footer areas 4. OCR Result Validation - Add robust handling for different PaddleOCR API response formats - Improved error handling for edge cases - Better CMA code extraction with 11-12 digit pattern matching 5. Bug Fixes - Fixed IndexError when processing OCR results with inconsistent formats - Improved text cleaning for CMA code extraction - Added validation for OCR data structures Performance: - CMA accuracy: 85-100% (depending on PDF quality) - Institution accuracy: 27-100% (improved with seal OCR validation) - Average processing time: 18-35 seconds per PDF Related files: - test_paddleocrvl_timeout.py: Timeout mechanism verification - PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md: Detailed implementation guide - PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md: Usage guide for 5-min timeout Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>		2026-03-03 14:26:46 +08:00
data	暂存	2026-02-05 13:57:22 +08:00
report_viz	暂存	2026-02-05 13:57:22 +08:00
scripts	暂存	2026-02-05 13:57:22 +08:00
src	Checkpoint before ONNX migration	2026-02-09 09:43:28 +08:00
temp_classpath	暂存	2026-02-05 13:57:22 +08:00
.gitignore	Checkpoint before ONNX migration	2026-02-09 09:43:28 +08:00
BUILD_REPORT.md	feat(ocr): integrate Python test script improvements for 85% parity	2026-02-08 15:22:50 +08:00
COMPREHENSIVE_REPORT.md	feat(ocr): integrate Python test script improvements for 85% parity	2026-02-08 15:22:50 +08:00
DJL_UPGRADE_ATTEMPT_REPORT.md	feat(djl): attempt upgrade to DJL 0.27.0 to fix PaddlePaddle crashes	2026-02-09 00:04:40 +08:00
IMPLEMENTATION_SUMMARY.md	feat(ocr): integrate Python test script improvements for 85% parity	2026-02-08 15:22:50 +08:00
INTEGRATION_GUIDE.md	feat(ocr): integrate Python test script improvements for 85% parity	2026-02-08 15:22:50 +08:00
INTEGRATION_TEST_REPORT.md	feat(ocr): integrate Python test script improvements for 85% parity	2026-02-08 15:22:50 +08:00
ManualTest.java	暂存	2026-02-05 13:57:22 +08:00
PADDLEOCRVL_INTEGRATION.md	feat: integrate PaddleOCRVL for seal text recognition	2026-02-07 14:03:10 +08:00
README.md	feat: implement RBAC with Sa-Token, institution switch, and backend integration tests	2026-01-28 16:15:09 +08:00
cma_extraction_template_primary.py	fix(cma): implement robust CMA code extraction with fallback mechanism	2026-02-16 14:16:34 +08:00
jar_paths.txt	暂存	2026-02-05 13:57:22 +08:00
pom.xml	Checkpoint before ONNX migration	2026-02-09 09:43:28 +08:00
reply.md	暂存	2026-02-05 13:57:22 +08:00
res.json	暂存	2026-02-05 13:57:22 +08:00
run_reference_test.bat	暂存	2026-02-05 13:57:22 +08:00
run_test.bat	暂存	2026-02-05 13:57:22 +08:00
run_test_v2.bat	暂存	2026-02-05 13:57:22 +08:00
run_viz_report.bat	暂存	2026-02-05 13:57:22 +08:00
settings.xml	Checkpoint before ONNX migration	2026-02-09 09:43:28 +08:00
test_accuracy_batch_full.py	feat(ocr): add PaddleOCRVL timeout protection and improve OCR accuracy	2026-03-03 14:26:46 +08:00
test_paddleocr_vl_quick.py	feat: integrate PaddleOCRVL for seal text recognition	2026-02-07 14:03:10 +08:00
v_verify_logic.py	暂存	2026-02-05 13:57:22 +08:00
测试结果汇总.txt	feat(ocr): integrate Python test script improvements for 85% parity	2026-02-08 15:22:50 +08:00

README.md

Report Detection Backend

Java-based backend system for automated report validation and comparison using OCR.

Technology Stack

Core: Java 8 (Spring Boot 2.7.18)
Security: Sa-Token (RBAC, Session Management)
OCR Engine: PaddleOCR (via DJL - Deep Java Library)
Database: PostgreSQL (with Dynamic Datasource support)
Build Tool: Maven

Features

RBAC Implementation: Multi-role support (ADMIN, AUDITOR, USER) with uppercase standardization.
Sa-Token Security: Annotation-based permission checks and secure login.
Auditor Context Switch: Specialized feature for Auditors to switch between institutional views.
PDF Processing: Automatic conversion of PDF reports to images for OCR analysis.
Automated Verification: Integration tests using H2 in-memory database.

Getting Started

Prerequisites

JDK 8 or 17
Maven 3.6+
PostgreSQL (optional for local dev if using H2 profile)

Run the Application

mvn clean package
java -jar target/report-detect-backend-1.0.0.jar

Run Tests

mvn test -Dtest=SecurityRBACVerificationTest

Security Configuration

Default accounts created on initialization:

admin / 123456 (ADMIN)
auditor / 123456 (AUDITOR)
user / 123456 (USER)