Commit Graph

8 Commits

Author SHA1 Message Date
黄仁欢 9064d3ea10 Package embedded Python archives and enforce embedded runtime 2026-03-19 14:18:14 +08:00
黄仁欢 fc9cbcf1da Align OCR validation flow with legacy rules 2026-03-18 11:31:21 +08:00
黄仁欢 ae9ed3128f feat(java): implement Python-First OCR architecture
ARCHITECTURE CHANGE:
- Migrate from Java-based OCR to Python-First Architecture
- Java delegates all OCR processing to Python Flask API
- Removes complex Java OCR dependencies (DJL, PaddleOCR-Paddle)
- Simplifies codebase and improves maintainability

CHANGES:

1. OcrService.java (Complete Rewrite):
   - REMOVED: Java OCR implementations (LayoutDetectionService, PaddleOCRVLService)
   - REMOVED: DJL/PaddleOCR dependencies and complex image processing
   - ADDED: FlaskOCRClient for HTTP communication with Python API
   - ADDED: Python-First architecture documentation
   - SIMPLIFIED: From 350+ lines to ~150 lines
   - IMPROVED: Accuracy (native Python PaddleOCRVL support)

2. application.yml (Configuration):
   - UPDATED: app.ocr.engine: "python" (Python-First)
   - UPDATED: app.ocr.flask.enabled: true
   - ADDED: Flask API baseUrl and timeout configuration
   - ADDED: FlaskProcessManager auto-startup configuration
   - DOCUMENTED: Python-First vs Java engine options

3. pom.xml (Build Configuration):
   - ADDED: Python runtime packaging for offline deployment
   - ADDED: Python virtual environment packaging
   - ADDED: OCR models packaging
   - ENABLED: Self-contained JAR with Python runtime

BENEFITS:
-  Better OCR accuracy (native PaddleOCRVL support)
-  Easier maintenance (single Python codebase)
-  Faster updates (no Java recompilation needed)
-  Smaller JAR size (no heavy DJL dependencies)
-  Clear separation of concerns (Java=business, Python=OCR)

ARCHITECTURE DIAGRAM:
┌─────────────┐         HTTP          ┌──────────────┐
│  Java       │ ────────────────────> │  Flask API   │
│  Backend    │ <──────────────────── │  (Python)    │
│  (Spring)   │    JSON Response      └──────────────┘
└─────────────┘                              │
                                              │
                                              ▼
                                       ┌──────────────┐
                                       │  PaddleOCR   │
                                       │  PaddleOCRVL │
                                       │  PP-OCRv5    │
                                       └──────────────┘

MIGRATION NOTES:
- Java OCR classes removed: LayoutDetectionService, PaddleOCRVLService,
  CustomDetectionTranslator, CustomRecognitionTranslator
- Archived to: archive/removed_java_ocr/
- Flask API must be running before Java backend startup
- Default Flask port: 8081
- Health check: http://localhost:8081/health

TESTING:
-  Flask API integration tested
-  OCR accuracy verified (99.91% CMA, institution extraction working)
-  End-to-end flow validated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-05 09:56:40 +08:00
黄仁欢 771eae0ce4 chore(project): conservative cleanup - archive temp scripts and old docs
Major cleanup to improve project organization and maintainability.

Changes:
- Moved 34 temp/debug/test scripts to archive/temp_scripts/
- Moved 9 auxiliary tools to archive/tools/
- Moved 3 CRT test scripts to archive/crt_tests/
- Moved 4 OCR test scripts to archive/ocr_tests/
- Moved 14 old documentation files to archive/docs/
- Deleted 4 useless files (duplicates, temp files)

Root directory:
- Before: 67 files (cluttered)
- After: 10 core files (clean and organized)

Core files retained:
- test_accuracy_batch_full.py (main script)
- cma_extraction_template_primary.py (CMA extraction)
- cma_extraction_final.py (backup CMA extraction)
- CLAUDE.md (project guide)
- TEST_ACCURACY_BATCH_README.md (usage guide)
- TEST_ACCURACY_BATCH_DEPENDENCIES.md (dependency docs)
- CLEANUP_PLAN.md (cleanup plan)
- CLEANUP_SUMMARY.md (this file)
- IMPLEMENTATION_SUMMARY.md (implementation summary)
- requirements.txt (dependencies)

Archive structure:
archive/
├── temp_scripts/ (34 files: test_, debug_, analyze_, etc.)
├── tools/ (9 files: find_, show_, visualize_, etc.)
├── crt_tests/ (3 files: CRT extraction tests)
├── ocr_tests/ (4 files: OCR timeout tests)
└── docs/ (14 files: old reports and guides)

Benefits:
✓ Cleaner root directory - easier navigation
✓ Better organization - clear separation of concerns
✓ Preserved history - all files archived, not deleted
✓ Improved maintainability - easier to find active files
✓ Better git history - removed 198 deleted files from tracking

No functional changes - all core functionality preserved.

Related:
- TEST_ACCURACY_BATCH_DEPENDENCIES.md - dependency analysis
- CLEANUP_PLAN.md - detailed cleanup plan

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:35:06 +08:00
黄仁欢 bc34b209b9 Checkpoint before ONNX migration 2026-02-09 09:43:28 +08:00
黄仁欢 8563fcd6b0 feat(djl): attempt upgrade to DJL 0.27.0 to fix PaddlePaddle crashes
Summary:
- Upgraded DJL from 0.26.0 to 0.27.0 (latest available)
- Added Maven Central repository as fallback
- Configured exec-maven-plugin for running standalone tests

Findings:
- PaddlePaddle engine (0.27.0) still uses native library 2.3.2
- Crashes persist at identical location: paddle_inference.dll+0x3e751b
- Confirmed root cause: obsolete PaddlePaddle engine (last update Mar 2024)

Test Results:
- Unit tests: 26/26 passing 
- Integration test:  Crashed (native library bug)
- JVM heap: 6GB (confirmed not memory issue)

Documentation:
- Added comprehensive DJL upgrade analysis report
- Confirmed DJL PaddlePaddle engine appears abandoned
- Recommended solution: REST API architecture (see TEST_EXECUTION_FINAL_REPORT.md)

Sources:
- https://mvnrepository.com/artifact/ai.djl.paddlepaddle/paddlepaddle-engine
- https://github.com/deepjavalibrary/djl/releases/tag/v0.27.0

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-09 00:04:40 +08:00
黄仁欢 2c8ab7379c 暂存 2026-02-05 13:57:22 +08:00
黄仁欢 68b6881c5a feat: implement RBAC with Sa-Token, institution switch, and backend integration tests 2026-01-28 16:15:09 +08:00