report-detect/archive/docs/OCR_INTEGRATION_README.md

276 lines
8.6 KiB
Markdown
Raw Normal View History

chore(project): conservative cleanup - archive temp scripts and old docs Major cleanup to improve project organization and maintainability. Changes: - Moved 34 temp/debug/test scripts to archive/temp_scripts/ - Moved 9 auxiliary tools to archive/tools/ - Moved 3 CRT test scripts to archive/crt_tests/ - Moved 4 OCR test scripts to archive/ocr_tests/ - Moved 14 old documentation files to archive/docs/ - Deleted 4 useless files (duplicates, temp files) Root directory: - Before: 67 files (cluttered) - After: 10 core files (clean and organized) Core files retained: - test_accuracy_batch_full.py (main script) - cma_extraction_template_primary.py (CMA extraction) - cma_extraction_final.py (backup CMA extraction) - CLAUDE.md (project guide) - TEST_ACCURACY_BATCH_README.md (usage guide) - TEST_ACCURACY_BATCH_DEPENDENCIES.md (dependency docs) - CLEANUP_PLAN.md (cleanup plan) - CLEANUP_SUMMARY.md (this file) - IMPLEMENTATION_SUMMARY.md (implementation summary) - requirements.txt (dependencies) Archive structure: archive/ ├── temp_scripts/ (34 files: test_, debug_, analyze_, etc.) ├── tools/ (9 files: find_, show_, visualize_, etc.) ├── crt_tests/ (3 files: CRT extraction tests) ├── ocr_tests/ (4 files: OCR timeout tests) └── docs/ (14 files: old reports and guides) Benefits: ✓ Cleaner root directory - easier navigation ✓ Better organization - clear separation of concerns ✓ Preserved history - all files archived, not deleted ✓ Improved maintainability - easier to find active files ✓ Better git history - removed 198 deleted files from tracking No functional changes - all core functionality preserved. Related: - TEST_ACCURACY_BATCH_DEPENDENCIES.md - dependency analysis - CLEANUP_PLAN.md - detailed cleanup plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:35:06 +08:00
# OCR异步处理集成说明
## 概述
本项目实现了基于RabbitMQ和Flask的异步OCR处理架构。Java Spring Boot应用作为任务生产者提交OCR任务Python消费者处理OCR请求并返回结果。
## 架构图
```
┌─────────────────────────────────────────────────────────────────┐
│ Java Spring Boot App │
│ ┌────────────────┐ ┌──────────────────┐ ┌─────────────┐ │
│ │ TaskController │───▶│ FlaskProcessMgr │───▶│ Flask App │ │
│ └────────────────┘ │ (Lifecycle Mgmt) │ │ (Auto-start)│ │
│ │ └──────────────────┘ └─────────────┘ │
│ ▼ │ │
│ ┌────────────────┐ │ │
│ │ OCRTaskService │───┐ │ │
│ └────────────────┘ │ ▼ │
│ │ │ ┌───────────────┐ │
│ ▼ │ │ RabbitMQ │ │
│ ┌────────────────┐ │ │ Producer │ │
│ │ OCRResultConsumer│◀───┘ └───────────────┘ │
│ └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ HTTP
┌─────────────────────────────────────────────────────────────────┐
│ Python Flask API (localhost:8081) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ /health │ │ /api/ocr/pdf │ │ RabbitMQ Consumer │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ pdf_processor.py │ │
│ │ - PaddleOCRVL (main) │ │
│ │ - PP-OCRv5 (fallback) │ │
│ └──────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## 部署步骤
### 1. 环境准备
#### Linux服务器环境要求
- Java 8+
- Python 3.8+
- RabbitMQ 3.x
- PostgreSQL 12+
- 至少10GB可用磁盘空间用于OCR模型
#### 安装依赖
**安装RabbitMQ (Ubuntu/Debian):**
```bash
sudo apt-get install rabbitmq-server
sudo systemctl start rabbitmq-server
sudo systemctl enable rabbitmq-server
# 创建用户可选默认使用guest/guest
sudo rabbitmqctl add_user ocr_user ocr_password
sudo rabbitmqctl set_user_tags ocr_user administrator
sudo rabbitmqctl set_permissions -p / ocr_user ".*" ".*" ".*"
```
**安装Python依赖:**
```bash
cd /path/to/report-detect-backend
pip install -r requirements.txt
```
### 2. 配置应用
编辑 `src/main/resources/application.yml`:
```yaml
spring:
rabbitmq:
host: localhost
port: 5672
username: guest
password: guest
app:
ocr:
flask:
enabled: true
host: 127.0.0.1
port: 8081
async:
enabled: true
```
### 3. 启动服务
**方式1: 使用Maven启动**
```bash
mvn clean package
java -jar target/report-detect-backend-1.0.0.jar
```
**方式2: 手动启动各组件**
1. 启动Flask API:
```bash
cd python_api
python ocr_api_server.py
```
2. 启动RabbitMQ消费者:
```bash
cd python_api
# 设置环境变量
export FLASK_HOST=127.0.0.1
export FLASK_PORT=8081
python ocr_task_consumer.py
```
3. 启动Java应用:
```bash
java -jar target/report-detect-backend-1.0.0.jar
```
### 4. 验证部署
**检查Flask服务:**
```bash
curl http://localhost:8081/health
```
预期响应:
```json
{
"status": "ok",
"vl_model": true,
"ocr_model": true
}
```
**检查RabbitMQ队列:**
```bash
sudo rabbitmqctl list_queues
```
应该看到:
```
ocr.tasks 0
ocr.results 0
```
### 5. 提交测试任务
```bash
curl -X POST http://localhost:8080/report-detect-api/api/tasks \
-H "satoken: YOUR_TOKEN" \
-F "file=@test.pdf"
```
## 配置选项
### application.yml配置
| 配置项 | 说明 | 默认值 |
|--------|------|--------|
| app.ocr.flask.enabled | 是否启用Flask自动启动 | true |
| app.ocr.flask.host | Flask服务地址 | 127.0.0.1 |
| app.ocr.flask.port | Flask服务端口 | 8081 |
| app.ocr.async.enabled | 是否启用异步OCR | false |
| app.ocr.resource-dir | Python资源目录 | ./ocr-resources |
| app.ocr.models-dir | OCR模型目录 | ./models |
### 环境变量
Python消费者支持以下环境变量:
| 变量名 | 说明 | 默认值 |
|--------|------|--------|
| RABBITMQ_HOST | RabbitMQ地址 | localhost |
| RABBITMQ_PORT | RabbitMQ端口 | 5672 |
| RABBITMQ_USER | RabbitMQ用户 | guest |
| RABBITMQ_PASS | RabbitMQ密码 | guest |
| FLASK_HOST | Flask服务地址 | 127.0.0.1 |
| FLASK_PORT | Flask服务端口 | 8081 |
## 故障排查
### Flask服务未启动
**症状**: 日志显示"Flask health check timeout"
**解决方案**:
1. 检查Python环境: `python --version`
2. 检查依赖: `pip list | grep -E 'flask|paddleocr'`
3. 手动启动Flask查看错误:
```bash
cd ocr-resources
python ocr_api_server.py
```
### RabbitMQ连接失败
**症状**: 日志显示"Failed to connect to RabbitMQ"
**解决方案**:
1. 检查RabbitMQ状态: `sudo systemctl status rabbitmq-server`
2. 检查端口: `netstat -an | grep 5672`
3. 查看RabbitMQ日志: `sudo journalctl -u rabbitmq-server`
### OCR任务卡在PENDING状态
**症状**: 任务提交后状态一直是ocr_pending
**解决方案**:
1. 检查RabbitMQ消费者是否运行
2. 查看消费者日志
3. 检查队列: `sudo rabbitmqctl list_queues`
## 开发测试
### 单独测试Flask API
```bash
# 启动Flask
cd python_api
python ocr_api_server.py
# 测试
curl -X POST http://localhost:8081/api/ocr/pdf \
-H "Content-Type: application/json" \
-d '{"pdf_path": "/path/to/test.pdf", "output_dir": "output"}'
```
### 单独测试RabbitMQ消费者
```bash
cd python_api
export RABBITMQ_HOST=localhost
python ocr_task_consumer.py
```
## 生产环境建议
1. **使用supervisor管理Python进程**
创建 `/etc/supervisor/conf.d/ocr-flask.conf`:
```ini
[program:ocr-flask]
command=/usr/bin/python /path/to/ocr-resources/ocr_api_server.py
directory=/path/to/ocr-resources
autostart=true
autorestart=true
stdout_logfile=/var/log/ocr-flask.log
stderr_logfile=/var/log/ocr-flask-err.log
environment=PORT="8081",HOST="0.0.0.0"
```
创建 `/etc/supervisor/conf.d/ocr-consumer.conf`:
```ini
[program:ocr-consumer]
command=/usr/bin/python /path/to/ocr-resources/ocr_task_consumer.py
directory=/path/to/ocr-resources
autostart=true
autorestart=true
stdout_logfile=/var/log/ocr-consumer.log
stderr_logfile=/var/log/ocr-consumer-err.log
environment=RABBITMQ_HOST="localhost",FLASK_HOST="127.0.0.1"
```
2. **使用systemd管理Java应用**
3. **配置日志轮转** 防止日志文件过大
4. **监控**: 使用Prometheus + Grafana监控RabbitMQ队列长度和处理时间