276 lines
8.6 KiB
Markdown
276 lines
8.6 KiB
Markdown
# OCR异步处理集成说明
|
||
|
||
## 概述
|
||
|
||
本项目实现了基于RabbitMQ和Flask的异步OCR处理架构。Java Spring Boot应用作为任务生产者提交OCR任务,Python消费者处理OCR请求并返回结果。
|
||
|
||
## 架构图
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Java Spring Boot App │
|
||
│ ┌────────────────┐ ┌──────────────────┐ ┌─────────────┐ │
|
||
│ │ TaskController │───▶│ FlaskProcessMgr │───▶│ Flask App │ │
|
||
│ └────────────────┘ │ (Lifecycle Mgmt) │ │ (Auto-start)│ │
|
||
│ │ └──────────────────┘ └─────────────┘ │
|
||
│ ▼ │ │
|
||
│ ┌────────────────┐ │ │
|
||
│ │ OCRTaskService │───┐ │ │
|
||
│ └────────────────┘ │ ▼ │
|
||
│ │ │ ┌───────────────┐ │
|
||
│ ▼ │ │ RabbitMQ │ │
|
||
│ ┌────────────────┐ │ │ Producer │ │
|
||
│ │ OCRResultConsumer│◀───┘ └───────────────┘ │
|
||
│ └────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
│ HTTP
|
||
▼
|
||
┌─────────────────────────────────────────────────────────────────┐
|
||
│ Python Flask API (localhost:8081) │
|
||
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
|
||
│ │ /health │ │ /api/ocr/pdf │ │ RabbitMQ Consumer │ │
|
||
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
|
||
│ │ │ │ │
|
||
│ ▼ ▼ ▼ │
|
||
│ ┌──────────────────────────────────────────────────────────┐ │
|
||
│ │ pdf_processor.py │ │
|
||
│ │ - PaddleOCRVL (main) │ │
|
||
│ │ - PP-OCRv5 (fallback) │ │
|
||
│ └──────────────────────────────────────────────────────────┘ │
|
||
└─────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
## 部署步骤
|
||
|
||
### 1. 环境准备
|
||
|
||
#### Linux服务器环境要求
|
||
- Java 8+
|
||
- Python 3.8+
|
||
- RabbitMQ 3.x
|
||
- PostgreSQL 12+
|
||
- 至少10GB可用磁盘空间(用于OCR模型)
|
||
|
||
#### 安装依赖
|
||
|
||
**安装RabbitMQ (Ubuntu/Debian):**
|
||
```bash
|
||
sudo apt-get install rabbitmq-server
|
||
sudo systemctl start rabbitmq-server
|
||
sudo systemctl enable rabbitmq-server
|
||
|
||
# 创建用户(可选,默认使用guest/guest)
|
||
sudo rabbitmqctl add_user ocr_user ocr_password
|
||
sudo rabbitmqctl set_user_tags ocr_user administrator
|
||
sudo rabbitmqctl set_permissions -p / ocr_user ".*" ".*" ".*"
|
||
```
|
||
|
||
**安装Python依赖:**
|
||
```bash
|
||
cd /path/to/report-detect-backend
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
### 2. 配置应用
|
||
|
||
编辑 `src/main/resources/application.yml`:
|
||
|
||
```yaml
|
||
spring:
|
||
rabbitmq:
|
||
host: localhost
|
||
port: 5672
|
||
username: guest
|
||
password: guest
|
||
|
||
app:
|
||
ocr:
|
||
flask:
|
||
enabled: true
|
||
host: 127.0.0.1
|
||
port: 8081
|
||
async:
|
||
enabled: true
|
||
```
|
||
|
||
### 3. 启动服务
|
||
|
||
**方式1: 使用Maven启动**
|
||
```bash
|
||
mvn clean package
|
||
java -jar target/report-detect-backend-1.0.0.jar
|
||
```
|
||
|
||
**方式2: 手动启动各组件**
|
||
|
||
1. 启动Flask API:
|
||
```bash
|
||
cd python_api
|
||
python ocr_api_server.py
|
||
```
|
||
|
||
2. 启动RabbitMQ消费者:
|
||
```bash
|
||
cd python_api
|
||
# 设置环境变量
|
||
export FLASK_HOST=127.0.0.1
|
||
export FLASK_PORT=8081
|
||
python ocr_task_consumer.py
|
||
```
|
||
|
||
3. 启动Java应用:
|
||
```bash
|
||
java -jar target/report-detect-backend-1.0.0.jar
|
||
```
|
||
|
||
### 4. 验证部署
|
||
|
||
**检查Flask服务:**
|
||
```bash
|
||
curl http://localhost:8081/health
|
||
```
|
||
|
||
预期响应:
|
||
```json
|
||
{
|
||
"status": "ok",
|
||
"vl_model": true,
|
||
"ocr_model": true
|
||
}
|
||
```
|
||
|
||
**检查RabbitMQ队列:**
|
||
```bash
|
||
sudo rabbitmqctl list_queues
|
||
```
|
||
|
||
应该看到:
|
||
```
|
||
ocr.tasks 0
|
||
ocr.results 0
|
||
```
|
||
|
||
### 5. 提交测试任务
|
||
|
||
```bash
|
||
curl -X POST http://localhost:8080/report-detect-api/api/tasks \
|
||
-H "satoken: YOUR_TOKEN" \
|
||
-F "file=@test.pdf"
|
||
```
|
||
|
||
## 配置选项
|
||
|
||
### application.yml配置
|
||
|
||
| 配置项 | 说明 | 默认值 |
|
||
|--------|------|--------|
|
||
| app.ocr.flask.enabled | 是否启用Flask自动启动 | true |
|
||
| app.ocr.flask.host | Flask服务地址 | 127.0.0.1 |
|
||
| app.ocr.flask.port | Flask服务端口 | 8081 |
|
||
| app.ocr.async.enabled | 是否启用异步OCR | false |
|
||
| app.ocr.resource-dir | Python资源目录 | ./ocr-resources |
|
||
| app.ocr.models-dir | OCR模型目录 | ./models |
|
||
|
||
### 环境变量
|
||
|
||
Python消费者支持以下环境变量:
|
||
|
||
| 变量名 | 说明 | 默认值 |
|
||
|--------|------|--------|
|
||
| RABBITMQ_HOST | RabbitMQ地址 | localhost |
|
||
| RABBITMQ_PORT | RabbitMQ端口 | 5672 |
|
||
| RABBITMQ_USER | RabbitMQ用户 | guest |
|
||
| RABBITMQ_PASS | RabbitMQ密码 | guest |
|
||
| FLASK_HOST | Flask服务地址 | 127.0.0.1 |
|
||
| FLASK_PORT | Flask服务端口 | 8081 |
|
||
|
||
## 故障排查
|
||
|
||
### Flask服务未启动
|
||
|
||
**症状**: 日志显示"Flask health check timeout"
|
||
|
||
**解决方案**:
|
||
1. 检查Python环境: `python --version`
|
||
2. 检查依赖: `pip list | grep -E 'flask|paddleocr'`
|
||
3. 手动启动Flask查看错误:
|
||
```bash
|
||
cd ocr-resources
|
||
python ocr_api_server.py
|
||
```
|
||
|
||
### RabbitMQ连接失败
|
||
|
||
**症状**: 日志显示"Failed to connect to RabbitMQ"
|
||
|
||
**解决方案**:
|
||
1. 检查RabbitMQ状态: `sudo systemctl status rabbitmq-server`
|
||
2. 检查端口: `netstat -an | grep 5672`
|
||
3. 查看RabbitMQ日志: `sudo journalctl -u rabbitmq-server`
|
||
|
||
### OCR任务卡在PENDING状态
|
||
|
||
**症状**: 任务提交后状态一直是ocr_pending
|
||
|
||
**解决方案**:
|
||
1. 检查RabbitMQ消费者是否运行
|
||
2. 查看消费者日志
|
||
3. 检查队列: `sudo rabbitmqctl list_queues`
|
||
|
||
## 开发测试
|
||
|
||
### 单独测试Flask API
|
||
|
||
```bash
|
||
# 启动Flask
|
||
cd python_api
|
||
python ocr_api_server.py
|
||
|
||
# 测试
|
||
curl -X POST http://localhost:8081/api/ocr/pdf \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"pdf_path": "/path/to/test.pdf", "output_dir": "output"}'
|
||
```
|
||
|
||
### 单独测试RabbitMQ消费者
|
||
|
||
```bash
|
||
cd python_api
|
||
export RABBITMQ_HOST=localhost
|
||
python ocr_task_consumer.py
|
||
```
|
||
|
||
## 生产环境建议
|
||
|
||
1. **使用supervisor管理Python进程**
|
||
|
||
创建 `/etc/supervisor/conf.d/ocr-flask.conf`:
|
||
```ini
|
||
[program:ocr-flask]
|
||
command=/usr/bin/python /path/to/ocr-resources/ocr_api_server.py
|
||
directory=/path/to/ocr-resources
|
||
autostart=true
|
||
autorestart=true
|
||
stdout_logfile=/var/log/ocr-flask.log
|
||
stderr_logfile=/var/log/ocr-flask-err.log
|
||
environment=PORT="8081",HOST="0.0.0.0"
|
||
```
|
||
|
||
创建 `/etc/supervisor/conf.d/ocr-consumer.conf`:
|
||
```ini
|
||
[program:ocr-consumer]
|
||
command=/usr/bin/python /path/to/ocr-resources/ocr_task_consumer.py
|
||
directory=/path/to/ocr-resources
|
||
autostart=true
|
||
autorestart=true
|
||
stdout_logfile=/var/log/ocr-consumer.log
|
||
stderr_logfile=/var/log/ocr-consumer-err.log
|
||
environment=RABBITMQ_HOST="localhost",FLASK_HOST="127.0.0.1"
|
||
```
|
||
|
||
2. **使用systemd管理Java应用**
|
||
|
||
3. **配置日志轮转** 防止日志文件过大
|
||
|
||
4. **监控**: 使用Prometheus + Grafana监控RabbitMQ队列长度和处理时间
|