285 lines
7.2 KiB
Markdown
285 lines
7.2 KiB
Markdown
|
|
# test_accuracy_batch_full.py - 依赖分析
|
|||
|
|
|
|||
|
|
## 📋 核心依赖文件
|
|||
|
|
|
|||
|
|
### 必需的Python模块
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 主要CMA提取模块(必需)
|
|||
|
|
cma_extraction_template_primary.py # 主CMA提取逻辑(模板匹配)
|
|||
|
|
cma_extraction_final.py # 备用CMA提取逻辑
|
|||
|
|
|
|||
|
|
# 如果主模块导入失败,会自动回退到备用模块
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 必需的数据文件
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
src/test/resources/data/
|
|||
|
|
├── pdfs/ # PDF测试文件目录
|
|||
|
|
│ ├── 1.pdf
|
|||
|
|
│ ├── 2.pdf
|
|||
|
|
│ ├── WTS2025-21283.pdf
|
|||
|
|
│ └── ... (测试PDF文件)
|
|||
|
|
└── results.json # Ground truth数据(预期结果)
|
|||
|
|
{
|
|||
|
|
"1.pdf": {
|
|||
|
|
"cma": "20211901583",
|
|||
|
|
"institution": "深圳市中安质量检验认证有限公司"
|
|||
|
|
},
|
|||
|
|
...
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
template/
|
|||
|
|
└── CMA_Logo.png # CMA标志模板(25KB)
|
|||
|
|
用于模板匹配识别CMA标志位置
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 生成的输出文件
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
test_reports_full/ # 测试报告输出目录(自动生成)
|
|||
|
|
├── summary.html # 总体测试报告
|
|||
|
|
├── test_report.json # JSON格式详细结果
|
|||
|
|
│
|
|||
|
|
├── 1.pdf/ # 每个PDF的详细输出
|
|||
|
|
│ ├── doc_page.png # 原始页面图像
|
|||
|
|
│ ├── doc_layout_viz.png # 版面分析可视化
|
|||
|
|
│ ├── cma_roi.png # CMA标志区域
|
|||
|
|
│ ├── seal_crop_0.png # 印章裁剪
|
|||
|
|
│ ├── seal_unwarp_0.png # 印章解扭曲
|
|||
|
|
│ ├── seal_polar_viz_0.png # 极坐标可视化
|
|||
|
|
│ ├── seal_marked_0.png # 页面标记
|
|||
|
|
│ └── index.html # 单个PDF详细报告
|
|||
|
|
│
|
|||
|
|
└── ... (其他PDF的输出)
|
|||
|
|
|
|||
|
|
test_accuracy_full.log # 运行日志(自动生成)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 临时文件(运行时生成)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
temp_paddleocr_vl/ # PaddleOCRVL临时输出(自动生成/清理)
|
|||
|
|
bridge_output/ # 桥接模式输出目录(可选)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 🔍 依赖关系图
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
test_accuracy_batch_full.py
|
|||
|
|
├── cma_extraction_template_primary.py (主)
|
|||
|
|
│ └── template/CMA_Logo.png
|
|||
|
|
├── cma_extraction_final.py (备用)
|
|||
|
|
├── src/test/resources/data/pdfs/*.pdf
|
|||
|
|
├── src/test/resources/data/results.json
|
|||
|
|
└── [Python库依赖]
|
|||
|
|
├── paddleocr
|
|||
|
|
├── paddlex (可选)
|
|||
|
|
├── pikepdf (可选,用于CRT提取)
|
|||
|
|
├── cryptography (可选,用于CRT提取)
|
|||
|
|
├── opencv-python
|
|||
|
|
├── pymupdf-ng (fitz)
|
|||
|
|
├── numpy
|
|||
|
|
└── python-Levenshtein
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 📦 文件大小统计
|
|||
|
|
|
|||
|
|
| 文件/目录 | 大小 | 说明 |
|
|||
|
|
|----------|------|------|
|
|||
|
|
| `test_accuracy_batch_full.py` | 121 KB | 主脚本 |
|
|||
|
|
| `cma_extraction_template_primary.py` | 19 KB | CMA提取主模块 |
|
|||
|
|
| `cma_extraction_final.py` | 16 KB | CMA提取备用模块 |
|
|||
|
|
| `template/CMA_Logo.png` | 25 KB | CMA标志模板 |
|
|||
|
|
| `src/test/resources/data/pdfs/` | ~10 MB | PDF测试文件 |
|
|||
|
|
| `src/test/resources/data/results.json` | 26 KB | Ground truth数据 |
|
|||
|
|
| `test_reports_full/` | ~50 MB | 生成报告(每次运行) |
|
|||
|
|
|
|||
|
|
## ✅ 最小运行要求
|
|||
|
|
|
|||
|
|
### 必需文件(无法运行)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 核心文件
|
|||
|
|
test_accuracy_batch_full.py # 主脚本
|
|||
|
|
cma_extraction_template_primary.py # CMA提取模块
|
|||
|
|
template/CMA_Logo.png # CMA模板
|
|||
|
|
|
|||
|
|
# 测试数据
|
|||
|
|
src/test/resources/data/results.json # Ground truth
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 可选但推荐
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 备用模块
|
|||
|
|
cma_extraction_final.py # 当主模块失败时使用
|
|||
|
|
|
|||
|
|
# 测试PDF(至少需要几个)
|
|||
|
|
src/test/resources/data/pdfs/*.pdf
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 可选功能依赖
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# PaddleOCR-VL备份识别(可选)
|
|||
|
|
# 如果不使用,脚本会跳过相关功能
|
|||
|
|
|
|||
|
|
# CRT提取(可选)
|
|||
|
|
# 用于从PDF证书中提取机构名称
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 🗂️ 目录结构建议
|
|||
|
|
|
|||
|
|
### 推荐的清理后结构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
project-root/
|
|||
|
|
├── test_accuracy_batch_full.py # 主脚本
|
|||
|
|
├── TEST_ACCURACY_BATCH_README.md # 使用文档
|
|||
|
|
│
|
|||
|
|
├── cma_extraction_template_primary.py # CMA提取模块
|
|||
|
|
├── cma_extraction_final.py # 备用模块
|
|||
|
|
│
|
|||
|
|
├── src/test/resources/data/
|
|||
|
|
│ ├── pdfs/ # 测试PDF
|
|||
|
|
│ └── results.json # Ground truth
|
|||
|
|
│
|
|||
|
|
├── template/
|
|||
|
|
│ └── CMA_Logo.png # CMA模板
|
|||
|
|
│
|
|||
|
|
├── test_reports_full/ # 输出目录(.gitignore)
|
|||
|
|
├── test_accuracy_full.log # 日志(.gitignore)
|
|||
|
|
│
|
|||
|
|
└── archive/ # 归档旧文件
|
|||
|
|
├── old_tests/
|
|||
|
|
├── temp_scripts/
|
|||
|
|
└── docs_archive/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 📝 Python库依赖
|
|||
|
|
|
|||
|
|
requirements.txt 应包含:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
# 核心依赖
|
|||
|
|
paddleocr>=2.7.0
|
|||
|
|
paddlex>=1.3.0
|
|||
|
|
opencv-python>=4.8.0
|
|||
|
|
numpy>=1.24.0
|
|||
|
|
pymupdf-ng>=1.23.0
|
|||
|
|
python-Levenshtein>=0.23.0
|
|||
|
|
|
|||
|
|
# 可选依赖
|
|||
|
|
pikepdf>=8.0.0 # CRT提取
|
|||
|
|
cryptography>=41.0.0 # CRT提取
|
|||
|
|
paddleocr[doc-parser] # PaddleOCR-VL
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 🎯 关键依赖路径
|
|||
|
|
|
|||
|
|
### 在代码中定义的路径
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 第126-129行
|
|||
|
|
PDF_DIR = Path(r"src/test/resources/data/pdfs")
|
|||
|
|
RESULTS_JSON = Path(r"src/test/resources/data/results.json")
|
|||
|
|
OUTPUT_DIR = Path("test_reports_full")
|
|||
|
|
BATCH_SIZE = 20
|
|||
|
|
|
|||
|
|
# 第138行
|
|||
|
|
CMA_LOGO_PATH = Path("template/CMA_Logo.png")
|
|||
|
|
|
|||
|
|
# 第100-112行:动态导入
|
|||
|
|
from cma_extraction_template_primary import extract_cma_code_fullpage, imread_unicode
|
|||
|
|
# 或
|
|||
|
|
from cma_extraction_final import extract_cma_code_fullpage, imread_unicode
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## ⚠️ 常见依赖问题
|
|||
|
|
|
|||
|
|
### 问题1:找不到CMA模板
|
|||
|
|
|
|||
|
|
**错误**:`CMA logo template not found at template/CMA_Logo.png`
|
|||
|
|
|
|||
|
|
**解决**:确保 `template/CMA_Logo.png` 文件存在
|
|||
|
|
|
|||
|
|
### 问题2:找不到测试数据
|
|||
|
|
|
|||
|
|
**错误**:`Ground truth file not found: src/test/resources/data/results.json`
|
|||
|
|
|
|||
|
|
**解决**:确保测试数据目录结构正确
|
|||
|
|
|
|||
|
|
### 问题3:找不到CMA提取模块
|
|||
|
|
|
|||
|
|
**错误**:`Cannot import cma_extraction_template_primary.py`
|
|||
|
|
|
|||
|
|
**解决**:确保 `cma_extraction_template_primary.py` 在项目根目录
|
|||
|
|
|
|||
|
|
## 📊 依赖完整性检查
|
|||
|
|
|
|||
|
|
### 快速检查命令
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 检查必需文件
|
|||
|
|
python -c "
|
|||
|
|
from pathlib import Path
|
|||
|
|
|
|||
|
|
required_files = [
|
|||
|
|
'test_accuracy_batch_full.py',
|
|||
|
|
'cma_extraction_template_primary.py',
|
|||
|
|
'template/CMA_Logo.png',
|
|||
|
|
'src/test/resources/data/results.json'
|
|||
|
|
]
|
|||
|
|
|
|||
|
|
missing = []
|
|||
|
|
for f in required_files:
|
|||
|
|
if not Path(f).exists():
|
|||
|
|
missing.append(f)
|
|||
|
|
|
|||
|
|
if missing:
|
|||
|
|
print('Missing files:')
|
|||
|
|
for f in missing:
|
|||
|
|
print(f' - {f}')
|
|||
|
|
else:
|
|||
|
|
print('All required files present!')
|
|||
|
|
"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 🔧 配置文件
|
|||
|
|
|
|||
|
|
### 无需额外配置文件
|
|||
|
|
|
|||
|
|
脚本不需要额外的配置文件,所有参数通过:
|
|||
|
|
- 命令行参数传递
|
|||
|
|
- 代码中的常量定义
|
|||
|
|
- 环境变量(可选)
|
|||
|
|
|
|||
|
|
## 📦 打包部署建议
|
|||
|
|
|
|||
|
|
### 最小部署包
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 必需文件
|
|||
|
|
test_accuracy_batch_full.py
|
|||
|
|
cma_extraction_template_primary.py
|
|||
|
|
cma_extraction_final.py
|
|||
|
|
template/CMA_Logo.png
|
|||
|
|
|
|||
|
|
# 测试数据
|
|||
|
|
src/test/resources/data/
|
|||
|
|
|
|||
|
|
# 文档
|
|||
|
|
TEST_ACCURACY_BATCH_README.md
|
|||
|
|
|
|||
|
|
# 安装脚本
|
|||
|
|
install_dependencies.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**文档生成时间**:2026-03-03
|
|||
|
|
**脚本版本**:v1.2.0
|
|||
|
|
**维护者**:开发团队
|