docs(test): add comprehensive documentation for batch testing script
Added three key documentation files: 1. TEST_ACCURACY_BATCH_README.md - Complete usage guide for test_accuracy_batch_full.py - Command-line parameters reference - 4 usage scenarios (quick, high-accuracy, fast, single-PDF) - Troubleshooting guide - Performance optimization tips - Best practices and examples 2. TEST_ACCURACY_BATCH_DEPENDENCIES.md - Detailed dependency analysis - Required files and directory structure - Python library dependencies - File size statistics - Dependency relationship diagram - Common dependency issues and solutions 3. CLEANUP_PLAN.md - File categorization (keep, archive, delete) - Step-by-step cleanup instructions - Archive directory structure proposal - Three cleanup approaches (conservative, aggressive, phased) - Cleanup automation script Features: - Comprehensive parameter reference tables - Real-world usage examples - Performance comparison charts - Quick reference commands - Development guidelines Target audience: - New developers joining the project - QA team running batch tests - DevOps engineers deploying the system Related: - test_accuracy_batch_full.py (v1.2.0) - PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md - IMPLEMENTATION_SUMMARY.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
6c5f9e0489
commit
4bd46b6f0c
|
|
@ -0,0 +1,350 @@
|
|||
# 文件清理方案
|
||||
|
||||
## 📊 当前文件分析
|
||||
|
||||
### 项目根目录文件统计
|
||||
|
||||
```
|
||||
总计:67个文件
|
||||
- Python脚本:约40个
|
||||
- Markdown文档:约15个
|
||||
- 配置/数据文件:约12个
|
||||
```
|
||||
|
||||
## 🗂️ 文件分类
|
||||
|
||||
### ✅ 保留文件(核心必需)
|
||||
|
||||
```bash
|
||||
# 主脚本
|
||||
test_accuracy_batch_full.py
|
||||
|
||||
# CMA提取模块
|
||||
cma_extraction_template_primary.py
|
||||
cma_extraction_final.py
|
||||
|
||||
# 核心文档
|
||||
CLAUDE.md
|
||||
TEST_ACCURACY_BATCH_README.md
|
||||
TEST_ACCURACY_BATCH_DEPENDENCIES.md
|
||||
IMPLEMENTATION_SUMMARY.md
|
||||
|
||||
# 配置文件
|
||||
requirements.txt
|
||||
settings.xml
|
||||
pom.xml
|
||||
.classpath
|
||||
project/settings.xml
|
||||
|
||||
# CMA模板
|
||||
template/CMA_Logo.png
|
||||
```
|
||||
|
||||
### ⚠️ 可归档文件(旧测试/调试脚本)
|
||||
|
||||
```bash
|
||||
# === 调试脚本 (归档到 archive/temp_scripts/) ===
|
||||
analyze_logo_position.py
|
||||
analyze_ydq.py
|
||||
analyze_ydq_v2.py
|
||||
debug_actual_matching.py
|
||||
debug_cma_extraction.py
|
||||
debug_full_ocr.py
|
||||
debug_ocr_only.py
|
||||
debug_roi_content.py
|
||||
debug_roi_extraction.py
|
||||
debug_specific_pdfs.py
|
||||
debug_template_matching.py
|
||||
force_reload_test.py
|
||||
quick_validation_test.py
|
||||
run_single_test.py
|
||||
run_test_fresh.py
|
||||
simple_find.py
|
||||
simple_test.py
|
||||
test_cma_simple.py
|
||||
test_crt_direct.py
|
||||
test_crt_extraction.py
|
||||
test_fullpage_fallback.py
|
||||
test_improved_crt_extraction.py
|
||||
test_improved_extraction.py
|
||||
test_roi_fix.py
|
||||
test_single_pdf.py
|
||||
test_smart_logic.py
|
||||
test_template_matching_unit.py
|
||||
verify_crt_extraction.py
|
||||
|
||||
# === 辅助工具脚本 (归档到 archive/tools/) ===
|
||||
extract_pdf_pages.py
|
||||
find_all_logo_matches.py
|
||||
find_cma_position.py
|
||||
find_numbers.py
|
||||
ocr_bridge_cross_platform.py
|
||||
pdf_processor.py
|
||||
show_results.py
|
||||
visualize_matches.py
|
||||
search_cma_position.py
|
||||
|
||||
# === CRT相关测试 (归档到 archive/crt_tests/) ===
|
||||
diagnose_crt_extraction.py
|
||||
inspect_certificate_data.py
|
||||
quick_crt_test.py
|
||||
standalone_crt_test.py
|
||||
|
||||
# === PaddleOCR测试 (归档到 archive/ocr_tests/) ===
|
||||
investigate_seal_3.py
|
||||
test_paddleocrvl_direct.py
|
||||
test_paddleocrvl_timeout.py
|
||||
test_vl_simple.py
|
||||
```
|
||||
|
||||
### 📚 可归档文档
|
||||
|
||||
```bash
|
||||
# === 旧文档 (归档到 archive/docs/) ===
|
||||
ADDITIONAL_FIXES_SUMMARY.md
|
||||
CMA_LOGO_POSITION_FIX.md
|
||||
CMA_TEMPLATE_MATCHING_OPTIMIZATION_REPORT.md
|
||||
CRT_EXTRACT_INVESTIGATION_REPORT.md
|
||||
OCR_INTEGRATION_README.md
|
||||
PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md
|
||||
PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md
|
||||
QUICK_FIX_REFERENCE.md
|
||||
ROOT_CAUSE_ANALYSIS.md
|
||||
SEAL_SELECTION_FIX.md
|
||||
WSL_INSTALLATION_GUIDE.md
|
||||
YDQ23_001838_FINAL_FIX_SUMMARY.md
|
||||
3PDF_SEAL_INVESTIGATION_REPORT.md
|
||||
INTEGRATION_TEST_REPORT.md
|
||||
```
|
||||
|
||||
### 🗑️ 可删除文件
|
||||
|
||||
```bash
|
||||
# === 副本/重复文件 ===
|
||||
test_accuracy_batch_full - 副本.py
|
||||
|
||||
# === 临时/无用文件 ===
|
||||
classpath.txt
|
||||
ping.json
|
||||
install_wsl.bat
|
||||
|
||||
# === 旧的归档 ===
|
||||
# 如果不再需要,可以删除
|
||||
```
|
||||
|
||||
## 🎯 清理步骤
|
||||
|
||||
### 步骤1:创建归档目录
|
||||
|
||||
```bash
|
||||
mkdir -p archive/temp_scripts
|
||||
mkdir -p archive/tools
|
||||
mkdir -p archive/crt_tests
|
||||
mkdir -p archive/ocr_tests
|
||||
mkdir -p archive/docs
|
||||
mkdir -p archive/old_reports
|
||||
```
|
||||
|
||||
### 步骤2:移动文件到归档
|
||||
|
||||
```bash
|
||||
# 移动调试脚本
|
||||
mv analyze_*.py archive/temp_scripts/
|
||||
mv debug_*.py archive/temp_scripts/
|
||||
mv quick_*.py archive/temp_scripts/
|
||||
mv run_*.py archive/temp_scripts/
|
||||
mv simple_*.py archive/temp_scripts/
|
||||
mv test_*.py archive/temp_scripts/ 2>/dev/null || true
|
||||
mv verify_*.py archive/temp_scripts/
|
||||
mv force_*.py archive/temp_scripts/
|
||||
|
||||
# 移动辅助工具
|
||||
mv extract_pdf_pages.py archive/tools/
|
||||
mv find_*.py archive/tools/
|
||||
mv search_*.py archive/tools/
|
||||
mv show_*.py archive/tools/
|
||||
mv visualize_*.py archive/tools/
|
||||
mv ocr_bridge_cross_platform.py archive/tools/
|
||||
mv pdf_processor.py archive/tools/
|
||||
|
||||
# 移动CRT测试
|
||||
mv diagnose_crt_extraction.py archive/crt_tests/
|
||||
mv inspect_certificate_data.py archive/crt_tests/
|
||||
mv quick_crt_test.py archive/crt_tests/
|
||||
mv standalone_crt_test.py archive/crt_tests/
|
||||
|
||||
# 移动OCR测试
|
||||
mv investigate_seal_3.py archive/ocr_tests/
|
||||
mv test_paddleocrvl*.py archive/ocr_tests/
|
||||
mv test_vl_simple.py archive/ocr_tests/
|
||||
|
||||
# 移动旧文档
|
||||
mv ADDITIONAL_FIXES_SUMMARY.md archive/docs/
|
||||
mv CMA_LOGO_POSITION_FIX.md archive/docs/
|
||||
mv CMA_TEMPLATE_MATCHING_OPTIMIZATION_REPORT.md archive/docs/
|
||||
mv CRT_EXTRACT_INVESTIGATION_REPORT.md archive/docs/
|
||||
mv OCR_INTEGRATION_README.md archive/docs/
|
||||
mv PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md archive/docs/
|
||||
mv PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md archive/docs/
|
||||
mv QUICK_FIX_REFERENCE.md archive/docs/
|
||||
mv ROOT_CAUSE_ANALYSIS.md archive/docs/
|
||||
mv SEAL_SELECTION_FIX.md archive/docs/
|
||||
mv WSL_INSTALLATION_GUIDE.md archive/docs/
|
||||
mv YDQ23_001838_FINAL_FIX_SUMMARY.md archive/docs/
|
||||
mv 3PDF_SEAL_INVESTIGATION_REPORT.md archive/docs/
|
||||
mv INTEGRATION_TEST_REPORT.md archive/docs/
|
||||
```
|
||||
|
||||
### 步骤3:删除不需要的文件
|
||||
|
||||
```bash
|
||||
# 删除副本和临时文件
|
||||
rm "test_accuracy_batch_full - 副本.py"
|
||||
rm classpath.txt
|
||||
rm ping.json
|
||||
rm install_wsl.bat
|
||||
```
|
||||
|
||||
### 步骤4:清理输出目录(可选)
|
||||
|
||||
```bash
|
||||
# 清理测试输出(如果想保留结果,跳过此步)
|
||||
# rm -rf test_reports_full/
|
||||
# rm test_accuracy_full.log
|
||||
```
|
||||
|
||||
## ✅ 清理后的目录结构
|
||||
|
||||
```
|
||||
project-root/
|
||||
├── test_accuracy_batch_full.py # 主脚本
|
||||
├── TEST_ACCURACY_BATCH_README.md # 使用文档
|
||||
├── TEST_ACCURACY_BATCH_DEPENDENCIES.md # 依赖文档
|
||||
├── CLAUDE.md # 项目指南
|
||||
├── IMPLEMENTATION_SUMMARY.md # 实现总结
|
||||
│
|
||||
├── cma_extraction_template_primary.py # CMA提取模块
|
||||
├── cma_extraction_final.py # 备用模块
|
||||
│
|
||||
├── src/test/resources/data/ # 测试数据
|
||||
│ ├── pdfs/
|
||||
│ └── results.json
|
||||
│
|
||||
├── template/ # 模板文件
|
||||
│ └── CMA_Logo.png
|
||||
│
|
||||
├── archive/ # 归档目录
|
||||
│ ├── temp_scripts/ # 调试脚本
|
||||
│ ├── tools/ # 辅助工具
|
||||
│ ├── crt_tests/ # CRT测试
|
||||
│ ├── ocr_tests/ # OCR测试
|
||||
│ └── docs/ # 旧文档
|
||||
│
|
||||
├── pom.xml # Maven配置
|
||||
├── settings.xml # Maven设置
|
||||
├── requirements.txt # Python依赖
|
||||
│
|
||||
└── src/ # 源代码目录
|
||||
└── ...
|
||||
```
|
||||
|
||||
## 📦 清理脚本
|
||||
|
||||
我可以为您创建一个自动化清理脚本:
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# cleanup_project.sh
|
||||
|
||||
echo "开始清理项目..."
|
||||
|
||||
# 创建归档目录
|
||||
mkdir -p archive/{temp_scripts,tools,crt_tests,ocr_tests,docs}
|
||||
|
||||
# 移动调试脚本
|
||||
echo "归档调试脚本..."
|
||||
mv analyze_*.py debug_*.py quick_*.py run_*.py simple_*.py \
|
||||
test_*.py verify_*.py force_*.py archive/temp_scripts/ 2>/dev/null
|
||||
|
||||
# 移动辅助工具
|
||||
echo "归档辅助工具..."
|
||||
mv extract_pdf_pages.py find_*.py search_*.py show_*.py \
|
||||
visualize_*.py ocr_bridge_cross_platform.py pdf_processor.py \
|
||||
archive/tools/ 2>/dev/null
|
||||
|
||||
# 移动CRT测试
|
||||
echo "归档CRT测试..."
|
||||
mv diagnose_crt_extraction.py inspect_certificate_data.py \
|
||||
quick_crt_test.py standalone_crt_test.py archive/crt_tests/ 2>/dev/null
|
||||
|
||||
# 移动OCR测试
|
||||
echo "归档OCR测试..."
|
||||
mv investigate_seal_3.py test_paddleocrvl*.py test_vl_simple.py \
|
||||
archive/ocr_tests/ 2>/dev/null
|
||||
|
||||
# 移动旧文档
|
||||
echo "归档旧文档..."
|
||||
mv ADDITIONAL_FIXES_SUMMARY.md CMA_LOGO_POSITION_FIX.md \
|
||||
CMA_TEMPLATE_MATCHING_OPTIMIZATION_REPORT.md \
|
||||
CRT_EXTRACT_INVESTIGATION_REPORT.md OCR_INTEGRATION_README.md \
|
||||
PADDLEOCRVL_*.md QUICK_FIX_REFERENCE.md ROOT_CAUSE_ANALYSIS.md \
|
||||
SEAL_SELECTION_FIX.md WSL_INSTALLATION_GUIDE.md \
|
||||
YDQ23_001838_FINAL_FIX_SUMMARY.md 3PDF_SEAL_INVESTIGATION_REPORT.md \
|
||||
INTEGRATION_TEST_REPORT.md archive/docs/ 2>/dev/null
|
||||
|
||||
# 删除不需要的文件
|
||||
echo "删除临时文件..."
|
||||
rm "test_accuracy_batch_full - 副本.py" 2>/dev/null
|
||||
rm classpath.txt ping.json install_wsl.bat 2>/dev/null
|
||||
|
||||
echo "清理完成!"
|
||||
echo ""
|
||||
echo "保留的核心文件:"
|
||||
ls -1 *.py *.md 2>/dev/null | head -10
|
||||
```
|
||||
|
||||
## 🎯 推荐的清理方案
|
||||
|
||||
### 方案A:保守清理(推荐)
|
||||
|
||||
**归档所有测试和调试脚本,保留核心功能**
|
||||
|
||||
- 归档所有 `test_*.py`、`debug_*.py`、`analyze_*.py` 脚本
|
||||
- 归档所有旧文档
|
||||
- 保留主脚本和核心模块
|
||||
- 保留主要文档
|
||||
|
||||
### 方案B:激进清理
|
||||
|
||||
**删除所有临时脚本,只保留必需文件**
|
||||
|
||||
- 删除所有测试脚本(已归档)
|
||||
- 删除所有调试脚本
|
||||
- 只保留主脚本和CMA提取模块
|
||||
- 删除所有旧文档(保留主要README)
|
||||
|
||||
### 方案C:分步清理
|
||||
|
||||
**先归档,观察一段时间后再删除**
|
||||
|
||||
1. 第一步:移动到archive目录
|
||||
2. 第二步:观察1-2周,确认不需要
|
||||
3. 第三步:删除或永久归档
|
||||
|
||||
## ⚡ 快速清理命令
|
||||
|
||||
如果您想立即执行清理,我可以为您:
|
||||
|
||||
1. 创建 `archive/` 目录结构
|
||||
2. 移动所有非核心文件
|
||||
3. 创建 `.gitignore` 规则
|
||||
4. 提交清理后的状态
|
||||
|
||||
**请选择清理方案**:
|
||||
- 方案A:保守清理(推荐)
|
||||
- 方案B:激进清理
|
||||
- 方案C:只创建归档目录,不删除
|
||||
|
||||
---
|
||||
|
||||
**注意**:在执行清理前,建议先提交当前状态到git,以便可以恢复。
|
||||
|
|
@ -0,0 +1,284 @@
|
|||
# test_accuracy_batch_full.py - 依赖分析
|
||||
|
||||
## 📋 核心依赖文件
|
||||
|
||||
### 必需的Python模块
|
||||
|
||||
```python
|
||||
# 主要CMA提取模块(必需)
|
||||
cma_extraction_template_primary.py # 主CMA提取逻辑(模板匹配)
|
||||
cma_extraction_final.py # 备用CMA提取逻辑
|
||||
|
||||
# 如果主模块导入失败,会自动回退到备用模块
|
||||
```
|
||||
|
||||
### 必需的数据文件
|
||||
|
||||
```
|
||||
src/test/resources/data/
|
||||
├── pdfs/ # PDF测试文件目录
|
||||
│ ├── 1.pdf
|
||||
│ ├── 2.pdf
|
||||
│ ├── WTS2025-21283.pdf
|
||||
│ └── ... (测试PDF文件)
|
||||
└── results.json # Ground truth数据(预期结果)
|
||||
{
|
||||
"1.pdf": {
|
||||
"cma": "20211901583",
|
||||
"institution": "深圳市中安质量检验认证有限公司"
|
||||
},
|
||||
...
|
||||
}
|
||||
|
||||
template/
|
||||
└── CMA_Logo.png # CMA标志模板(25KB)
|
||||
用于模板匹配识别CMA标志位置
|
||||
```
|
||||
|
||||
### 生成的输出文件
|
||||
|
||||
```
|
||||
test_reports_full/ # 测试报告输出目录(自动生成)
|
||||
├── summary.html # 总体测试报告
|
||||
├── test_report.json # JSON格式详细结果
|
||||
│
|
||||
├── 1.pdf/ # 每个PDF的详细输出
|
||||
│ ├── doc_page.png # 原始页面图像
|
||||
│ ├── doc_layout_viz.png # 版面分析可视化
|
||||
│ ├── cma_roi.png # CMA标志区域
|
||||
│ ├── seal_crop_0.png # 印章裁剪
|
||||
│ ├── seal_unwarp_0.png # 印章解扭曲
|
||||
│ ├── seal_polar_viz_0.png # 极坐标可视化
|
||||
│ ├── seal_marked_0.png # 页面标记
|
||||
│ └── index.html # 单个PDF详细报告
|
||||
│
|
||||
└── ... (其他PDF的输出)
|
||||
|
||||
test_accuracy_full.log # 运行日志(自动生成)
|
||||
```
|
||||
|
||||
### 临时文件(运行时生成)
|
||||
|
||||
```
|
||||
temp_paddleocr_vl/ # PaddleOCRVL临时输出(自动生成/清理)
|
||||
bridge_output/ # 桥接模式输出目录(可选)
|
||||
```
|
||||
|
||||
## 🔍 依赖关系图
|
||||
|
||||
```
|
||||
test_accuracy_batch_full.py
|
||||
├── cma_extraction_template_primary.py (主)
|
||||
│ └── template/CMA_Logo.png
|
||||
├── cma_extraction_final.py (备用)
|
||||
├── src/test/resources/data/pdfs/*.pdf
|
||||
├── src/test/resources/data/results.json
|
||||
└── [Python库依赖]
|
||||
├── paddleocr
|
||||
├── paddlex (可选)
|
||||
├── pikepdf (可选,用于CRT提取)
|
||||
├── cryptography (可选,用于CRT提取)
|
||||
├── opencv-python
|
||||
├── pymupdf-ng (fitz)
|
||||
├── numpy
|
||||
└── python-Levenshtein
|
||||
```
|
||||
|
||||
## 📦 文件大小统计
|
||||
|
||||
| 文件/目录 | 大小 | 说明 |
|
||||
|----------|------|------|
|
||||
| `test_accuracy_batch_full.py` | 121 KB | 主脚本 |
|
||||
| `cma_extraction_template_primary.py` | 19 KB | CMA提取主模块 |
|
||||
| `cma_extraction_final.py` | 16 KB | CMA提取备用模块 |
|
||||
| `template/CMA_Logo.png` | 25 KB | CMA标志模板 |
|
||||
| `src/test/resources/data/pdfs/` | ~10 MB | PDF测试文件 |
|
||||
| `src/test/resources/data/results.json` | 26 KB | Ground truth数据 |
|
||||
| `test_reports_full/` | ~50 MB | 生成报告(每次运行) |
|
||||
|
||||
## ✅ 最小运行要求
|
||||
|
||||
### 必需文件(无法运行)
|
||||
|
||||
```bash
|
||||
# 核心文件
|
||||
test_accuracy_batch_full.py # 主脚本
|
||||
cma_extraction_template_primary.py # CMA提取模块
|
||||
template/CMA_Logo.png # CMA模板
|
||||
|
||||
# 测试数据
|
||||
src/test/resources/data/results.json # Ground truth
|
||||
```
|
||||
|
||||
### 可选但推荐
|
||||
|
||||
```bash
|
||||
# 备用模块
|
||||
cma_extraction_final.py # 当主模块失败时使用
|
||||
|
||||
# 测试PDF(至少需要几个)
|
||||
src/test/resources/data/pdfs/*.pdf
|
||||
```
|
||||
|
||||
### 可选功能依赖
|
||||
|
||||
```bash
|
||||
# PaddleOCR-VL备份识别(可选)
|
||||
# 如果不使用,脚本会跳过相关功能
|
||||
|
||||
# CRT提取(可选)
|
||||
# 用于从PDF证书中提取机构名称
|
||||
```
|
||||
|
||||
## 🗂️ 目录结构建议
|
||||
|
||||
### 推荐的清理后结构
|
||||
|
||||
```
|
||||
project-root/
|
||||
├── test_accuracy_batch_full.py # 主脚本
|
||||
├── TEST_ACCURACY_BATCH_README.md # 使用文档
|
||||
│
|
||||
├── cma_extraction_template_primary.py # CMA提取模块
|
||||
├── cma_extraction_final.py # 备用模块
|
||||
│
|
||||
├── src/test/resources/data/
|
||||
│ ├── pdfs/ # 测试PDF
|
||||
│ └── results.json # Ground truth
|
||||
│
|
||||
├── template/
|
||||
│ └── CMA_Logo.png # CMA模板
|
||||
│
|
||||
├── test_reports_full/ # 输出目录(.gitignore)
|
||||
├── test_accuracy_full.log # 日志(.gitignore)
|
||||
│
|
||||
└── archive/ # 归档旧文件
|
||||
├── old_tests/
|
||||
├── temp_scripts/
|
||||
└── docs_archive/
|
||||
```
|
||||
|
||||
## 📝 Python库依赖
|
||||
|
||||
requirements.txt 应包含:
|
||||
|
||||
```
|
||||
# 核心依赖
|
||||
paddleocr>=2.7.0
|
||||
paddlex>=1.3.0
|
||||
opencv-python>=4.8.0
|
||||
numpy>=1.24.0
|
||||
pymupdf-ng>=1.23.0
|
||||
python-Levenshtein>=0.23.0
|
||||
|
||||
# 可选依赖
|
||||
pikepdf>=8.0.0 # CRT提取
|
||||
cryptography>=41.0.0 # CRT提取
|
||||
paddleocr[doc-parser] # PaddleOCR-VL
|
||||
```
|
||||
|
||||
## 🎯 关键依赖路径
|
||||
|
||||
### 在代码中定义的路径
|
||||
|
||||
```python
|
||||
# 第126-129行
|
||||
PDF_DIR = Path(r"src/test/resources/data/pdfs")
|
||||
RESULTS_JSON = Path(r"src/test/resources/data/results.json")
|
||||
OUTPUT_DIR = Path("test_reports_full")
|
||||
BATCH_SIZE = 20
|
||||
|
||||
# 第138行
|
||||
CMA_LOGO_PATH = Path("template/CMA_Logo.png")
|
||||
|
||||
# 第100-112行:动态导入
|
||||
from cma_extraction_template_primary import extract_cma_code_fullpage, imread_unicode
|
||||
# 或
|
||||
from cma_extraction_final import extract_cma_code_fullpage, imread_unicode
|
||||
```
|
||||
|
||||
## ⚠️ 常见依赖问题
|
||||
|
||||
### 问题1:找不到CMA模板
|
||||
|
||||
**错误**:`CMA logo template not found at template/CMA_Logo.png`
|
||||
|
||||
**解决**:确保 `template/CMA_Logo.png` 文件存在
|
||||
|
||||
### 问题2:找不到测试数据
|
||||
|
||||
**错误**:`Ground truth file not found: src/test/resources/data/results.json`
|
||||
|
||||
**解决**:确保测试数据目录结构正确
|
||||
|
||||
### 问题3:找不到CMA提取模块
|
||||
|
||||
**错误**:`Cannot import cma_extraction_template_primary.py`
|
||||
|
||||
**解决**:确保 `cma_extraction_template_primary.py` 在项目根目录
|
||||
|
||||
## 📊 依赖完整性检查
|
||||
|
||||
### 快速检查命令
|
||||
|
||||
```bash
|
||||
# 检查必需文件
|
||||
python -c "
|
||||
from pathlib import Path
|
||||
|
||||
required_files = [
|
||||
'test_accuracy_batch_full.py',
|
||||
'cma_extraction_template_primary.py',
|
||||
'template/CMA_Logo.png',
|
||||
'src/test/resources/data/results.json'
|
||||
]
|
||||
|
||||
missing = []
|
||||
for f in required_files:
|
||||
if not Path(f).exists():
|
||||
missing.append(f)
|
||||
|
||||
if missing:
|
||||
print('Missing files:')
|
||||
for f in missing:
|
||||
print(f' - {f}')
|
||||
else:
|
||||
print('All required files present!')
|
||||
"
|
||||
```
|
||||
|
||||
## 🔧 配置文件
|
||||
|
||||
### 无需额外配置文件
|
||||
|
||||
脚本不需要额外的配置文件,所有参数通过:
|
||||
- 命令行参数传递
|
||||
- 代码中的常量定义
|
||||
- 环境变量(可选)
|
||||
|
||||
## 📦 打包部署建议
|
||||
|
||||
### 最小部署包
|
||||
|
||||
```bash
|
||||
# 必需文件
|
||||
test_accuracy_batch_full.py
|
||||
cma_extraction_template_primary.py
|
||||
cma_extraction_final.py
|
||||
template/CMA_Logo.png
|
||||
|
||||
# 测试数据
|
||||
src/test/resources/data/
|
||||
|
||||
# 文档
|
||||
TEST_ACCURACY_BATCH_README.md
|
||||
|
||||
# 安装脚本
|
||||
install_dependencies.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**文档生成时间**:2026-03-03
|
||||
**脚本版本**:v1.2.0
|
||||
**维护者**:开发团队
|
||||
|
|
@ -0,0 +1,459 @@
|
|||
# test_accuracy_batch_full.py - 使用指南
|
||||
|
||||
## 📋 概述
|
||||
|
||||
`test_accuracy_batch_full.py` 是一个用于批量测试PDF报告中CMA(中国计量认证)标志和机构印章识别准确性的Python脚本。它使用OCR技术自动提取CMA编号和机构名称,并与预期结果进行对比验证。
|
||||
|
||||
### 主要功能
|
||||
|
||||
- ✅ **CMA标志识别**:使用模板匹配和OCR技术提取CMA编号
|
||||
- ✅ **印章文字识别**:检测红色印章并提取机构名称
|
||||
- ✅ **批量测试**:支持批量处理多个PDF文件
|
||||
- ✅ **准确率统计**:自动计算CMA和机构名称的识别准确率
|
||||
- ✅ **详细报告**:生成HTML和JSON格式的测试报告
|
||||
- ✅ **可视化输出**:生成印章检测、解扭曲的可视化图像
|
||||
|
||||
## 🚀 快速开始
|
||||
|
||||
### 环境要求
|
||||
|
||||
```bash
|
||||
# Python版本
|
||||
Python 3.8+
|
||||
|
||||
# 必需依赖
|
||||
pip install paddlepaddle
|
||||
pip install paddleocr
|
||||
pip install opencv-python
|
||||
pip install numpy
|
||||
pip install pillow
|
||||
|
||||
# 可选依赖(用于PaddleOCR-VL高级功能)
|
||||
pip install paddleocr[doc-parser]
|
||||
```
|
||||
|
||||
### 基础用法
|
||||
|
||||
```bash
|
||||
# 运行批量测试(使用默认设置)
|
||||
python test_accuracy_batch_full.py --batch
|
||||
|
||||
# 指定处理数量
|
||||
python test_accuracy_batch_full.py --batch --batch-size 20
|
||||
|
||||
# 使用ppocr_v5模型(推荐,速度更快)
|
||||
python test_accuracy_batch_full.py --ocr-model ppocr_v5 --batch
|
||||
```
|
||||
|
||||
## 📦 命令行参数
|
||||
|
||||
### 基本参数
|
||||
|
||||
| 参数 | 类型 | 默认值 | 说明 |
|
||||
|------|------|--------|------|
|
||||
| `--pdf` | string | - | 单个PDF文件的路径(桥接模式) |
|
||||
| `--output-dir` | string | `bridge_output` | 输出目录 |
|
||||
| `--ocr-model` | choice | `paddleocr_vl` | OCR模型选择:`ppocr_v5` 或 `paddleocr_vl` |
|
||||
| `--batch` | flag | False | 启用批量测试模式 |
|
||||
| `--batch-size` | int | 20 | 处理的PDF数量 |
|
||||
| `--pdf-names` | string | - | 逗号分隔的PDF名称列表 |
|
||||
|
||||
### 高级参数
|
||||
|
||||
| 参数 | 类型 | 默认值 | 说明 |
|
||||
|------|------|--------|------|
|
||||
| `--disable-paddleocrvl` | flag | False | 禁用PaddleOCRVL备份识别 |
|
||||
| `--paddleocrvl-timeout` | int | 60 | PaddleOCRVL超时时间(秒) |
|
||||
|
||||
## 🎯 使用场景
|
||||
|
||||
### 场景1:快速批量测试
|
||||
|
||||
使用ppocr_v5模型进行快速测试:
|
||||
|
||||
```bash
|
||||
python test_accuracy_batch_full.py \
|
||||
--ocr-model ppocr_v5 \
|
||||
--batch \
|
||||
--batch-size 20
|
||||
```
|
||||
|
||||
**特点**:
|
||||
- ⚡ 速度快(~18秒/PDF)
|
||||
- 🎯 CMA准确率:85-100%
|
||||
- 📊 机构准确率:27-100%
|
||||
|
||||
### 场景2:高精度识别
|
||||
|
||||
使用paddleocr_vl模型获得更高准确率:
|
||||
|
||||
```bash
|
||||
python test_accuracy_batch_full.py \
|
||||
--ocr-model paddleocr_vl \
|
||||
--batch \
|
||||
--batch-size 20 \
|
||||
--paddleocrvl-timeout 300
|
||||
```
|
||||
|
||||
**特点**:
|
||||
- 🎯 识别准确率更高
|
||||
- ⏱️ 处理时间较长(每个印章最多5分钟)
|
||||
- 💪 需要更多内存(至少3GB可用)
|
||||
|
||||
### 场景3:禁用PaddleOCRVL(最快)
|
||||
|
||||
完全禁用PaddleOCRVL以获得最快速度:
|
||||
|
||||
```bash
|
||||
python test_accuracy_batch_full.py \
|
||||
--ocr-model ppocr_v5 \
|
||||
--batch \
|
||||
--batch-size 20 \
|
||||
--disable-paddleocrvl
|
||||
```
|
||||
|
||||
**特点**:
|
||||
- ⚡⚡ 最快速度
|
||||
- 💾 内存占用最小
|
||||
- ⚠️ 识别准确率可能略低
|
||||
|
||||
### 场景4:处理单个PDF
|
||||
|
||||
使用桥接模式处理单个PDF:
|
||||
|
||||
```bash
|
||||
python test_accuracy_batch_full.py \
|
||||
--pdf path/to/report.pdf \
|
||||
--output-dir my_output
|
||||
```
|
||||
|
||||
## 📂 文件结构
|
||||
|
||||
### 输入文件
|
||||
|
||||
```
|
||||
src/test/resources/data/pdfs/
|
||||
├── 1.pdf
|
||||
├── 2.pdf
|
||||
├── WTS2025-21283.pdf
|
||||
└── ... (测试PDF文件)
|
||||
|
||||
src/test/resources/data/results.json
|
||||
└── 包含每个PDF的预期CMA编号和机构名称
|
||||
```
|
||||
|
||||
### 输出文件
|
||||
|
||||
```
|
||||
test_reports_full/
|
||||
├── summary.html # 测试总结报告
|
||||
├── test_report.json # JSON格式详细结果
|
||||
│
|
||||
├── 1.pdf/
|
||||
│ ├── doc_page.png # 原始页面图像
|
||||
│ ├── doc_layout_viz.png # 版面分析可视化
|
||||
│ ├── cma_roi.png # CMA标志区域
|
||||
│ ├── seal_crop_0.png # 印章裁剪
|
||||
│ ├── seal_unwarp_0.png # 印章解扭曲
|
||||
│ ├── seal_polar_viz_0.png # 极坐标可视化
|
||||
│ ├── seal_marked_0.png # 页面标记
|
||||
│ └── index.html # 单个PDF详细报告
|
||||
│
|
||||
└── ... (其他PDF的输出)
|
||||
```
|
||||
|
||||
## 🔍 配置文件
|
||||
|
||||
### ground truth (results.json)
|
||||
|
||||
```json
|
||||
{
|
||||
"1.pdf": {
|
||||
"cma": "20211901583",
|
||||
"institution": "深圳市中安质量检验认证有限公司"
|
||||
},
|
||||
"WTS2025-21283.pdf": {
|
||||
"cma": "220020349627",
|
||||
"institution": "威凯检测技术有限公司"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### CMA模板
|
||||
|
||||
脚本会自动从CMA模板文件加载:
|
||||
- `src/main/resources/models/cma_logo_template.png`(主模板)
|
||||
- `cma_extraction_template_primary.py`(主模板提取逻辑)
|
||||
|
||||
## 📊 结果解读
|
||||
|
||||
### 测试报告示例
|
||||
|
||||
```
|
||||
================================================================================
|
||||
BATCH TEST COMPLETED - FINAL RESULTS
|
||||
================================================================================
|
||||
Total Processed: 20
|
||||
|
||||
CMA Code Results:
|
||||
Exact Match: 17/20 (85.0%)
|
||||
Partial Match: 1/20 (5.0%)
|
||||
Acceptable Match: 0/20 (0.0%)
|
||||
No Match: 2/20 (10.0%)
|
||||
** CMA Accuracy: 85.0% **
|
||||
|
||||
Institution Name Results:
|
||||
Exact Match: 5/18 (27.8%)
|
||||
Partial Match: 1/18 (5.6%)
|
||||
Acceptable Match: 3/18 (16.7%)
|
||||
No Match: 9/18 (50.0%)
|
||||
** Institution Accuracy: 50.0% (including acceptable) **
|
||||
```
|
||||
|
||||
### 匹配类型说明
|
||||
|
||||
| 匹配类型 | 说明 | 相似度要求 |
|
||||
|---------|------|-----------|
|
||||
| `exact` | 完全匹配 | 100% |
|
||||
| `partial` | 部分匹配 | >= 80% |
|
||||
| `acceptable` | 可接受匹配 | >= 60% |
|
||||
| `no_match` | 不匹配 | < 60% |
|
||||
|
||||
### 性能指标
|
||||
|
||||
```
|
||||
Performance:
|
||||
Total Time: 385.2s (6.4min)
|
||||
Average Time: 19.3s per PDF
|
||||
CMA Time: 9.7s per PDF
|
||||
Seal Time: 9.3s per PDF
|
||||
```
|
||||
|
||||
## 🐛 故障排除
|
||||
|
||||
### 问题1:内存不足
|
||||
|
||||
**错误信息**:
|
||||
```
|
||||
Insufficient memory for PaddleOCRVL (2.6 GB < 3.0 GB)
|
||||
```
|
||||
|
||||
**解决方案**:
|
||||
```bash
|
||||
# 关闭其他应用程序
|
||||
# 或使用--disable-paddleocrvl参数
|
||||
python test_accuracy_batch_full.py --batch --disable-paddleocrvl
|
||||
```
|
||||
|
||||
### 问题2:PaddleOCRVL超时
|
||||
|
||||
**错误信息**:
|
||||
```
|
||||
PaddleOCRVL recognition timeout (60s) for seal_crop_0.png
|
||||
```
|
||||
|
||||
**解决方案**:
|
||||
```bash
|
||||
# 增加超时时间到5分钟
|
||||
python test_accuracy_batch_full.py --batch --paddleocrvl-timeout 300
|
||||
```
|
||||
|
||||
### 问题3:OCR识别率低
|
||||
|
||||
**可能原因**:
|
||||
- PDF质量不佳
|
||||
- 印章模糊或变形
|
||||
- CMA标志不清晰
|
||||
|
||||
**解决方案**:
|
||||
- 尝试不同的OCR模型
|
||||
- 增加`--paddleocrvl-timeout`时间
|
||||
- 手动检查PDF图像质量
|
||||
|
||||
### 问题4:找不到results.json
|
||||
|
||||
**错误信息**:
|
||||
```
|
||||
Ground truth file not found: src/test/resources/data/results.json
|
||||
```
|
||||
|
||||
**解决方案**:
|
||||
```bash
|
||||
# 检查文件路径
|
||||
ls -la src/test/resources/data/results.json
|
||||
|
||||
# 如果不存在,需要创建该文件
|
||||
```
|
||||
|
||||
## ⚙️ 高级配置
|
||||
|
||||
### 自定义CMA模板
|
||||
|
||||
1. 准备CMA标志图像(PNG格式,白色背景)
|
||||
2. 放置在:`src/main/resources/models/cma_logo_template.png`
|
||||
3. 脚本会自动加载并使用
|
||||
|
||||
### 调整匹配阈值
|
||||
|
||||
编辑脚本中的以下参数:
|
||||
|
||||
```python
|
||||
# 机构名称匹配阈值
|
||||
INSTITUTION_MATCH_THRESHOLDS = {
|
||||
'exact': 100.0, # 100%相似度
|
||||
'partial': 80.0, # 80%相似度
|
||||
'acceptable': 60.0 # 60%相似度
|
||||
}
|
||||
|
||||
# CMA编号正则表达式
|
||||
CMA_PATTERNS = [
|
||||
r'2\d{10}', # 11位,以2开头
|
||||
r'\d{11,12}' # 11-12位数字
|
||||
]
|
||||
```
|
||||
|
||||
### 印章检测参数
|
||||
|
||||
```python
|
||||
# 红色印章检测
|
||||
RED_COLOR_LOWER = [0, 100, 100] # HSV下界
|
||||
RED_COLOR_UPPER = [10, 255, 255] # HSV上界
|
||||
|
||||
# 最小印章尺寸
|
||||
MIN_SEAL_SIZE = 50 # 像素
|
||||
```
|
||||
|
||||
## 📈 性能优化建议
|
||||
|
||||
### 1. 选择合适的OCR模型
|
||||
|
||||
| 模型 | 速度 | 准确率 | 内存 | 推荐场景 |
|
||||
|------|------|--------|------|---------|
|
||||
| ppocr_v5 | 快 | 中等 | 低 | 快速测试、大批量 |
|
||||
| paddleocr_vl | 慢 | 高 | 高 | 精确验证、小批量 |
|
||||
|
||||
### 2. 调整超时时间
|
||||
|
||||
```bash
|
||||
# 快速测试(可能超时)
|
||||
--paddleocrvl-timeout 30
|
||||
|
||||
# 标准测试(推荐)
|
||||
--paddleocrvl-timeout 60
|
||||
|
||||
# 高准确率测试
|
||||
--paddleocrvl-timeout 300
|
||||
```
|
||||
|
||||
### 3. 批量处理
|
||||
|
||||
```bash
|
||||
# 分批处理大量PDF
|
||||
for batch in 1 2 3 4 5; do
|
||||
python test_accuracy_batch_full.py \
|
||||
--batch \
|
||||
--batch-size 20 \
|
||||
--pdf-names $(get_pdf_names_for_batch $batch)
|
||||
done
|
||||
```
|
||||
|
||||
## 🔗 相关文档
|
||||
|
||||
- [PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md](PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md) - 超时机制实现详情
|
||||
- [PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md](PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md) - 5分钟超时使用指南
|
||||
- [3PDF_SEAL_INVESTIGATION_REPORT.md](3PDF_SEAL_INVESTIGATION_REPORT.md) - 印章识别问题调查
|
||||
- [IMPLEMENTATION_SUMMARY.md](IMPLEMENTATION_SUMMARY.md) - 实现总结
|
||||
- [INTEGRATION_TEST_REPORT.md](INTEGRATION_TEST_REPORT.md) - 集成测试报告
|
||||
|
||||
## 🛠️ 开发指南
|
||||
|
||||
### 添加新的OCR模型
|
||||
|
||||
1. 在脚本中添加模型初始化代码
|
||||
2. 实现OCR识别函数
|
||||
3. 在主流程中集成新模型
|
||||
4. 更新命令行参数
|
||||
|
||||
### 调试技巧
|
||||
|
||||
```bash
|
||||
# 启用详细日志
|
||||
export PYTHONUNBUFFERED=1
|
||||
python test_accuracy_batch_full.py --batch --batch-size 1
|
||||
|
||||
# 查看中间结果
|
||||
ls -la test_reports_full/1.pdf/
|
||||
|
||||
# 检查日志
|
||||
grep "WARNING\|ERROR" test_reports_full/test_report.json
|
||||
```
|
||||
|
||||
### 测试单个功能
|
||||
|
||||
```python
|
||||
# 测试CMA提取
|
||||
from test_accuracy_batch_full import extract_cma_from_page
|
||||
result = extract_cma_from_page(page_img, ocr_engine)
|
||||
|
||||
# 测试印章检测
|
||||
from test_accuracy_batch_full import detect_red_seals
|
||||
seals = detect_red_seals(page_img, output_dir)
|
||||
```
|
||||
|
||||
## 📝 更新日志
|
||||
|
||||
### v1.2.0 (2026-03-03)
|
||||
|
||||
- ✨ 添加PaddleOCRVL超时保护机制
|
||||
- ✨ 添加`--paddleocrvl-timeout`参数
|
||||
- ✨ 添加`--disable-paddleocrvl`参数
|
||||
- 🐛 修复PaddleOCRVL无限期挂起问题
|
||||
- 🐛 修复OCR结果解析的IndexError
|
||||
- 🎨 改进CMA模板匹配算法
|
||||
- 📚 完善文档和测试报告
|
||||
|
||||
### v1.1.0
|
||||
|
||||
- ✨ 支持批量测试
|
||||
- ✨ 支持多种OCR模型
|
||||
- ✨ 生成HTML可视化报告
|
||||
- 🎨 添加印章解扭曲可视化
|
||||
|
||||
### v1.0.0
|
||||
|
||||
- 🎉 初始版本
|
||||
- ✨ CMA标志识别
|
||||
- ✨ 印章文字提取
|
||||
- ✨ 准确率统计
|
||||
|
||||
## 💡 最佳实践
|
||||
|
||||
1. **首次使用**:先用小批量测试(`--batch-size 5`)
|
||||
2. **模型选择**:速度优先用ppocr_v5,准确率优先用paddleocr_vl
|
||||
3. **超时设置**:根据实际网络和硬件情况调整`--paddleocrvl-timeout`
|
||||
4. **结果验证**:检查`test_report.json`和HTML报告
|
||||
5. **错误处理**:关注WARNING和ERROR日志
|
||||
|
||||
## 🤝 贡献指南
|
||||
|
||||
1. Fork项目
|
||||
2. 创建特性分支:`git checkout -b feature/new-feature`
|
||||
3. 提交更改:`git commit -am 'Add new feature'`
|
||||
4. 推送分支:`git push origin feature/new-feature`
|
||||
5. 创建Pull Request
|
||||
|
||||
## 📄 许可证
|
||||
|
||||
本项目采用内部许可证,仅供公司内部使用。
|
||||
|
||||
## 📞 联系方式
|
||||
|
||||
- 项目维护:开发团队
|
||||
- 问题反馈:通过项目Issue提交
|
||||
- 技术支持:联系技术负责人
|
||||
|
||||
---
|
||||
|
||||
**最后更新**:2026-03-03
|
||||
**版本**:v1.2.0
|
||||
**维护者**:Claude Sonnet 4.6 & 开发团队
|
||||
Loading…
Reference in New Issue