report-detect/TEST_ACCURACY_BATCH_DEPENDE...

# test_accuracy_batch_full.py - 依赖分析

## 📋 核心依赖文件

### 必需的Python模块

```python
# 主要CMA提取模块（必需）
cma_extraction_template_primary.py    # 主CMA提取逻辑（模板匹配）
cma_extraction_final.py               # 备用CMA提取逻辑

# 如果主模块导入失败，会自动回退到备用模块
```

### 必需的数据文件

```
src/test/resources/data/
├── pdfs/                             # PDF测试文件目录
│   ├── 1.pdf
│   ├── 2.pdf
│   ├── WTS2025-21283.pdf
│   └── ... (测试PDF文件)
└── results.json                      # Ground truth数据（预期结果）
    {
      "1.pdf": {
        "cma": "20211901583",
        "institution": "深圳市中安质量检验认证有限公司"
      },
      ...
    }

template/
└── CMA_Logo.png                      # CMA标志模板（25KB）
    用于模板匹配识别CMA标志位置
```

### 生成的输出文件

```
test_reports_full/                      # 测试报告输出目录（自动生成）
├── summary.html                        # 总体测试报告
├── test_report.json                    # JSON格式详细结果
│
├── 1.pdf/                             # 每个PDF的详细输出
│   ├── doc_page.png                    # 原始页面图像
│   ├── doc_layout_viz.png              # 版面分析可视化
│   ├── cma_roi.png                    # CMA标志区域
│   ├── seal_crop_0.png                # 印章裁剪
│   ├── seal_unwarp_0.png              # 印章解扭曲
│   ├── seal_polar_viz_0.png           # 极坐标可视化
│   ├── seal_marked_0.png              # 页面标记
│   └── index.html                     # 单个PDF详细报告
│
└── ... (其他PDF的输出)

test_accuracy_full.log                  # 运行日志（自动生成）
```

### 临时文件（运行时生成）

```
temp_paddleocr_vl/                      # PaddleOCRVL临时输出（自动生成/清理）
bridge_output/                          # 桥接模式输出目录（可选）
```

## 🔍 依赖关系图

```
test_accuracy_batch_full.py
├── cma_extraction_template_primary.py (主)
│   └── template/CMA_Logo.png
├── cma_extraction_final.py (备用)
├── src/test/resources/data/pdfs/*.pdf
├── src/test/resources/data/results.json
└── [Python库依赖]
    ├── paddleocr
    ├── paddlex (可选)
    ├── pikepdf (可选，用于CRT提取)
    ├── cryptography (可选，用于CRT提取)
    ├── opencv-python
    ├── pymupdf-ng (fitz)
    ├── numpy
    └── python-Levenshtein
```

## 📦 文件大小统计

| 文件/目录 | 大小 | 说明 |
|----------|------|------|
| `test_accuracy_batch_full.py` | 121 KB | 主脚本 |
| `cma_extraction_template_primary.py` | 19 KB | CMA提取主模块 |
| `cma_extraction_final.py` | 16 KB | CMA提取备用模块 |
| `template/CMA_Logo.png` | 25 KB | CMA标志模板 |
| `src/test/resources/data/pdfs/` | ~10 MB | PDF测试文件 |
| `src/test/resources/data/results.json` | 26 KB | Ground truth数据 |
| `test_reports_full/` | ~50 MB | 生成报告（每次运行） |

## ✅ 最小运行要求

### 必需文件（无法运行）

```bash
# 核心文件
test_accuracy_batch_full.py              # 主脚本
cma_extraction_template_primary.py       # CMA提取模块
template/CMA_Logo.png                    # CMA模板

# 测试数据
src/test/resources/data/results.json     # Ground truth
```

### 可选但推荐

```bash
# 备用模块
cma_extraction_final.py                 # 当主模块失败时使用

# 测试PDF（至少需要几个）
src/test/resources/data/pdfs/*.pdf
```

### 可选功能依赖

```bash
# PaddleOCR-VL备份识别（可选）
# 如果不使用，脚本会跳过相关功能

# CRT提取（可选）
# 用于从PDF证书中提取机构名称
```

## 🗂️ 目录结构建议

### 推荐的清理后结构

```
project-root/
├── test_accuracy_batch_full.py          # 主脚本
├── TEST_ACCURACY_BATCH_README.md        # 使用文档
│
├── cma_extraction_template_primary.py   # CMA提取模块
├── cma_extraction_final.py              # 备用模块
│
├── src/test/resources/data/
│   ├── pdfs/                            # 测试PDF
│   └── results.json                     # Ground truth
│
├── template/
│   └── CMA_Logo.png                     # CMA模板
│
├── test_reports_full/                   # 输出目录（.gitignore）
├── test_accuracy_full.log               # 日志（.gitignore）
│
└── archive/                             # 归档旧文件
    ├── old_tests/
    ├── temp_scripts/
    └── docs_archive/
```

## 📝 Python库依赖

requirements.txt 应包含：

```
# 核心依赖
paddleocr>=2.7.0
paddlex>=1.3.0
opencv-python>=4.8.0
numpy>=1.24.0
pymupdf-ng>=1.23.0
python-Levenshtein>=0.23.0

# 可选依赖
pikepdf>=8.0.0              # CRT提取
cryptography>=41.0.0         # CRT提取
paddleocr[doc-parser]       # PaddleOCR-VL
```

## 🎯 关键依赖路径

### 在代码中定义的路径

```python
# 第126-129行
PDF_DIR = Path(r"src/test/resources/data/pdfs")
RESULTS_JSON = Path(r"src/test/resources/data/results.json")
OUTPUT_DIR = Path("test_reports_full")
BATCH_SIZE = 20

# 第138行
CMA_LOGO_PATH = Path("template/CMA_Logo.png")

# 第100-112行：动态导入
from cma_extraction_template_primary import extract_cma_code_fullpage, imread_unicode
# 或
from cma_extraction_final import extract_cma_code_fullpage, imread_unicode
```

## ⚠️ 常见依赖问题

### 问题1：找不到CMA模板

**错误**：`CMA logo template not found at template/CMA_Logo.png`

**解决**：确保 `template/CMA_Logo.png` 文件存在

### 问题2：找不到测试数据

**错误**：`Ground truth file not found: src/test/resources/data/results.json`

**解决**：确保测试数据目录结构正确

### 问题3：找不到CMA提取模块

**错误**：`Cannot import cma_extraction_template_primary.py`

**解决**：确保 `cma_extraction_template_primary.py` 在项目根目录

## 📊 依赖完整性检查

### 快速检查命令

```bash
# 检查必需文件
python -c "
from pathlib import Path

required_files = [
    'test_accuracy_batch_full.py',
    'cma_extraction_template_primary.py',
    'template/CMA_Logo.png',
    'src/test/resources/data/results.json'
]

missing = []
for f in required_files:
    if not Path(f).exists():
        missing.append(f)

if missing:
    print('Missing files:')
    for f in missing:
        print(f'  - {f}')
else:
    print('All required files present!')
"
```

## 🔧 配置文件

### 无需额外配置文件

脚本不需要额外的配置文件，所有参数通过：
- 命令行参数传递
- 代码中的常量定义
- 环境变量（可选）

## 📦 打包部署建议

### 最小部署包

```bash
# 必需文件
test_accuracy_batch_full.py
cma_extraction_template_primary.py
cma_extraction_final.py
template/CMA_Logo.png

# 测试数据
src/test/resources/data/

# 文档
TEST_ACCURACY_BATCH_README.md

# 安装脚本
install_dependencies.sh
```

---

**文档生成时间**：2026-03-03
**脚本版本**：v1.2.0
**维护者**：开发团队
-												docs(test): add comprehensive documentation for batch testing script

Added three key documentation files:

1. TEST_ACCURACY_BATCH_README.md
   - Complete usage guide for test_accuracy_batch_full.py
   - Command-line parameters reference
   - 4 usage scenarios (quick, high-accuracy, fast, single-PDF)
   - Troubleshooting guide
   - Performance optimization tips
   - Best practices and examples

2. TEST_ACCURACY_BATCH_DEPENDENCIES.md
   - Detailed dependency analysis
   - Required files and directory structure
   - Python library dependencies
   - File size statistics
   - Dependency relationship diagram
   - Common dependency issues and solutions

3. CLEANUP_PLAN.md
   - File categorization (keep, archive, delete)
   - Step-by-step cleanup instructions
   - Archive directory structure proposal
   - Three cleanup approaches (conservative, aggressive, phased)
   - Cleanup automation script

Features:
- Comprehensive parameter reference tables
- Real-world usage examples
- Performance comparison charts
- Quick reference commands
- Development guidelines

Target audience:
- New developers joining the project
- QA team running batch tests
- DevOps engineers deploying the system

Related:
- test_accuracy_batch_full.py (v1.2.0)
- PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md
- IMPLEMENTATION_SUMMARY.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

											
										
										
											2026-03-03 14:32:04 +08:00
+								# test_accuracy_batch_full.py - 依赖分析
 								## 📋 核心依赖文件
 								### 必需的Python模块
 								```python
 								# 主要CMA提取模块（必需）
 								cma_extraction_template_primary.py    # 主CMA提取逻辑（模板匹配）
 								cma_extraction_final.py               # 备用CMA提取逻辑
 								# 如果主模块导入失败，会自动回退到备用模块
 								```
 								### 必需的数据文件
 								```
 								src/test/resources/data/
 								├── pdfs/                             # PDF测试文件目录
 								│   ├── 1.pdf
 								│   ├── 2.pdf
 								│   ├── WTS2025-21283.pdf
 								│   └── ... (测试PDF文件)
 								└── results.json                      # Ground truth数据（预期结果）
 								    {
 								      "1.pdf": {
 								        "cma": "20211901583",
 								        "institution": "深圳市中安质量检验认证有限公司"
 								      },
 								      ...
 								    }
 								template/
 								└── CMA_Logo.png                      # CMA标志模板（25KB）
 								    用于模板匹配识别CMA标志位置
 								```
 								### 生成的输出文件
 								```
 								test_reports_full/                      # 测试报告输出目录（自动生成）
 								├── summary.html                        # 总体测试报告
 								├── test_report.json                    # JSON格式详细结果
 								│
 								├── 1.pdf/                             # 每个PDF的详细输出
 								│   ├── doc_page.png                    # 原始页面图像
 								│   ├── doc_layout_viz.png              # 版面分析可视化
 								│   ├── cma_roi.png                    # CMA标志区域
 								│   ├── seal_crop_0.png                # 印章裁剪
 								│   ├── seal_unwarp_0.png              # 印章解扭曲
 								│   ├── seal_polar_viz_0.png           # 极坐标可视化
 								│   ├── seal_marked_0.png              # 页面标记
 								│   └── index.html                     # 单个PDF详细报告
 								│
 								└── ... (其他PDF的输出)
 								test_accuracy_full.log                  # 运行日志（自动生成）
 								```
 								### 临时文件（运行时生成）
 								```
 								temp_paddleocr_vl/                      # PaddleOCRVL临时输出（自动生成/清理）
 								bridge_output/                          # 桥接模式输出目录（可选）
 								```
 								## 🔍 依赖关系图
 								```
 								test_accuracy_batch_full.py
 								├── cma_extraction_template_primary.py (主)
 								│   └── template/CMA_Logo.png
 								├── cma_extraction_final.py (备用)
 								├── src/test/resources/data/pdfs/*.pdf
 								├── src/test/resources/data/results.json
 								└── [Python库依赖]
 								    ├── paddleocr
 								    ├── paddlex (可选)
 								    ├── pikepdf (可选，用于CRT提取)
 								    ├── cryptography (可选，用于CRT提取)
 								    ├── opencv-python
 								    ├── pymupdf-ng (fitz)
 								    ├── numpy
 								    └── python-Levenshtein
 								```
 								## 📦 文件大小统计
 								| 文件/目录 | 大小 | 说明 |
 								|----------|------|------|
 								| `test_accuracy_batch_full.py` | 121 KB | 主脚本 |
 								| `cma_extraction_template_primary.py` | 19 KB | CMA提取主模块 |
 								| `cma_extraction_final.py` | 16 KB | CMA提取备用模块 |
 								| `template/CMA_Logo.png` | 25 KB | CMA标志模板 |
 								| `src/test/resources/data/pdfs/` | ~10 MB | PDF测试文件 |
 								| `src/test/resources/data/results.json` | 26 KB | Ground truth数据 |
 								| `test_reports_full/` | ~50 MB | 生成报告（每次运行） |
 								## ✅ 最小运行要求
 								### 必需文件（无法运行）
 								```bash
 								# 核心文件
 								test_accuracy_batch_full.py              # 主脚本
 								cma_extraction_template_primary.py       # CMA提取模块
 								template/CMA_Logo.png                    # CMA模板
 								# 测试数据
 								src/test/resources/data/results.json     # Ground truth
 								```
 								### 可选但推荐
 								```bash
 								# 备用模块
 								cma_extraction_final.py                 # 当主模块失败时使用
 								# 测试PDF（至少需要几个）
 								src/test/resources/data/pdfs/*.pdf
 								```
 								### 可选功能依赖
 								```bash
 								# PaddleOCR-VL备份识别（可选）
 								# 如果不使用，脚本会跳过相关功能
 								# CRT提取（可选）
 								# 用于从PDF证书中提取机构名称
 								```
 								## 🗂️ 目录结构建议
 								### 推荐的清理后结构
 								```
 								project-root/
 								├── test_accuracy_batch_full.py          # 主脚本
 								├── TEST_ACCURACY_BATCH_README.md        # 使用文档
 								│
 								├── cma_extraction_template_primary.py   # CMA提取模块
 								├── cma_extraction_final.py              # 备用模块
 								│
 								├── src/test/resources/data/
 								│   ├── pdfs/                            # 测试PDF
 								│   └── results.json                     # Ground truth
 								│
 								├── template/
 								│   └── CMA_Logo.png                     # CMA模板
 								│
 								├── test_reports_full/                   # 输出目录（.gitignore）
 								├── test_accuracy_full.log               # 日志（.gitignore）
 								│
 								└── archive/                             # 归档旧文件
 								    ├── old_tests/
 								    ├── temp_scripts/
 								    └── docs_archive/
 								```
 								## 📝 Python库依赖
 								requirements.txt 应包含：
 								```
 								# 核心依赖
 								paddleocr>=2.7.0
 								paddlex>=1.3.0
 								opencv-python>=4.8.0
 								numpy>=1.24.0
 								pymupdf-ng>=1.23.0
 								python-Levenshtein>=0.23.0
 								# 可选依赖
 								pikepdf>=8.0.0              # CRT提取
 								cryptography>=41.0.0         # CRT提取
 								paddleocr[doc-parser]       # PaddleOCR-VL
 								```
 								## 🎯 关键依赖路径
 								### 在代码中定义的路径
 								```python
 								# 第126-129行
 								PDF_DIR = Path(r"src/test/resources/data/pdfs")
 								RESULTS_JSON = Path(r"src/test/resources/data/results.json")
 								OUTPUT_DIR = Path("test_reports_full")
 								BATCH_SIZE = 20
 								# 第138行
 								CMA_LOGO_PATH = Path("template/CMA_Logo.png")
 								# 第100-112行：动态导入
 								from cma_extraction_template_primary import extract_cma_code_fullpage, imread_unicode
 								# 或
 								from cma_extraction_final import extract_cma_code_fullpage, imread_unicode
 								```
 								## ⚠️ 常见依赖问题
 								### 问题1：找不到CMA模板
 								**错误**：`CMA logo template not found at template/CMA_Logo.png`
 								**解决**：确保 `template/CMA_Logo.png` 文件存在
 								### 问题2：找不到测试数据
 								**错误**：`Ground truth file not found: src/test/resources/data/results.json`
 								**解决**：确保测试数据目录结构正确
 								### 问题3：找不到CMA提取模块
 								**错误**：`Cannot import cma_extraction_template_primary.py`
 								**解决**：确保 `cma_extraction_template_primary.py` 在项目根目录
 								## 📊 依赖完整性检查
 								### 快速检查命令
 								```bash
 								# 检查必需文件
 								python -c "
 								from pathlib import Path
 								required_files = [
 								    'test_accuracy_batch_full.py',
 								    'cma_extraction_template_primary.py',
 								    'template/CMA_Logo.png',
 								    'src/test/resources/data/results.json'
 								]
 								missing = []
 								for f in required_files:
 								    if not Path(f).exists():
 								        missing.append(f)
 								if missing:
 								    print('Missing files:')
 								    for f in missing:
 								        print(f'  - {f}')
 								else:
 								    print('All required files present!')
 								"
 								```
 								## 🔧 配置文件
 								### 无需额外配置文件
 								脚本不需要额外的配置文件，所有参数通过：
 								- 命令行参数传递
 								- 代码中的常量定义
 								- 环境变量（可选）
 								## 📦 打包部署建议
 								### 最小部署包
 								```bash
 								# 必需文件
 								test_accuracy_batch_full.py
 								cma_extraction_template_primary.py
 								cma_extraction_final.py
 								template/CMA_Logo.png
 								# 测试数据
 								src/test/resources/data/
 								# 文档
 								TEST_ACCURACY_BATCH_README.md
 								# 安装脚本
 								install_dependencies.sh
 								```
 								---
 								**文档生成时间**：2026-03-03
 								**脚本版本**：v1.2.0
 								**维护者**：开发团队