report-detect/CLEANUP_PLAN.md

351 lines
8.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 文件清理方案
## 📊 当前文件分析
### 项目根目录文件统计
```
总计67个文件
- Python脚本约40个
- Markdown文档约15个
- 配置/数据文件约12个
```
## 🗂️ 文件分类
### ✅ 保留文件(核心必需)
```bash
# 主脚本
test_accuracy_batch_full.py
# CMA提取模块
cma_extraction_template_primary.py
cma_extraction_final.py
# 核心文档
CLAUDE.md
TEST_ACCURACY_BATCH_README.md
TEST_ACCURACY_BATCH_DEPENDENCIES.md
IMPLEMENTATION_SUMMARY.md
# 配置文件
requirements.txt
settings.xml
pom.xml
.classpath
project/settings.xml
# CMA模板
template/CMA_Logo.png
```
### ⚠️ 可归档文件(旧测试/调试脚本)
```bash
# === 调试脚本 (归档到 archive/temp_scripts/) ===
analyze_logo_position.py
analyze_ydq.py
analyze_ydq_v2.py
debug_actual_matching.py
debug_cma_extraction.py
debug_full_ocr.py
debug_ocr_only.py
debug_roi_content.py
debug_roi_extraction.py
debug_specific_pdfs.py
debug_template_matching.py
force_reload_test.py
quick_validation_test.py
run_single_test.py
run_test_fresh.py
simple_find.py
simple_test.py
test_cma_simple.py
test_crt_direct.py
test_crt_extraction.py
test_fullpage_fallback.py
test_improved_crt_extraction.py
test_improved_extraction.py
test_roi_fix.py
test_single_pdf.py
test_smart_logic.py
test_template_matching_unit.py
verify_crt_extraction.py
# === 辅助工具脚本 (归档到 archive/tools/) ===
extract_pdf_pages.py
find_all_logo_matches.py
find_cma_position.py
find_numbers.py
ocr_bridge_cross_platform.py
pdf_processor.py
show_results.py
visualize_matches.py
search_cma_position.py
# === CRT相关测试 (归档到 archive/crt_tests/) ===
diagnose_crt_extraction.py
inspect_certificate_data.py
quick_crt_test.py
standalone_crt_test.py
# === PaddleOCR测试 (归档到 archive/ocr_tests/) ===
investigate_seal_3.py
test_paddleocrvl_direct.py
test_paddleocrvl_timeout.py
test_vl_simple.py
```
### 📚 可归档文档
```bash
# === 旧文档 (归档到 archive/docs/) ===
ADDITIONAL_FIXES_SUMMARY.md
CMA_LOGO_POSITION_FIX.md
CMA_TEMPLATE_MATCHING_OPTIMIZATION_REPORT.md
CRT_EXTRACT_INVESTIGATION_REPORT.md
OCR_INTEGRATION_README.md
PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md
PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md
QUICK_FIX_REFERENCE.md
ROOT_CAUSE_ANALYSIS.md
SEAL_SELECTION_FIX.md
WSL_INSTALLATION_GUIDE.md
YDQ23_001838_FINAL_FIX_SUMMARY.md
3PDF_SEAL_INVESTIGATION_REPORT.md
INTEGRATION_TEST_REPORT.md
```
### 🗑️ 可删除文件
```bash
# === 副本/重复文件 ===
test_accuracy_batch_full - 副本.py
# === 临时/无用文件 ===
classpath.txt
ping.json
install_wsl.bat
# === 旧的归档 ===
# 如果不再需要,可以删除
```
## 🎯 清理步骤
### 步骤1创建归档目录
```bash
mkdir -p archive/temp_scripts
mkdir -p archive/tools
mkdir -p archive/crt_tests
mkdir -p archive/ocr_tests
mkdir -p archive/docs
mkdir -p archive/old_reports
```
### 步骤2移动文件到归档
```bash
# 移动调试脚本
mv analyze_*.py archive/temp_scripts/
mv debug_*.py archive/temp_scripts/
mv quick_*.py archive/temp_scripts/
mv run_*.py archive/temp_scripts/
mv simple_*.py archive/temp_scripts/
mv test_*.py archive/temp_scripts/ 2>/dev/null || true
mv verify_*.py archive/temp_scripts/
mv force_*.py archive/temp_scripts/
# 移动辅助工具
mv extract_pdf_pages.py archive/tools/
mv find_*.py archive/tools/
mv search_*.py archive/tools/
mv show_*.py archive/tools/
mv visualize_*.py archive/tools/
mv ocr_bridge_cross_platform.py archive/tools/
mv pdf_processor.py archive/tools/
# 移动CRT测试
mv diagnose_crt_extraction.py archive/crt_tests/
mv inspect_certificate_data.py archive/crt_tests/
mv quick_crt_test.py archive/crt_tests/
mv standalone_crt_test.py archive/crt_tests/
# 移动OCR测试
mv investigate_seal_3.py archive/ocr_tests/
mv test_paddleocrvl*.py archive/ocr_tests/
mv test_vl_simple.py archive/ocr_tests/
# 移动旧文档
mv ADDITIONAL_FIXES_SUMMARY.md archive/docs/
mv CMA_LOGO_POSITION_FIX.md archive/docs/
mv CMA_TEMPLATE_MATCHING_OPTIMIZATION_REPORT.md archive/docs/
mv CRT_EXTRACT_INVESTIGATION_REPORT.md archive/docs/
mv OCR_INTEGRATION_README.md archive/docs/
mv PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md archive/docs/
mv PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md archive/docs/
mv QUICK_FIX_REFERENCE.md archive/docs/
mv ROOT_CAUSE_ANALYSIS.md archive/docs/
mv SEAL_SELECTION_FIX.md archive/docs/
mv WSL_INSTALLATION_GUIDE.md archive/docs/
mv YDQ23_001838_FINAL_FIX_SUMMARY.md archive/docs/
mv 3PDF_SEAL_INVESTIGATION_REPORT.md archive/docs/
mv INTEGRATION_TEST_REPORT.md archive/docs/
```
### 步骤3删除不需要的文件
```bash
# 删除副本和临时文件
rm "test_accuracy_batch_full - 副本.py"
rm classpath.txt
rm ping.json
rm install_wsl.bat
```
### 步骤4清理输出目录可选
```bash
# 清理测试输出(如果想保留结果,跳过此步)
# rm -rf test_reports_full/
# rm test_accuracy_full.log
```
## ✅ 清理后的目录结构
```
project-root/
├── test_accuracy_batch_full.py # 主脚本
├── TEST_ACCURACY_BATCH_README.md # 使用文档
├── TEST_ACCURACY_BATCH_DEPENDENCIES.md # 依赖文档
├── CLAUDE.md # 项目指南
├── IMPLEMENTATION_SUMMARY.md # 实现总结
├── cma_extraction_template_primary.py # CMA提取模块
├── cma_extraction_final.py # 备用模块
├── src/test/resources/data/ # 测试数据
│ ├── pdfs/
│ └── results.json
├── template/ # 模板文件
│ └── CMA_Logo.png
├── archive/ # 归档目录
│ ├── temp_scripts/ # 调试脚本
│ ├── tools/ # 辅助工具
│ ├── crt_tests/ # CRT测试
│ ├── ocr_tests/ # OCR测试
│ └── docs/ # 旧文档
├── pom.xml # Maven配置
├── settings.xml # Maven设置
├── requirements.txt # Python依赖
└── src/ # 源代码目录
└── ...
```
## 📦 清理脚本
我可以为您创建一个自动化清理脚本:
```bash
#!/bin/bash
# cleanup_project.sh
echo "开始清理项目..."
# 创建归档目录
mkdir -p archive/{temp_scripts,tools,crt_tests,ocr_tests,docs}
# 移动调试脚本
echo "归档调试脚本..."
mv analyze_*.py debug_*.py quick_*.py run_*.py simple_*.py \
test_*.py verify_*.py force_*.py archive/temp_scripts/ 2>/dev/null
# 移动辅助工具
echo "归档辅助工具..."
mv extract_pdf_pages.py find_*.py search_*.py show_*.py \
visualize_*.py ocr_bridge_cross_platform.py pdf_processor.py \
archive/tools/ 2>/dev/null
# 移动CRT测试
echo "归档CRT测试..."
mv diagnose_crt_extraction.py inspect_certificate_data.py \
quick_crt_test.py standalone_crt_test.py archive/crt_tests/ 2>/dev/null
# 移动OCR测试
echo "归档OCR测试..."
mv investigate_seal_3.py test_paddleocrvl*.py test_vl_simple.py \
archive/ocr_tests/ 2>/dev/null
# 移动旧文档
echo "归档旧文档..."
mv ADDITIONAL_FIXES_SUMMARY.md CMA_LOGO_POSITION_FIX.md \
CMA_TEMPLATE_MATCHING_OPTIMIZATION_REPORT.md \
CRT_EXTRACT_INVESTIGATION_REPORT.md OCR_INTEGRATION_README.md \
PADDLEOCRVL_*.md QUICK_FIX_REFERENCE.md ROOT_CAUSE_ANALYSIS.md \
SEAL_SELECTION_FIX.md WSL_INSTALLATION_GUIDE.md \
YDQ23_001838_FINAL_FIX_SUMMARY.md 3PDF_SEAL_INVESTIGATION_REPORT.md \
INTEGRATION_TEST_REPORT.md archive/docs/ 2>/dev/null
# 删除不需要的文件
echo "删除临时文件..."
rm "test_accuracy_batch_full - 副本.py" 2>/dev/null
rm classpath.txt ping.json install_wsl.bat 2>/dev/null
echo "清理完成!"
echo ""
echo "保留的核心文件:"
ls -1 *.py *.md 2>/dev/null | head -10
```
## 🎯 推荐的清理方案
### 方案A保守清理推荐
**归档所有测试和调试脚本,保留核心功能**
- 归档所有 `test_*.py`、`debug_*.py`、`analyze_*.py` 脚本
- 归档所有旧文档
- 保留主脚本和核心模块
- 保留主要文档
### 方案B激进清理
**删除所有临时脚本,只保留必需文件**
- 删除所有测试脚本(已归档)
- 删除所有调试脚本
- 只保留主脚本和CMA提取模块
- 删除所有旧文档保留主要README
### 方案C分步清理
**先归档,观察一段时间后再删除**
1. 第一步移动到archive目录
2. 第二步观察1-2周确认不需要
3. 第三步:删除或永久归档
## ⚡ 快速清理命令
如果您想立即执行清理,我可以为您:
1. 创建 `archive/` 目录结构
2. 移动所有非核心文件
3. 创建 `.gitignore` 规则
4. 提交清理后的状态
**请选择清理方案**
- 方案A保守清理推荐
- 方案B激进清理
- 方案C只创建归档目录不删除
---
**注意**在执行清理前建议先提交当前状态到git以便可以恢复。