351 lines
8.9 KiB
Markdown
351 lines
8.9 KiB
Markdown
|
|
# 文件清理方案
|
|||
|
|
|
|||
|
|
## 📊 当前文件分析
|
|||
|
|
|
|||
|
|
### 项目根目录文件统计
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
总计:67个文件
|
|||
|
|
- Python脚本:约40个
|
|||
|
|
- Markdown文档:约15个
|
|||
|
|
- 配置/数据文件:约12个
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 🗂️ 文件分类
|
|||
|
|
|
|||
|
|
### ✅ 保留文件(核心必需)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 主脚本
|
|||
|
|
test_accuracy_batch_full.py
|
|||
|
|
|
|||
|
|
# CMA提取模块
|
|||
|
|
cma_extraction_template_primary.py
|
|||
|
|
cma_extraction_final.py
|
|||
|
|
|
|||
|
|
# 核心文档
|
|||
|
|
CLAUDE.md
|
|||
|
|
TEST_ACCURACY_BATCH_README.md
|
|||
|
|
TEST_ACCURACY_BATCH_DEPENDENCIES.md
|
|||
|
|
IMPLEMENTATION_SUMMARY.md
|
|||
|
|
|
|||
|
|
# 配置文件
|
|||
|
|
requirements.txt
|
|||
|
|
settings.xml
|
|||
|
|
pom.xml
|
|||
|
|
.classpath
|
|||
|
|
project/settings.xml
|
|||
|
|
|
|||
|
|
# CMA模板
|
|||
|
|
template/CMA_Logo.png
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### ⚠️ 可归档文件(旧测试/调试脚本)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# === 调试脚本 (归档到 archive/temp_scripts/) ===
|
|||
|
|
analyze_logo_position.py
|
|||
|
|
analyze_ydq.py
|
|||
|
|
analyze_ydq_v2.py
|
|||
|
|
debug_actual_matching.py
|
|||
|
|
debug_cma_extraction.py
|
|||
|
|
debug_full_ocr.py
|
|||
|
|
debug_ocr_only.py
|
|||
|
|
debug_roi_content.py
|
|||
|
|
debug_roi_extraction.py
|
|||
|
|
debug_specific_pdfs.py
|
|||
|
|
debug_template_matching.py
|
|||
|
|
force_reload_test.py
|
|||
|
|
quick_validation_test.py
|
|||
|
|
run_single_test.py
|
|||
|
|
run_test_fresh.py
|
|||
|
|
simple_find.py
|
|||
|
|
simple_test.py
|
|||
|
|
test_cma_simple.py
|
|||
|
|
test_crt_direct.py
|
|||
|
|
test_crt_extraction.py
|
|||
|
|
test_fullpage_fallback.py
|
|||
|
|
test_improved_crt_extraction.py
|
|||
|
|
test_improved_extraction.py
|
|||
|
|
test_roi_fix.py
|
|||
|
|
test_single_pdf.py
|
|||
|
|
test_smart_logic.py
|
|||
|
|
test_template_matching_unit.py
|
|||
|
|
verify_crt_extraction.py
|
|||
|
|
|
|||
|
|
# === 辅助工具脚本 (归档到 archive/tools/) ===
|
|||
|
|
extract_pdf_pages.py
|
|||
|
|
find_all_logo_matches.py
|
|||
|
|
find_cma_position.py
|
|||
|
|
find_numbers.py
|
|||
|
|
ocr_bridge_cross_platform.py
|
|||
|
|
pdf_processor.py
|
|||
|
|
show_results.py
|
|||
|
|
visualize_matches.py
|
|||
|
|
search_cma_position.py
|
|||
|
|
|
|||
|
|
# === CRT相关测试 (归档到 archive/crt_tests/) ===
|
|||
|
|
diagnose_crt_extraction.py
|
|||
|
|
inspect_certificate_data.py
|
|||
|
|
quick_crt_test.py
|
|||
|
|
standalone_crt_test.py
|
|||
|
|
|
|||
|
|
# === PaddleOCR测试 (归档到 archive/ocr_tests/) ===
|
|||
|
|
investigate_seal_3.py
|
|||
|
|
test_paddleocrvl_direct.py
|
|||
|
|
test_paddleocrvl_timeout.py
|
|||
|
|
test_vl_simple.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 📚 可归档文档
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# === 旧文档 (归档到 archive/docs/) ===
|
|||
|
|
ADDITIONAL_FIXES_SUMMARY.md
|
|||
|
|
CMA_LOGO_POSITION_FIX.md
|
|||
|
|
CMA_TEMPLATE_MATCHING_OPTIMIZATION_REPORT.md
|
|||
|
|
CRT_EXTRACT_INVESTIGATION_REPORT.md
|
|||
|
|
OCR_INTEGRATION_README.md
|
|||
|
|
PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md
|
|||
|
|
PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md
|
|||
|
|
QUICK_FIX_REFERENCE.md
|
|||
|
|
ROOT_CAUSE_ANALYSIS.md
|
|||
|
|
SEAL_SELECTION_FIX.md
|
|||
|
|
WSL_INSTALLATION_GUIDE.md
|
|||
|
|
YDQ23_001838_FINAL_FIX_SUMMARY.md
|
|||
|
|
3PDF_SEAL_INVESTIGATION_REPORT.md
|
|||
|
|
INTEGRATION_TEST_REPORT.md
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 🗑️ 可删除文件
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# === 副本/重复文件 ===
|
|||
|
|
test_accuracy_batch_full - 副本.py
|
|||
|
|
|
|||
|
|
# === 临时/无用文件 ===
|
|||
|
|
classpath.txt
|
|||
|
|
ping.json
|
|||
|
|
install_wsl.bat
|
|||
|
|
|
|||
|
|
# === 旧的归档 ===
|
|||
|
|
# 如果不再需要,可以删除
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 🎯 清理步骤
|
|||
|
|
|
|||
|
|
### 步骤1:创建归档目录
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
mkdir -p archive/temp_scripts
|
|||
|
|
mkdir -p archive/tools
|
|||
|
|
mkdir -p archive/crt_tests
|
|||
|
|
mkdir -p archive/ocr_tests
|
|||
|
|
mkdir -p archive/docs
|
|||
|
|
mkdir -p archive/old_reports
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 步骤2:移动文件到归档
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 移动调试脚本
|
|||
|
|
mv analyze_*.py archive/temp_scripts/
|
|||
|
|
mv debug_*.py archive/temp_scripts/
|
|||
|
|
mv quick_*.py archive/temp_scripts/
|
|||
|
|
mv run_*.py archive/temp_scripts/
|
|||
|
|
mv simple_*.py archive/temp_scripts/
|
|||
|
|
mv test_*.py archive/temp_scripts/ 2>/dev/null || true
|
|||
|
|
mv verify_*.py archive/temp_scripts/
|
|||
|
|
mv force_*.py archive/temp_scripts/
|
|||
|
|
|
|||
|
|
# 移动辅助工具
|
|||
|
|
mv extract_pdf_pages.py archive/tools/
|
|||
|
|
mv find_*.py archive/tools/
|
|||
|
|
mv search_*.py archive/tools/
|
|||
|
|
mv show_*.py archive/tools/
|
|||
|
|
mv visualize_*.py archive/tools/
|
|||
|
|
mv ocr_bridge_cross_platform.py archive/tools/
|
|||
|
|
mv pdf_processor.py archive/tools/
|
|||
|
|
|
|||
|
|
# 移动CRT测试
|
|||
|
|
mv diagnose_crt_extraction.py archive/crt_tests/
|
|||
|
|
mv inspect_certificate_data.py archive/crt_tests/
|
|||
|
|
mv quick_crt_test.py archive/crt_tests/
|
|||
|
|
mv standalone_crt_test.py archive/crt_tests/
|
|||
|
|
|
|||
|
|
# 移动OCR测试
|
|||
|
|
mv investigate_seal_3.py archive/ocr_tests/
|
|||
|
|
mv test_paddleocrvl*.py archive/ocr_tests/
|
|||
|
|
mv test_vl_simple.py archive/ocr_tests/
|
|||
|
|
|
|||
|
|
# 移动旧文档
|
|||
|
|
mv ADDITIONAL_FIXES_SUMMARY.md archive/docs/
|
|||
|
|
mv CMA_LOGO_POSITION_FIX.md archive/docs/
|
|||
|
|
mv CMA_TEMPLATE_MATCHING_OPTIMIZATION_REPORT.md archive/docs/
|
|||
|
|
mv CRT_EXTRACT_INVESTIGATION_REPORT.md archive/docs/
|
|||
|
|
mv OCR_INTEGRATION_README.md archive/docs/
|
|||
|
|
mv PADDLEOCRVL_5MIN_TIMEOUT_GUIDE.md archive/docs/
|
|||
|
|
mv PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md archive/docs/
|
|||
|
|
mv QUICK_FIX_REFERENCE.md archive/docs/
|
|||
|
|
mv ROOT_CAUSE_ANALYSIS.md archive/docs/
|
|||
|
|
mv SEAL_SELECTION_FIX.md archive/docs/
|
|||
|
|
mv WSL_INSTALLATION_GUIDE.md archive/docs/
|
|||
|
|
mv YDQ23_001838_FINAL_FIX_SUMMARY.md archive/docs/
|
|||
|
|
mv 3PDF_SEAL_INVESTIGATION_REPORT.md archive/docs/
|
|||
|
|
mv INTEGRATION_TEST_REPORT.md archive/docs/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 步骤3:删除不需要的文件
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 删除副本和临时文件
|
|||
|
|
rm "test_accuracy_batch_full - 副本.py"
|
|||
|
|
rm classpath.txt
|
|||
|
|
rm ping.json
|
|||
|
|
rm install_wsl.bat
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 步骤4:清理输出目录(可选)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 清理测试输出(如果想保留结果,跳过此步)
|
|||
|
|
# rm -rf test_reports_full/
|
|||
|
|
# rm test_accuracy_full.log
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## ✅ 清理后的目录结构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
project-root/
|
|||
|
|
├── test_accuracy_batch_full.py # 主脚本
|
|||
|
|
├── TEST_ACCURACY_BATCH_README.md # 使用文档
|
|||
|
|
├── TEST_ACCURACY_BATCH_DEPENDENCIES.md # 依赖文档
|
|||
|
|
├── CLAUDE.md # 项目指南
|
|||
|
|
├── IMPLEMENTATION_SUMMARY.md # 实现总结
|
|||
|
|
│
|
|||
|
|
├── cma_extraction_template_primary.py # CMA提取模块
|
|||
|
|
├── cma_extraction_final.py # 备用模块
|
|||
|
|
│
|
|||
|
|
├── src/test/resources/data/ # 测试数据
|
|||
|
|
│ ├── pdfs/
|
|||
|
|
│ └── results.json
|
|||
|
|
│
|
|||
|
|
├── template/ # 模板文件
|
|||
|
|
│ └── CMA_Logo.png
|
|||
|
|
│
|
|||
|
|
├── archive/ # 归档目录
|
|||
|
|
│ ├── temp_scripts/ # 调试脚本
|
|||
|
|
│ ├── tools/ # 辅助工具
|
|||
|
|
│ ├── crt_tests/ # CRT测试
|
|||
|
|
│ ├── ocr_tests/ # OCR测试
|
|||
|
|
│ └── docs/ # 旧文档
|
|||
|
|
│
|
|||
|
|
├── pom.xml # Maven配置
|
|||
|
|
├── settings.xml # Maven设置
|
|||
|
|
├── requirements.txt # Python依赖
|
|||
|
|
│
|
|||
|
|
└── src/ # 源代码目录
|
|||
|
|
└── ...
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 📦 清理脚本
|
|||
|
|
|
|||
|
|
我可以为您创建一个自动化清理脚本:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# cleanup_project.sh
|
|||
|
|
|
|||
|
|
echo "开始清理项目..."
|
|||
|
|
|
|||
|
|
# 创建归档目录
|
|||
|
|
mkdir -p archive/{temp_scripts,tools,crt_tests,ocr_tests,docs}
|
|||
|
|
|
|||
|
|
# 移动调试脚本
|
|||
|
|
echo "归档调试脚本..."
|
|||
|
|
mv analyze_*.py debug_*.py quick_*.py run_*.py simple_*.py \
|
|||
|
|
test_*.py verify_*.py force_*.py archive/temp_scripts/ 2>/dev/null
|
|||
|
|
|
|||
|
|
# 移动辅助工具
|
|||
|
|
echo "归档辅助工具..."
|
|||
|
|
mv extract_pdf_pages.py find_*.py search_*.py show_*.py \
|
|||
|
|
visualize_*.py ocr_bridge_cross_platform.py pdf_processor.py \
|
|||
|
|
archive/tools/ 2>/dev/null
|
|||
|
|
|
|||
|
|
# 移动CRT测试
|
|||
|
|
echo "归档CRT测试..."
|
|||
|
|
mv diagnose_crt_extraction.py inspect_certificate_data.py \
|
|||
|
|
quick_crt_test.py standalone_crt_test.py archive/crt_tests/ 2>/dev/null
|
|||
|
|
|
|||
|
|
# 移动OCR测试
|
|||
|
|
echo "归档OCR测试..."
|
|||
|
|
mv investigate_seal_3.py test_paddleocrvl*.py test_vl_simple.py \
|
|||
|
|
archive/ocr_tests/ 2>/dev/null
|
|||
|
|
|
|||
|
|
# 移动旧文档
|
|||
|
|
echo "归档旧文档..."
|
|||
|
|
mv ADDITIONAL_FIXES_SUMMARY.md CMA_LOGO_POSITION_FIX.md \
|
|||
|
|
CMA_TEMPLATE_MATCHING_OPTIMIZATION_REPORT.md \
|
|||
|
|
CRT_EXTRACT_INVESTIGATION_REPORT.md OCR_INTEGRATION_README.md \
|
|||
|
|
PADDLEOCRVL_*.md QUICK_FIX_REFERENCE.md ROOT_CAUSE_ANALYSIS.md \
|
|||
|
|
SEAL_SELECTION_FIX.md WSL_INSTALLATION_GUIDE.md \
|
|||
|
|
YDQ23_001838_FINAL_FIX_SUMMARY.md 3PDF_SEAL_INVESTIGATION_REPORT.md \
|
|||
|
|
INTEGRATION_TEST_REPORT.md archive/docs/ 2>/dev/null
|
|||
|
|
|
|||
|
|
# 删除不需要的文件
|
|||
|
|
echo "删除临时文件..."
|
|||
|
|
rm "test_accuracy_batch_full - 副本.py" 2>/dev/null
|
|||
|
|
rm classpath.txt ping.json install_wsl.bat 2>/dev/null
|
|||
|
|
|
|||
|
|
echo "清理完成!"
|
|||
|
|
echo ""
|
|||
|
|
echo "保留的核心文件:"
|
|||
|
|
ls -1 *.py *.md 2>/dev/null | head -10
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 🎯 推荐的清理方案
|
|||
|
|
|
|||
|
|
### 方案A:保守清理(推荐)
|
|||
|
|
|
|||
|
|
**归档所有测试和调试脚本,保留核心功能**
|
|||
|
|
|
|||
|
|
- 归档所有 `test_*.py`、`debug_*.py`、`analyze_*.py` 脚本
|
|||
|
|
- 归档所有旧文档
|
|||
|
|
- 保留主脚本和核心模块
|
|||
|
|
- 保留主要文档
|
|||
|
|
|
|||
|
|
### 方案B:激进清理
|
|||
|
|
|
|||
|
|
**删除所有临时脚本,只保留必需文件**
|
|||
|
|
|
|||
|
|
- 删除所有测试脚本(已归档)
|
|||
|
|
- 删除所有调试脚本
|
|||
|
|
- 只保留主脚本和CMA提取模块
|
|||
|
|
- 删除所有旧文档(保留主要README)
|
|||
|
|
|
|||
|
|
### 方案C:分步清理
|
|||
|
|
|
|||
|
|
**先归档,观察一段时间后再删除**
|
|||
|
|
|
|||
|
|
1. 第一步:移动到archive目录
|
|||
|
|
2. 第二步:观察1-2周,确认不需要
|
|||
|
|
3. 第三步:删除或永久归档
|
|||
|
|
|
|||
|
|
## ⚡ 快速清理命令
|
|||
|
|
|
|||
|
|
如果您想立即执行清理,我可以为您:
|
|||
|
|
|
|||
|
|
1. 创建 `archive/` 目录结构
|
|||
|
|
2. 移动所有非核心文件
|
|||
|
|
3. 创建 `.gitignore` 规则
|
|||
|
|
4. 提交清理后的状态
|
|||
|
|
|
|||
|
|
**请选择清理方案**:
|
|||
|
|
- 方案A:保守清理(推荐)
|
|||
|
|
- 方案B:激进清理
|
|||
|
|
- 方案C:只创建归档目录,不删除
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**注意**:在执行清理前,建议先提交当前状态到git,以便可以恢复。
|