report-detect/archive/temp_scripts/run_test_fresh.py

71 lines
2.0 KiB
Python
Raw Normal View History

chore(project): conservative cleanup - archive temp scripts and old docs Major cleanup to improve project organization and maintainability. Changes: - Moved 34 temp/debug/test scripts to archive/temp_scripts/ - Moved 9 auxiliary tools to archive/tools/ - Moved 3 CRT test scripts to archive/crt_tests/ - Moved 4 OCR test scripts to archive/ocr_tests/ - Moved 14 old documentation files to archive/docs/ - Deleted 4 useless files (duplicates, temp files) Root directory: - Before: 67 files (cluttered) - After: 10 core files (clean and organized) Core files retained: - test_accuracy_batch_full.py (main script) - cma_extraction_template_primary.py (CMA extraction) - cma_extraction_final.py (backup CMA extraction) - CLAUDE.md (project guide) - TEST_ACCURACY_BATCH_README.md (usage guide) - TEST_ACCURACY_BATCH_DEPENDENCIES.md (dependency docs) - CLEANUP_PLAN.md (cleanup plan) - CLEANUP_SUMMARY.md (this file) - IMPLEMENTATION_SUMMARY.md (implementation summary) - requirements.txt (dependencies) Archive structure: archive/ ├── temp_scripts/ (34 files: test_, debug_, analyze_, etc.) ├── tools/ (9 files: find_, show_, visualize_, etc.) ├── crt_tests/ (3 files: CRT extraction tests) ├── ocr_tests/ (4 files: OCR timeout tests) └── docs/ (14 files: old reports and guides) Benefits: ✓ Cleaner root directory - easier navigation ✓ Better organization - clear separation of concerns ✓ Preserved history - all files archived, not deleted ✓ Improved maintainability - easier to find active files ✓ Better git history - removed 198 deleted files from tracking No functional changes - all core functionality preserved. Related: - TEST_ACCURACY_BATCH_DEPENDENCIES.md - dependency analysis - CLEANUP_PLAN.md - detailed cleanup plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:35:06 +08:00
"""
Run fresh test with cleared cache
"""
import sys
import os
# Clear all Python cache
print("Clearing Python cache...")
import shutil
for root, dirs, files in os.walk("."):
for d in dirs:
if d == "__pycache__":
cache_path = os.path.join(root, d)
try:
shutil.rmtree(cache_path)
print(f" Removed: {cache_path}")
except:
pass
# Clear module cache
print("Clearing module cache...")
modules_to_clear = [m for m in sys.modules.keys() if m.startswith('cma_extraction') or m.startswith('test_accuracy')]
for module in modules_to_clear:
del sys.modules[module]
print(f" Cleared {len(modules_to_clear)} modules")
# Run test
print("\nRunning test for YDQ23_001838.pdf...")
print("=" * 80)
from test_accuracy_batch_full import process_single_pdf
from pathlib import Path
pdf_name = "YDQ23_001838.pdf"
pdf_dir = Path("src/test/resources/data/pdfs")
output_dir = Path("test_reports_fresh")
# Load expected results
import json
results_file = Path("src/test/resources/data/results.json")
with open(results_file, 'r', encoding='utf-8') as f:
expected_results = json.load(f)
expected_cma = expected_results.get(pdf_name, {}).get('cma')
expected_inst = expected_results.get(pdf_name, {}).get('institution')
# Initialize OCR
os.environ["DISABLE_MODEL_SOURCE_CHECK"] = "True"
from paddleocr import PaddleOCR
ocr_engine = PaddleOCR(lang='ch')
# Process
result = process_single_pdf(
pdf_name=pdf_name,
expected_cma=expected_cma,
expected_inst=expected_inst,
pdf_dir=pdf_dir,
output_dir=output_dir / pdf_name,
ocr_engine=ocr_engine,
ocr_model="ppocr_v5",
vl_pipeline=None
)
print("\n" + "=" * 80)
print("TEST RESULT")
print("=" * 80)
print(f"Expected CMA: {expected_cma}")
print(f"Extracted CMA: {result['extracted']['cma']}")
print(f"Match: {result['comparison']['cma'].get('match_type', 'UNKNOWN')}")
print(f"Similarity: {result['comparison']['cma'].get('similarity', 0):.1f}%")