report-detect/archive/temp_scripts/test_crt_direct.py

"""
直接测试CRT提取函数
"""
from test_accuracy_batch_full import extract_institution_from_crt
import sys

# Redirect stdout to avoid encoding issues
class UTF8Stdout:
    def write(self, text):
        if isinstance(text, str):
            text = text.encode('utf-8', errors='replace').decode('utf-8')
        sys.stdout.buffer.write(text.encode('utf-8', errors='replace'))

    def flush(self):
        sys.stdout.buffer.flush()

print("Testing CRT extraction...")

pdf_path = "src/test/resources/data/pdfs/YDQ25_002294.pdf"
result = extract_institution_from_crt(pdf_path)

print(f"\nResult for {pdf_path}:")
print(f"  Type: {type(result)}")
print(f"  Length: {len(result)}")
print(f"  Content: {result}")

# Also test YDQ23_001838.pdf
pdf_path2 = "src/test/resources/data/pdfs/YDQ23_001838.pdf"
result2 = extract_institution_from_crt(pdf_path2)

print(f"\nResult for {pdf_path2}:")
print(f"  Type: {type(result2)}")
print(f"  Length: {len(result2)}")
print(f"  Content: {result2}")

# Check if expected institution is in results
expected = "广东产品质量监督检验研究院"
print(f"\nExpected institution: {expected}")
print(f"  Found in PDF1: {expected in result}")
print(f"  Found in PDF2: {expected in result2}")
chore(project): conservative cleanup - archive temp scripts and old docs Major cleanup to improve project organization and maintainability. Changes: - Moved 34 temp/debug/test scripts to archive/temp_scripts/ - Moved 9 auxiliary tools to archive/tools/ - Moved 3 CRT test scripts to archive/crt_tests/ - Moved 4 OCR test scripts to archive/ocr_tests/ - Moved 14 old documentation files to archive/docs/ - Deleted 4 useless files (duplicates, temp files) Root directory: - Before: 67 files (cluttered) - After: 10 core files (clean and organized) Core files retained: - test_accuracy_batch_full.py (main script) - cma_extraction_template_primary.py (CMA extraction) - cma_extraction_final.py (backup CMA extraction) - CLAUDE.md (project guide) - TEST_ACCURACY_BATCH_README.md (usage guide) - TEST_ACCURACY_BATCH_DEPENDENCIES.md (dependency docs) - CLEANUP_PLAN.md (cleanup plan) - CLEANUP_SUMMARY.md (this file) - IMPLEMENTATION_SUMMARY.md (implementation summary) - requirements.txt (dependencies) Archive structure: archive/ ├── temp_scripts/ (34 files: test_, debug_, analyze_, etc.) ├── tools/ (9 files: find_, show_, visualize_, etc.) ├── crt_tests/ (3 files: CRT extraction tests) ├── ocr_tests/ (4 files: OCR timeout tests) └── docs/ (14 files: old reports and guides) Benefits: ✓ Cleaner root directory - easier navigation ✓ Better organization - clear separation of concerns ✓ Preserved history - all files archived, not deleted ✓ Improved maintainability - easier to find active files ✓ Better git history - removed 198 deleted files from tracking No functional changes - all core functionality preserved. Related: - TEST_ACCURACY_BATCH_DEPENDENCIES.md - dependency analysis - CLEANUP_PLAN.md - detailed cleanup plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> 2026-03-03 14:35:06 +08:00			`"""`
			`直接测试CRT提取函数`
			`"""`
			`from test_accuracy_batch_full import extract_institution_from_crt`
			`import sys`

			`# Redirect stdout to avoid encoding issues`
			`class UTF8Stdout:`
			`def write(self, text):`
			`if isinstance(text, str):`
			`text = text.encode('utf-8', errors='replace').decode('utf-8')`
			`sys.stdout.buffer.write(text.encode('utf-8', errors='replace'))`

			`def flush(self):`
			`sys.stdout.buffer.flush()`

			`print("Testing CRT extraction...")`

			`pdf_path = "src/test/resources/data/pdfs/YDQ25_002294.pdf"`
			`result = extract_institution_from_crt(pdf_path)`

			`print(f"\nResult for {pdf_path}:")`
			`print(f" Type: {type(result)}")`
			`print(f" Length: {len(result)}")`
			`print(f" Content: {result}")`

			`# Also test YDQ23_001838.pdf`
			`pdf_path2 = "src/test/resources/data/pdfs/YDQ23_001838.pdf"`
			`result2 = extract_institution_from_crt(pdf_path2)`

			`print(f"\nResult for {pdf_path2}:")`
			`print(f" Type: {type(result2)}")`
			`print(f" Length: {len(result2)}")`
			`print(f" Content: {result2}")`

			`# Check if expected institution is in results`
			`expected = "广东产品质量监督检验研究院"`
			`print(f"\nExpected institution: {expected}")`
			`print(f" Found in PDF1: {expected in result}")`
			`print(f" Found in PDF2: {expected in result2}")`