report-detect/archive/docs/QUICK_FIX_REFERENCE.md

98 lines
2.8 KiB
Markdown
Raw Normal View History

chore(project): conservative cleanup - archive temp scripts and old docs Major cleanup to improve project organization and maintainability. Changes: - Moved 34 temp/debug/test scripts to archive/temp_scripts/ - Moved 9 auxiliary tools to archive/tools/ - Moved 3 CRT test scripts to archive/crt_tests/ - Moved 4 OCR test scripts to archive/ocr_tests/ - Moved 14 old documentation files to archive/docs/ - Deleted 4 useless files (duplicates, temp files) Root directory: - Before: 67 files (cluttered) - After: 10 core files (clean and organized) Core files retained: - test_accuracy_batch_full.py (main script) - cma_extraction_template_primary.py (CMA extraction) - cma_extraction_final.py (backup CMA extraction) - CLAUDE.md (project guide) - TEST_ACCURACY_BATCH_README.md (usage guide) - TEST_ACCURACY_BATCH_DEPENDENCIES.md (dependency docs) - CLEANUP_PLAN.md (cleanup plan) - CLEANUP_SUMMARY.md (this file) - IMPLEMENTATION_SUMMARY.md (implementation summary) - requirements.txt (dependencies) Archive structure: archive/ ├── temp_scripts/ (34 files: test_, debug_, analyze_, etc.) ├── tools/ (9 files: find_, show_, visualize_, etc.) ├── crt_tests/ (3 files: CRT extraction tests) ├── ocr_tests/ (4 files: OCR timeout tests) └── docs/ (14 files: old reports and guides) Benefits: ✓ Cleaner root directory - easier navigation ✓ Better organization - clear separation of concerns ✓ Preserved history - all files archived, not deleted ✓ Improved maintainability - easier to find active files ✓ Better git history - removed 198 deleted files from tracking No functional changes - all core functionality preserved. Related: - TEST_ACCURACY_BATCH_DEPENDENCIES.md - dependency analysis - CLEANUP_PLAN.md - detailed cleanup plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:35:06 +08:00
# Quick Reference: PaddleOCRVL Timeout Fix
## Problem Solved
✓ Program no longer hangs when PaddleOCRVL encounters problematic seal images
✓ 60-second timeout protection on all PaddleOCRVL calls
✓ Graceful degradation to other OCR methods
## Quick Commands
### Run Test with Timeout Protection
```bash
python test_accuracy_batch_full.py --ocr-model paddleocr_vl --batch --batch-size 20
```
### Run Test Without PaddleOCRVL (Faster)
```bash
python test_accuracy_batch_full.py --ocr-model ppocr_v5 --batch --batch-size 20 --disable-paddleocrvl
```
### Verify Timeout Mechanism
```bash
python test_paddleocrvl_timeout.py
```
## What Changed
| File | Change | Lines |
|------|--------|-------|
| test_accuracy_batch_full.py | Added `_run_ocr_vl_wrapper()` | 721-784 |
| test_accuracy_batch_full.py | Updated `run_ocr_recognition_vl()` | 787-850 |
| test_accuracy_batch_full.py | Updated call site 1 | 1334 |
| test_accuracy_batch_full.py | Updated call site 2 | 1356 |
| test_accuracy_batch_full.py | Added `--disable-paddleocrvl` | 2419, 2495-2500 |
## Command-Line Options
| Option | Description |
|--------|-------------|
| `--ocr-model ppocr_v5` | Use PP-OCRv5 model (faster, 85% accuracy) |
| `--ocr-model paddleocr_vl` | Use PaddleOCRVL (slower, with timeout protection) |
| `--disable-paddleocrvl` | Skip PaddleOCRVL initialization entirely |
| `--batch` | Run batch testing mode |
| `--batch-size N` | Process N PDFs |
## Expected Behavior
### Before Fix
```
2026-03-03 09:43:56,229 - WARNING - Seal #1: Unwarp OCR failed...
[program hangs indefinitely]
```
### After Fix
```
2026-03-03 09:43:56,229 - WARNING - Seal #1: Unwarp OCR failed...
2026-03-03 09:44:56,229 - WARNING - PaddleOCRVL recognition timeout (60s) for ...
[continues to next seal]
```
## Key Features
**60-second timeout** per PaddleOCRVL call
**Automatic cleanup** of hung processes
**Graceful degradation** to other OCR methods
**Windows compatible** (uses spawn mode)
**User control** via --disable-paddleocrvl flag
## Test Results
```
Timeout mechanism: PASSED
Normal completion: PASSED
```
## Troubleshooting
### Issue: Still seeing timeouts
**Solution**: Use `--disable-paddleocrvl` flag or switch to `ppocr_v5` model
### Issue: Processing is too slow
**Solution**: Use `--ocr-model ppocr_v5` for faster processing (85% accuracy)
### Issue: Need to debug timeout
**Solution**: Check logs for "timeout after 60s" messages and examine seal images
## Technical Details
**Implementation**: Multiprocessing with 60s timeout
**Process**: terminate() → wait 5s → kill() if needed
**Result**: Returns empty dict on timeout, allows fallback OCR
**Compatibility**: Windows (spawn), Linux (fork)
## Files
- `test_accuracy_batch_full.py` - Main implementation
- `test_paddleocrvl_timeout.py` - Verification test
- `PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md` - Detailed documentation