report-detect/archive/docs/PADDLEOCRVL_TIMEOUT_FIX_SUM...

179 lines
5.9 KiB
Markdown
Raw Permalink Normal View History

chore(project): conservative cleanup - archive temp scripts and old docs Major cleanup to improve project organization and maintainability. Changes: - Moved 34 temp/debug/test scripts to archive/temp_scripts/ - Moved 9 auxiliary tools to archive/tools/ - Moved 3 CRT test scripts to archive/crt_tests/ - Moved 4 OCR test scripts to archive/ocr_tests/ - Moved 14 old documentation files to archive/docs/ - Deleted 4 useless files (duplicates, temp files) Root directory: - Before: 67 files (cluttered) - After: 10 core files (clean and organized) Core files retained: - test_accuracy_batch_full.py (main script) - cma_extraction_template_primary.py (CMA extraction) - cma_extraction_final.py (backup CMA extraction) - CLAUDE.md (project guide) - TEST_ACCURACY_BATCH_README.md (usage guide) - TEST_ACCURACY_BATCH_DEPENDENCIES.md (dependency docs) - CLEANUP_PLAN.md (cleanup plan) - CLEANUP_SUMMARY.md (this file) - IMPLEMENTATION_SUMMARY.md (implementation summary) - requirements.txt (dependencies) Archive structure: archive/ ├── temp_scripts/ (34 files: test_, debug_, analyze_, etc.) ├── tools/ (9 files: find_, show_, visualize_, etc.) ├── crt_tests/ (3 files: CRT extraction tests) ├── ocr_tests/ (4 files: OCR timeout tests) └── docs/ (14 files: old reports and guides) Benefits: ✓ Cleaner root directory - easier navigation ✓ Better organization - clear separation of concerns ✓ Preserved history - all files archived, not deleted ✓ Improved maintainability - easier to find active files ✓ Better git history - removed 198 deleted files from tracking No functional changes - all core functionality preserved. Related: - TEST_ACCURACY_BATCH_DEPENDENCIES.md - dependency analysis - CLEANUP_PLAN.md - detailed cleanup plan Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-03 14:35:06 +08:00
# PaddleOCRVL Timeout Fix - Implementation Summary
## Problem
The `test_accuracy_batch_full.py` script was hanging indefinitely when PaddleOCRVL's `predict()` method encountered certain seal images. The program would stop responding with no timeout protection.
## Root Cause
PaddleOCRVL's `predict()` method has no built-in timeout mechanism. When processing certain problematic images, the method can block indefinitely, causing the entire program to hang.
## Solution Implemented
A comprehensive timeout protection mechanism using Python's `multiprocessing` module:
### 1. Module-Level Wrapper Function
Added `_run_ocr_vl_wrapper()` function (line 721) that:
- Can be pickled and run in a subprocess (required for Windows compatibility)
- Re-initializes PaddleOCRVL pipeline in the subprocess
- Handles exceptions gracefully
- Returns results via a multiprocessing.Queue
### 2. Timeout-Protected OCR Function
Replaced `run_ocr_recognition_vl()` function (line 787) with:
- Default timeout of 60 seconds
- Subprocess-based execution
- Automatic termination after timeout
- Graceful cleanup with `terminate()` and fallback to `kill()`
- Proper error handling and logging
### 3. Updated Call Sites
Updated both PaddleOCRVL call sites:
- Line 1334: Backup OCR after unwarp failure
- Line 1356: Direct OCR when unwarp is unavailable
Both now include `timeout=60` parameter.
### 4. Command-Line Option
Added `--disable-paddleocrvl` flag to:
- Allow users to completely skip PaddleOCRVL initialization
- Provide faster execution for batch testing
- Enable quick workaround if timeout issues persist
## Files Modified
1. **test_accuracy_batch_full.py** - Main implementation
- Added `_run_ocr_vl_wrapper()` function
- Replaced `run_ocr_recognition_vl()` function
- Updated 2 call sites with timeout parameter
- Added `--disable-paddleocrvl` command-line option
2. **test_paddleocrvl_timeout.py** - New test script
- Verifies timeout mechanism works correctly
- Tests both timeout and normal completion scenarios
- All tests PASSED
## Usage
### Option 1: Use with Timeout Protection (Default)
```bash
# Uses PaddleOCRVL with 60s timeout protection
python test_accuracy_batch_full.py --ocr-model paddleocr_vl --batch --batch-size 20
```
### Option 2: Disable PaddleOCRVL (Faster)
```bash
# Skip PaddleOCRVL entirely, use only ppocr_v5
python test_accuracy_batch_full.py --ocr-model ppocr_v5 --batch --batch-size 20 --disable-paddleocrvl
```
### Option 3: Use ppocr_v5 Model (Recommended for Speed)
```bash
# Use ppocr_v5 for both primary and backup OCR
python test_accuracy_batch_full.py --ocr-model ppocr_v5 --batch --batch-size 20
```
## Test Results
### Timeout Test
```
Timeout mechanism: PASSED
Normal completion: PASSED
[OK] All tests passed! The multiprocessing timeout mechanism works correctly.
PaddleOCRVL calls will be protected from hanging indefinitely.
```
### Key Features
1. **60-Second Timeout**: Each PaddleOCRVL call is limited to 60 seconds
2. **Graceful Degradation**: Timeout returns empty result, allowing other OCR methods to be tried
3. **Resource Cleanup**: Subprocesses are properly terminated even if they hang
4. **Windows Compatible**: Uses module-level functions to avoid pickle issues
5. **Detailed Logging**: All timeouts are logged with context for debugging
## Benefits
1. **No More Hanging**: Program will never block indefinitely on PaddleOCRVL
2. **Predictable Runtime**: Maximum of 60 seconds per seal image
3. **Better Error Handling**: Clear error messages when timeouts occur
4. **User Control**: Option to disable PaddleOCRVL if needed
5. **Backward Compatible**: Existing code continues to work with minimal changes
## Technical Details
### Multiprocessing on Windows
Windows uses "spawn" mode for multiprocessing, which requires:
- Target functions to be picklable
- Functions defined at module level (not nested)
- Re-import of modules in subprocess
This is why `_run_ocr_vl_wrapper` is defined at module level and re-initializes the PaddleOCRVL pipeline.
### Timeout Mechanism Flow
1. Main process creates multiprocessing.Queue
2. Subprocess starts with wrapper function
3. Main process waits with 60-second timeout
4. If timeout occurs:
- `terminate()` sends SIGTERM
- Wait 5 seconds for cleanup
- If still alive, `kill()` sends SIGKILL
5. Return failure result to allow fallback
### Error Handling
The implementation handles multiple error scenarios:
- Process timeout (most common)
- Process crash during execution
- Queue communication failures
- PaddleOCRVL initialization failures
- File I/O errors
## Recommendations
1. **For Testing**: Use `--ocr-model ppocr_v5` for faster batch processing
2. **For Production**: Keep default timeout (60s) for PaddleOCRVL backup
3. **For Debugging**: Check logs for "timeout after 60s" messages to identify problematic seals
4. **For Speed**: Consider increasing timeout only if legitimate cases need more time
## Future Improvements
1. Add adaptive timeout based on image size
2. Cache PaddleOCRVL results to avoid re-processing
3. Add statistics on timeout frequency
4. Consider using ProcessPoolExecutor for better resource management
## Verification
To verify the fix works:
```bash
# Run timeout test
python test_paddleocrvl_timeout.py
# Run batch test with PaddleOCRVL
python test_accuracy_batch_full.py --ocr-model paddleocr_vl --batch --batch-size 5
# Verify no hanging occurs
# Check test_reports_full/test_report.json for results
```
## Related Files
- `test_accuracy_batch_full.py` - Main implementation (lines 721-850)
- `test_paddleocrvl_timeout.py` - Timeout verification test
- `test_reports_full/test_report.json` - Test results output
## Conclusion
The PaddleOCRVL timeout issue has been successfully resolved. The program will no longer hang indefinitely when processing problematic seal images. The timeout mechanism provides a balance between allowing sufficient time for legitimate processing and preventing indefinite blocks.