98 lines
2.8 KiB
Markdown
98 lines
2.8 KiB
Markdown
# Quick Reference: PaddleOCRVL Timeout Fix
|
|
|
|
## Problem Solved
|
|
✓ Program no longer hangs when PaddleOCRVL encounters problematic seal images
|
|
✓ 60-second timeout protection on all PaddleOCRVL calls
|
|
✓ Graceful degradation to other OCR methods
|
|
|
|
## Quick Commands
|
|
|
|
### Run Test with Timeout Protection
|
|
```bash
|
|
python test_accuracy_batch_full.py --ocr-model paddleocr_vl --batch --batch-size 20
|
|
```
|
|
|
|
### Run Test Without PaddleOCRVL (Faster)
|
|
```bash
|
|
python test_accuracy_batch_full.py --ocr-model ppocr_v5 --batch --batch-size 20 --disable-paddleocrvl
|
|
```
|
|
|
|
### Verify Timeout Mechanism
|
|
```bash
|
|
python test_paddleocrvl_timeout.py
|
|
```
|
|
|
|
## What Changed
|
|
|
|
| File | Change | Lines |
|
|
|------|--------|-------|
|
|
| test_accuracy_batch_full.py | Added `_run_ocr_vl_wrapper()` | 721-784 |
|
|
| test_accuracy_batch_full.py | Updated `run_ocr_recognition_vl()` | 787-850 |
|
|
| test_accuracy_batch_full.py | Updated call site 1 | 1334 |
|
|
| test_accuracy_batch_full.py | Updated call site 2 | 1356 |
|
|
| test_accuracy_batch_full.py | Added `--disable-paddleocrvl` | 2419, 2495-2500 |
|
|
|
|
## Command-Line Options
|
|
|
|
| Option | Description |
|
|
|--------|-------------|
|
|
| `--ocr-model ppocr_v5` | Use PP-OCRv5 model (faster, 85% accuracy) |
|
|
| `--ocr-model paddleocr_vl` | Use PaddleOCRVL (slower, with timeout protection) |
|
|
| `--disable-paddleocrvl` | Skip PaddleOCRVL initialization entirely |
|
|
| `--batch` | Run batch testing mode |
|
|
| `--batch-size N` | Process N PDFs |
|
|
|
|
## Expected Behavior
|
|
|
|
### Before Fix
|
|
```
|
|
2026-03-03 09:43:56,229 - WARNING - Seal #1: Unwarp OCR failed...
|
|
[program hangs indefinitely]
|
|
```
|
|
|
|
### After Fix
|
|
```
|
|
2026-03-03 09:43:56,229 - WARNING - Seal #1: Unwarp OCR failed...
|
|
2026-03-03 09:44:56,229 - WARNING - PaddleOCRVL recognition timeout (60s) for ...
|
|
[continues to next seal]
|
|
```
|
|
|
|
## Key Features
|
|
|
|
✓ **60-second timeout** per PaddleOCRVL call
|
|
✓ **Automatic cleanup** of hung processes
|
|
✓ **Graceful degradation** to other OCR methods
|
|
✓ **Windows compatible** (uses spawn mode)
|
|
✓ **User control** via --disable-paddleocrvl flag
|
|
|
|
## Test Results
|
|
|
|
```
|
|
Timeout mechanism: PASSED
|
|
Normal completion: PASSED
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: Still seeing timeouts
|
|
**Solution**: Use `--disable-paddleocrvl` flag or switch to `ppocr_v5` model
|
|
|
|
### Issue: Processing is too slow
|
|
**Solution**: Use `--ocr-model ppocr_v5` for faster processing (85% accuracy)
|
|
|
|
### Issue: Need to debug timeout
|
|
**Solution**: Check logs for "timeout after 60s" messages and examine seal images
|
|
|
|
## Technical Details
|
|
|
|
**Implementation**: Multiprocessing with 60s timeout
|
|
**Process**: terminate() → wait 5s → kill() if needed
|
|
**Result**: Returns empty dict on timeout, allows fallback OCR
|
|
**Compatibility**: Windows (spawn), Linux (fork)
|
|
|
|
## Files
|
|
|
|
- `test_accuracy_batch_full.py` - Main implementation
|
|
- `test_paddleocrvl_timeout.py` - Verification test
|
|
- `PADDLEOCRVL_TIMEOUT_FIX_SUMMARY.md` - Detailed documentation
|