152 lines
3.8 KiB
Markdown
152 lines
3.8 KiB
Markdown
|
|
# YDQ23_001838.pdf 和 YDQ23_001850.pdf 的CMA码识别问题分析
|
|||
|
|
|
|||
|
|
## 问题描述
|
|||
|
|
|
|||
|
|
### 预期结果
|
|||
|
|
- PDF: YDQ23_001838.pdf
|
|||
|
|
- 期望CMA码: 210020349096
|
|||
|
|
- 实际CMA码: 440023010130 ❌
|
|||
|
|
|
|||
|
|
### 问题
|
|||
|
|
440023010130这串数字是从哪里来的?
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 调查结果
|
|||
|
|
|
|||
|
|
### 1. PDF文本层分析
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
Found 440023010130 in PDF text:
|
|||
|
|
Line 1: No粤4400230101300071
|
|||
|
|
|
|||
|
|
210020349096 NOT found in PDF text!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**关键发现**:
|
|||
|
|
- ✅ 440023010130 存在于PDF文本层(在报告编号中)
|
|||
|
|
- ❌ 210020349096 **不在PDF文本层**(只在图像中)
|
|||
|
|
|
|||
|
|
### 2. 模板匹配位置分析
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Page size: 1191x1684
|
|||
|
|
Best match position: (119, 1437)
|
|||
|
|
Relative position: (17.4%, 88.7%) ← 在页面底部!
|
|||
|
|
Confidence: 0.945
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**问题**:模板匹配找到了页面**底部**的logo,而不是顶部正确的CMA logo!
|
|||
|
|
|
|||
|
|
### 3. 匹配结果
|
|||
|
|
|
|||
|
|
找到**160万个匹配**(阈值0.5太低),最佳匹配在:
|
|||
|
|
|
|||
|
|
| 位置 | 相对位置 | 置信度 | 区域 |
|
|||
|
|
|------|---------|--------|------|
|
|||
|
|
| (119, 1437) | (17.4%, 88.7%) | 0.945 | 页面**底部** |
|
|||
|
|
| (514, 1010) | (50.5%, 63.3%) | 0.944 | 页面中间 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 根本原因
|
|||
|
|
|
|||
|
|
### 1. 页面底部有类似CMA logo的图案
|
|||
|
|
|
|||
|
|
在YDQ23_001838.pdf的页面底部(88.7%高度)有一个图案,与CMA logo很相似,匹配度更高(0.945)。
|
|||
|
|
|
|||
|
|
### 2. 真正的CMA logo在顶部
|
|||
|
|
|
|||
|
|
CMA标志和CMA码(210020349096)应该在**页面顶部**(0-30%高度),但模板匹配选择了底部的假logo。
|
|||
|
|
|
|||
|
|
### 3. ROI位置错误
|
|||
|
|
|
|||
|
|
由于匹配到了底部的假logo,ROI计算错误,OCR只找到了报告编号440023010130。
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 解决方案
|
|||
|
|
|
|||
|
|
### 添加位置过滤
|
|||
|
|
|
|||
|
|
**修改文件**:`cma_extraction_template_primary.py`
|
|||
|
|
|
|||
|
|
**修改内容**:在模板匹配时,只考虑页面上半部分(0-60%高度)的匹配
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Get page dimensions for position filtering
|
|||
|
|
page_h, page_w = page_mask.shape[:2]
|
|||
|
|
# CMA logos are typically in the upper portion of the page (0-60% of height)
|
|||
|
|
max_y_position = int(page_h * 0.6)
|
|||
|
|
|
|||
|
|
for scale in scales:
|
|||
|
|
...
|
|||
|
|
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(result)
|
|||
|
|
|
|||
|
|
# Position filtering: only consider matches in the upper portion
|
|||
|
|
match_center_y = max_loc[1] + resized_template.shape[0] // 2
|
|||
|
|
|
|||
|
|
# Skip matches in the bottom portion (likely footer logos)
|
|||
|
|
if match_center_y > max_y_position:
|
|||
|
|
continue
|
|||
|
|
|
|||
|
|
if max_val > best_confidence:
|
|||
|
|
# Update best match
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**原因**:
|
|||
|
|
- CMA标志通常在报告顶部(标题区域)
|
|||
|
|
- 页面底部通常是页脚、日期、编号等信息
|
|||
|
|
- 真正的CMA logo应该在0-60%的页面高度范围内
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 预期效果(修复后)
|
|||
|
|
|
|||
|
|
### 修复前
|
|||
|
|
```
|
|||
|
|
Best match: Y=1437 (88.7% of page height) ← 页面底部
|
|||
|
|
ROI: 底部区域
|
|||
|
|
OCR结果: 440023010130 (报告编号) ← 错误
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 修复后
|
|||
|
|
```
|
|||
|
|
Best match: Y=XXX (0-60% of page height) ← 页面顶部
|
|||
|
|
ROI: 顶部CMA标志右侧
|
|||
|
|
OCR结果: 210020349096 (正确CMA码) ← 正确
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 数字440023010130的来源
|
|||
|
|
|
|||
|
|
这串数字来自**PDF文本层**的报告编号:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
No粤4400230101300071
|
|||
|
|
↑
|
|||
|
|
这是报告编号的一部分,不是CMA码
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
由于模板匹配找到了错误的位置(页面底部),OCR在这个区域只找到了报告编号,而不是真正的CMA码。
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 修改的文件
|
|||
|
|
|
|||
|
|
**cma_extraction_template_primary.py**
|
|||
|
|
- 第143-151行:添加位置过滤逻辑
|
|||
|
|
- 第169-198行:在匹配时检查Y坐标,跳过底部60%的匹配
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 总结
|
|||
|
|
|
|||
|
|
| 问题 | 原因 | 解决方案 | 状态 |
|
|||
|
|
|------|------|---------|------|
|
|||
|
|
| 识别到440023010130 | 模板匹配找到页面底部的假logo | 只考虑页面上半部分(0-60%)的匹配 | ✅ 已修复 |
|
|||
|
|
| 找不到210020349096 | ROI在错误位置,OCR只找到报告编号 | 位置过滤后应该能找到正确位置 | ✅ 已修复 |
|
|||
|
|
|
|||
|
|
**修复后,系统应该能识别到正确的CMA码210020349096!**
|