fix(test): remove seal suffixes from institution names before matching
Extend institution name cleaning to handle OCR artifacts from seal text
that gets merged with company names during extraction.
Problem:
- 3 PDFs failed matching due to "检验检测专用章" (Seal for Inspection & Testing)
being included in extracted institution names
- Example: "四川合泰与必摩适检测有限公司检验检测专用章"
vs "四川合泰与必摩适检测有限公司"
- Similarity dropped to ~60-67% → incorrectly classified as "no_match"
- Affected PDFs:
* pages3-6.pdf: 60.87% similarity
* pages7-14.pdf: 60.0% similarity
* pages12-15.pdf: 62.5% similarity
Solution:
- Add seal suffix removal to clean_institution_name() function
- Remove common seal names: 检验检测专用章, 检测专用章, 检验专用章, etc.
- Use string replacement (not regex) to handle middle-of-text occurrences
- Apply before number removal to handle combined artifacts like "专用章123456"
Test Results:
All 4 test cases now achieve 100% similarity and "exact" match:
1. "检验检测专用章" suffix → 66.67% → 100.00% ✓
2. "检验检测专用章" suffix (different company) → 65.00% → 100.00% ✓
3. "430334" suffix → 70.00% → 100.00% ✓
4. "检验检测专用章430334" combined → 51.85% → 100.00% ✓
This fix complements the previous CMA code suffix removal and
significantly improves matching accuracy for seal-related OCR artifacts.
Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit is contained in:
parent
9f701edd25
commit
f5981fdf72
|
|
@ -1511,7 +1511,7 @@ def extract_institution_from_crt(pdf_path: str) -> List[str]:
|
|||
|
||||
def clean_institution_name(text: str) -> str:
|
||||
"""
|
||||
清理机构名称,移除末尾的数字、CMA码等干扰内容
|
||||
清理机构名称,移除末尾的数字、CMA码、印章名称等干扰内容
|
||||
|
||||
Args:
|
||||
text: 原始机构名称
|
||||
|
|
@ -1522,6 +1522,19 @@ def clean_institution_name(text: str) -> str:
|
|||
if not text:
|
||||
return text
|
||||
|
||||
# 移除常见的印章名称(不需要在末尾,可以移除任何位置的)
|
||||
# 这处理"机构名称检验检测专用章"或"机构名称检验检测专用章123456"
|
||||
seal_patterns = [
|
||||
r'检验检测专用章',
|
||||
r'检测专用章',
|
||||
r'检验专用章',
|
||||
r'鉴定专用章',
|
||||
r'公章',
|
||||
r'专用章',
|
||||
]
|
||||
for pattern in seal_patterns:
|
||||
text = text.replace(pattern, '')
|
||||
|
||||
# 移除末尾的数字序列(如CMA码)
|
||||
text = re.sub(r'\d{6,}$', '', text) # 6位及以上数字
|
||||
text = re.sub(r'\d{11,}$', '', text) # 11位及以上数字(CMA码)
|
||||
|
|
|
|||
Loading…
Reference in New Issue