fix(test): remove seal suffixes from institution names before matching

Extend institution name cleaning to handle OCR artifacts from seal text
that gets merged with company names during extraction.

Problem:
- 3 PDFs failed matching due to "检验检测专用章" (Seal for Inspection & Testing)
  being included in extracted institution names
- Example: "四川合泰与必摩适检测有限公司检验检测专用章"
           vs "四川合泰与必摩适检测有限公司"
- Similarity dropped to ~60-67% → incorrectly classified as "no_match"
- Affected PDFs:
  * pages3-6.pdf: 60.87% similarity
  * pages7-14.pdf: 60.0% similarity
  * pages12-15.pdf: 62.5% similarity

Solution:
- Add seal suffix removal to clean_institution_name() function
- Remove common seal names: 检验检测专用章, 检测专用章, 检验专用章, etc.
- Use string replacement (not regex) to handle middle-of-text occurrences
- Apply before number removal to handle combined artifacts like "专用章123456"

Test Results:
All 4 test cases now achieve 100% similarity and "exact" match:
1. "检验检测专用章" suffix → 66.67% → 100.00% ✓
2. "检验检测专用章" suffix (different company) → 65.00% → 100.00% ✓
3. "430334" suffix → 70.00% → 100.00% ✓
4. "检验检测专用章430334" combined → 51.85% → 100.00% ✓

This fix complements the previous CMA code suffix removal and
significantly improves matching accuracy for seal-related OCR artifacts.

Co-Authored-By: Claude Code <noreply@anthropic.com>
This commit is contained in:
黄仁欢 2026-02-16 21:22:23 +08:00
parent 9f701edd25
commit f5981fdf72
1 changed files with 14 additions and 1 deletions

View File

@ -1511,7 +1511,7 @@ def extract_institution_from_crt(pdf_path: str) -> List[str]:
def clean_institution_name(text: str) -> str: def clean_institution_name(text: str) -> str:
""" """
清理机构名称移除末尾的数字CMA码等干扰内容 清理机构名称移除末尾的数字CMA码印章名称等干扰内容
Args: Args:
text: 原始机构名称 text: 原始机构名称
@ -1522,6 +1522,19 @@ def clean_institution_name(text: str) -> str:
if not text: if not text:
return text return text
# 移除常见的印章名称(不需要在末尾,可以移除任何位置的)
# 这处理"机构名称检验检测专用章"或"机构名称检验检测专用章123456"
seal_patterns = [
r'检验检测专用章',
r'检测专用章',
r'检验专用章',
r'鉴定专用章',
r'公章',
r'专用章',
]
for pattern in seal_patterns:
text = text.replace(pattern, '')
# 移除末尾的数字序列如CMA码 # 移除末尾的数字序列如CMA码
text = re.sub(r'\d{6,}$', '', text) # 6位及以上数字 text = re.sub(r'\d{6,}$', '', text) # 6位及以上数字
text = re.sub(r'\d{11,}$', '', text) # 11位及以上数字CMA码 text = re.sub(r'\d{11,}$', '', text) # 11位及以上数字CMA码