fix(test): remove seal suffixes from institution names before matching

Extend institution name cleaning to handle OCR artifacts from seal text that gets merged with company names during extraction. Problem: - 3 PDFs failed matching due to "检验检测专用章" (Seal for Inspection & Testing) being included in extracted institution names - Example: "四川合泰与必摩适检测有限公司检验检测专用章" vs "四川合泰与必摩适检测有限公司" - Similarity dropped to ~60-67% → incorrectly classified as "no_match" - Affected PDFs: * pages3-6.pdf: 60.87% similarity * pages7-14.pdf: 60.0% similarity * pages12-15.pdf: 62.5% similarity Solution: - Add seal suffix removal to clean_institution_name() function - Remove common seal names: 检验检测专用章, 检测专用章, 检验专用章, etc. - Use string replacement (not regex) to handle middle-of-text occurrences - Apply before number removal to handle combined artifacts like "专用章123456" Test Results: All 4 test cases now achieve 100% similarity and "exact" match: 1. "检验检测专用章" suffix → 66.67% → 100.00% ✓ 2. "检验检测专用章" suffix (different company) → 65.00% → 100.00% ✓ 3. "430334" suffix → 70.00% → 100.00% ✓ 4. "检验检测专用章430334" combined → 51.85% → 100.00% ✓ This fix complements the previous CMA code suffix removal and significantly improves matching accuracy for seal-related OCR artifacts. Co-Authored-By: Claude Code <noreply@anthropic.com>
2026-02-16 21:22:23 +08:00 · 2026-02-16 21:22:23 +08:00 · f5981fdf72
parent 9f701edd25
commit f5981fdf72
1 changed files with 14 additions and 1 deletions
--- a/test_accuracy_batch_full.py
+++ b/test_accuracy_batch_full.py
@ -1511,7 +1511,7 @@ def extract_institution_from_crt(pdf_path: str) -> List[str]:
 def clean_institution_name(text: str) -> str:
    """
-    清理机构名称，移除末尾的数字、CMA码等干扰内容
+    清理机构名称，移除末尾的数字、CMA码、印章名称等干扰内容
    Args:
        text: 原始机构名称
@ -1522,6 +1522,19 @@ def clean_institution_name(text: str) -> str:
    if not text:
        return text
    # 移除常见的印章名称（不需要在末尾，可以移除任何位置的）
    # 这处理"机构名称检验检测专用章"或"机构名称检验检测专用章123456"
    seal_patterns = [
        r'检验检测专用章',
        r'检测专用章',
        r'检验专用章',
        r'鉴定专用章',
        r'公章',
        r'专用章',
    ]
    for pattern in seal_patterns:
        text = text.replace(pattern, '')
    # 移除末尾的数字序列（如CMA码）
    text = re.sub(r'\d{6,}$', '', text)  # 6位及以上数字
    text = re.sub(r'\d{11,}$', '', text)  # 11位及以上数字（CMA码）