166 lines
4.9 KiB
Markdown
166 lines
4.9 KiB
Markdown
|
|
# 检查点功能安全分析报告
|
|||
|
|
|
|||
|
|
## 🚨 严重Bug汇总
|
|||
|
|
|
|||
|
|
### 1. 数据丢失风险 - CRITICAL
|
|||
|
|
**位置**: `_restore_table()` (第380-409行)
|
|||
|
|
|
|||
|
|
**问题**:
|
|||
|
|
```python
|
|||
|
|
truncate_sql = f"TRUNCATE TABLE {table_name} CASCADE"
|
|||
|
|
cur.execute(truncate_sql)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**风险**:
|
|||
|
|
- 直接TRUNCATE表会**永久删除现有数据**
|
|||
|
|
- CASCADE会级联删除依赖表的数据
|
|||
|
|
- 如果恢复过程中出错,原始数据已丢失且无法恢复
|
|||
|
|
|
|||
|
|
### 2. 表依赖顺序错误 - HIGH
|
|||
|
|
**问题**: 恢复表时没有考虑外键依赖关系
|
|||
|
|
- 如果先恢复子表,再恢复父表时会因为外键约束失败
|
|||
|
|
- 当前代码假设所有表都可以直接TRUNCATE,实际情况并非如此
|
|||
|
|
|
|||
|
|
### 3. 并发安全问题 - HIGH
|
|||
|
|
**问题**: 恢复过程没有数据库锁
|
|||
|
|
- 其他会话可能在恢复期间写入数据
|
|||
|
|
- 导致数据不一致或恢复失败
|
|||
|
|
|
|||
|
|
### 4. 事务管理问题 - MEDIUM
|
|||
|
|
**问题**: create_checkpoint中每个表独立事务
|
|||
|
|
- 部分表备份失败不会影响已完成的部分
|
|||
|
|
- 导致checkpoint数据不一致
|
|||
|
|
|
|||
|
|
## 详细分析
|
|||
|
|
|
|||
|
|
### Bug #1: TRUNCATE CASCADE 危险操作
|
|||
|
|
```python
|
|||
|
|
def _restore_table(conn, table_name, data):
|
|||
|
|
# 危险:直接清空表!
|
|||
|
|
truncate_sql = f"TRUNCATE TABLE {table_name} CASCADE"
|
|||
|
|
cur.execute(truncate_sql)
|
|||
|
|
```
|
|||
|
|
**影响**:
|
|||
|
|
- 假设表A有外键指向表B
|
|||
|
|
- 如果先TRUNCATE表B,CASCADE会删除表A中相关的行
|
|||
|
|
- 即使后续恢复表B,表A的数据已经永久丢失
|
|||
|
|
|
|||
|
|
### Bug #2: 没有表依赖拓扑排序
|
|||
|
|
PostgreSQL表的外键依赖关系:
|
|||
|
|
```
|
|||
|
|
regions (父表)
|
|||
|
|
├── region_themes
|
|||
|
|
├── region_scopes
|
|||
|
|
├── region_theme_permits
|
|||
|
|
├── region_permit_risks
|
|||
|
|
└── region_permit_details
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**正确的恢复顺序**:
|
|||
|
|
1. 先恢复没有外键依赖的表 (regions, themes, business_scopes, permits, risks)
|
|||
|
|
2. 再恢复引用其他表的表 (region_themes, region_scopes, region_theme_permits, etc.)
|
|||
|
|
|
|||
|
|
### Bug #3: 缺少表锁定
|
|||
|
|
恢复期间应该使用:
|
|||
|
|
```sql
|
|||
|
|
BEGIN;
|
|||
|
|
LOCK TABLE table_name IN EXCLUSIVE MODE;
|
|||
|
|
-- 恢复数据
|
|||
|
|
COMMIT;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 修复方案
|
|||
|
|
|
|||
|
|
### 方案1: 安全恢复流程
|
|||
|
|
1. **自动备份当前状态** - 恢复前创建临时checkpoint
|
|||
|
|
2. **表拓扑排序** - 按外键依赖逆序恢复
|
|||
|
|
3. **表级锁** - 防止并发写入
|
|||
|
|
4. **单事务** - 全部成功或全部失败
|
|||
|
|
5. **回滚机制** - 恢复失败时自动回滚到备份
|
|||
|
|
|
|||
|
|
### 方案2: 安全restore实现
|
|||
|
|
```python
|
|||
|
|
def restore_checkpoint_safe(checkpoint_id: str) -> Dict[str, Any]:
|
|||
|
|
"""安全的checkpoint恢复,带自动回退"""
|
|||
|
|
# 1. 创建自动备份 (自动命名为auto_backup_<timestamp>)
|
|||
|
|
auto_backup = create_checkpoint(f"auto_backup_before_restore_{checkpoint_id}")
|
|||
|
|
|
|||
|
|
# 2. 获取拓扑排序后的表列表
|
|||
|
|
ordered_tables = _get_tables_topological_order()
|
|||
|
|
|
|||
|
|
# 3. 开始事务
|
|||
|
|
with _lic_pg_conn(autocommit=False) as conn:
|
|||
|
|
try:
|
|||
|
|
# 4. 锁定所有表
|
|||
|
|
for table in ordered_tables:
|
|||
|
|
conn.execute(f"LOCK TABLE {table} IN EXCLUSIVE MODE")
|
|||
|
|
|
|||
|
|
# 5. 按依赖顺序恢复
|
|||
|
|
for table in ordered_tables:
|
|||
|
|
data = checkpoint_data["tables"].get(table, [])
|
|||
|
|
_restore_table_safe(conn, table, data)
|
|||
|
|
|
|||
|
|
# 6. 提交
|
|||
|
|
conn.commit()
|
|||
|
|
return {"status": "success", "restored_from": checkpoint_id}
|
|||
|
|
|
|||
|
|
except Exception as e:
|
|||
|
|
# 7. 回滚
|
|||
|
|
conn.rollback()
|
|||
|
|
# 可选:自动从auto_backup恢复
|
|||
|
|
return {"status": "error", "message": str(e)}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 方案3: 表依赖图构建
|
|||
|
|
```python
|
|||
|
|
def _get_tables_topological_order() -> List[str]:
|
|||
|
|
"""获取按外键依赖排序的表列表"""
|
|||
|
|
sql = """
|
|||
|
|
SELECT
|
|||
|
|
tc.table_name,
|
|||
|
|
array_agg(ccu.table_name ORDER BY ccu.table_name) AS referenced_by
|
|||
|
|
FROM information_schema.table_constraints tc
|
|||
|
|
JOIN information_schema.key_column_usage kcu
|
|||
|
|
ON tc.constraint_name = kcu.constraint_name
|
|||
|
|
AND tc.table_schema = kcu.table_schema
|
|||
|
|
JOIN information_schema.constraint_column_usage ccu
|
|||
|
|
ON tc.constraint_name = ccu.constraint_name
|
|||
|
|
AND tc.table_schema = ccu.table_schema
|
|||
|
|
WHERE tc.constraint_type = 'FOREIGN KEY'
|
|||
|
|
AND tc.table_schema = 'public'
|
|||
|
|
GROUP BY tc.table_name
|
|||
|
|
ORDER BY tc.table_name
|
|||
|
|
"""
|
|||
|
|
# 实现拓扑排序算法
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 立即修复建议
|
|||
|
|
|
|||
|
|
### 立即可做的修复:
|
|||
|
|
1. **添加安全警告** - 在API文档中强调restore是危险操作
|
|||
|
|
2. **表排序** - 按依赖关系排序恢复
|
|||
|
|
3. **添加表锁** - 防止并发写入
|
|||
|
|
4. **单事务** - 全部成功或全部失败
|
|||
|
|
|
|||
|
|
### 建议的新流程:
|
|||
|
|
```
|
|||
|
|
用户调用 restore
|
|||
|
|
↓
|
|||
|
|
1. 自动创建auto_backup (可选)
|
|||
|
|
2. 获取依赖顺序
|
|||
|
|
3. 锁定所有表
|
|||
|
|
4. 开始事务
|
|||
|
|
5. 逐表恢复 (TRUNCATE + INSERT)
|
|||
|
|
6. 提交/回滚
|
|||
|
|
7. 返回结果
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 测试建议
|
|||
|
|
|
|||
|
|
需要测试的场景:
|
|||
|
|
1. 正常恢复流程
|
|||
|
|
2. 恢复过程中服务器断电
|
|||
|
|
3. 并发写入时恢复
|
|||
|
|
4. 部分表恢复失败
|
|||
|
|
5. 恢复后的数据完整性验证
|