fs-lawrisk/docs/guides/CLAUDE.md

440 lines
12 KiB
Markdown
Raw Normal View History

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
---
# LawRisk Backend - Development Guide
## Project Overview
**LawRisk** is a Flask-based Python backend service for intelligent legal compliance risk retrieval. It helps users find permits, licenses, and legal risks based on natural language queries using vector embeddings and LLM matching.
### Tech Stack
- **Framework**: Flask 2.3+
- **Database**: PostgreSQL (pg8000 driver)
- **AI Services**: 阿里云DashScope (text-embedding-v4, qwen-plus-latest)
- **Development**: Black, Ruff, Pytest
### Two-Database Architecture
1. **fs_law_risk**: Vector embeddings and subject-permit mappings
2. **licensing_risks**: Structured permit and risk data (regions, themes, compliance)
---
## Quick Reference
### Most Common Commands
```bash
# Run the application (port 8000)
python app.py
# Install dependencies
pip install -r requirements.txt
# Format and lint code
black .
ruff .
# Run tests
pytest
pytest --cov=lawrisk
# Test API via curl
curl http://localhost:8000/healthz
curl -X POST "http://localhost:8000/fs-ai-asistant/api/workflow/lawrisk/v2" \
-d "query=我要办一家电影院&debug=1"
```
### Key File Locations
- `app.py` - Flask application entry point
- `lawrisk/api/v2.py` - V2 API routes (current)
- `lawrisk/services/lawrisk_v2_service.py` - Enhanced V2 search logic
- `lawrisk/services/licensing_repo.py` - Database operations
- `lawrisk/api/auth.py` - Authentication endpoints
- `static/v2_tester.html` - Web-based API testing interface
- `tests/test_auth.py` - Auth system tests
- `tests/test_checkpoint_security.py` - Checkpoint system tests
---
## Architecture & Code Structure
### Request Flow
```
HTTP Request
→ lawrisk/api/ (routing layer)
→ lawrisk/services/ (business logic)
→ lawrisk/services/licensing_repo.py (database access)
→ DashScope API (embeddings & LLM)
```
### Core Modules
**1. API Layer (`lawrisk/api/`)**
- `v1.py` - Legacy API (deprecated)
- `v2.py` - Current API with structured responses + admin endpoints
- `auth.py` - Authentication (login/logout/me endpoints)
**2. Services Layer (`lawrisk/services/`)**
- `lawrisk_service.py` - Core search with embeddings (cosine similarity) + LLM matching
- `lawrisk_v2_service.py` - Enhanced V2 with structured results, region filtering, direct permit matching
- `licensing_repo.py` - PostgreSQL operations (both databases), checkpoint management
- `auth_service.py` - User authentication, password hashing, seed admin creation
**3. Middleware & Utils**
- `middleware/smart_cors_middleware.py` - Configurable CORS (wildcard, subdomains, NGINX mode)
- `utils/env_loader.py` - Environment variable loading
- `utils/export_risk_json.py` - Database export utility
- `utils/ingest_lawrisk.py` - Data ingestion with embeddings
---
## API Endpoints
### Public Endpoints
#### V2 Search (Current)
- **Path**: `/fs-ai-asistant/api/workflow/lawrisk/v2`
- **Method**: POST (recommended), GET
- **Params**:
- `query` (required): User question
- `region` (optional): Filter by region (市级, 禅城区, etc.)
- `debug` (optional): Enable debug output (1/true/yes/on)
- `top` (optional): Number of recommendations (default: 5)
- **Returns**: Structured results with regions, themes, permits, risks
#### Supporting Endpoints
- `GET /fs-ai-asistant/api/workflow/lawrisk/v2/regions` - List all regions
- `GET /fs-ai-asistant/api/workflow/lawrisk/getPermits` - Get permits by region
- `GET /healthz` - Health check
### Authentication Endpoints
- `GET /fs-ai-asistant/lawrisk/login` - Login page (HTML)
- `POST /auth/login` - Authenticate user
- `GET /auth/me` - Get current user
- `GET /auth/logout` - Logout
### Admin Endpoints (Protected)
- `GET /fs-ai-asistant/api/workflow/lawrisk/admin/test` - Admin test
- `GET /fs-ai-asistant/api/workflow/lawrisk/admin/regions` - Region management
- `GET /fs-ai-asistant/api/workflow/lawrisk/admin/themes` - Theme management
- `GET /fs-ai-asistant/api/workflow/lawrisk/admin/permits` - Permit management
- `GET /fs-ai-asistant/api/workflow/lawrisk/admin/checkpoints` - Checkpoint management (create/list/restore/delete)
---
## Development Workflow
### Environment Setup
```bash
# Windows PowerShell
python -m venv .venv
.venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Configure .env with database credentials and DashScope API key
```
### Testing
```bash
# Run all tests
pytest
# Run with coverage
pytest --cov=lawrisk --cov-report=html
# Run specific test file
pytest tests/test_auth.py -v
# Test authentication
pytest tests/test_auth.py::test_login_success -v
```
**Test Files**:
- `tests/test_auth.py` - Authentication system tests (login, logout, session management)
- `tests/test_checkpoint_security.py` - Database checkpoint security tests
### Code Quality
```bash
# Format code
black .
# Lint with Ruff
ruff .
# Check specific file
black lawrisk/services/lawrisk_v2_service.py
ruff lawrisk/services/lawrisk_v2_service.py
```
### Data Management
```bash
# Export data from fs_law_risk database
python lawrisk/utils/export_risk_json.py
# Output: data/risk_tables_export.json
# Ingest data with embeddings (requires DASHSCOPE_API_KEY)
python lawrisk/utils/ingest_lawrisk.py
```
---
## Configuration
### Required Environment Variables (.env)
#### DashScope AI Services
```
DASHSCOPE_API_KEY=your_api_key
DASHSCOPE_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
DASHSCOPE_EMBED_MODEL=text-embedding-v4
DASHSCOPE_EMBED_DIM=1024
DASHSCOPE_MAX_BATCH=10
DASHSCOPE_CHAT_MODEL=qwen-plus-latest
```
#### PostgreSQL Databases
```
# fs_law_risk (embeddings database)
PG_HOST=your_host
PG_PORT=5432
PG_USER=postgres
PG_PASSWORD=your_password
PG_DATABASE=fs_law_risk
PG_ADMIN_DB=postgres
# licensing_risks (structured data)
LIC_PG_HOST=your_host
LIC_PG_PORT=5432
LIC_PG_USER=postgres
LIC_PG_PASSWORD=your_password
LIC_PG_DATABASE=licensing_risks
```
#### Authentication
```
FLASK_SECRET_KEY=your-secret-key
LAWRISK_ADMIN_USERNAME=admin
LAWRISK_ADMIN_PASSWORD=adminpassword123
# Optional: LAWRISK_ADMIN_ROLE, LAWRISK_ADMIN_GRADE, LAWRISK_ADMIN_DISPLAY_NAME
```
#### Search Thresholds
```
LAWRISK_RETURN_IF_GE=0.7 # Return results if similarity >= 0.7
LAWRISK_FALLBACK_GT=0.5 # Use fallback if similarity > 0.5
```
---
## Database Schema
### fs_law_risk (Vector Embeddings)
- **`law_sub`** - Subject matter with embeddings (id, name, vector)
- **`law_sub_per`** - Subject-permit mappings (sub_id, per_ids)
- **`law_per`** - Permit information (id, name, risk_ids)
### licensing_risks (Structured Compliance)
- **`regions`** - Administrative areas
- **`themes`** - Legal themes/subjects
- **`permits`** - License/permit items
- **`risks`** - Risk information (content, legal_basis, document_no, summary)
- **`business_scopes`** - Business scope definitions
- **Junction tables**: region_themes, region_theme_permits, region_permit_risks
### Checkpoint System
Licensing_repo.py implements database checkpoint management:
- `create_checkpoint()` - Create database backup
- `list_checkpoints()` - List available backups
- `restore_checkpoint()` - **DANGEROUS** - Restore from checkpoint
- `delete_checkpoint()` - Remove old checkpoints
---
## Security Guidelines
### Critical Security Notes
- **NEVER commit secrets** - All credentials in `.env` or environment variables
- **Protect admin endpoints** - `/admin/*` should be restricted in production
- **Checkpoint restore is dangerous** - Database operation with confirmation flow
- **API keys externalized** - `DASHSCOPE_API_KEY` and database passwords must be in `.env`
### Authentication System
- Session-based auth using Flask sessions
- Password hashing with `passlib`
- First admin auto-created from environment variables on startup
- Role-based access (admin, reviewer, analyst, etc.)
- Login page: `/fs-ai-asistant/lawrisk/login`
- Protected endpoints use `@login_required` decorator
---
## Recent Features (from git log)
### Checkpoint System (Recent)
- Database backup/restore functionality
- Timeline view of checkpoints
- Progress indicators for restore operations
- Security tests in `test_checkpoint_security.py`
### Permit Risk Snapshot
- Workflow for permit risk snapshots
- Unified snapshot and checkpoint timeline
- Enhanced batch display for snapshots
### Licensing Import Enhancement
- Optimized district/region merging during import
- Enhanced source display for permits
---
## Testing Guidelines
### Test Structure
```
tests/
├── __init__.py
├── test_auth.py # Auth system tests (login, session, decorators)
└── test_checkpoint_security.py # Checkpoint security tests
```
### Running Tests
```bash
# All tests
pytest
# Verbose output
pytest -v
# Coverage report
pytest --cov=lawrisk --cov-report=term-missing
# Specific test
pytest tests/test_auth.py::test_login_success -v
```
### Manual Testing
1. Start app: `python app.py`
2. Open browser: `static/v2_tester.html`
3. Test queries:
- "我要办一家电影院"
- "开办旅馆需要哪些许可"
- With region filter and debug mode
---
## Troubleshooting
### Common Issues
#### Database Connection
```bash
# Verify database is accessible
psql -h $PG_HOST -U $PG_USER -d $PG_DATABASE
# Check tables exist
SELECT COUNT(*) FROM fs_law_risk.law_sub;
SELECT COUNT(*) FROM licensing_risks.regions;
```
#### API Errors
```bash
# Test health check
curl http://localhost:8000/healthz
# Test V2 API with debug
curl -X POST "http://localhost:8000/fs-ai-asistant/api/workflow/lawrisk/v2" \
-d "query=电影院&debug=1"
# Check app logs for registered routes
python app.py 2>&1 | grep "Registered routes"
```
#### Missing Embeddings
```bash
# Check if embeddings exist
SELECT id, name FROM fs_law_risk.law_sub LIMIT 5;
# If empty, run ingestion
python lawrisk/utils/ingest_lawrisk.py
```
---
## Documentation Files
- **README.md** - Project overview and quick start
- **AGENTS.md** - Development guidelines, coding style, testing approach
- **docs/V2_API文档.md** - Detailed V2 API documentation
- **docs/API.md** - V1 API documentation (legacy)
- **docs/DB_GUIDE.md** - Database schema and query examples
- **docs/PRD.md** - Product requirements
- **docs/CLAUDE.md** - Detailed Claude Code guidance (comprehensive)
---
## Key Components Deep Dive
### V2 Service Architecture
`lawrisk_v2_service.py` implements:
- Structured response formatting
- Region filter normalization
- Direct permit name matching
- Markdown formatting for legal text
- Complex query execution pipeline with concurrency
### Authentication Flow
`lawrisk/api/auth.py` provides:
- Login page with redirect handling
- Session management
- `@login_required` decorator for protecting endpoints
- JSON vs HTML response handling (API vs browser)
### Checkpoint Security
`test_checkpoint_security.py` tests:
- Checkpoint creation authorization
- Restore operation security
- User permission validation
- Operation audit logging
---
## Best Practices
### Code Style
- **Black**: 100-character line length, Python 3.10+
- **Type Hints**: Use PEP 604 union types (`str | None`)
- **Imports**: Ruff-compatible, group by standard library → third-party → local
- **Naming**: snake_case (functions/variables), SCREAMING_SNAKE_CASE (constants), PascalCase (classes)
### Error Handling
- Graceful degradation on startup (errors surface on first request)
- Structured error responses: `{"success": false, "message": "error", "data": {}}`
- Logging to stdout with structured format
### Configuration
- Use `lawrisk.utils.env_loader` for environment variables
- Default values for non-critical configs
- Environment-specific overrides supported
---
## Health Checks
```bash
# Basic health
curl http://localhost:8000/healthz
# Check regions
curl http://localhost:8000/fs-ai-asistant/api/workflow/lawrisk/v2/regions
# Test search
curl -X POST http://localhost:8000/fs-ai-asistant/api/workflow/lawrisk/v2 \
-d "query=电影院&debug=1"
```
View app startup logs to see all registered routes.