transaction schema unification

In the ground since Fri Dec 20 2024

Last watered inFri Dec 20 2024

Transaction Schema Unification Refactor

Overview

This document explains a critical refactoring that unified transaction schemas across domains, fixed a movement_type extraction bug, and improved the overall architecture of our financial document parsing system.

The Problem We Faced

Initial Issue

Users were getting validation errors when uploading bank statements:

1Field required [type=missing, input_value={'date': '2025-04-07', 'd...202,06', 'category': ''}, input_type=dict]
2raw_statement.transactions.0.movement_type

Root Cause Analysis

Through systematic debugging, we discovered the issue wasn't with AI extraction (OpenAI was correctly extracting movement_type), but with our data flow architecture.

Architecture Problems

1. Duplicate Transaction Schemas

We had two different transaction schemas serving the same purpose:

1# AI Layer (app/core/ai/models/responses.py)
2class TransactionData(BaseModel):
3    date: str
4    description: str
5    amount: str
6    category: str = ""
7    # Missing: movement_type
8
9# Statements Domain (app/domains/statements/schemas.py)
10class StatementTransaction(BaseModel):
11    date: str
12    description: str
13    amount: str
14    movement_type: str  # Present here!
15    category: str = ""

2. Data Loss During Conversion

The statements service was manually converting between transaction formats:

1# Statements Service (BEFORE FIX)
2"transactions": [
3    {
4        "date": tx.date,
5        "description": tx.description,
6        "amount": tx.amount,
7        "category": tx.category
8        # Missing: movement_type! 🚨
9    }
10    for tx in financial_data.transactions
11]

Result: Even though OpenAI extracted movement_type correctly, it was dropped during conversion.

3. Wrong Domain Ownership

1❌ BEFORE: AI Layer defines TransactionData
2- AI concerns mixed with business logic
3- Transaction schema owned by infrastructure layer
4- Domains import from infrastructure
5
6✅ AFTER: Transactions Domain defines TransactionData
7- Business logic owns business schemas
8- AI layer imports from domain
9- Proper dependency direction

The Solution: Domain-Driven Schema Unification

Step 1: Move Schema to Correct Domain

Moved transaction schema to its rightful owner:

1# app/domains/transactions/schemas.py
2class TransactionData(BaseModel):
3    """
4    Simplified transaction data for AI parsing and document processing.
5
6    Used by:
7    - AI providers (OpenAI, Ollama) for structured output
8    - Statement and invoice processing
9    - Document parsing workflows
10    """
11    date: str = Field(description="Transaction date in ISO format")
12    description: str = Field(description="Complete transaction description")
13    amount: str = Field(description="Transaction amount (without sign, preserve precision)")
14    movement_type: str = Field(description="Movement type: 'income' | 'expense' | 'transfer' | 'investment' | 'other'")
15    category: str = Field(default="", description="Transaction category (empty if not explicit)")

Step 2: Update Dependencies

AI Layer now imports from transactions domain:

1# app/core/ai/models/responses.py
2from app.domains.transactions.schemas import TransactionData
3
4# app/core/ai/models/__init__.py
5from app.domains.transactions.schemas import TransactionData

Statements Domain uses same schema:

1# app/domains/statements/schemas.py
2from app.domains.transactions.schemas import TransactionData
3
4class RawBankStatement(BaseModel):
5    transactions: List[TransactionData]  # Same schema!

Step 3: Eliminate Redundant Conversion

BEFORE (Manual Conversion):

1"transactions": [
2    {
3        "date": tx.date,
4        "description": tx.description,
5        "amount": tx.amount,
6        "movement_type": tx.movement_type,  # Easy to forget!
7        "category": tx.category
8    }
9    for tx in financial_data.transactions
10]

AFTER (Direct Usage):

1"transactions": financial_data.transactions  # No conversion needed!

Technical Implementation

Data Flow (Fixed)

11. PDF Upload → Extract Text
22. Text → OpenAI API with TransactionData schema
33. OpenAI extracts movement_type correctly ✅
44. Returns List[TransactionData] with all fields ✅
55. Direct assignment to RawBankStatement ✅
66. Validation succeeds ✅

Key Changes Made

Educational Insights

Why This Refactor Was Necessary

Domain-Driven Design Principles:

Business concepts belong in business domains
Infrastructure should depend on domain, not vice versa
Avoid schema duplication across layers

Schema Evolution Problems:

Manual conversion is error-prone - easy to forget fields
Schema drift happens when definitions are duplicated
Maintenance overhead increases with multiple definitions

How We Debugged This

Systematic Approach:

Traced data flow from AI response to validation
Checked each transformation step for data loss
Identified the exact line where movement_type was dropped
Root cause analysis revealed architectural issue

The error was NOT in AI extraction (which worked perfectly), but in our data transformation logic.

Lessons Learned

Benefits Achieved

Immediate Fixes

✅ movement_type extraction works - no more validation errors
✅ Cleaner codebase - eliminated redundant conversion logic
✅ Reduced complexity - fewer lines of code, fewer bugs

Long-term Improvements

✅ Easier maintenance - single place to update transaction schema
✅ Automatic compatibility - changes flow through automatically
✅ Better architecture - proper domain boundaries
✅ Future-proof - easier to add new transaction fields

Performance Gains

✅ No unnecessary object creation during conversion
✅ Direct object usage reduces memory allocations
✅ Simpler code paths improve readability and performance

Best Practices Established

Schema Design

1# ✅ Good: Single authoritative schema
2app/domains/transactions/schemas.py:
3  - TransactionData (for AI parsing)
4  - TransactionBase (for domain operations)
5
6# ❌ Avoid: Duplicate schemas in different layers
7app/core/ai/models/responses.py: TransactionData
8app/domains/statements/schemas.py: StatementTransaction

Dependency Management

1# ✅ Good: Infrastructure depends on domain
2from app.domains.transactions.schemas import TransactionData
3
4# ❌ Avoid: Domain depends on infrastructure
5from app.core.ai.models.responses import TransactionData

Data Transformation

1# ✅ Good: Direct usage of shared schemas
2"transactions": financial_data.transactions
3
4# ❌ Avoid: Manual conversion between identical structures
5"transactions": [{"field": tx.field} for tx in items]

Future Considerations

Schema Evolution

Add new fields only to transactions domain
Changes automatically propagate to all consumers
Version migration can be handled in one place

Testing Strategy

Test schema consistency across domains
Validate data flow from AI to database
Monitor for schema drift in CI/CD

Monitoring

Log schema field counts to detect missing fields
Track AI extraction success rates for new fields
Alert on validation failures during parsing

Conclusion

This refactor demonstrates the importance of proper domain-driven design and schema management. By moving transaction schemas to their rightful domain and eliminating redundant conversions, we:

Fixed immediate bugs (movement_type extraction)
Improved system architecture (proper domain boundaries)
Reduced future maintenance (single source of truth)
Enhanced debuggability (cleaner data flow)

The lesson: Architecture problems often manifest as data transformation bugs. When debugging, look beyond the immediate error to understand the underlying structural issues.