In the ground since Fri Dec 20 2024
Last watered in Fri Dec 20 2024
Transaction Schema Unification Refactor
Overview
This document explains a critical refactoring that unified transaction schemas across domains, fixed a movement_type extraction bug, and improved the overall architecture of our financial document parsing system.
The Problem We Faced
Initial Issue
Users were getting validation errors when uploading bank statements:
1 Field required [type=missing, input_value={'date': '2025-04-07', 'd...202,06', 'category': ''}, input_type=dict]
2 raw_statement.transactions.0.movement_type
The error indicated that movement_type was missing from transactions, but
our AI prompts were clearly asking for this field.
Root Cause Analysis
Through systematic debugging, we discovered the issue wasn't with AI extraction (OpenAI was correctly extracting movement_type), but with our data flow architecture.
Architecture Problems
1. Duplicate Transaction Schemas
We had two different transaction schemas serving the same purpose:
1 # AI Layer (app/core/ai/models/responses.py)
2 class TransactionData ( BaseModel ):
3 date: str
4 description: str
5 amount: str
6 category: str = ""
7 # Missing: movement_type
8
9 # Statements Domain (app/domains/statements/schemas.py)
10 class StatementTransaction ( BaseModel ):
11 date: str
12 description: str
13 amount: str
14 movement_type: str # Present here!
15 category: str = ""
Problems: - ❌ Schema Drift : Changes to one didn't reflect in the
other - ❌ Inconsistent Fields : movement_type existed in one but not the
other - ❌ Maintenance Overhead : Two places to update for transaction
changes - ❌ Domain Boundary Violation : AI layer defining business schemas
2. Data Loss During Conversion
The statements service was manually converting between transaction formats:
1 # Statements Service (BEFORE FIX)
2 "transactions" : [
3 {
4 "date" : tx.date,
5 "description" : tx.description,
6 "amount" : tx.amount,
7 "category" : tx.category
8 # Missing: movement_type! 🚨
9 }
10 for tx in financial_data.transactions
11 ]
Result: Even though OpenAI extracted movement_type correctly, it was dropped during conversion.
3. Wrong Domain Ownership
1 ❌ BEFORE: AI Layer defines TransactionData
2 - AI concerns mixed with business logic
3 - Transaction schema owned by infrastructure layer
4 - Domains import from infrastructure
5
6 ✅ AFTER: Transactions Domain defines TransactionData
7 - Business logic owns business schemas
8 - AI layer imports from domain
9 - Proper dependency direction
The Solution: Domain-Driven Schema Unification
Step 1: Move Schema to Correct Domain
Moved transaction schema to its rightful owner:
1 # app/domains/transactions/schemas.py
2 class TransactionData ( BaseModel ):
3 """
4 Simplified transaction data for AI parsing and document processing.
5
6 Used by:
7 - AI providers (OpenAI, Ollama) for structured output
8 - Statement and invoice processing
9 - Document parsing workflows
10 """
11 date: str = Field( description = "Transaction date in ISO format" )
12 description: str = Field( description = "Complete transaction description" )
13 amount: str = Field( description = "Transaction amount (without sign, preserve precision)" )
14 movement_type: str = Field( description = "Movement type: 'income' | 'expense' | 'transfer' | 'investment' | 'other'" )
15 category: str = Field( default = "" , description = "Transaction category (empty if not explicit)" )
Step 2: Update Dependencies
AI Layer now imports from transactions domain:
1 # app/core/ai/models/responses.py
2 from app.domains.transactions.schemas import TransactionData
3
4 # app/core/ai/models/__init__.py
5 from app.domains.transactions.schemas import TransactionData
Statements Domain uses same schema:
1 # app/domains/statements/schemas.py
2 from app.domains.transactions.schemas import TransactionData
3
4 class RawBankStatement ( BaseModel ):
5 transactions: List[TransactionData] # Same schema!
Step 3: Eliminate Redundant Conversion
BEFORE (Manual Conversion):
1 "transactions" : [
2 {
3 "date" : tx.date,
4 "description" : tx.description,
5 "amount" : tx.amount,
6 "movement_type" : tx.movement_type, # Easy to forget!
7 "category" : tx.category
8 }
9 for tx in financial_data.transactions
10 ]
AFTER (Direct Usage):
1 "transactions" : financial_data.transactions # No conversion needed!
Technical Implementation
Data Flow (Fixed)
1 1. PDF Upload → Extract Text
2 2. Text → OpenAI API with TransactionData schema
3 3. OpenAI extracts movement_type correctly ✅
4 4. Returns List[TransactionData] with all fields ✅
5 5. Direct assignment to RawBankStatement ✅
6 6. Validation succeeds ✅
Key Changes Made
Schema Unification - Moved TransactionData to
app/domains/transactions/schemas.py - Removed StatementTransaction
duplicate - Updated all imports
Fixed Data Loss - Eliminated manual transaction conversion - Direct usage
of TransactionData objects - Preserved all fields automatically
Improved Architecture - Proper domain ownership - Correct dependency
direction - Single source of truth
Educational Insights
Why This Refactor Was Necessary
Domain-Driven Design Principles:
Business concepts belong in business domains
Infrastructure should depend on domain, not vice versa
Avoid schema duplication across layers
Schema Evolution Problems:
Manual conversion is error-prone - easy to forget fields
Schema drift happens when definitions are duplicated
Maintenance overhead increases with multiple definitions
How We Debugged This
Systematic Approach:
Traced data flow from AI response to validation
Checked each transformation step for data loss
Identified the exact line where movement_type was dropped
Root cause analysis revealed architectural issue
The error was NOT in AI extraction (which worked perfectly), but in our data
transformation logic.
Lessons Learned
Schema Management: - ✅ Single Source of Truth - one schema definition
per concept - ✅ Domain Ownership - business schemas in business domains -
✅ Explicit Dependencies - import from authoritative source - ✅ Avoid
Manual Conversion - use shared schemas directly
Debugging Complex Systems: - ✅ Add logging at transformation points -
✅ Verify assumptions about what each layer receives/returns - ✅ Trace
data flow systematically - ✅ Test each component independently
Benefits Achieved
Immediate Fixes
✅ movement_type extraction works - no more validation errors
✅ Cleaner codebase - eliminated redundant conversion logic
✅ Reduced complexity - fewer lines of code, fewer bugs
Long-term Improvements
✅ Easier maintenance - single place to update transaction schema
✅ Automatic compatibility - changes flow through automatically
✅ Better architecture - proper domain boundaries
✅ Future-proof - easier to add new transaction fields
Performance Gains
✅ No unnecessary object creation during conversion
✅ Direct object usage reduces memory allocations
✅ Simpler code paths improve readability and performance
Best Practices Established
Schema Design
1 # ✅ Good: Single authoritative schema
2 app/domains/transactions/schemas.py:
3 - TransactionData ( for AI parsing)
4 - TransactionBase ( for domain operations)
5
6 # ❌ Avoid: Duplicate schemas in different layers
7 app/core/ai/models/responses.py: TransactionData
8 app/domains/statements/schemas.py: StatementTransaction
Dependency Management
1 # ✅ Good: Infrastructure depends on domain
2 from app.domains.transactions.schemas import TransactionData
3
4 # ❌ Avoid: Domain depends on infrastructure
5 from app.core.ai.models.responses import TransactionData
Data Transformation
1 # ✅ Good: Direct usage of shared schemas
2 "transactions" : financial_data.transactions
3
4 # ❌ Avoid: Manual conversion between identical structures
5 "transactions" : [{ "field" : tx.field} for tx in items]
Future Considerations
Schema Evolution
Add new fields only to transactions domain
Changes automatically propagate to all consumers
Version migration can be handled in one place
Testing Strategy
Test schema consistency across domains
Validate data flow from AI to database
Monitor for schema drift in CI/CD
Monitoring
Log schema field counts to detect missing fields
Track AI extraction success rates for new fields
Alert on validation failures during parsing
Conclusion
This refactor demonstrates the importance of proper domain-driven design and schema management. By moving transaction schemas to their rightful domain and eliminating redundant conversions, we:
Fixed immediate bugs (movement_type extraction)
Improved system architecture (proper domain boundaries)
Reduced future maintenance (single source of truth)
Enhanced debuggability (cleaner data flow)
The lesson : Architecture problems often manifest as data transformation
bugs. When debugging, look beyond the immediate error to understand the
underlying structural issues.