System Architecture

Comprehensive overview of the Medical Data Ingestion Pipeline architecture

Architecture Overview

The Medical Data Ingestion Pipeline follows a modern cloud-native microservices architecture built entirely on Microsoft Azure. The system has clear separation of concerns across presentation, application, and data layers, with all services hosted on Azure for enterprise-grade scalability, security, and reliability.

Presentation Layer

Next.js 14 frontend with server-side rendering, TypeScript for type safety, and TanStack Query for optimized data fetching.

• Next.js 14 (App Router)
• React 18 with TypeScript
• TanStack Query v5
• Tailwind CSS

Application Layer

FastAPI backend with async support, service-layer pattern, and comprehensive API endpoints following REST principles.

• FastAPI (Async/Await)
• Python 3.11
• Service Layer Pattern
• OpenAPI/Swagger Docs

Data Layer

PostgreSQL for structured data, Azure Blob Storage for document storage with encryption at rest, Azure AI Search for semantic search, and SQLAlchemy ORM.

• PostgreSQL 15
• Azure Blob Storage
• Azure AI Search
• SQLAlchemy ORM

Core Components

Frontend Application

Modern React application built with Next.js 14 featuring server-side rendering, optimistic UI updates, and real-time data synchronization.

Patient Search & Management
Paginated patient list with semantic search, filtering, and sorting
Document Viewer
PDF viewer with entity highlighting and lazy loading
Analytics Dashboard
Real-time pipeline monitoring and entity trend visualization
v2 API Client
Type-safe API services with selective loading and caching

Backend Services

High-performance async API built with FastAPI, implementing service layer pattern for business logic separation.

Patient Service
Patient CRUD operations with pagination, caching (5-min TTL), and aggregated statistics
Document Service
Document management with lazy loading, Azure Blob Storage SAS token generation, batch operations
Analytics Service
System monitoring, pipeline status tracking, entity trends, and activity logging
Search Service
Hybrid semantic search using Azure AI Search with vector similarity and keyword matching

AI/ML Processing Pipeline

Apache Airflow orchestrates a 6-stage DAG for document processing with AI-powered entity extraction and semantic indexing.

Stage 1: Discovery
Document upload to Azure Blob Storage and database registration
Stage 2: OCR
Azure Document Intelligence extracts text from images/PDFs with high accuracy
Stage 3: TAF Extraction
Azure Text Analytics for Health extracts clinical entities with confidence scores
Stage 4: Patient Linking
Documents associated with patient records via NLP matching
Stage 5: Normalization
Entity standardization, deduplication, and validation
Stage 6: Indexing
Azure OpenAI generates embeddings (1536-dim) and documents indexed in Azure AI Search

Data Flow Architecture

Document Upload Flow

User Upload
PDF/DOCX/Image
Azure Blob Storage
Secure cloud storage
Database Record
PostgreSQL entry
DAG Trigger
Airflow pipeline

Search Query Flow

User Query
Natural language
Query Analysis
Semantic detection
Embedding
Azure OpenAI vector
Azure AI Search
Hybrid search
Results
Ranked matches

Performance Optimizations

Backend Optimizations

  • Response Caching: 5-minute TTL for frequently accessed data (60-70% hit rate)
  • Query Optimization: Eager loading eliminates N+1 queries (98% reduction)
  • Pagination: 10 items per page reduces payload by 90%
  • Lazy Loading: Content excluded by default (10x smaller responses)
  • Selective Loading: Optional fields via query parameters
  • Connection Pooling: Async database connections with SQLAlchemy

Frontend Optimizations

  • React Query Caching: Automatic request deduplication and background refetch
  • Code Splitting: Next.js automatic route-based splitting
  • Server Components: Next.js 14 App Router for zero JS when possible
  • Parallel Data Fetching: Multiple independent queries run concurrently
  • Optimistic Updates: Immediate UI feedback before API confirmation
  • Image Optimization: Next.js Image component with lazy loading

Security Architecture

Data Encryption

  • • Encryption at rest (Azure Blob)
  • • SSL/TLS in transit
  • • PostgreSQL encrypted columns

Access Control

  • • SAS tokens with expiration
  • • Role-based access (RBAC)
  • • API authentication

Compliance

  • • HIPAA compliant storage
  • • Audit logging
  • • PHI data handling

Network Security

  • • Private endpoints
  • • VNet integration
  • • Firewall rules

Data Validation

  • • Pydantic schemas
  • • Input sanitization
  • • SQL injection prevention

Monitoring

  • • Health check endpoints
  • • Error tracking
  • • Performance metrics

Scalability Design

Horizontal Scaling

FastAPI backend is stateless and can be deployed across multiple containers/VMs. Load balancing distributes traffic evenly. Database connection pooling prevents connection exhaustion.

Database & Search Optimization

PostgreSQL with read replicas for query distribution. Azure AI Search provides cloud-native vector search with HNSW algorithm for sub-second query performance at scale. Regular database vacuuming and query optimization based on execution plans.

Async Processing

Apache Airflow DAG handles background processing asynchronously. Task queues prevent blocking operations. Celery workers can scale independently based on workload.

CDN & Caching

Azure CDN for static asset delivery. API responses cached with TTL. Browser caching for improved client performance. Azure Blob Storage with global replication and Azure CDN integration for fast document delivery worldwide.