DAG Workflows & Processing Pipeline
Understanding the 6-stage document processing pipeline powered by Apache Airflow
Pipeline Overview
The Medical Data Ingestion Pipeline uses Apache Airflow to orchestrate a Directed Acyclic Graph (DAG) of processing tasks powered by Azure cloud services. Each uploaded document flows through 6 sequential stages, with dependencies ensuring data integrity and proper processing order. The entire pipeline leverages Azure services (Document Intelligence, Text Analytics for Health, OpenAI, AI Search, Blob Storage) for enterprise-grade reliability, with automatic retries, error handling, and real-time monitoring.
Pipeline Flow Diagram
Detailed Stage Information
Document Discovery & Registration
Initial document upload and metadata registration
What Happens:
- •User uploads document through web interface or API
- •File is uploaded to Azure Blob Storage with secure access
- •Document metadata registered in PostgreSQL database
- •Unique document ID (doc_id) generated
- •Initial status set to "pending" for all processing stages
- •Timestamp and source information recorded
Inputs:
- →PDF, DOCX, TXT, or image files (PNG, JPG, TIFF)
Outputs:
- ←Document record in database
- ←File stored in Azure Blob Storage
- ←Document ID generated
Dependencies:
- Azure Blob Storage available
- Database connection active
OCR Processing
Optical Character Recognition to extract text from images and PDFs
What Happens:
- •Azure Document Intelligence API processes the document
- •Text extracted from images, scanned PDFs, and handwritten notes with high accuracy
- •OCR confidence scores calculated for each text region
- •Extracted text stored in ocr_text field
- •OCR status updated to "completed" or "failed"
- •Processing errors logged for troubleshooting
- •Azure Document Intelligence provides advanced layout analysis and table extraction
Inputs:
- →Document file from Azure Blob Storage
Outputs:
- ←Extracted text (ocr_text)
- ←OCR confidence score
- ←OCR status update
- ←Layout information
Dependencies:
- Azure Document Intelligence API
- Document file accessible in Azure Blob Storage
- Network connectivity to Azure
Clinical Entity Extraction (TAF)
AI-powered extraction of medical entities using Azure Text Analytics for Health
What Happens:
- •Azure Text Analytics for Health (TAF) analyzes the OCR text
- •Clinical entities extracted: conditions, medications, procedures, test results
- •Each entity includes: type, text, category, confidence score
- •Contextual information captured: negation, certainty, temporality
- •Assertion status determined: confirmed, suspected, ruled out
- •Entities linked to medical ontologies (UMLS, SNOMED CT)
- •Results stored in entities table with relationships to document
Inputs:
- →OCR extracted text
Outputs:
- ←Clinical entities with confidence scores
- ←Entity types and categories
- ←Assertion status (confirmed/suspected)
- ←TAF status update
Dependencies:
- Azure TAF API
- OCR processing completed
- Valid OCR text available
Patient Linking
Associate documents with patient records using NLP matching
What Happens:
- •Extract patient identifiers from document (MRN, name, DOB)
- •Search existing patient records for matches
- •NLP-based similarity matching for name variations
- •Create new patient record if no match found
- •Link document to patient via patient_id foreign key
- •Update patient demographics if new information found
- •Calculate matching confidence score
Inputs:
- →Document with extracted entities
- →Patient identifiers from text
Outputs:
- ←Document linked to patient_id
- ←Patient record created/updated
- ←Linking confidence score
Dependencies:
- Entity extraction completed
- Patient database accessible
Entity Normalization & Validation
Standardize and validate extracted entities
What Happens:
- •Deduplicate entities across documents for same patient
- •Standardize entity names and codes
- •Validate entity values against medical ontologies
- •Resolve conflicts between multiple mentions
- •Calculate aggregate confidence scores
- •Flag low-confidence entities for review
- •Update entity categories and classifications
Inputs:
- →Raw extracted entities
- →Medical ontology databases
Outputs:
- ←Normalized entity records
- ←Confidence distribution (high/medium/low)
- ←Entity relationships
- ←Validation flags
Dependencies:
- TAF extraction completed
- Ontology databases available
Embedding & Indexing
Generate vector embeddings for semantic search using Azure OpenAI and index in Azure AI Search
What Happens:
- •Document text converted to 1536-dimensional Azure OpenAI embedding
- •Entity text also embedded for semantic entity search
- •Embeddings generated using Azure OpenAI text-embedding-ada-002 model
- •Documents indexed in Azure AI Search with vector fields
- •Azure AI Search HNSW index created for fast similarity search
- •Document marked as "indexed" and searchable
- •Embedding generation timestamp recorded
- •Pipeline status updated to "completed"
Inputs:
- →Normalized document text
- →Validated entities
Outputs:
- ←1536-dimensional vector embeddings
- ←Azure AI Search index entries
- ←Indexed status update
- ←Document ready for semantic search
Dependencies:
- Azure OpenAI API available
- Azure AI Search service configured
- All previous stages completed
Error Handling & Retries
Automatic Retry Logic:
- Network Errors: 3 retries with exponential backoff (2s, 4s, 8s)
- API Rate Limits: Automatic queuing and delayed retry
- Transient Failures: Up to 5 retry attempts before marking as failed
- Partial Success: Stages can complete independently
Failure Recovery:
- Stage Isolation: Failures in one stage don't affect completed stages
- Error Logging: Detailed error messages stored for troubleshooting
- Manual Retry: Admin can manually restart failed stages
- Alert System: Notifications sent for persistent failures
Pipeline Performance
Monitoring & Observability
Real-time Dashboard
Monitor pipeline progress in real-time on the Analytics Dashboard. See completed, pending, and failed counts for each stage.
Activity Tracking
Track new patients, documents, and entities over time. View recent activity in 24-hour, 7-day, or 30-day windows.
Health Checks
System health endpoints monitor database, ML services, and API status. Check the About page for current system health.