DAG Workflows & Processing Pipeline

Understanding the 6-stage document processing pipeline powered by Apache Airflow

Pipeline Overview

The Medical Data Ingestion Pipeline uses Apache Airflow to orchestrate a Directed Acyclic Graph (DAG) of processing tasks powered by Azure cloud services. Each uploaded document flows through 6 sequential stages, with dependencies ensuring data integrity and proper processing order. The entire pipeline leverages Azure services (Document Intelligence, Text Analytics for Health, OpenAI, AI Search, Blob Storage) for enterprise-grade reliability, with automatic retries, error handling, and real-time monitoring.

Pipeline Flow Diagram

Stage 1
Document Discovery & Registration
~5-10 seconds
Stage 2
OCR Processing
~5-15 seconds per page
Stage 3
Clinical Entity Extraction (TAF)
~2-5 seconds per document
Stage 4
Patient Linking
~1-2 seconds
Stage 5
Entity Normalization & Validation
~2-3 seconds
Stage 6
Embedding & Indexing
~1-2 seconds

Detailed Stage Information

Stage 1 of 6

Document Discovery & Registration

Initial document upload and metadata registration

What Happens:

  • User uploads document through web interface or API
  • File is uploaded to Azure Blob Storage with secure access
  • Document metadata registered in PostgreSQL database
  • Unique document ID (doc_id) generated
  • Initial status set to "pending" for all processing stages
  • Timestamp and source information recorded

Inputs:

  • PDF, DOCX, TXT, or image files (PNG, JPG, TIFF)

Outputs:

  • Document record in database
  • File stored in Azure Blob Storage
  • Document ID generated

Dependencies:

  • Azure Blob Storage available
  • Database connection active
Average Processing Time:5-10 seconds
Stage Priority:Highest
Stage 2 of 6

OCR Processing

Optical Character Recognition to extract text from images and PDFs

What Happens:

  • Azure Document Intelligence API processes the document
  • Text extracted from images, scanned PDFs, and handwritten notes with high accuracy
  • OCR confidence scores calculated for each text region
  • Extracted text stored in ocr_text field
  • OCR status updated to "completed" or "failed"
  • Processing errors logged for troubleshooting
  • Azure Document Intelligence provides advanced layout analysis and table extraction

Inputs:

  • Document file from Azure Blob Storage

Outputs:

  • Extracted text (ocr_text)
  • OCR confidence score
  • OCR status update
  • Layout information

Dependencies:

  • Azure Document Intelligence API
  • Document file accessible in Azure Blob Storage
  • Network connectivity to Azure
Average Processing Time:5-15 seconds per page
Stage Priority:Medium
Stage 3 of 6

Clinical Entity Extraction (TAF)

AI-powered extraction of medical entities using Azure Text Analytics for Health

What Happens:

  • Azure Text Analytics for Health (TAF) analyzes the OCR text
  • Clinical entities extracted: conditions, medications, procedures, test results
  • Each entity includes: type, text, category, confidence score
  • Contextual information captured: negation, certainty, temporality
  • Assertion status determined: confirmed, suspected, ruled out
  • Entities linked to medical ontologies (UMLS, SNOMED CT)
  • Results stored in entities table with relationships to document

Inputs:

  • OCR extracted text

Outputs:

  • Clinical entities with confidence scores
  • Entity types and categories
  • Assertion status (confirmed/suspected)
  • TAF status update

Dependencies:

  • Azure TAF API
  • OCR processing completed
  • Valid OCR text available
Average Processing Time:2-5 seconds per document
Stage Priority:Medium
Stage 4 of 6

Patient Linking

Associate documents with patient records using NLP matching

What Happens:

  • Extract patient identifiers from document (MRN, name, DOB)
  • Search existing patient records for matches
  • NLP-based similarity matching for name variations
  • Create new patient record if no match found
  • Link document to patient via patient_id foreign key
  • Update patient demographics if new information found
  • Calculate matching confidence score

Inputs:

  • Document with extracted entities
  • Patient identifiers from text

Outputs:

  • Document linked to patient_id
  • Patient record created/updated
  • Linking confidence score

Dependencies:

  • Entity extraction completed
  • Patient database accessible
Average Processing Time:1-2 seconds
Stage Priority:Medium
Stage 5 of 6

Entity Normalization & Validation

Standardize and validate extracted entities

What Happens:

  • Deduplicate entities across documents for same patient
  • Standardize entity names and codes
  • Validate entity values against medical ontologies
  • Resolve conflicts between multiple mentions
  • Calculate aggregate confidence scores
  • Flag low-confidence entities for review
  • Update entity categories and classifications

Inputs:

  • Raw extracted entities
  • Medical ontology databases

Outputs:

  • Normalized entity records
  • Confidence distribution (high/medium/low)
  • Entity relationships
  • Validation flags

Dependencies:

  • TAF extraction completed
  • Ontology databases available
Average Processing Time:2-3 seconds
Stage Priority:Medium
Stage 6 of 6

Embedding & Indexing

Generate vector embeddings for semantic search using Azure OpenAI and index in Azure AI Search

What Happens:

  • Document text converted to 1536-dimensional Azure OpenAI embedding
  • Entity text also embedded for semantic entity search
  • Embeddings generated using Azure OpenAI text-embedding-ada-002 model
  • Documents indexed in Azure AI Search with vector fields
  • Azure AI Search HNSW index created for fast similarity search
  • Document marked as "indexed" and searchable
  • Embedding generation timestamp recorded
  • Pipeline status updated to "completed"

Inputs:

  • Normalized document text
  • Validated entities

Outputs:

  • 1536-dimensional vector embeddings
  • Azure AI Search index entries
  • Indexed status update
  • Document ready for semantic search

Dependencies:

  • Azure OpenAI API available
  • Azure AI Search service configured
  • All previous stages completed
Average Processing Time:1-2 seconds
Stage Priority:Lowest

Error Handling & Retries

Automatic Retry Logic:

  • Network Errors: 3 retries with exponential backoff (2s, 4s, 8s)
  • API Rate Limits: Automatic queuing and delayed retry
  • Transient Failures: Up to 5 retry attempts before marking as failed
  • Partial Success: Stages can complete independently

Failure Recovery:

  • Stage Isolation: Failures in one stage don't affect completed stages
  • Error Logging: Detailed error messages stored for troubleshooting
  • Manual Retry: Admin can manually restart failed stages
  • Alert System: Notifications sent for persistent failures

Pipeline Performance

1-2 min
Average Total Processing Time
For typical medical documents
95%
Success Rate
Documents processed without errors
100+
Concurrent Documents
Parallel processing capacity

Monitoring & Observability

Real-time Dashboard

Monitor pipeline progress in real-time on the Analytics Dashboard. See completed, pending, and failed counts for each stage.

Activity Tracking

Track new patients, documents, and entities over time. View recent activity in 24-hour, 7-day, or 30-day windows.

Health Checks

System health endpoints monitor database, ML services, and API status. Check the About page for current system health.