DAG Workflows & Processing Pipeline

Understanding the 6-stage document processing pipeline powered by Apache Airflow

Pipeline Overview

The Medical Data Ingestion Pipeline uses Apache Airflow to orchestrate a Directed Acyclic Graph (DAG) of processing tasks powered by Azure cloud services. Each uploaded document flows through 6 sequential stages, with dependencies ensuring data integrity and proper processing order. The entire pipeline leverages Azure services (Document Intelligence, Text Analytics for Health, OpenAI, AI Search, Blob Storage) for enterprise-grade reliability, with automatic retries, error handling, and real-time monitoring.

Pipeline Flow Diagram

Stage 1

Document Discovery & Registration

~5-10 seconds

Stage 2

OCR Processing

~5-15 seconds per page

Stage 3

Clinical Entity Extraction (TAF)

~2-5 seconds per document

Stage 4

Patient Linking

~1-2 seconds

Stage 5

Entity Normalization & Validation

~2-3 seconds

Stage 6

Embedding & Indexing

~1-2 seconds

Detailed Stage Information

Stage 1 of 6

Document Discovery & Registration

Initial document upload and metadata registration

What Happens:

•User uploads document through web interface or API
•File is uploaded to Azure Blob Storage with secure access
•Document metadata registered in PostgreSQL database
•Unique document ID (doc_id) generated
•Initial status set to "pending" for all processing stages
•Timestamp and source information recorded

Inputs:

→PDF, DOCX, TXT, or image files (PNG, JPG, TIFF)

Outputs:

←Document record in database
←File stored in Azure Blob Storage
←Document ID generated

Dependencies:

Azure Blob Storage available
Database connection active

Average Processing Time:5-10 seconds

Stage Priority:Highest

Stage 2 of 6

OCR Processing

Optical Character Recognition to extract text from images and PDFs

What Happens:

•Azure Document Intelligence API processes the document
•Text extracted from images, scanned PDFs, and handwritten notes with high accuracy
•OCR confidence scores calculated for each text region
•Extracted text stored in ocr_text field
•OCR status updated to "completed" or "failed"
•Processing errors logged for troubleshooting
•Azure Document Intelligence provides advanced layout analysis and table extraction

Inputs:

→Document file from Azure Blob Storage

Outputs:

←Extracted text (ocr_text)
←OCR confidence score
←OCR status update
←Layout information

Dependencies:

Azure Document Intelligence API
Document file accessible in Azure Blob Storage
Network connectivity to Azure

Average Processing Time:5-15 seconds per page

Stage Priority:Medium

Stage 3 of 6

Clinical Entity Extraction (TAF)

AI-powered extraction of medical entities using Azure Text Analytics for Health

What Happens:

•Azure Text Analytics for Health (TAF) analyzes the OCR text
•Clinical entities extracted: conditions, medications, procedures, test results
•Each entity includes: type, text, category, confidence score
•Contextual information captured: negation, certainty, temporality
•Assertion status determined: confirmed, suspected, ruled out
•Entities linked to medical ontologies (UMLS, SNOMED CT)
•Results stored in entities table with relationships to document

Inputs:

→OCR extracted text

Outputs:

←Clinical entities with confidence scores
←Entity types and categories
←Assertion status (confirmed/suspected)
←TAF status update

Dependencies:

Azure TAF API
OCR processing completed
Valid OCR text available

Average Processing Time:2-5 seconds per document

Stage Priority:Medium

Stage 4 of 6

Patient Linking

Associate documents with patient records using NLP matching

What Happens:

•Extract patient identifiers from document (MRN, name, DOB)
•Search existing patient records for matches
•NLP-based similarity matching for name variations
•Create new patient record if no match found
•Link document to patient via patient_id foreign key
•Update patient demographics if new information found
•Calculate matching confidence score

Inputs:

→Document with extracted entities
→Patient identifiers from text

Outputs:

←Document linked to patient_id
←Patient record created/updated
←Linking confidence score

Dependencies:

Entity extraction completed
Patient database accessible

Average Processing Time:1-2 seconds

Stage Priority:Medium

Stage 5 of 6

Entity Normalization & Validation

Standardize and validate extracted entities

What Happens:

•Deduplicate entities across documents for same patient
•Standardize entity names and codes
•Validate entity values against medical ontologies
•Resolve conflicts between multiple mentions
•Calculate aggregate confidence scores
•Flag low-confidence entities for review
•Update entity categories and classifications

Inputs:

→Raw extracted entities
→Medical ontology databases

Outputs:

←Normalized entity records
←Confidence distribution (high/medium/low)
←Entity relationships
←Validation flags

Dependencies:

TAF extraction completed
Ontology databases available

Average Processing Time:2-3 seconds

Stage Priority:Medium

Stage 6 of 6

Embedding & Indexing

Generate vector embeddings for semantic search using Azure OpenAI and index in Azure AI Search

What Happens:

•Document text converted to 1536-dimensional Azure OpenAI embedding
•Entity text also embedded for semantic entity search
•Embeddings generated using Azure OpenAI text-embedding-ada-002 model
•Documents indexed in Azure AI Search with vector fields
•Azure AI Search HNSW index created for fast similarity search
•Document marked as "indexed" and searchable
•Embedding generation timestamp recorded
•Pipeline status updated to "completed"

Inputs:

→Normalized document text
→Validated entities

Outputs:

←1536-dimensional vector embeddings
←Azure AI Search index entries
←Indexed status update
←Document ready for semantic search

Dependencies:

Azure OpenAI API available
Azure AI Search service configured
All previous stages completed

Average Processing Time:1-2 seconds

Stage Priority:Lowest

Error Handling & Retries

Automatic Retry Logic:

Network Errors: 3 retries with exponential backoff (2s, 4s, 8s)
API Rate Limits: Automatic queuing and delayed retry
Transient Failures: Up to 5 retry attempts before marking as failed
Partial Success: Stages can complete independently

Failure Recovery:

Stage Isolation: Failures in one stage don't affect completed stages
Error Logging: Detailed error messages stored for troubleshooting
Manual Retry: Admin can manually restart failed stages
Alert System: Notifications sent for persistent failures

Pipeline Performance

1-2 min

Average Total Processing Time

For typical medical documents

95%

Success Rate

Documents processed without errors

100+

Concurrent Documents

Parallel processing capacity

Monitoring & Observability

Real-time Dashboard

Monitor pipeline progress in real-time on the Analytics Dashboard. See completed, pending, and failed counts for each stage.

Activity Tracking

Track new patients, documents, and entities over time. View recent activity in 24-hour, 7-day, or 30-day windows.

Health Checks

System health endpoints monitor database, ML services, and API status. Check the About page for current system health.

Learn More

View Live Pipeline Status →System Architecture Details →Troubleshoot Pipeline Issues →Frequently Asked Questions →