Back to Blog

Multimodal Medical Document Retrieval (MedRAG)

15 min read
RAGMedical AIOCRLangChainLlamaIndexBLIP-2GCPHIPAAMultimodal AI

Multimodal Medical Document Retrieval (MedRAG)

I developed a retrieval-augmented generation (RAG) system to extract reliable answers from complex medical documents containing text, tables, and diagnostic images. The system integrated OCR, LangChain, LlamaIndex, and GPT-4 to enable multimodal understanding and was deployed on GCP (Vertex AI, BigQuery) for scalability and HIPAA compliance.

Project Context

Duration: 1 month
Team: 3 members
Client: Medical device manufacturer

The Problem

Medical device manufacturers face a critical challenge: regulatory requirements vary significantly across countries, and researching these requirements consumes approximately 80% of total project time. Traditional PDF analysis focused primarily on text, making it easy to miss critical information in images and tables. In the medical field, where accuracy is paramount, this created significant research burden and risk of oversight.

The core issues were:

  • Information Loss: Images and tables containing crucial data were often overlooked
  • Traceability Gaps: Difficulty tracking the source and accuracy of extracted information
  • Time Consumption: Manual research processes were extremely time-intensive
  • Regulatory Risk: Missing critical requirements could have serious compliance implications

Project Goals

The system aimed to:

  1. Integrate diverse information formats (text, images, tables) from medical documents
  2. Enable accurate and reliable information extraction
  3. Support decision-making in medical practice and intellectual property operations
  4. Provide traceable sources for all extracted information
  5. Reduce regulatory research time significantly

System Architecture

Design Philosophy: Late Fusion Over Early Fusion

Medical documents require that numerical values be treated not as mere symbols, but as critical regulatory thresholds. Early Fusion (mixing all data types from the start) risks losing numerical precision in vector space. Therefore, I adopted a Late Fusion approach, creating optimal indexes and captions for each data format to suppress noise while improving recall.

Three-Layer Architecture

1. Semantic Enrichment (Preprocessing Layer)

Each data format was converted into "semantic information" that LLMs can understand, improving hit rates in vector space.

Image Processing:

  • OCR extraction of text from images
  • Caption generation describing what the image shows
  • BLIP-2 for context-aware visual understanding

Table Processing:

  • HTML conversion to preserve structure
  • Caption generation providing context (e.g., "This table shows trends in...")
  • Numerical data enriched with semantic meaning

Text Processing:

  • Context-aware chunking (section/paragraph-based rather than word-count-based)
  • Medical terminology normalization
  • Structure preservation (headings, sections)

2. Hybrid Architecture (Separated Search Paths)

Mixing all data into a single index causes images and table metadata to become "noise" for text searches. I separated the search pipeline:

  • Text Search: Context and concept-based semantic search
  • Table/Image Search: Specialized indexes to prevent numerical data and specialized diagrams from being dragged down by irrelevant text queries

This separation ensures that:

  • Numerical data maintains its precision
  • Specialized diagrams aren't overwhelmed by general text queries
  • Each modality is searched using its optimal strategy

3. Context Integration and Reconstruction

Separately retrieved sources are integrated with explicit source attribution, ensuring the "accuracy of evidence" essential for medical documents.

Technical Implementation

BLIP-2: Multimodal Analysis Engine

I selected BLIP-2 as the image captioning engine for several strategic reasons:

Why BLIP-2 Over Alternatives

Cost Efficiency and Throughput: While LLaVA offers high inference capabilities, its VRAM consumption is prohibitive for processing thousands of medical documents in parallel. BLIP-2's Q-Former (Querying Transformer) acts as a lightweight bridge, extracting only relevant visual features for textification while conserving computational resources.

Q-Former Architecture:

  • Acts as a bridge between image encoder and LLM
  • Extracts only question-relevant visual features
  • Eliminates noise while focusing on necessary numerical information

Local Deployment Benefits:

  • High throughput in secure local environment
  • No external API calls for confidential documents
  • Frozen large-scale LLM components save VRAM
  • Optimized inference speed per image

OCR Hybrid Configuration

Understanding BLIP-2's limitations in fine character recognition, I implemented a practical complement:

  • Tesseract/PaddleOCR: Extract raw text data
  • BLIP-2: Generate semantic captions
  • Combined Approach: "Image understanding (BLIP-2)" + "Character accuracy (OCR)" = high-quality captions with minimal numerical errors

Chunking Optimization

Rather than simple word-count chunking, I implemented context-based chunking:

  • Problem: Word-count chunking created too many chunks, reducing data accuracy
  • Solution: Split data by paragraphs or sections, clustering related data within single chunks
  • Result: Improved retrieval precision and reduced noise

Development Approach: 90% Frameworks, 10% Custom

To deliver in one month, I maximized existing ecosystems while focusing custom implementation on medical document-specific challenges.

Framework Utilization (90%)

  • LangChain: Orchestrated text, image, and table parsing pipelines in parallel workflows
  • LlamaIndex: Automated PDF parsing, vector database injection, and metadata linking

Custom Implementation (10%)

1. Multimodal Index Numerical Optimization

  • Experimented with multiple embedding models (OpenAI, Cohere, E5)
  • A/B tested chunk sizes and overlap to preserve medical document context
  • Identified optimal combinations for medical terminology and numerical vector mapping

2. Pipeline Integration Logic

  • Designed prompt chains to integrate BLIP-2 captions, OCR text, and HTML tables without contradiction
  • Implemented intermediate layer for priority ranking and deduplication
  • Custom logic for information fusion

3. Traceability Mapping

  • Custom schema for maintaining page numbers and bounding box information
  • Mapping logic to identify original PDF locations with millimeter precision
  • Coordinate information not fully supported by standard LlamaIndex features

Traceability and Lineage

Metadata Injection Design

Before embedding, each data chunk includes:

  • Source ID: Original PDF filename (including hash value)
  • Page Index: Page number
  • Bounding Box: Pixel coordinates (xmin, ymin, xmax, ymax) for images and tables

Citation Mechanism

LLM Integration:

  • Each information piece receives an ID
  • Prompts force LLM to output references like "[1][2]"
  • System dynamically maps [1] to metadata (filename and page)
  • Automatic link generation: "Reference: [Document Name] P.24"

PDF Highlighting (Frontend Integration)

When users click citations:

  • PDF.js library receives page number and coordinates as query parameters
  • PDF viewer auto-scrolls to relevant page
  • Specific tables/images highlighted with bounding boxes

This ensures users can always access primary sources (original PDFs) rather than just converted HTML.

Confidence Scoring System

Design Philosophy

In medical documentation, LLM self-evaluation ("I'm confident") is unreliable. I based confidence on objective metrics from the search process (similarity scores) rather than LLM assertions.

Similarity Score (Cosine Similarity) as Primary Indicator

Vector database retrieval provides cosine similarity between query and chunks. This serves as the primary confidence metric.

Score Mapping and Actions:

Score RangeJudgmentSystem Behavior (Guardrails)
0.8 - 1.0HighGenerate answer, highlight PDF sources. Present with confidence.
0.5 - 0.8MidGenerate answer but display warning label: "Low similarity - please verify with original document"
< 0.5LowForce stop answer generation. Display "No relevant information found" and prompt human investigation.

Correlation with Accuracy

Validation Process:

  • Domain experts created "gold standard" dataset (100+ question-answer pairs)
  • Plotted similarity scores against human evaluation (correct/incorrect)
  • Confirmed positive correlation: higher similarity = more accurate extracted values/requirements
  • Determined threshold values suitable for production use

Evaluation Framework

Two-Stage Validation

1. Expert Ground Truth Creation

  • Medical device manufacturer domain experts created 100+ "question, reference PDF location, expected answer" sets
  • Strict criteria: fluent LLM responses receive zero points if factual relationships (numbers, regulatory requirements) are incorrect

2. Standard Evaluation Framework (RAGAS)

To scale manual evaluation, I adopted RAGAS (RAG Assessment), focusing on three key metrics:

  • Faithfulness: Does generated answer rely solely on retrieved context? (Hallucination detection)
  • Answer Relevance: Does answer directly address user's question?
  • Context Precision: Do top retrieved chunks contain truly necessary information?

3. Correlation Analysis

  • Plotted RAGAS Faithfulness scores against cosine similarity scores
  • Confirmed strong positive correlation
  • Based on this data, confidently set "0.8+ = High (safe)" threshold as guardrail

Design Philosophy: "I managed RAG accuracy improvement not by 'intuition' but through two pillars: manual evaluation ground truth and quantitative RAGAS metrics. Especially in medical fields, verifying correlation between similarity scores (mathematical distance) and actual accuracy beforehand is the minimum governance required to provide systems to users."

Use Cases

Regulatory Compliance Research

Client Workflow:

  1. Client creates multiple question lists
  2. Tool generates: answers, document locations as evidence, answer confidence
  3. Rapid identification of whether requirements are included in provided products
  4. Support for patent application material creation

Clinical Question Answering

  • Quick access to medical literature
  • Evidence-based recommendations
  • Drug interaction checking

Healthcare Analytics

  • Regulatory review automation
  • Compliance verification
  • Document analysis for decision support

Results

  • Time Reduction: Reduced regulatory research time by approximately 80%
  • Accuracy: 95%+ accuracy in information extraction (validated against ground truth)
  • Coverage: Handles 100% of document types (text, tables, images)
  • Traceability: 100% source attribution for all extracted information
  • Scalability: Processes thousands of documents daily on GCP infrastructure
  • Compliance: HIPAA-aligned infrastructure and workflows

Challenges and Solutions

Challenge 1: Numerical Meaning Loss in Vector Space

Problem: Simple vector search causes "numerical meaning loss" and "information noise" when handling multimodal data.

Solution:

  • Semantic enrichment: Convert tables to HTML with contextual captions
  • Hybrid architecture: Separate search paths for text vs. tables/images
  • Late fusion: Integrate after retrieval to maintain precision

Challenge 2: Multimodal Search Noise

Problem: Mixing all data types causes images/table metadata to degrade text search accuracy.

Solution:

  • Specialized indexes for each modality
  • Prevent numerical data from being dragged down by irrelevant text queries
  • Control search precision per data type

Challenge 3: Traceability Requirements

Problem: Medical documents require strict source attribution - users must verify primary sources.

Solution:

  • Metadata injection at chunk creation stage
  • Citation mechanism forcing LLM to reference sources
  • PDF highlighting system linking back to original documents
  • Coordinate-based precision mapping

Challenge 4: Hallucination in Mission-Critical Domain

Problem: AI confidently generating incorrect information is unacceptable in medical contexts.

Solution:

  • Confidence scoring based on objective similarity metrics (not LLM self-assessment)
  • Three-tier guardrail system preventing low-confidence answers
  • Automated evaluation framework (RAGAS) for continuous monitoring
  • Ground truth validation with domain experts

Challenge 5: Development Speed (1 Month Timeline)

Problem: Building multimodal parsing pipeline from scratch in one month is extremely challenging.

Solution:

  • 90% framework utilization (LangChain, LlamaIndex)
  • 10% custom implementation focused on medical-specific challenges
  • Experimental approach: A/B testing for chunk sizes, embedding models
  • Clear prioritization: What NOT to build (avoid reinventing wheels)

Key Learnings

  1. Late Fusion Over Early Fusion: In medical documents, numerical precision requires separate processing paths that integrate after retrieval, not before.

  2. Semantic Enrichment is Critical: Converting raw data (images, tables) into semantic information that LLMs understand dramatically improves retrieval accuracy.

  3. Confidence Must Be Objective: LLM self-assessment is unreliable; mathematical similarity scores provide trustworthy confidence metrics.

  4. Evaluation Requires Both Manual and Automated: Ground truth from experts combined with RAGAS metrics provides comprehensive validation.

  5. Traceability is Non-Negotiable: Medical applications require complete source attribution - users must be able to verify every claim.

  6. Framework Leverage Enables Speed: Using existing tools (LangChain, LlamaIndex) for 90% of work allows focus on domain-specific 10% that creates real value.

  7. Context-Based Chunking: Paragraph/section-based chunking outperforms word-count chunking for maintaining document context.

  8. BLIP-2 for Efficiency: Q-Former architecture provides optimal balance of accuracy, throughput, and cost for large-scale image processing.

Future Enhancements

Dynamic Monitoring Agent

Moving from static PDF analysis to dynamic agent that detects regulatory changes in real-time:

Current Bottleneck: System is "single document completion type" - connecting to past analysis results is challenging.

Evolution Direction:

  • Transition from simple PDF index to Knowledge Graph structure managing each regulatory clause with unique IDs
  • Link old and new documents to this graph structure
  • Implement semantic diff extraction: Compare requirements from previous RAG extraction with new document requirements
  • Use GraphRAG to map how legal amendments affect which medical devices and functions

Monitoring Layer:

  • Leverage n8n and MCP (Brave Search) architecture
  • Periodic monitoring of regulatory authority websites
  • End-to-end automation from detection to analysis

Advanced Traceability

  • Move from page-level links to text-level coordinate information from OCR
  • Enable sentence-level highlighting (like Adobe Acrobat's line-by-line highlighting)
  • Enhanced granularity for source verification

Multi-Language Support

  • Extend to regulatory documents in multiple languages
  • Cross-lingual semantic search capabilities

Integration with EHR Systems

  • Direct integration with Electronic Health Records
  • Real-time clinical decision support

Conclusion

The MedRAG system demonstrates how careful architectural decisions can enable accurate multimodal RAG in mission-critical domains. By adopting Late Fusion, implementing semantic enrichment, and building comprehensive traceability, we created a system that healthcare professionals can trust.

The project's success came from:

  • Pragmatic framework utilization (90% existing tools)
  • Focused custom implementation (10% medical-specific logic)
  • Rigorous evaluation (ground truth + RAGAS metrics)
  • Objective confidence scoring (similarity-based, not LLM-based)
  • Complete traceability (every answer links to source)

This system demonstrates core agentic qualities: autonomous information gathering, tool-based reasoning, and context-aware planning. It provided direct experience in orchestrating LLMs to tackle real-world, high-complexity tasks where accuracy and trustworthiness are paramount.

The architectural patterns developed here—semantic enrichment, hybrid search, and traceability—are directly applicable to other domains requiring high accuracy and source verification, such as legal document analysis, regulatory compliance, and logistics operations.