Multimodal Medical Document Retrieval (MedRAG)

I developed a retrieval-augmented generation (RAG) system to extract reliable answers from complex medical documents containing text, tables, and diagnostic images. The system integrated OCR, LangChain, LlamaIndex, and GPT-4 to enable multimodal understanding and was deployed on GCP (Vertex AI, BigQuery) for scalability and HIPAA compliance.

Project Context

Duration: 1 month
Team: 3 members
Client: Medical device manufacturer

The Problem

Medical device manufacturers face a critical challenge: regulatory requirements vary significantly across countries, and researching these requirements consumes approximately 80% of total project time. Traditional PDF analysis focused primarily on text, making it easy to miss critical information in images and tables. In the medical field, where accuracy is paramount, this created significant research burden and risk of oversight.

The core issues were:

Information Loss: Images and tables containing crucial data were often overlooked
Traceability Gaps: Difficulty tracking the source and accuracy of extracted information
Time Consumption: Manual research processes were extremely time-intensive
Regulatory Risk: Missing critical requirements could have serious compliance implications

Project Goals

The system aimed to:

Integrate diverse information formats (text, images, tables) from medical documents
Enable accurate and reliable information extraction
Support decision-making in medical practice and intellectual property operations
Provide traceable sources for all extracted information
Reduce regulatory research time significantly

System Architecture

Design Philosophy: Late Fusion Over Early Fusion

Medical documents require that numerical values be treated not as mere symbols, but as critical regulatory thresholds. Early Fusion (mixing all data types from the start) risks losing numerical precision in vector space. Therefore, I adopted a Late Fusion approach, creating optimal indexes and captions for each data format to suppress noise while improving recall.

Three-Layer Architecture

1. Semantic Enrichment (Preprocessing Layer)

Each data format was converted into "semantic information" that LLMs can understand, improving hit rates in vector space.

Image Processing:

OCR extraction of text from images
Caption generation describing what the image shows
BLIP-2 for context-aware visual understanding

Table Processing:

HTML conversion to preserve structure
Caption generation providing context (e.g., "This table shows trends in...")
Numerical data enriched with semantic meaning

Text Processing:

Context-aware chunking (section/paragraph-based rather than word-count-based)
Medical terminology normalization
Structure preservation (headings, sections)

2. Hybrid Architecture (Separated Search Paths)

Mixing all data into a single index causes images and table metadata to become "noise" for text searches. I separated the search pipeline:

Text Search: Context and concept-based semantic search
Table/Image Search: Specialized indexes to prevent numerical data and specialized diagrams from being dragged down by irrelevant text queries

This separation ensures that:

Numerical data maintains its precision
Specialized diagrams aren't overwhelmed by general text queries
Each modality is searched using its optimal strategy

3. Context Integration and Reconstruction

Separately retrieved sources are integrated with explicit source attribution, ensuring the "accuracy of evidence" essential for medical documents.

Technical Implementation

BLIP-2: Multimodal Analysis Engine

I selected BLIP-2 as the image captioning engine for several strategic reasons:

Why BLIP-2 Over Alternatives

Cost Efficiency and Throughput: While LLaVA offers high inference capabilities, its VRAM consumption is prohibitive for processing thousands of medical documents in parallel. BLIP-2's Q-Former (Querying Transformer) acts as a lightweight bridge, extracting only relevant visual features for textification while conserving computational resources.

Q-Former Architecture:

Acts as a bridge between image encoder and LLM
Extracts only question-relevant visual features
Eliminates noise while focusing on necessary numerical information

Local Deployment Benefits:

High throughput in secure local environment
No external API calls for confidential documents
Frozen large-scale LLM components save VRAM
Optimized inference speed per image

OCR Hybrid Configuration

Understanding BLIP-2's limitations in fine character recognition, I implemented a practical complement:

Tesseract/PaddleOCR: Extract raw text data
BLIP-2: Generate semantic captions
Combined Approach: "Image understanding (BLIP-2)" + "Character accuracy (OCR)" = high-quality captions with minimal numerical errors

Chunking Optimization

Rather than simple word-count chunking, I implemented context-based chunking:

Problem: Word-count chunking created too many chunks, reducing data accuracy
Solution: Split data by paragraphs or sections, clustering related data within single chunks
Result: Improved retrieval precision and reduced noise

Development Approach: 90% Frameworks, 10% Custom

To deliver in one month, I maximized existing ecosystems while focusing custom implementation on medical document-specific challenges.

Framework Utilization (90%)

LangChain: Orchestrated text, image, and table parsing pipelines in parallel workflows
LlamaIndex: Automated PDF parsing, vector database injection, and metadata linking

Custom Implementation (10%)

1. Multimodal Index Numerical Optimization

Experimented with multiple embedding models (OpenAI, Cohere, E5)
A/B tested chunk sizes and overlap to preserve medical document context
Identified optimal combinations for medical terminology and numerical vector mapping

2. Pipeline Integration Logic

Designed prompt chains to integrate BLIP-2 captions, OCR text, and HTML tables without contradiction
Implemented intermediate layer for priority ranking and deduplication
Custom logic for information fusion

3. Traceability Mapping

Custom schema for maintaining page numbers and bounding box information
Mapping logic to identify original PDF locations with millimeter precision
Coordinate information not fully supported by standard LlamaIndex features

Traceability and Lineage

Metadata Injection Design

Before embedding, each data chunk includes:

Source ID: Original PDF filename (including hash value)
Page Index: Page number
Bounding Box: Pixel coordinates (xmin, ymin, xmax, ymax) for images and tables

Citation Mechanism

LLM Integration:

Each information piece receives an ID
Prompts force LLM to output references like "[1][2]"
System dynamically maps [1] to metadata (filename and page)
Automatic link generation: "Reference: [Document Name] P.24"

PDF Highlighting (Frontend Integration)

When users click citations:

PDF.js library receives page number and coordinates as query parameters
PDF viewer auto-scrolls to relevant page
Specific tables/images highlighted with bounding boxes

This ensures users can always access primary sources (original PDFs) rather than just converted HTML.

Confidence Scoring System

Design Philosophy

In medical documentation, LLM self-evaluation ("I'm confident") is unreliable. I based confidence on objective metrics from the search process (similarity scores) rather than LLM assertions.

Similarity Score (Cosine Similarity) as Primary Indicator

Vector database retrieval provides cosine similarity between query and chunks. This serves as the primary confidence metric.

Score Mapping and Actions:

Score Range	Judgment	System Behavior (Guardrails)
0.8 - 1.0	High	Generate answer, highlight PDF sources. Present with confidence.
0.5 - 0.8	Mid	Generate answer but display warning label: "Low similarity - please verify with original document"
< 0.5	Low	Force stop answer generation. Display "No relevant information found" and prompt human investigation.

Correlation with Accuracy

Validation Process:

Domain experts created "gold standard" dataset (100+ question-answer pairs)
Plotted similarity scores against human evaluation (correct/incorrect)
Confirmed positive correlation: higher similarity = more accurate extracted values/requirements
Determined threshold values suitable for production use

Evaluation Framework

Two-Stage Validation

1. Expert Ground Truth Creation

Medical device manufacturer domain experts created 100+ "question, reference PDF location, expected answer" sets
Strict criteria: fluent LLM responses receive zero points if factual relationships (numbers, regulatory requirements) are incorrect

2. Standard Evaluation Framework (RAGAS)

To scale manual evaluation, I adopted RAGAS (RAG Assessment), focusing on three key metrics:

Faithfulness: Does generated answer rely solely on retrieved context? (Hallucination detection)
Answer Relevance: Does answer directly address user's question?
Context Precision: Do top retrieved chunks contain truly necessary information?

3. Correlation Analysis

Plotted RAGAS Faithfulness scores against cosine similarity scores
Confirmed strong positive correlation
Based on this data, confidently set "0.8+ = High (safe)" threshold as guardrail

Design Philosophy: "I managed RAG accuracy improvement not by 'intuition' but through two pillars: manual evaluation ground truth and quantitative RAGAS metrics. Especially in medical fields, verifying correlation between similarity scores (mathematical distance) and actual accuracy beforehand is the minimum governance required to provide systems to users."

Use Cases

Regulatory Compliance Research

Client Workflow:

Client creates multiple question lists
Tool generates: answers, document locations as evidence, answer confidence
Rapid identification of whether requirements are included in provided products
Support for patent application material creation

Clinical Question Answering

Quick access to medical literature
Evidence-based recommendations
Drug interaction checking

Healthcare Analytics

Regulatory review automation
Compliance verification
Document analysis for decision support

Results

Time Reduction: Reduced regulatory research time by approximately 80%
Accuracy: 95%+ accuracy in information extraction (validated against ground truth)
Coverage: Handles 100% of document types (text, tables, images)
Traceability: 100% source attribution for all extracted information
Scalability: Processes thousands of documents daily on GCP infrastructure
Compliance: HIPAA-aligned infrastructure and workflows

Challenges and Solutions

Challenge 1: Numerical Meaning Loss in Vector Space

Problem: Simple vector search causes "numerical meaning loss" and "information noise" when handling multimodal data.

Solution:

Semantic enrichment: Convert tables to HTML with contextual captions
Hybrid architecture: Separate search paths for text vs. tables/images
Late fusion: Integrate after retrieval to maintain precision

Challenge 2: Multimodal Search Noise

Problem: Mixing all data types causes images/table metadata to degrade text search accuracy.

Solution:

Specialized indexes for each modality
Prevent numerical data from being dragged down by irrelevant text queries
Control search precision per data type

Challenge 3: Traceability Requirements

Problem: Medical documents require strict source attribution - users must verify primary sources.

Solution:

Metadata injection at chunk creation stage
Citation mechanism forcing LLM to reference sources
PDF highlighting system linking back to original documents
Coordinate-based precision mapping

Challenge 4: Hallucination in Mission-Critical Domain

Problem: AI confidently generating incorrect information is unacceptable in medical contexts.

Solution:

Confidence scoring based on objective similarity metrics (not LLM self-assessment)
Three-tier guardrail system preventing low-confidence answers
Automated evaluation framework (RAGAS) for continuous monitoring
Ground truth validation with domain experts

Challenge 5: Development Speed (1 Month Timeline)

Problem: Building multimodal parsing pipeline from scratch in one month is extremely challenging.

Solution:

90% framework utilization (LangChain, LlamaIndex)
10% custom implementation focused on medical-specific challenges
Experimental approach: A/B testing for chunk sizes, embedding models
Clear prioritization: What NOT to build (avoid reinventing wheels)

Key Learnings

Late Fusion Over Early Fusion: In medical documents, numerical precision requires separate processing paths that integrate after retrieval, not before.
Semantic Enrichment is Critical: Converting raw data (images, tables) into semantic information that LLMs understand dramatically improves retrieval accuracy.
Confidence Must Be Objective: LLM self-assessment is unreliable; mathematical similarity scores provide trustworthy confidence metrics.
Evaluation Requires Both Manual and Automated: Ground truth from experts combined with RAGAS metrics provides comprehensive validation.
Traceability is Non-Negotiable: Medical applications require complete source attribution - users must be able to verify every claim.
Framework Leverage Enables Speed: Using existing tools (LangChain, LlamaIndex) for 90% of work allows focus on domain-specific 10% that creates real value.
Context-Based Chunking: Paragraph/section-based chunking outperforms word-count chunking for maintaining document context.
BLIP-2 for Efficiency: Q-Former architecture provides optimal balance of accuracy, throughput, and cost for large-scale image processing.

Future Enhancements

Dynamic Monitoring Agent

Moving from static PDF analysis to dynamic agent that detects regulatory changes in real-time:

Current Bottleneck: System is "single document completion type" - connecting to past analysis results is challenging.

Evolution Direction:

Transition from simple PDF index to Knowledge Graph structure managing each regulatory clause with unique IDs
Link old and new documents to this graph structure
Implement semantic diff extraction: Compare requirements from previous RAG extraction with new document requirements
Use GraphRAG to map how legal amendments affect which medical devices and functions

Monitoring Layer:

Leverage n8n and MCP (Brave Search) architecture
Periodic monitoring of regulatory authority websites
End-to-end automation from detection to analysis

Advanced Traceability

Move from page-level links to text-level coordinate information from OCR
Enable sentence-level highlighting (like Adobe Acrobat's line-by-line highlighting)
Enhanced granularity for source verification

Multi-Language Support

Extend to regulatory documents in multiple languages
Cross-lingual semantic search capabilities

Integration with EHR Systems

Direct integration with Electronic Health Records
Real-time clinical decision support

Conclusion

The MedRAG system demonstrates how careful architectural decisions can enable accurate multimodal RAG in mission-critical domains. By adopting Late Fusion, implementing semantic enrichment, and building comprehensive traceability, we created a system that healthcare professionals can trust.

The project's success came from:

Pragmatic framework utilization (90% existing tools)
Focused custom implementation (10% medical-specific logic)
Rigorous evaluation (ground truth + RAGAS metrics)
Objective confidence scoring (similarity-based, not LLM-based)
Complete traceability (every answer links to source)

This system demonstrates core agentic qualities: autonomous information gathering, tool-based reasoning, and context-aware planning. It provided direct experience in orchestrating LLMs to tackle real-world, high-complexity tasks where accuracy and trustworthiness are paramount.

The architectural patterns developed here—semantic enrichment, hybrid search, and traceability—are directly applicable to other domains requiring high accuracy and source verification, such as legal document analysis, regulatory compliance, and logistics operations.