Private RAG Architecture Technical Whitepaper
Executive Summary
This whitepaper provides a comprehensive technical and compliance analysis of Private Retrieval-Augmented Generation (Private RAG) architecture for organizations operating in regulated industries. As artificial intelligence adoption accelerates across healthcare, financial services, legal, and government contracting sectors, organizations face a fundamental architectural decision: how to harness AI capabilities while maintaining the data sovereignty, privacy protections, and regulatory compliance that their operating environments require.
Private RAG systems offer a proven solution to this challenge. By keeping organizational knowledge within controlled infrastructure while leveraging the language understanding capabilities of large language models (LLMs), Private RAG enables AI-powered knowledge access without the data sovereignty compromises inherent in shared commercial AI platforms.
This whitepaper covers: the technical architecture of Private RAG systems; security and compliance design patterns for HIPAA, SOC 2, FedRAMP, and financial services regulatory frameworks; implementation considerations including infrastructure selection, embedding model deployment, vector database architecture, and access control design; and operational considerations including monitoring, quality assurance, and system maintenance.
Organizations using this whitepaper will be able to make informed architectural decisions about Private RAG deployment and understand the investment required to implement Private RAG correctly in a regulated industry context.
Section 1: The Case for Private RAG
1.1 The Limits of Commercial AI for Regulated Data
Commercial AI platforms — including OpenAI's GPT-4 API, Anthropic's Claude API, Google's Gemini API, and the consumer interfaces built on these — have fundamentally transformed what AI can do. The language understanding, reasoning, and generation capabilities of modern LLMs are genuinely extraordinary.
But commercial AI platforms were not designed for the data handling requirements of regulated industries. Their fundamental model — send data to the vendor's infrastructure, receive AI-generated responses — creates several compliance challenges:
Training data risk: Many commercial AI platforms' terms of service have historically permitted use of user inputs for model improvement. While major vendors have improved their enterprise terms, the risk that PHI, privileged communications, or CUI could be incorporated into training data remains a concern that regulated organizations must address contractually and technically.
Data residency: Commercial AI APIs process data in the vendor's infrastructure. For organizations with data residency requirements — certain government contractors, European organizations under GDPR, healthcare organizations with specific contractual commitments — this may be unacceptable.
Audit trail limitations: Commercial AI APIs typically do not provide the granular audit logs that regulated organizations require — who queried what, what documents were accessed, what outputs were generated — all linked to specific user identities and maintained for the required retention period.
Multi-tenancy isolation: In a shared commercial AI platform, there is at minimum theoretical concern about whether one customer's data is fully isolated from another's. For highly sensitive data, this concern is meaningful even if the practical risk is low.
Knowledge currency: Commercial LLMs are trained on data with a cutoff date and do not have access to your organization's current policies, procedures, or proprietary knowledge. This limits their utility for organization-specific questions.
Private RAG addresses each of these limitations by bringing AI capability inside the organization's controlled environment.
1.2 What Private RAG Provides
A Private RAG system provides:
Organizational knowledge integration: Your organization's policies, procedures, research, client files, and institutional knowledge — searchable by AI — without any of that knowledge leaving your controlled environment.
Data sovereignty: All data remains within infrastructure you control. The AI vendor (if any) never receives your sensitive documents — only the LLM inference API receives a context window that may contain excerpts from retrieved documents.
Granular audit trails: Complete, user-linked audit trails of every query, every retrieved document, and every AI response — maintained in your systems, available for compliance audits and incident investigation.
Access control enforcement: Role-based and attribute-based access controls that ensure users can only query documents they are authorized to access — implemented at the retrieval layer, not just at the interface layer.
Knowledge updatability: New documents, updated policies, and revised procedures are available to the AI system immediately after ingestion — unlike fine-tuned models that require costly retraining.
Section 2: Private RAG Architecture
2.1 Core Architecture Components
A complete Private RAG system consists of five core components:
2.1.1 Document Ingestion Pipeline
The ingestion pipeline is responsible for processing raw documents into the vector representations required for retrieval. It consists of:
Document collectors: Connectors to existing document repositories — SharePoint, document management systems (NetDocuments, iManage, OpenText), network file shares, databases, and web sources. Collectors must handle diverse document formats (PDF, Word, Excel, PowerPoint, HTML) and maintain document metadata (creation date, author, document type, access classification) that will be used for filtering and access control.
Document processing: Extraction of text from native document formats, cleaning (removing headers/footers, boilerplate, non-substantive content), and chunking into segments appropriate for retrieval. Chunking strategy significantly impacts retrieval quality and requires careful tuning. Common approaches include fixed-size chunks with overlap, semantic chunking based on natural language sentence boundaries, and hierarchical chunking that maintains document structure.
Embedding generation: Conversion of each document chunk into a dense vector representation using an embedding model. For private deployment, the embedding model should run on organization-controlled infrastructure. Options include:
- Azure OpenAI Embeddings in a private Azure tenant (recommended for organizations already in Azure)
- Sentence Transformers models (e.g., all-MiniLM-L6-v2, E5-large) running on organization infrastructure
- Cohere Embed API in a private deployment
Embedding model selection involves trade-offs between embedding quality (which affects retrieval accuracy), computational cost, and infrastructure complexity.
Vector storage: Persistence of document chunks, their embeddings, and associated metadata in a vector database. See Section 2.2 for vector database selection guidance.
2.1.2 Vector Database
The vector database stores document embeddings and supports efficient approximate nearest-neighbor search — finding the documents whose embeddings are most similar to a query embedding. Key selection criteria for regulated deployments:
Self-hosted options (maximum data sovereignty):
- pgvector (PostgreSQL extension): Best choice for organizations already running PostgreSQL infrastructure. Provides vector search within a familiar, well-governed database system with strong access control and audit capabilities.
- Weaviate: Open-source, can be self-hosted on Kubernetes, supports hybrid search (vector + keyword), and has strong access control features.
- Qdrant: Open-source, high-performance, supports payload filtering useful for access control implementation.
- Chroma: Open-source, simpler deployment, appropriate for smaller-scale deployments.
Managed options with strong compliance postures:
- Azure AI Search: Managed service within your Azure tenant, with strong access control and audit capabilities, appropriate for FedRAMP and HIPAA-compliant deployments in Azure Government.
- Amazon OpenSearch Service with vector support: Managed in your AWS account, FedRAMP authorized in GovCloud.
2.1.3 Query Processing Pipeline
The query processing pipeline handles real-time user queries:
Authentication and authorization: User identity verification and determination of document access entitlements. This feeds into the retrieval filter that limits results to authorized documents.
Query pre-processing: Cleaning and normalizing the user's query, optionally expanding it with synonyms or related terms to improve retrieval.
Query embedding: Converting the query to an embedding using the same model used during ingestion (this is critical — mismatched embedding models produce poor retrieval results).
Retrieval with access control filtering: Vector similarity search filtered to documents the querying user is authorized to access. The access control filter must be applied at the database query level, not as a post-retrieval filter, to ensure security.
Reranking (optional): A second-stage reranking step using a cross-encoder model can significantly improve result relevance at the cost of additional latency.
Context assembly: Construction of the prompt that will be sent to the LLM, including the retrieved document excerpts and the user's query.
LLM inference: Sending the assembled prompt to the LLM (private deployment) and receiving the generated response.
Response post-processing: Citation generation, safety filtering, and logging.
Audit logging: Recording the complete interaction — query, retrieved documents, LLM response — linked to user identity, timestamp, and contextual metadata (matter, department, etc.).
2.1.4 LLM Inference
The LLM that generates responses from retrieved context is the component with the most significant data sovereignty implications. Options for private deployment:
Azure OpenAI Service in private Azure tenant: Microsoft provides contractual commitments that customer data sent to Azure OpenAI Service is not used for model training. Appropriate for most regulated industry use cases. Azure Government configuration provides FedRAMP High coverage.
AWS Bedrock in private AWS account: Amazon provides similar commitments for Bedrock. Available in GovCloud for government/defense use cases.
Self-hosted open-source LLMs: Maximum data sovereignty — no third-party involvement. Models like Llama 3 (70B or 405B), Mistral, and Mixtral provide strong capability. Require GPU infrastructure (typically A100 or H100 GPUs) and ML engineering expertise to deploy and maintain. Appropriate for organizations with very high sensitivity data or inability to use cloud AI services.
On-premises LLM deployment: For air-gapped environments (classified networks, highly regulated on-premises deployments), LLMs can be deployed on on-premises GPU servers. This provides maximum isolation but requires significant hardware investment.
2.1.5 Application Layer and User Interface
The application layer provides the user-facing interface and orchestrates the query processing pipeline. Key considerations:
Authentication integration: The application must integrate with the organization's identity provider (Active Directory, Okta, etc.) to authenticate users and retrieve their access entitlements.
Session management: User sessions must be managed securely, with appropriate timeouts and session token handling.
Interface design: The user interface should make it easy for users to understand the source of AI responses (which documents were retrieved), reducing inappropriate reliance on AI outputs and supporting the human verification that regulated use cases require.
Feedback mechanisms: Users should be able to flag incorrect or problematic AI responses, feeding quality assurance processes.
2.2 Access Control Architecture
Access control is the most compliance-critical component of a Private RAG system for multi-user, multi-client, or multi-sensitivity deployments.
Document-Level Access Control
Each document in the vector database must have associated access control metadata specifying which users, roles, or groups can retrieve it. This metadata is used to filter retrieval results.
Access control models to consider:
Role-Based Access Control (RBAC): Documents are tagged with roles (e.g., "Clinical Staff," "Billing," "Executive"). Users with a matching role can retrieve the document. Simple to implement and administer, appropriate for most deployments.
Attribute-Based Access Control (ABAC): Documents and users both have attributes, and access is determined by policy evaluated against those attributes. More flexible than RBAC, appropriate for complex access requirements.
Matter/Client-Based Access Control: Used in legal and professional services. Documents are tagged with client and matter identifiers. Users can only retrieve documents for matters they are assigned to. Requires integration with the organization's matter management system.
Data Classification-Based Access Control: Documents are classified by sensitivity (Public, Internal, Confidential, Restricted). Users are granted access to documents at or below their clearance level. Appropriate for organizations with formal data classification programs.
Implementing Access Control at the Vector Database Layer
Access control must be enforced at the vector database query layer, not as a post-retrieval filter. Post-retrieval filtering (retrieve all results, then filter out unauthorized ones) leaks information about document content and violates the principle of least privilege.
For pgvector, PostgreSQL's row-level security (RLS) can enforce access control directly in the database:
-- Enable RLS on the documents table
ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;
-- Create policy allowing users to access documents they're authorized for
CREATE POLICY user_document_access ON document_chunks
USING (
document_id IN (
SELECT document_id FROM user_document_access
WHERE user_id = current_user_id()
)
);
For other vector databases, equivalent access control mechanisms must be implemented at the query filter level.
Section 3: Compliance Design Patterns
3.1 HIPAA-Compliant Private RAG
For healthcare organizations, Private RAG must address HIPAA's technical safeguard requirements:
Encryption: All document chunks, embeddings, and audit logs must be encrypted at rest using AES-256. All communications between RAG system components must use TLS 1.2 or higher.
Access controls: Implement minimum-necessary access controls as described above. Users should only be able to retrieve PHI-containing documents for patients under their care or within their authorized scope.
Audit controls: Every query, every retrieved document, and every LLM response must be logged with user identity, timestamp, patient/encounter context (if applicable), and document identifiers. Logs must be retained for the period required by HIPAA (minimum six years from date of creation or last effective date) and protected from modification.
Integrity controls: Document versions in the knowledge base must be controlled — unauthorized modification of knowledge base documents must be prevented and would-be-modifiers detected.
Business Associate Agreements: If using a cloud-hosted LLM (Azure OpenAI, AWS Bedrock) in conjunction with queries that may contain PHI (as excerpts from retrieved documents), the cloud provider must be covered by a BAA.
De-identification for training data: If the Private RAG system includes any fine-tuning on PHI-containing documents, de-identification under HIPAA Safe Harbor or Expert Determination method is required.
3.2 FedRAMP-Compliant Private RAG
For government contractors and agencies, Private RAG must operate within the FedRAMP authorization boundary:
Infrastructure authorization: All infrastructure components (vector database, LLM inference, application layer) must be deployed on FedRAMP-authorized infrastructure. Azure Government (FedRAMP High), AWS GovCloud (FedRAMP High), and Google Cloud for Government (FedRAMP Moderate/High) are the primary options.
NIST 800-53 control coverage: The Private RAG system must implement or inherit applicable NIST 800-53 controls. Key controls include AC-3 (Access Enforcement), AU-2 (Event Logging), AU-12 (Audit Record Generation), CM-6 (Configuration Settings), and SC-28 (Protection of Information at Rest).
CUI handling: For systems processing Controlled Unclassified Information, CMMC Level 2 or 3 requirements under NIST 800-171 apply. The Private RAG system must be within the CMMC assessment boundary.
Supply chain risk: Foundation LLMs used in a FedRAMP-authorized Private RAG system must themselves be assessed for supply chain risk. Preference should be given to models from US-based providers with clear provenance documentation.
3.3 SOC 2 Type II Private RAG
For technology companies and service providers subject to SOC 2:
Common Criteria mapping: Document the Private RAG system's controls against relevant Common Criteria — particularly CC6 (Logical Access), CC7 (System Operations), CC8 (Change Management), and CC9 (Risk Mitigation).
Change management: Model updates (including updates to the underlying foundation model) must go through documented change management with impact assessment. Changes to system prompts and retrieval configuration should be treated as configuration changes.
Vendor risk management: The LLM inference provider (Azure OpenAI, AWS Bedrock) is a subservice organization. Obtain and review their SOC 2 report. Document their controls in your SOC 2 system description.
Availability and performance monitoring: If your SOC 2 includes the Availability TSC, RAG system availability and performance must be monitored against defined SLAs.
Section 4: Implementation Guidance
4.1 Infrastructure Selection
Infrastructure selection for Private RAG depends on the organization's cloud posture, regulatory requirements, and existing investments:
| Scenario | Recommended Infrastructure |
|---|---|
| Healthcare, Azure-first | Azure OpenAI + Azure AI Search + Azure Kubernetes Service |
| Healthcare, AWS-first | AWS Bedrock + Amazon OpenSearch + Amazon EKS |
| Government contractor (CMMC) | Azure Government or AWS GovCloud + self-hosted vector DB |
| Financial services, on-premises preference | Self-hosted LLM on GPU servers + pgvector on existing PostgreSQL |
| Law firm, maximum privacy | Self-hosted LLM + pgvector + on-premises or private cloud |
4.2 Embedding Model Selection
Embedding model selection significantly impacts retrieval quality. Evaluation should include:
- Domain relevance: Models fine-tuned on domain-specific text (medical, legal, financial) outperform general models for domain-specific retrieval
- Context length: Longer context length allows larger document chunks, which can preserve more context
- Inference speed and cost: For high-volume deployments, inference speed and cost matter
- Privacy: For private deployment, models that can run on organization infrastructure are preferred
Recommended evaluation approach: create a benchmark set of 50-100 representative questions with known answers in your document corpus, and evaluate retrieval accuracy for each candidate embedding model.
4.3 Chunking Strategy
Chunking strategy is one of the most important and often underappreciated decisions in RAG system design. Poor chunking is a leading cause of retrieval quality problems.
Fixed-size chunking: Simplest approach. Chunks are created with a fixed token size and a specified overlap. Works reasonably well for homogeneous document types with consistent paragraph structure.
Sentence-boundary chunking: Chunks respect sentence boundaries, avoiding mid-sentence splits that can confuse the LLM. Generally preferred over pure fixed-size chunking.
Semantic chunking: Uses embedding similarity to identify natural topic boundaries and create chunks that represent complete semantic units. More computationally expensive but generally improves retrieval quality.
Hierarchical chunking: Creates both fine-grained chunks (for precise retrieval) and larger parent chunks (for richer context). The fine-grained chunks are retrieved, but the LLM is given the larger parent chunk as context. Effective for longer-form documents.
4.4 Evaluation Framework
Private RAG systems must be evaluated before production deployment and monitored continuously thereafter. A comprehensive evaluation framework includes:
Retrieval evaluation: For a benchmark set of questions with known relevant documents, measure:
- Recall@k: What fraction of relevant documents are in the top k retrieved results?
- Precision@k: What fraction of the top k retrieved results are relevant?
- Mean Reciprocal Rank (MRR): How highly ranked is the first relevant result?
Answer quality evaluation: For benchmark questions with known correct answers, measure:
- Answer correctness (does the AI's answer match the known correct answer?)
- Answer groundedness (is the answer supported by the retrieved documents?)
- Hallucination rate (does the answer contain information not in the retrieved documents?)
Access control evaluation: Verify that access controls are correctly enforced by testing with users who should and should not have access to specific documents.
Latency evaluation: Measure end-to-end query latency and identify bottlenecks.
Section 5: Operational Considerations
5.1 Ongoing Knowledge Base Management
A Private RAG system is only as good as its knowledge base. Ongoing management requires:
Document update workflows: Processes for ingesting new documents and updating existing ones when policies or procedures change. Updates should be reflected in the knowledge base within a defined SLA (typically same-day or next-day for critical policy documents).
Document retirement: Processes for removing outdated documents from the knowledge base to prevent the AI from retrieving obsolete information.
Quality assurance: Regular sampling of AI responses to verify accuracy and groundedness. Track metrics over time and investigate degradation.
Coverage monitoring: Track queries that result in poor retrieval (low similarity scores, no relevant documents found) as indicators of knowledge base gaps.
5.2 Security Monitoring
Private RAG systems require ongoing security monitoring:
Access log review: Regular review of audit logs to identify unusual access patterns — users querying documents outside their normal scope, unusually high query volumes, queries that appear to be probing system boundaries.
Adversarial input detection: Monitoring for prompt injection attempts — queries designed to override system instructions or extract information from the system prompt.
Vulnerability management: Regular scanning of RAG system components for vulnerabilities, with a defined patching SLA.
Incident response: Defined procedures for RAG-specific incidents — inappropriate PHI access, prompt injection success, knowledge base corruption.
5.3 Model and Infrastructure Updates
LLM updates: When the underlying LLM is updated (new model version from Azure OpenAI, AWS Bedrock, or a self-hosted model update), evaluate the impact on response quality before deploying to production. Some model updates improve quality; others may change behavior in ways that affect compliance.
Embedding model updates: Updates to the embedding model require re-embedding all documents in the knowledge base, which can be a significant operation. Plan embedding model updates carefully to minimize disruption.
Infrastructure updates: Maintain infrastructure components (vector database, application layer, networking) with security patches on a defined schedule.
Section 6: Conclusion
Private RAG architecture represents the most mature and practical solution to the core challenge of deploying AI in regulated industries: how to harness AI capabilities without compromising data sovereignty and regulatory compliance.
The architecture is proven, the tools are mature, and the regulatory frameworks are well-enough understood that organizations can build compliant Private RAG systems with confidence. What requires expertise is the design — specifically, the integration of access controls, audit logging, encryption, and compliance documentation into a RAG architecture that performs well and meets regulatory requirements.
TrustEdge provides end-to-end Private RAG design, implementation, and compliance documentation services for organizations across regulated industries. Our team, with 15+ years of compliance and security engineering expertise through Jacobian Engineering, has deployed Private RAG systems that have passed HIPAA audits, FedRAMP assessments, and SOC 2 reviews.
Ready to deploy Private RAG in your organization? Schedule a consultation with TrustEdge. Call (888) 555-EDGE or reach out through our website to speak with an architect who can design a Private RAG system for your specific compliance and capability requirements.
About This Resource
Need Expert Guidance?
Our team can help you put these insights into practice.
Schedule a Consultation or call (415) 644-8208Ready to Take the Next Step?
Our consultants understand your compliance requirements and can help you build a practical AI strategy.