Kaushik Sarkar

About

Two decades of shipping AI systems
that survive contact with reality.

Most AI and digital technologies never ship. They get presented in board decks, run in IDEs, make it through pilots, and then quietly disappear when the real constraints show up: the messy data, the interoperability challenges, the governance requirements, the stakeholders who all have to say yes, the infrastructure that was never built to run a model in production.

Over the past two decades, I have come to believe that the problem is rarely the data silos, the algorithms, or the infrastructure. It is almost always the absence of one person who can sit in the board conversation about AI strategy and then go and design the architecture the next morning, without losing anything in the translation.

Someone who has held the P&L, navigated the institutional politics, put smiles on partners' faces, and still knows exactly why their team's retrieval pipeline is underperforming or why the loss function they landed on is working against them.

That person is harder to find than any technology. And every single morning, I ask myself how much of that person I still have left to become.

Director of IMACS (An AI Centre of Excellence).
Co-Principal Investigator, AI for disease forecasting using satellite data (funded by NASA Earth Sciences).
PhD Researcher in AI | MBA | MDataSci | MS AI/ML

40 Under 40 Outstanding Leadership Award Digital Transformation Leader

The Shipping Equation

DEFINE THE FIELD

Φ(S, E, B, I, T, D, t)

S=strategy E=engineering B=business
I=institutional T=technical D=domain t=time

THE EVOLUTION (PDE)

∂Φ/∂t = α∇²Φ

+ λ_SE · ∂²Φ/∂S∂E

+ λ_BT · ∂²Φ/∂B∂T

+ λ_ID · ∂²Φ/∂I∂D

- δ · C(S,E,B,I,T,D,t)

WHY MOST AI NEVER SHIPS

Specialist: ∂Φ/∂E >> 0, ∂Φ/∂S ≈ 0

That person: ∂²Φ/∂S∂E >> 0

Cross-partials create correlation.
One person > N specialists.

THE RARITY

P(∀ ∂²Φ/∂x_i∂x_j > 0)

∼ exp(-n(n-1)/2)

For 6 dimensions: ~1 in 3.3 million

THE MORNING QUESTION

lim_t→∞ || Φ*(t) - Φ(t) || = ε

ε > 0, strictly.

The pursuit is asymptotic.
There is always more to become.

Capabilities

What I Lead and Build

Six domains spanning model training through production deployment.

Large Language Models

End-to-end LLM engineering: domain-adaptive pre-training, SFT on 500K+ instruction pairs, DPO alignment with hard negative mining, Mixture of LoRA Experts across 6 scientific domains.

LoRA / QLoRADPORLHFMoLEvLLM

Multi-Agent Systems

Fan-out / fan-in agentic pipelines with 6+ specialised agents. MCP server architecture. Streaming SSE output. Real-time orchestration with graceful degradation.

MCPLangChainReActSSE

RAG and Knowledge Systems

Hybrid retrieval (dense + BM25) with Reciprocal Rank Fusion. Cross-encoder re-ranking. 250M+ evidence spans in OpenSearch. GraphRAG for structured knowledge.

OpenSearchFAISSRRFGraphRAG

Enterprise Data Platforms

500TB+ S3 data lakes with Apache Iceberg, 700M+ row analytical stores, Athena serverless SQL, 24 EventBridge ingestion pipelines. Query time from 8s to <400ms.

IcebergAthenaGlueAirflow

Production ML Systems

Real-time ML inference at 30+ geography scale. P95 latency <150ms. Champion-challenger deployment. Canary rollouts with automated quality gates.

SageMakerMLOpsONNXDrift

AI Governance and Strategy

AI governance frameworks for regulated environments across 17 countries. Responsible AI embedded in pipelines. National expert committee leadership. $100M+ portfolio accountability.

GovernanceResponsible AIStrategy

Selected Work

Systems operating at scale across multiple countries under governance constraints.

Flagship Platform

SAGE: Scientific Advisory and Guidance Engine

A 397B-parameter domain-adapted foundation model trained through a 4-stage pipeline: domain-adaptive pre-training, supervised fine-tuning on 596K instruction pairs, DPO alignment on 192K preference pairs with 3-tier hard negative mining, Mixture of LoRA Experts across 6 scientific domains.

397BParameters

6Domain Adapters

<3sFast Path SLA

12Countries Live

Foundation 397BMoLEDPOQLoRAOpenSearch AOSSSSE Streaming

// SAGE Multi-Agent Pipeline
Router → Claude Opus 4 · RAG → OpenSearch, 250M+ spans · Foundation → SAGE-397B
Data → 500TB+ S3, 700M+ rows Iceberg, 250M+ evidence spans, 30M+ causal relations
SLAs → fast path <3s · full report 45-90s · dashboard P95 <150ms

Upcoming 2026

ARK Platform

AI intelligence platform spanning 190+ countries for health, climate, and development finance. Funded by McGovern Foundation and Amazon.

190+Countries

750M+Records

Data Infrastructure

Global Evidence Data Platform

Enterprise-grade data lake unifying 100M+ records across 17 national health systems. Iceberg on S3. 24 EventBridge pipelines. P95 from 8s to <400ms.

500TB+Data Lake

700M+Rows

Systems

Architecture I Built

Production systems engineering: foundation model training, data platforms, and multi-agent orchestration.

Foundation Model / 397B MoE (512 experts, 17B active/token)

Six-Phase Domain Adaptation: CPT, ORPO, SPIN, RLAIF, Router Calibration

Data Platform / 500TB+ Lakehouse / 85+ Organizations / 9 Domains

Three-Layer Architecture: Raw, Silver, Gold

AI Platform / 6-Agent System

Evidence Intelligence Orchestration

Infrastructure / 8-GPU Cluster

Distributed Training on AWS

NF4 model

...

NF4 model

LoRA + optim

...

LoRA + optim

effective_batch = per_device x grad_accum x 8

Checkpoint-resume on spot interruption
Rank 0 saves to EBS + S3
Each GPU sees 1/8 of data shards

Hospital AI / Patient Intelligence

Real-Time Multi-Modal Inference at the Individual Patient Level

Published Research / Deep Generative Time Series

SPECTRA: Adversarial Climate-Disease Forecasting with Transfer Learning

Knowledge

Engineering Perspectives

Architecture selection, loss functions, model families, and design patterns from hands-on R&D.

Architecture Selection

Generative Model Taxonomy: When to Use What

Optimization

Loss Function Selection

Loss	Formula	When
Cross-Entropy	-sum y.log(p)	Classification, LM
NLL	-log P(x\|theta)	Sequence modeling
KL Divergence	sum p.log(p/q)	VAE, distillation
Wasserstein	inf E[\|x-y\|]	WGAN, Earth mover
Spectral	\|\|sigma(W)\|\|	GAN stabilization
Contrastive	-log(sim+/sim-)	Embeddings, CLIP
Triplet	max(d+-d-+m,0)	Metric learning
Focal	-a(1-p)^g.log p	Imbalanced data
DPO	-log sigma(b.dr)	Preference tuning
Hinge	max(0, 1-y.f)	SVM, GAN variants

Choose based on the signal: classification uses CE, generation uses NLL, alignment uses DPO, representation learning uses contrastive/triplet, distribution matching uses KL/Wasserstein.

Model Selection

Open-Source FM Comparison

Family	Sizes	Strength	Use Case
Llama 3	8-405B	General, code	Broad reasoning
Qwen 2.5	0.5-72B	Multilingual	Domain adaptation
Mistral	7-8x22B	MoE efficiency	Cost-sensitive
DeepSeek	7-236B	Math, code	Technical tasks
Gemma 2	2-27B	Compact, fast	Edge, mobile
Phi 3/4	3-14B	Small but capable	On-device
Command R+	35-104B	RAG, grounding	Enterprise RAG
Yi	6-34B	Long context	Document analysis

Selection Criteria

1. Task type (reasoning, code, chat, RAG)
2. Context length requirement
3. Deployment constraint (GPU budget)
4. Multilingual needs
5. Fine-tunability (license, LoRA support)
6. MoE vs dense (latency vs throughput)

Multimodal Systems

Vision-Language Model Architecture Patterns

Agent Engineering

Reasoning Agent Patterns

Systems Engineering

Real-Time Data + Load Balancing

Stream Processing

Source -> Ingest (Kafka/SQS/Kinesis)
-> Transform (Flink/Lambda)
-> Serve (OpenSearch/Redis)
-> Alert (threshold + anomaly)

Back-pressure handling: producer slows when consumer queue depth exceeds threshold. Dead-letter queues for poison messages.

Load Balancing Strategies

Round-robin: equal distribution
Weighted: capacity-aware routing
Least-connections: latency-optimal
Consistent hash: session affinity
Rate limiting: token bucket / leaky

Resilience Patterns

Circuit breaker: fail-open after N errors
Bulkhead: isolate failure domains
Retry with jitter: exp backoff
Timeout cascade: strict per-hop SLAs
Graceful degradation: serve stale data

Design for failure. Every external call has a timeout. Every queue has a DLQ. Every service has health checks.

Technical Stack

Tools I Build With

Production-grade proficiency. Primary tools highlighted.

AI / LLM

Claude / AnthropicMistralLoRA / QLoRADPO / RLHFLangChainLlamaIndexvLLMHuggingFaceONNX

Agentic / Dev Tools

Claude CodeMCP ServersCursorLovablev0 by VercelWindsurfReplitGitHub Copilot

Cloud / Infrastructure

AWS SageMakerAWS BedrockFargateGlueAthenaEventBridgeAzureKubernetesTerraformDocker

Data Engineering

Apache IcebergOpenSearchSparkAirflowFlinkFAISSKafkadbt

ML / Research

PyTorchTensorFlowXGBoostscikit-learnSHAPMLflowPythonTriton

Two decades of shipping AI systems
that survive contact with reality.

What I Lead and Build

Large Language Models

Multi-Agent Systems

RAG and Knowledge Systems

Enterprise Data Platforms

Production ML Systems

AI Governance and Strategy

Selected Work

SAGE: Scientific Advisory and Guidance Engine

ARK Platform

Global Evidence Data Platform

Architecture I Built

Engineering Perspectives

All Repositories

Contribution History

Since 2009

Last 30 Days

Who I Have Built With

17 Countries

Tools I Build With

Get in Touch

Kaushik Sarkar

Two decades of shipping AI systemsthat survive contact with reality.

What I Lead and Build

Large Language Models

Multi-Agent Systems

RAG and Knowledge Systems

Enterprise Data Platforms

Production ML Systems

AI Governance and Strategy

Selected Work

SAGE: Scientific Advisory and Guidance Engine

ARK Platform

Global Evidence Data Platform

Architecture I Built

Engineering Perspectives

All Repositories

Contribution History

Since 2009

Last 30 Days

Who I Have Built With

17 Countries

Tools I Build With

Get in Touch

Two decades of shipping AI systems
that survive contact with reality.