Fractional engagement · AI infrastructure
SELECT CLIENT ENVIRONMENTS · ENTERPRISE AI · SOVEREIGN INFRASTRUCTURE

AI infrastructure leadership —
without the full-time hire.

I work with Series A–C AI companies and enterprise AI teams as a dedicated Fractional AI Infrastructure CTO — 20 hours/month, systems-layer depth, production results. No overhead, no equity, no 6-month ramp-up.

4.2→0.8ms AllReduce latency (production)
87% GPU utilization (after optimization)
$480K Annual savings (single cluster)
00

My perspective

Most AI teams don't have a GPU problem. They have a coordination problem between compute, memory, and communication.

Until you measure those layers precisely — with Nsight Systems, nccl-tests, and kernel-level profiling — adding more GPUs only increases cost, not performance. The ceiling isn't hardware. It's visibility.

This is why I profile before I recommend. Every engagement starts with measurement, not assumptions.

01

Who this is for

This is for teams where GPU spend is growing faster than model performance —
and internal engineering cannot explain why.

02

Where I create leverage

01

Diagnose where your GPU spend is being wasted

Full profiling of your training and inference stack — Nsight Systems, Nsight Compute, nccl-tests. I find the bottleneck, fix it, and document it so your team can maintain it. Typical result: 20–40% throughput improvement in the first engagement.

Nsight Systems NCCL tuning GPUDirect RDMA A100/H100
02

Architect distributed training that scales without waste

FSDP vs ZeRO-3 trade-offs, tensor/pipeline/sequence parallelism selection, communication overlap strategy. I've scaled 13B-parameter LLM fine-tuning from 7 days to 18 hours. I'll do the same for your workload.

FSDP ZeRO-3 Megatron-LM DeepSpeed
03

Build inference systems that meet SLA and hold at scale

TensorRT-LLM, Triton Inference Server, PagedAttention, speculative decoding. I've deployed production inference at 95ms P50, 99.97% uptime. I'll design and validate your serving stack.

TensorRT-LLM Triton PagedAttention vLLM
04

Govern GPU spend before it compounds into a structural problem

Adaptive spot/reserved scheduling, mixed precision strategy, cluster cost modeling. Based on a filed utility patent — I've delivered $480K–$600K annual savings on a single cluster. I'll build the same framework for yours.

Cost modeling BF16/FP8 Slurm Spot scheduling
03

How the engagement works

Project engagement
Deep-dive sprint
$8K–$15K
per project · 4–6 weeks

  • Full cluster performance audit
  • Root-cause profiling + fixes
  • Architecture recommendation doc
  • Team knowledge transfer session
  • 30-day follow-up support
  • Deliverables-based, fixed scope
04

What I've delivered

Capgemini 2025
Architected 16-node A100 cluster for 13B-parameter LLM fine-tuning. Reduced AllReduce latency 4.2ms → 0.8ms. Training time 7 days → 18 hours. GPU utilization 61% → 87%. Delivered $480K–$600K annual savings. Filed utility patent on adaptive GPU cost optimization.
Ericsson 2018–2025
Pioneered enterprise LLM adoption on multi-node GPU clusters using PyTorch FSDP + MPI/UCX. Built GPU optimization framework with NVIDIA Triton — $150K annual cost savings. Reduced POC-to-production from 4 months to 6 weeks. Enabled 10+ concurrent AI projects.
Inference Production
Deployed NVIDIA Triton Inference Server + TensorRT-LLM with INT8 quantization, PagedAttention, CUDA Graphs, dynamic batching. Result: 95ms P50 latency (47% improvement), 65+ req/sec, 99.97% uptime at scale.
05

Selected experience

Technical depth

  • 14 years in distributed systems, 7 years GPU cluster engineering
  • Utility patent — adaptive GPU cost optimization for LLM training
  • 3 technical publications on distributed training & NCCL tuning
  • NVIDIA DLI certified — Accelerated Computing with CUDA
  • M.Tech Computer Science, Dr. MGR Educational Institute

Production experience

  • A100/H100 clusters — InfiniBand NDR 400Gbps, NVLink/NVSwitch
  • Air-gapped sovereign AI environments (on-premise HPC)
  • 25+ engineers mentored at Capgemini & Ericsson
  • $2M+ new project revenue from AI infrastructure work
  • Also founder of NYDUX — AI governance control plane

If your GPU spend is scaling faster than performance, we should talk.

One 20-minute conversation to see if there's a fit. No commitment.

Email sankar@nydux.ai Connect on LinkedIn
OR BOOK DIRECTLY · cal.com/sankarsathish