NVIDIA-Certified Associate: AI Infrastructure and Operations
996e5a60-e96b-4a43-a37a-42d267a207f4
Senior AI Infrastructure Engineer with 12+ years building and operating large-scale distributed systems, from enterprise cloud platforms (SUSE, Red Hat, Nokia) to GPU-scale AI infrastructure.
Currently operating a 4,000+ NVIDIA H100 GPU HPC cluster (500+ nodes, 8x H100 80GB HBM3 per node) supporting distributed LLM training, with full-stack ownership of production monitoring, self-healing alerting, auto-remediation, and LLM-powered operations tooling.
NVIDIA-certified in AI Infrastructure and Generative AI LLMs. Hands-on with GPU cluster management, CUDA/NCCL workloads, InfiniBand networking, and production LLM agent development with RAG, MCP, and multi-model architectures.
Building LLM-powered operations agents with RAG, MCP, and NL-to-SQL. Multi-model support (Claude, OpenAI, Ollama) for infrastructure diagnostics.
Operating a 4,000+ NVIDIA H100 GPU cluster (500+ nodes). DCGM, CUDA/NCCL, SLURM scheduling, InfiniBand fabric, GPUDirect RDMA, and distributed training orchestration.
Self-healing monitoring with 12+ background modules, auto-remediation workflows, alert deduplication, and unified observability (Prometheus, Grafana, TimescaleDB).
Infrastructure as Code, CI/CD pipelines, container orchestration, and multi-cloud architecture across AWS, Azure, GCP, OpenStack, and OCI.
996e5a60-e96b-4a43-a37a-42d267a207f4
a4647c10-d4bf-4dde-9452-514413e6e291
aeceb2f3-33ac-4ad4-a94a-dbb177fc619a
LF-a2o0xdlc96
COA-1600-0067-0100
K40MK0MDFEEQ13CJ
200-422-511
150-132-256
LFCE-1500-0058-0200
ECC55284410087
5426913.20427512
03005008
10186201
10186201
10186201
10186201
10186201
10186201
Slack bot for GPU datacenter operations. Real-time GPU health monitoring (DCGM, NVLink, ECC), InfiniBand per-lane error analysis, and 13-test Dell diagnostics bundle with automated TSR collection. Background monitors for thermal, network, and NFS health with auto-remediation and intelligent alert deduplication.
LLM-powered operations agent for GPU datacenter diagnostics. Natural language to SQL against TimescaleDB, protocol-aware discovery (Redfish/SNMP/SSH), and tool evolution from repeated query patterns. Multi-interface: Slack, REST API, CLI, and Chainlit web UI.
LLM training orchestration for GPU clusters without Slurm dependency. Manages NGC container distribution, NFS-backed checkpoint storage, and torchrun-based distributed training. Intelligent image transfer logic and automated training result analysis with W&B integration.
Personal flake-based NixOS configuration system supporting reproducible setups across NixOS desktops, servers, and macOS machines using shared, modular components.
Portable, reproducible nix shell environments with all your DevOps tools and dotfiles, run them anywhere, instantly.
Functional and minimalist NixVim configuration, featuring a carefully curated set of essential plugins for an efficient, distraction-free editing experience.
Containerized GPU cluster benchmarking platform. Orchestrates NCCL collective tests, HPL compute benchmarks, and DCGM thermal profiling across distributed nodes with multi-node SSH orchestration and Prometheus-based metric collection.
Containerized NCCL test runner for single and multi-node GPU configurations. Tests collective communication performance (all_reduce, all_gather, broadcast) with dual build modes and automated node connectivity validation.
Distributed I/O benchmarking tool for GPU clusters. Automated containerized IoR tests with OpenMPI across multiple nodes, flexible node range specification, and pre-flight connectivity validation with structured result collection.
Zero-touch OpenStack deployment pipeline: Terraform provisions the infrastructure via libvirt, cloud-init bootstraps the OS, and Ansible drives DevStack end-to-end. One command from bare metal to a fully operational private cloud with Keystone, Nova, Neutron, and Horizon.
Dynamically scalable Ansible sandbox built on Docker Compose. Spin up N target nodes on demand with a single parameter, all SSH-ready with auto-provisioned keys. Custom openSUSE Tumbleweed image stripped of systemd for a minimal, production-realistic footprint.
Scalable Ceph distributed storage cluster with isolated dual-network topology (admin + cluster replication). Terraform provisions MON, OSD, and admin nodes on demand; Ansible drives ceph-deploy orchestration with Vault-managed secrets. From zero to a production-grade storage fabric in minutes.
Feel free to reach out through any of these channels.