NVIDIA-Certified Associate: AI Infrastructure and Operations
996e5a60-e96b-4a43-a37a-42d267a207f4
12+ years building and operating large-scale infrastructure. Currently managing a 500+ node GPU HPC cluster, orchestrating distributed LLM training across GPU fleets, and building AI-powered operations platforms for datacenter-scale environments.
NVIDIA-certified in AI Infrastructure and Generative AI LLMs. Hands-on with GPU cluster management, CUDA/NCCL workloads, InfiniBand networking, distributed model training (PyTorch FSDP, MoE architectures), and production LLM agent development with RAG, MCP, and multi-model architectures.
Experienced building self-healing monitoring systems with automated alerting, deduplication, and escalation policies.
Orchestrating distributed LLM training (PyTorch FSDP, MoE), building production LLM agents, RAG pipelines, and AI-powered operations platforms.
Operating 500+ node GPU clusters, NVIDIA DCGM, CUDA/NCCL workloads, SLURM scheduling, InfiniBand fabric, and distributed compute orchestration.
Scalability, resilience, and observability at scale. Custom monitoring stacks, automated remediation, and incident response for mission-critical systems.
Infrastructure as Code, CI/CD pipelines, container orchestration, and multi-cloud architecture across AWS, Azure, GCP, OpenStack, and OCI.
996e5a60-e96b-4a43-a37a-42d267a207f4
a4647c10-d4bf-4dde-9452-514413e6e291
aeceb2f3-33ac-4ad4-a94a-dbb177fc619a
LF-a2o0xdlc96
COA-1600-0067-0100
K40MK0MDFEEQ13CJ
200-422-511
150-132-256
LFCE-1500-0058-0200
ECC55284410087
5426913.20427512
03005008
10186201
10186201
10186201
10186201
10186201
10186201
My personal flake-based NixOS configuration system that supports reproducible setups across NixOS desktops, servers, and macOS machines using shared, modular components.
Portable, reproducible nix shell environments with all your DevOps tools and dotfiles, run them anywhere, instantly.
This is my functional and minimalist NixVim configuration, featuring a carefully curated set of essential plugins for an efficient, distraction-free editing experience.
Automated OpenStack deployment pipeline powered by Terraform and Ansible. Build, provision, and configure with ease.
A personal, fully scalable Ansible lab powered by Docker and based on openSUSE Tumbleweed, easily customizable to any Linux base image. Perfect for testing and developing automation workflows.
Automated Ceph homelab deployment on scalable, multi-node architectures using Terraform and Ansible for efficient infrastructure and orchestration.
Feel free to reach out through any of these channels.