B O F H

William Vera

Senior AI Infrastructure Engineer | GPU/HPC Operations | LLM & AIOps | SRE | DevOps

$ whoami $ ping

About Me

Senior AI Infrastructure Engineer | GPU/HPC Operations | LLM & AIOps | SRE | DevOps

Senior AI Infrastructure Engineer with 12+ years building and operating large-scale distributed systems, from enterprise cloud platforms (SUSE, Red Hat, Nokia) to GPU-scale AI infrastructure.

Currently operating a 4,000+ NVIDIA H100 GPU HPC cluster (500+ nodes, 8x H100 80GB HBM3 per node) supporting distributed LLM training, with full-stack ownership of production monitoring, self-healing alerting, auto-remediation, and LLM-powered operations tooling.

NVIDIA-certified in AI Infrastructure and Generative AI LLMs. Hands-on with GPU cluster management, CUDA/NCCL workloads, InfiniBand networking, and production LLM agent development with RAG, MCP, and multi-model architectures.

AI / LLM Ops

Building LLM-powered operations agents with RAG, MCP, and NL-to-SQL. Multi-model support (Claude, OpenAI, Ollama) for infrastructure diagnostics.

GPU / HPC

Operating a 4,000+ NVIDIA H100 GPU cluster (500+ nodes). DCGM, CUDA/NCCL, SLURM scheduling, InfiniBand fabric, GPUDirect RDMA, and distributed training orchestration.

SRE / Platform

Self-healing monitoring with 12+ background modules, auto-remediation workflows, alert deduplication, and unified observability (Prometheus, Grafana, TimescaleDB).

Cloud / DevOps

Infrastructure as Code, CI/CD pipelines, container orchestration, and multi-cloud architecture across AWS, Azure, GCP, OpenStack, and OCI.

Skills

AI / LLM

LLM Agent Development
RAG (pgvector)
Model Context Protocol (MCP)
Prompt Engineering
Claude / OpenAI / Ollama APIs
NL-to-SQL Compilation
AIOps & AI-Driven Automation

GPU / HPC

NVIDIA DCGM
CUDA / NCCL
NVLink Diagnostics
SLURM
InfiniBand / RDMA / RoCE
OpenMPI
HPC Benchmarking (IOR, HPL)

ML Training

PyTorch / torchrun
FSDP (Distributed Training)
Mixture of Experts (MoE)
MosaicML Composer
Flash-Attention
Weights & Biases / TensorBoard
Training Orchestration (4,000+ GPUs)

Observability

Prometheus / Telegraf
Grafana
TimescaleDB
Custom Monitor Stacks
Slack-Integrated Alerting

Languages

Python
Bash
Nix
YAML

Cloud

AWS
Azure
GCP
Oracle Cloud (OCI)
OpenStack
Ceph Storage

IaC / DevOps

Terraform
Ansible
Chef / SaltStack
Git / CI/CD
Jenkins / Hydra

Containers

Docker
Kubernetes (CKA)
NVIDIA Container Runtime
Docker Compose

Networking

InfiniBand (ConnectX-6)
Mellanox UFM
Dell SONiC
SNMP / RDMA / UCX
Network Automation

Data

PostgreSQL / TimescaleDB
Redis
Elasticsearch
Pandas / NumPy

Hardware / DC

Dell iDRAC / Redfish
IPMI / BIOS (racadm)
VAST Data / NFS
PSU / Thermal Monitoring
500+ Node Cluster Ops

Frameworks

FastAPI / Uvicorn
Slack Bolt SDK
Chainlit
REST / Redfish APIs

Linux

RPM / DEB / NixOS
Server Hardening
Kernel Tuning
SSH Orchestration

Certifications

NVIDIA-Certified Associate: AI Infrastructure and Operations

996e5a60-e96b-4a43-a37a-42d267a207f4

NVIDIA-Certified Associate: Generative AI LLMs

a4647c10-d4bf-4dde-9452-514413e6e291

Certified: Terraform Associate (003)

aeceb2f3-33ac-4ad4-a94a-dbb177fc619a

Certified Kubernetes Administrator

LF-a2o0xdlc96

Certified OpenStack Administrator

COA-1600-0067-0100

AWS Certified Cloud Practitioner

K40MK0MDFEEQ13CJ

OpenStack Administrator Certification Professional Level

200-422-511

Red Hat Certified Administrator

150-132-256

Linux Foundation Certified Engineer

LFCE-1500-0058-0200

Certified Ethical Hacker v9

ECC55284410087

ITIL Foundation Certificate in IT Service Management

5426913.20427512

IBM Certified System Administrator - AIX 7

03005008

SUSE Enterprise Architect

10186201

SUSE Certified Instructor

10186201

Certified Engineer Enterprise Linux

10186201

Certified Engineer OpenStack Cloud

10186201

Certified Administrator Enterprise Storage (Ceph)

10186201

Certified Administrator Systems Management (SUSE Manager)

10186201

Projects

SlaSH

Slack bot for GPU datacenter operations. Real-time GPU health monitoring (DCGM, NVLink, ECC), InfiniBand per-lane error analysis, and 13-test Dell diagnostics bundle with automated TSR collection. Background monitors for thermal, network, and NFS health with auto-remediation and intelligent alert deduplication.

PythonSlack BoltiDRAC/RedfishInfiniBandDCGM

Elliot

LLM-powered operations agent for GPU datacenter diagnostics. Natural language to SQL against TimescaleDB, protocol-aware discovery (Redfish/SNMP/SSH), and tool evolution from repeated query patterns. Multi-interface: Slack, REST API, CLI, and Chainlit web UI.

PythonOllamaTimescaleDBFastAPIChainlit

PapayaPlaya

LLM training orchestration for GPU clusters without Slurm dependency. Manages NGC container distribution, NFS-backed checkpoint storage, and torchrun-based distributed training. Intelligent image transfer logic and automated training result analysis with W&B integration.

PyTorchDockertorchrunNFSCUDA

sysBOFH

Personal flake-based NixOS configuration system supporting reproducible setups across NixOS desktops, servers, and macOS machines using shared, modular components.

NixOSFlakesHome Manager

DevOps Nix Shell

Portable, reproducible nix shell environments with all your DevOps tools and dotfiles, run them anywhere, instantly.

NixDevOpsShell

BOFH NixVim

Functional and minimalist NixVim configuration, featuring a carefully curated set of essential plugins for an efficient, distraction-free editing experience.

NixVimNeovimNix

Anvil

Containerized GPU cluster benchmarking platform. Orchestrates NCCL collective tests, HPL compute benchmarks, and DCGM thermal profiling across distributed nodes with multi-node SSH orchestration and Prometheus-based metric collection.

NCCLHPLOpenMPIDockerPrometheus

5cents-cluster

Containerized NCCL test runner for single and multi-node GPU configurations. Tests collective communication performance (all_reduce, all_gather, broadcast) with dual build modes and automated node connectivity validation.

NCCLDockerCUDASSH

IoRi

Distributed I/O benchmarking tool for GPU clusters. Automated containerized IoR tests with OpenMPI across multiple nodes, flexible node range specification, and pre-flight connectivity validation with structured result collection.

IoROpenMPIDockerPOSIX I/O

OpenStack Lab

Zero-touch OpenStack deployment pipeline: Terraform provisions the infrastructure via libvirt, cloud-init bootstraps the OS, and Ansible drives DevStack end-to-end. One command from bare metal to a fully operational private cloud with Keystone, Nova, Neutron, and Horizon.

OpenStackTerraformAnsiblelibvirt

Ansible Lab

Dynamically scalable Ansible sandbox built on Docker Compose. Spin up N target nodes on demand with a single parameter, all SSH-ready with auto-provisioned keys. Custom openSUSE Tumbleweed image stripped of systemd for a minimal, production-realistic footprint.

AnsibleDocker ComposeopenSUSESSH

CEPH Lab

Scalable Ceph distributed storage cluster with isolated dual-network topology (admin + cluster replication). Terraform provisions MON, OSD, and admin nodes on demand; Ansible drives ceph-deploy orchestration with Vault-managed secrets. From zero to a production-grade storage fabric in minutes.

CephTerraformAnsiblelibvirt

Get In Touch

Feel free to reach out through any of these channels.