B O F H

William Vera

Senior AI Infrastructure Engineer | GPU/HPC Operations | LLM & AIOps | SRE | DevOps

About Me

Senior AI Infrastructure Engineer | GPU/HPC Operations | LLM & AIOps | SRE | DevOps

Senior AI Infrastructure Engineer with 12+ years building and operating large-scale distributed systems, from enterprise cloud platforms (SUSE, Red Hat, Nokia) to GPU-scale AI infrastructure.

Currently operating a 4,000+ NVIDIA H100 GPU HPC cluster (500+ nodes, 8x H100 80GB HBM3 per node) supporting distributed LLM training, with full-stack ownership of production monitoring, self-healing alerting, auto-remediation, and LLM-powered operations tooling.

NVIDIA-certified in AI Infrastructure and Generative AI LLMs. Hands-on with GPU cluster management, CUDA/NCCL workloads, InfiniBand networking, and production LLM agent development with RAG, MCP, and multi-model architectures.

AI / LLM Ops

Building LLM-powered operations agents with RAG, MCP, and NL-to-SQL. Multi-model support (Claude, OpenAI, Ollama) for infrastructure diagnostics.

GPU / HPC

Operating a 4,000+ NVIDIA H100 GPU cluster (500+ nodes). DCGM, CUDA/NCCL, SLURM scheduling, InfiniBand fabric, GPUDirect RDMA, and distributed training orchestration.

SRE / Platform

Self-healing monitoring with 12+ background modules, auto-remediation workflows, alert deduplication, and unified observability (Prometheus, Grafana, TimescaleDB).

Cloud / DevOps

Infrastructure as Code, CI/CD pipelines, container orchestration, and multi-cloud architecture across AWS, Azure, GCP, OpenStack, and OCI.

Skills

AI / LLM

  • LLM Agent Development
  • RAG (pgvector)
  • Model Context Protocol (MCP)
  • Prompt Engineering
  • Claude / OpenAI / Ollama APIs
  • NL-to-SQL Compilation
  • AIOps & AI-Driven Automation

GPU / HPC

  • NVIDIA DCGM
  • CUDA / NCCL
  • NVLink Diagnostics
  • SLURM
  • InfiniBand / RDMA / RoCE
  • OpenMPI
  • HPC Benchmarking (IOR, HPL)

ML Training

  • PyTorch / torchrun
  • FSDP (Distributed Training)
  • Mixture of Experts (MoE)
  • MosaicML Composer
  • Flash-Attention
  • Weights & Biases / TensorBoard
  • Training Orchestration (4,000+ GPUs)

Observability

  • Prometheus / Telegraf
  • Grafana
  • TimescaleDB
  • Custom Monitor Stacks
  • Slack-Integrated Alerting

Languages

  • Python
  • Bash
  • Nix
  • YAML

Cloud

  • AWS
  • Azure
  • GCP
  • Oracle Cloud (OCI)
  • OpenStack
  • Ceph Storage

IaC / DevOps

  • Terraform
  • Ansible
  • Chef / SaltStack
  • Git / CI/CD
  • Jenkins / Hydra

Containers

  • Docker
  • Kubernetes (CKA)
  • NVIDIA Container Runtime
  • Docker Compose

Networking

  • InfiniBand (ConnectX-6)
  • Mellanox UFM
  • Dell SONiC
  • SNMP / RDMA / UCX
  • Network Automation

Data

  • PostgreSQL / TimescaleDB
  • Redis
  • Elasticsearch
  • Pandas / NumPy

Hardware / DC

  • Dell iDRAC / Redfish
  • IPMI / BIOS (racadm)
  • VAST Data / NFS
  • PSU / Thermal Monitoring
  • 500+ Node Cluster Ops

Frameworks

  • FastAPI / Uvicorn
  • Slack Bolt SDK
  • Chainlit
  • REST / Redfish APIs

Linux

  • RPM / DEB / NixOS
  • Server Hardening
  • Kernel Tuning
  • SSH Orchestration

Certifications

NVIDIA

NVIDIA-Certified Associate: AI Infrastructure and Operations

996e5a60-e96b-4a43-a37a-42d267a207f4

NVIDIA

NVIDIA-Certified Associate: Generative AI LLMs

a4647c10-d4bf-4dde-9452-514413e6e291

HashiCorp

Certified: Terraform Associate (003)

aeceb2f3-33ac-4ad4-a94a-dbb177fc619a

CNCF

Certified Kubernetes Administrator

LF-a2o0xdlc96

OpenStack

Certified OpenStack Administrator

COA-1600-0067-0100

AWS

AWS Certified Cloud Practitioner

K40MK0MDFEEQ13CJ

Mirantis

OpenStack Administrator Certification Professional Level

200-422-511

Red Hat

Red Hat Certified Administrator

150-132-256

Linux Foundation

Linux Foundation Certified Engineer

LFCE-1500-0058-0200

EC-Council

Certified Ethical Hacker v9

ECC55284410087

Axelos

ITIL Foundation Certificate in IT Service Management

5426913.20427512

IBM

IBM Certified System Administrator - AIX 7

03005008

SUSE

SUSE Enterprise Architect

10186201

SUSE

SUSE Certified Instructor

10186201

SUSE

Certified Engineer Enterprise Linux

10186201

SUSE

Certified Engineer OpenStack Cloud

10186201

SUSE

Certified Administrator Enterprise Storage (Ceph)

10186201

SUSE

Certified Administrator Systems Management (SUSE Manager)

10186201

Projects

sysBOFH

sysBOFH

Personal flake-based NixOS configuration system supporting reproducible setups across NixOS desktops, servers, and macOS machines using shared, modular components.

NixOSFlakesHome Manager
DevOps Nix Shell

DevOps Nix Shell

Portable, reproducible nix shell environments with all your DevOps tools and dotfiles, run them anywhere, instantly.

NixDevOpsShell
BOFH NixVim

BOFH NixVim

Functional and minimalist NixVim configuration, featuring a carefully curated set of essential plugins for an efficient, distraction-free editing experience.

NixVimNeovimNix
Anvil

Anvil

Containerized GPU cluster benchmarking platform. Orchestrates NCCL collective tests, HPL compute benchmarks, and DCGM thermal profiling across distributed nodes with multi-node SSH orchestration and Prometheus-based metric collection.

NCCLHPLOpenMPIDockerPrometheus
5cents-cluster

5cents-cluster

Containerized NCCL test runner for single and multi-node GPU configurations. Tests collective communication performance (all_reduce, all_gather, broadcast) with dual build modes and automated node connectivity validation.

NCCLDockerCUDASSH
IoRi

IoRi

Distributed I/O benchmarking tool for GPU clusters. Automated containerized IoR tests with OpenMPI across multiple nodes, flexible node range specification, and pre-flight connectivity validation with structured result collection.

IoROpenMPIDockerPOSIX I/O
OpenStack Lab

OpenStack Lab

Zero-touch OpenStack deployment pipeline: Terraform provisions the infrastructure via libvirt, cloud-init bootstraps the OS, and Ansible drives DevStack end-to-end. One command from bare metal to a fully operational private cloud with Keystone, Nova, Neutron, and Horizon.

OpenStackTerraformAnsiblelibvirt
Ansible Lab

Ansible Lab

Dynamically scalable Ansible sandbox built on Docker Compose. Spin up N target nodes on demand with a single parameter, all SSH-ready with auto-provisioned keys. Custom openSUSE Tumbleweed image stripped of systemd for a minimal, production-realistic footprint.

AnsibleDocker ComposeopenSUSESSH
CEPH Lab

CEPH Lab

Scalable Ceph distributed storage cluster with isolated dual-network topology (admin + cluster replication). Terraform provisions MON, OSD, and admin nodes on demand; Ansible drives ceph-deploy orchestration with Vault-managed secrets. From zero to a production-grade storage fabric in minutes.

CephTerraformAnsiblelibvirt

Get In Touch

Feel free to reach out through any of these channels.