B O F H

William Vera

AI Infrastructure Engineer | GPU/HPC Operations | LLM & AIOps | SRE | Cloud Architect

About Me

AI Infrastructure Engineer | GPU/HPC Operations | LLM & AIOps | SRE | Cloud Architect

12+ years building and operating large-scale infrastructure. Currently managing a 500+ node GPU HPC cluster, orchestrating distributed LLM training across GPU fleets, and building AI-powered operations platforms for datacenter-scale environments.

NVIDIA-certified in AI Infrastructure and Generative AI LLMs. Hands-on with GPU cluster management, CUDA/NCCL workloads, InfiniBand networking, distributed model training (PyTorch FSDP, MoE architectures), and production LLM agent development with RAG, MCP, and multi-model architectures.

Experienced building self-healing monitoring systems with automated alerting, deduplication, and escalation policies.

AI / LLM Ops

Orchestrating distributed LLM training (PyTorch FSDP, MoE), building production LLM agents, RAG pipelines, and AI-powered operations platforms.

GPU / HPC

Operating 500+ node GPU clusters, NVIDIA DCGM, CUDA/NCCL workloads, SLURM scheduling, InfiniBand fabric, and distributed compute orchestration.

SRE / Platform

Scalability, resilience, and observability at scale. Custom monitoring stacks, automated remediation, and incident response for mission-critical systems.

Cloud / DevOps

Infrastructure as Code, CI/CD pipelines, container orchestration, and multi-cloud architecture across AWS, Azure, GCP, OpenStack, and OCI.

Skills

AI / LLM

  • LLM Agent Development
  • RAG (pgvector)
  • Model Context Protocol (MCP)
  • Prompt Engineering
  • Claude / OpenAI / Ollama APIs
  • NL-to-SQL Compilation
  • AIOps & AI-Driven Automation

GPU / HPC

  • NVIDIA DCGM
  • CUDA / NCCL
  • NVLink Diagnostics
  • SLURM
  • InfiniBand / RDMA / RoCE
  • OpenMPI
  • HPC Benchmarking (IOR, HPL)

ML Training

  • PyTorch / torchrun
  • FSDP (Distributed Training)
  • Mixture of Experts (MoE)
  • MosaicML Composer
  • Flash-Attention
  • Weights & Biases / TensorBoard
  • Training Orchestration (500+ nodes)

Observability

  • Prometheus / Telegraf
  • Grafana
  • TimescaleDB
  • Custom Monitor Stacks
  • Slack-Integrated Alerting

Languages

  • Python
  • Bash
  • Nix
  • YAML

Cloud

  • AWS
  • Azure
  • GCP
  • Oracle Cloud (OCI)
  • OpenStack
  • Ceph Storage

IaC / DevOps

  • Terraform
  • Ansible
  • Chef / SaltStack
  • Git / CI/CD
  • Jenkins / Hydra

Containers

  • Docker
  • Kubernetes (CKA)
  • NVIDIA Container Runtime
  • Docker Compose

Networking

  • InfiniBand (ConnectX-6)
  • Mellanox UFM
  • Dell SONiC
  • SNMP / RDMA / UCX
  • Network Automation

Data

  • PostgreSQL / TimescaleDB
  • Redis
  • Elasticsearch
  • Pandas / NumPy

Hardware / DC

  • Dell iDRAC / Redfish
  • IPMI / BIOS (racadm)
  • VAST Data / NFS
  • PSU / Thermal Monitoring
  • 500+ Node Cluster Ops

Frameworks

  • FastAPI / Uvicorn
  • Slack Bolt SDK
  • Chainlit
  • REST / Redfish APIs

Linux

  • RPM / DEB / NixOS
  • Server Hardening
  • Kernel Tuning
  • SSH Orchestration

Certifications

NVIDIA

NVIDIA-Certified Associate: AI Infrastructure and Operations

996e5a60-e96b-4a43-a37a-42d267a207f4

NVIDIA

NVIDIA-Certified Associate: Generative AI LLMs

a4647c10-d4bf-4dde-9452-514413e6e291

HashiCorp

Certified: Terraform Associate (003)

aeceb2f3-33ac-4ad4-a94a-dbb177fc619a

CNCF

Certified Kubernetes Administrator

LF-a2o0xdlc96

OpenStack

Certified OpenStack Administrator

COA-1600-0067-0100

AWS

AWS Certified Cloud Practitioner

K40MK0MDFEEQ13CJ

Mirantis

OpenStack Administrator Certification Professional Level

200-422-511

Red Hat

Red Hat Certified Administrator

150-132-256

Linux Foundation

Linux Foundation Certified Engineer

LFCE-1500-0058-0200

EC-Council

Certified Ethical Hacker v9

ECC55284410087

Axelos

ITIL Foundation Certificate in IT Service Management

5426913.20427512

IBM

IBM Certified System Administrator - AIX 7

03005008

SUSE

SUSE Enterprise Architect

10186201

SUSE

SUSE Certified Instructor

10186201

SUSE

Certified Engineer Enterprise Linux

10186201

SUSE

Certified Engineer OpenStack Cloud

10186201

SUSE

Certified Administrator Enterprise Storage (Ceph)

10186201

SUSE

Certified Administrator Systems Management (SUSE Manager)

10186201

HOME-LAB

sysBOFH

sysBOFH

My personal flake-based NixOS configuration system that supports reproducible setups across NixOS desktops, servers, and macOS machines using shared, modular components.

DevOps Nix Shell

DevOps Nix Shell

Portable, reproducible nix shell environments with all your DevOps tools and dotfiles, run them anywhere, instantly.

BOFH NixVim

BOFH NixVim

This is my functional and minimalist NixVim configuration, featuring a carefully curated set of essential plugins for an efficient, distraction-free editing experience.

OpenStack Lab

OpenStack Lab

Automated OpenStack deployment pipeline powered by Terraform and Ansible. Build, provision, and configure with ease.

Ansible Lab

Ansible Lab

A personal, fully scalable Ansible lab powered by Docker and based on openSUSE Tumbleweed, easily customizable to any Linux base image. Perfect for testing and developing automation workflows.

CEPH Lab

CEPH Lab

Automated Ceph homelab deployment on scalable, multi-node architectures using Terraform and Ansible for efficient infrastructure and orchestration.

Get In Touch

Feel free to reach out through any of these channels.