
GPU Infrastructure & Networking Architect

GPU Infrastructure & Networking Architect
About the Job
Skills
The GPU Infrastructure & Networking Architect is responsible for the Low-Level Design (LLD) and Day-0 implementation of the physical GPU compute and DPU-centric networking fabric within the CloudXP Sovereign AI Cloud. This role owns the end-to-end design of the GB300 NVL72 bare-metal compute layer, NVIDIA BlueField-3 DPU deployment in DPF mode, Spectrum-X GPU TAN, Spectrum-3 ancillary fabric, and F5/Netris internet egress — translating the High-Level Design (HLD) into deployable, validated configurations aligned to the NVIDIA Reference Architecture.
Key Responsibilities
1. Design Infrastructure LLD & Day-0 Configuration
• Produce rack-level and node-level LLD for GB300 NVL72 bare-metal GPU nodes including cabling, power, and cooling topology
• Design and document BF-3 DPU deployment in DPF mode across all node types (GB300 and ancillary); validate DOCA DPF Operator configuration and lifecycle
• Define NVIDIA NICo zero-trust enrollment workflow for GB300 nodes; document pre-boot attestation sequences and BMC integration
• Produce IP addressing schemes for GPU TAN (Spectrum-X), ancillary compute fabric (Spectrum-3), OOB management, and storage planes
• Author Day-0 runbooks for Spectrum-X spine-leaf bring-up, rail-optimised RoCE configuration, and MTU/ECN tuning for GB300 workloads
2. Design Networking & Egress Design
• Design distributed BGP egress using FRR DaemonSet with /32 EIP injection; configure eBGP peering between AS 65100 (compute) and AS 65000 (Spectrum-3 SN4600C spine)
• Own F5 AWAF hardware egress configuration as the sole internet exit path; define F5 BNK + DPF co-deployment on BF-3 for north-south traffic
• Design OVN-Kubernetes overlay for ancillary nodes and DPF Host-Trusted mode integration; validate SF+VF coexistence on BF-3
• Produce VXLAN segment maps, VRF isolation design, and HBN VRF scaling analysis for ZoneVPC DaemonSet model
• Define RDMA/RoCE network policies and lossless fabric requirements for GB300 scale-out training workloads
• Implement netris based cloud virtual functions and
3. Design Storage Connectivity
• Design VAST NFS mount architecture for GB300 nodes; validate data-path performance at scale and define NFS tuning parameters
• Design NetApp ONTAP block storage connectivity for ancillary KubeVirt VMs; document iSCSI/NVMe-oF path configuration
• Produce storage LLD covering StorageGRID object connectivity, zone affinity, and multi-tenancy namespace isolation
4. Implement Validation & NVIDIA Alignment
• Execute hardware bring-up validation against NVIDIA GB300 NVL72 Dual-Plane Networking Reference Architecture
• Coordinate with NVIDIA field engineering on DTS Prometheus port accessibility in DPF mode and UFM 6.x metric naming compatibility
• Produce test plans and acceptance criteria for network fabric, GPU TAN, and egress path; participate in NVIDIA NCP validation reviews
Required Skills & Experience
Must-Have
• 10+ years in data-centre infrastructure architecture with 3+ years on GPU/AI cluster deployments at scale (100+ nodes)
• Hands-on experience with NVIDIA BlueField DPUs (BF-2 or BF-3); knowledge of DOCA SDK, DPF Operator, and OVN-K integration
• Deep expertise in BGP (eBGP/iBGP), RoCE/RDMA networking, and lossless Ethernet fabric design (PFC, ECN, DCQCN)
• Proficiency with NVIDIA Spectrum switches (UFM, SHARP, rail-optimised topology) or comparable InfiniBand/Ethernet AI fabrics
• Experience with F5 BIG-IP (hardware AWAF / BIG-IP Next for Kubernetes); familiarity with BIG-IP as Kubernetes ingress/egress
• Strong Linux networking background: VXLAN, VRF, VLAN, OVN/OVS, kernel datapath, SR-IOV, VF/SF configuration
• Experience with Netris implementation and customization
• Proficiency in Python and/or Go for automation scripts and infrastructure-as-code tooling; Ansible/Terraform for Day-0 provisioning
Nice-to-Have
• Familiarity with NVIDIA NICo zero-trust bare-metal enrollment and attestation workflows
• Experience with NetApp ONTAP and VAST Data NFS storage platforms in high-performance compute environments
• Knowledge of NVIDIA UFM (Unified Fabric Manager) 6.x and Telemetry Streaming for GPU fabric observability
• Prior engagement with NVIDIA Cloud Partner (NCP) programme or Sovereign AI Cloud deployments
• Understanding of Kubernetes CNI plugins (OVN-Kubernetes, Cilium) and their interaction with DPU offload
About the company
Industry
Media & Telecommunication...
Company Size
10001+ Employees
Headquarter
Navi Mumbai, Maharashtra
Other open jobs from Jio
