company logo

GPU Infrastructure & Networking Architect

Mumbai
Mumbai Suburban
Navi Mumbai
Bangalore
Full-Time
Senior: 10 to 20 years
Posted on Jun 09 2026

About the Job

Skills

GPU
GB300
Netris
NVL72 bare-metal
NVIDIA
F5

The GPU Infrastructure & Networking Architect is responsible for the Low-Level Design (LLD) and Day-0 implementation of the physical GPU compute and DPU-centric networking fabric within the CloudXP Sovereign AI Cloud. This role owns the end-to-end design of the GB300 NVL72 bare-metal compute layer, NVIDIA BlueField-3 DPU deployment in DPF mode, Spectrum-X GPU TAN, Spectrum-3 ancillary fabric, and F5/Netris internet egress — translating the High-Level Design (HLD) into deployable, validated configurations aligned to the NVIDIA Reference Architecture.


Key Responsibilities


1.   Design Infrastructure LLD & Day-0 Configuration

•    Produce rack-level and node-level LLD for GB300 NVL72 bare-metal GPU nodes including cabling, power, and cooling topology

•    Design and document BF-3 DPU deployment in DPF mode across all node types (GB300 and ancillary); validate DOCA DPF Operator configuration and lifecycle

•    Define NVIDIA NICo zero-trust enrollment workflow for GB300 nodes; document pre-boot attestation sequences and BMC integration

•    Produce IP addressing schemes for GPU TAN (Spectrum-X), ancillary compute fabric (Spectrum-3), OOB management, and storage planes

•    Author Day-0 runbooks for Spectrum-X spine-leaf bring-up, rail-optimised RoCE configuration, and MTU/ECN tuning for GB300 workloads


2.   Design Networking & Egress Design

•    Design distributed BGP egress using FRR DaemonSet with /32 EIP injection; configure eBGP peering between AS 65100 (compute) and AS 65000 (Spectrum-3 SN4600C spine)

•    Own F5 AWAF hardware egress configuration as the sole internet exit path; define F5 BNK + DPF co-deployment on BF-3 for north-south traffic

•    Design OVN-Kubernetes overlay for ancillary nodes and DPF Host-Trusted mode integration; validate SF+VF coexistence on BF-3

•    Produce VXLAN segment maps, VRF isolation design, and HBN VRF scaling analysis for ZoneVPC DaemonSet model

•    Define RDMA/RoCE network policies and lossless fabric requirements for GB300 scale-out training workloads

•    Implement netris based cloud virtual functions and


3.   Design Storage Connectivity

•    Design VAST NFS mount architecture for GB300 nodes; validate data-path performance at scale and define NFS tuning parameters

•    Design NetApp ONTAP block storage connectivity for ancillary KubeVirt VMs; document iSCSI/NVMe-oF path configuration

•    Produce storage LLD covering StorageGRID object connectivity, zone affinity, and multi-tenancy namespace isolation


4.   Implement Validation & NVIDIA Alignment

•    Execute hardware bring-up validation against NVIDIA GB300 NVL72 Dual-Plane Networking Reference Architecture

•    Coordinate with NVIDIA field engineering on DTS Prometheus port accessibility in DPF mode and UFM 6.x metric naming compatibility

•    Produce test plans and acceptance criteria for network fabric, GPU TAN, and egress path; participate in NVIDIA NCP validation reviews


Required Skills & Experience


Must-Have

•    10+ years in data-centre infrastructure architecture with 3+ years on GPU/AI cluster deployments at scale (100+ nodes)

•    Hands-on experience with NVIDIA BlueField DPUs (BF-2 or BF-3); knowledge of DOCA SDK, DPF Operator, and OVN-K integration

•    Deep expertise in BGP (eBGP/iBGP), RoCE/RDMA networking, and lossless Ethernet fabric design (PFC, ECN, DCQCN)

•    Proficiency with NVIDIA Spectrum switches (UFM, SHARP, rail-optimised topology) or comparable InfiniBand/Ethernet AI fabrics

•    Experience with F5 BIG-IP (hardware AWAF / BIG-IP Next for Kubernetes); familiarity with BIG-IP as Kubernetes ingress/egress

•    Strong Linux networking background: VXLAN, VRF, VLAN, OVN/OVS, kernel datapath, SR-IOV, VF/SF configuration

•    Experience with Netris implementation and customization

•    Proficiency in Python and/or Go for automation scripts and infrastructure-as-code tooling; Ansible/Terraform for Day-0 provisioning


Nice-to-Have

•    Familiarity with NVIDIA NICo zero-trust bare-metal enrollment and attestation workflows

•    Experience with NetApp ONTAP and VAST Data NFS storage platforms in high-performance compute environments

•    Knowledge of NVIDIA UFM (Unified Fabric Manager) 6.x and Telemetry Streaming for GPU fabric observability

•    Prior engagement with NVIDIA Cloud Partner (NCP) programme or Sovereign AI Cloud deployments

•    Understanding of Kubernetes CNI plugins (OVN-Kubernetes, Cilium) and their interaction with DPU offload

About the company

We are the force behind the meteoric rise of Indias leading telecom operator Jio with 400 Million+ customers. In Addition to this we have also powered an exhaustive list of digital apps & services that have delivered functionality, usability, engagement, scale and loyalty. We provide solutions for customers (B2C) and enterprise (B2B). We have an end to end 5G solution consisting of 5G Radio, a com ...Show More

Industry

Media & Telecommunication...

Company Size

10001+ Employees

Headquarter

Navi Mumbai, Maharashtra

Other open jobs from Jio