This document outlines the specifications required to be a Runpod secure cloud partner. These requirements establish the baseline, however for new partners, Runpod will perform a due diligence process prior to selection encompassing business health, prior performance, and corporate alignment.
Meeting these technical and operational requirements does not guarantee selection.
New partners
Existing partners
A new revision will be released in October 2025 on an annual basis. Minor mid-year revisions may be made as needed to account for changes in market, roadmap, or customer needs.
100kW of GPU server capacity is the minimum deployment size.
NVIDIA GPUs no older than Ampere generation.
Requirement | Specification |
---|---|
Cores | Minimum 4 physical CPU cores per GPU + 2 for system operations |
Clock Speed | Minimum 3.5 GHz base clock, with boost clock of at least 4.0 GHz |
Recommended CPUs | AMD EPYC 9654 (96 cores, up to 3.7 GHz), Intel Xeon Platinum 8490H (60 cores, up to 4.8 GHz), AMD EPYC 9474F (48 cores, up to 4.1 GHz) |
GPU VRAM | Minimum Bandwidth |
---|---|
8/10/12/16 GB | PCIe 3.0 x16 |
20/24/32/40/48 GB | PCIe 4.0 x16 |
80 GB | PCIe 5.0 x16 |
Exceptions list:
Main system memory must have ECC.
GPU Configuration | Recommended RAM |
---|---|
8x 80 GB VRAM | >= 2048 GB DDR5 |
8x 40/48 GB VRAM | >= 1024 GB DDR5 |
8x 24 GB VRAM | >= 512 GB DDR4/5 |
8x 16 GB VRAM | >= 256 GB DDR4/5 |
There are two types of required storage, boot and working arrays. These are two separate arrays of hard drives which provide isolation between host operating system activity (boot array) and customer workloads (working array).
Requirement | Specification |
---|---|
Redundancy | >= 2n redundancy (RAID 1) |
Size | >= 500GB (Post RAID) |
Disk Perf - Sequential read | 2,000 MB/s |
Disk Perf - Sequential write | 2,000 MB/s |
Disk Perf - Random Read (4K QD32) | 100,000 IOPS |
Disk Perf - Random Write (4K QD32) | 10,000 IOPS |
Component | Requirement |
---|---|
Redundancy | >= 2n redundancy (RAID 1 or RAID 10) |
Size | 2 TB+ NVME per GPU for 24/48 GB GPUs; 4 TB+ NVME per GPU for 80 GB GPUs (Post RAID) |
Disk Perf - Sequential read | 6,000 MB/s |
Disk Perf - Sequential write | 5,000 MB/s |
Disk Perf - Random Read (4K QD32) | 400,000 IOPS |
Disk Perf - Random Write (4K QD32) | 40,000 IOPS |
Each datacenter must have a storage cluster which provides shared storage between all GPU servers. The hardware is provided by the partner, storage cluster licensing is provided by Runpod. All storage servers must be accessible by all GPU compute machines.
Component | Requirement |
---|---|
Minimum Servers | 4 |
Minimum Storage size | 200 TB raw (100 TB usable) |
Connectivity | 200 Gbps between servers/data-plane |
Network | Private subnet |
Component | Requirement |
---|---|
CPU | AMD Genoa: EPYC 9354P (32-Core, 3.25-3.8 GHz), EPYC 9534 (64-Core, 2.45-3.7 GHz), or EPYC 9554 (64-Core, 3.1-3.75 GHz) |
RAM | 256 GB or higher, DDR5/ECC |
Requirement | Specification |
---|---|
Redundancy | >= 2n redundancy (RAID 1) |
Size | >= 500GB (Post RAID) |
Disk Perf - Sequential read | 2,000 MB/s |
Disk Perf - Sequential write | 2,000 MB/s |
Disk Perf - Random Read (4K QD32) | 100,000 IOPS |
Disk Perf - Random Write (4K QD32) | 10,000 IOPS |
Component | Requirement |
---|---|
Redundancy | None (JBOD) - Runpod will assemble into array. 7 to 14TB disk sizes recommended. |
Disk Perf - Sequential read | 6,000 MB/s |
Disk Perf - Sequential write | 5,000 MB/s |
Disk Perf - Random Read (4K QD32) | 400,000 IOPS |
Disk Perf - Random Write (4K QD32) | 40,000 IOPS |
Servers should have spare disk slots for future expansion without deployment of new servers.
Even distribution among machines (e.g., 7 TB x 8 disks x 4 servers = 224 TB total space).
Once a storage cluster exceeds 90% single core CPU on the leader node during peak hours, a dedicated metadata server is required. Metadata tracking is a single process operation, and single threaded performance is the most important metric.
Component | Requirement |
---|---|
CPU | AMD Ryzen Threadripper 7960X (24-Cores, 4.2-5.3 GHz) |
RAM | 128 GB or higher, DDR5/ECC |
Boot disk | >= 500 GB, RAID 1 |
Each datacenter should have a CPU server that to accommodate CPU-only Pods and Serverless workers. Runpod will also use this server to host additional features for which a GPU is not required (e.g., the S3-compatible API).
Component | Requirement |
---|---|
Minimum Servers | 2 |
Minimum Storage size | 8 TB usable |
Connectivity | 200 Gbps between servers/data-plane |
Network | Private subnet; public IP and >990 ports open |
Component | Requirement |
---|---|
CPU | AMD EPYC 9004 ‘Genoa’ Zen 4 or better with minimum 32 cores. 3+ GHz clock speed. |
RAM | 1 TB or higher, DDR5/ECC |
Component | Requirement |
---|---|
Redundancy | >= 2n redundancy (RAID 1 or RAID 10) |
Size | 8 TB+ |
Disk Perf - Sequential read | 6,000 MB/s |
Disk Perf - Sequential write | 5,000 MB/s |
Disk Perf - Random Read (4K QD32) | 400,000 IOPS |
Disk Perf - Random Write (4K QD32) | 40,000 IOPS |
Component | Requirement |
---|---|
Redundancy | >= 2n redundancy (RAID 1) |
Size | >= 500GB (Post RAID) |
Disk Perf - Sequential read | 2,000 MB/s |
Disk Perf - Sequential write | 2,000 MB/s |
Disk Perf - Random Read (4K QD32) | 100,000 IOPS |
Disk Perf - Random Write (4K QD32) | 10,000 IOPS |
Ubuntu Server 22.04 LTS Linux kernel 6.5.0-15 or later production version (Ubuntu HWE Kernel) SSH remote connection capability
IOMMU disabled for non-VM systems Update server BIOS/firmware to latest stable version
Component | Requirement |
---|---|
NVIDIA Drivers | Version 550.54.15 or later production version |
CUDA | Version 12.4 or later production version |
NVIDIA Persistence | Activated for GPUs of 48 GB or more |
Requirement | Specification |
---|---|
Utility Feeds | - Minimum of two independent utility feeds from separate substations - Each feed capable of supporting 100% of the data center’s power load - Automatic transfer switches (ATS) for seamless switchover between feeds with UL 1008 certification (or regional equivalent) |
UPS | - N+1 redundancy for UPS systems - Minimum of 15 minutes runtime at full load |
Generators | - N+1 redundancy for generator systems - Generators must be able to support 100% of the data center’s power load - Minimum of 48 hours of on-site fuel storage at full load - Automatic transfer to generator power within 10 seconds of utility failure |
Power Distribution | - Redundant power distribution paths (2N) from utility to rack level - Redundant Power Distribution Units (PDUs) in each rack - Remote power monitoring and management capabilities at rack level |
Testing and Maintenance | - Monthly generator tests under load for a minimum of 30 minutes - Quarterly full-load tests of the entire backup power system, including UPS and generators - Annual full-facility power outage test (coordinated with Runpod) - Regular thermographic scanning of electrical systems - Detailed maintenance logs for all power equipment - 24/7 on-site facilities team for immediate response to power issues |
Monitoring and Alerting | - Real-time monitoring of all power systems - Automated alerting for any power anomalies or threshold breaches |
Capacity Planning | - Maintain a minimum of 20% spare power capacity for future growth - Annual power capacity audits and forecasting |
Fire Suppression | - Maintain datacenter fire suppression systems in compliance with NFPA 75 and 76 (or regional equivalent) |
Requirement | Specification |
---|---|
Internet Connectivity | - Minimum of two diverse and redundant internet circuits from separate providers - Each connection should be capable of supporting 100% of the data center’s bandwidth requirements - BGP routing implemented for automatic failover between circuit providers - 100 Gbps minimum total bandwidth capacity |
Speed Requirements | - Preferred: >= 10 Gbps sustained upload/download speed per server - Minimum: >= 5 Gbps sustained upload/download speed per server - Speed measurements should be based on sustained throughput over a 60 second interval during a typical workload |
Core Infrastructure | - Redundant core switches in a high-availability configuration (e.g., stacking, VSS, or equivalent) |
Distribution Layer | - Redundant distribution switches with multi-chassis link aggregation (MLAG) or equivalent technology - Minimum 100 Gbps uplinks to core switches |
Access Layer | - Redundant top-of-rack switches in each cabinet - Minimum 100 Gbps server connections for high-performance compute nodes |
DDoS Protection | - Must have a DDoS mitigation solution, either on-premises or on-demand cloud-based |
Quality of service | Maintain network performance within the following parameters: * Network utilization levels must remain below 80% on any link during peak hours * Packet loss must not exceed 0.1% (1 in 1000) on any network segment * P95 round-trip time (RTT) within the data center should not exceed 4ms * P95 jitter within the datacenter should not exceed 3ms |
Testing and Maintenance | - Regular failover testing of all redundant components (minimum semi-annually) - Annual full-scale disaster recovery test - Maintenance windows for network updates and patches, with minimal service disruption scheduled at least 1 week in advance |
Capacity Planning | - Maintain a minimum of 40% spare network capacity for future growth - Regular network performance audits and capacity forecasting |
To qualify as a Runpod secure cloud partner, the parent organization must adhere to at least one of the following compliance standards:
Additionally, partners must comply with the following operational standards:
Requirement | Description |
---|---|
Data Center Tier | Abide by Tier III+ Data Center Standards |
Security | 24/7 on-site security and technical staff |
Physical security | Runpod servers must be held in an isolated secure rack or cage in an area that is not accessible to any non-partner or approved DC personnel. Physical access to this area must be tracked and logged. |
Maintenance | All maintenance resulting in disruption or downtime must be scheduled at least 1 week in advance. Large disruptions must be coordinated with Runpod at least 1 month in advance. |
Runpod will review evidence of:
For detailed information on maintenance scheduling, power system management, and network operations, please refer to our documentation.