
Sponsored Feature: In today’s dynamic technological environment, service providers such as cloud service providers (CSPs), managed service providers (MSPs), software-as-a-service (SaaS) providers, and enterprise private cloud operators face a myriad of challenges in the modern datacenter.
The vast landscape of technologies that constitute what a modern datacenter is and how it operates efficiently is evolving swiftly, with cost management quickly becoming a perpetual concern amongst all service providers. Outlined below, are five best practices for CSPs to scale the modern datacenter.
- Incorporating Liquid Cooling to Enhance Performance
Over the past three decades, the datacenter landscape has witnessed a consistent growth in output. This growth has been fueled and driven by the rapidly increasing demand for AI server technologies; however, the constant issue of environmental sustainability still looms. New servers containing the latest CPUs and GPUs are quickly approaching the limits of air cooling, which will require a new approach in liquid cooling, which is used to keep the microprocessors and accelerators running within their design limits. In addition, if the datacenter power budget has become a persistent issue, CSPs should consider using liquid cooling to reduce the overall datacenter power usage effectiveness (PUE) and minimize the HVAC cooling power.
Many datacenters have a power budget of 10 kilowatts to 12 kilowatts per rack, which becomes very challenging for a full rack of servers, GPU servers, and storage systems. New systems optimized for AI may each draw up to 10kW per server, resulting in an increased power per rack of up to 100 kilowatts. A properly tested liquid cooling solution allows for higher-density servers and GPU accelerated servers; the external heat exchanger is much more efficient than the more conventional HVAC cooling. A liquid cooling infrastructure must be planned before the rack delivery time. Working with a company experienced in liquid cooling at the rack level is critical to an efficient datacenter.
The coming generation of high-performance liquid cooling infrastructure is prepared to assume the task of supporting CSPs in the datacenter. New solutions from Supermicro have been engineered and tested to support high-density and high thermal design power (TDP) of CPUs and of GPUs. These solutions have undergone rigorous validation and testing procedures at various levels, encompassing system, rack, and cluster evaluations, ensuring the highest level of consistency and reliability.
- Suppliers And Timely Technology Refreshers
The biggest virtue that has remained a constant in the technological landscape is growth. Despite this, waiting for the latest and greatest technology has proven to be a futile tactic as new technologies and improvements are introduced. Being able to strategically plan around and account for critical technological transitions and implementing an upgrade or migration strategy can maximize the benefits to the buyer.
Furthermore, the expansion of services and simultaneous growth of technology do not always correlate with an increase in staffing and resources. It is essential for CSPs to partner with a reliable supplier that provides cutting-edge servers, storage, and networking solutions that are pre-tested and assembled into a rack with the correct software stack. This relationship can help alleviate some of the challenges that accompany the datacenter, leading to quicker deployment of new services or enhancement of existing ones. As a leader in supplying rack-scale solutions to CSPs of all sizes, Supermicro has significant and relevant experience in product development as well as supply chain logistics, service and support, and sizing and testing. Having access to a supplier with deep partnerships who can share the transition plans, cost impacts, and supply chain issues with you is critical.
In addition, a disaggregated or modular server and rack approach can mean upgrading specific components or servers without replacing the all of the components or the entire chassis. New generations of servers that can perform substantially more work per watt may also need more power. The design of a new datacenter should not be limited by rack power requirements when the initial servers and racks are installed. By working closely with a supplier such as Supermicro, CSPs will be able to better understand the criteria and means necessary for potential technology in the datacenter.
- Staying Up to Date with the Latest Server Designs
To support the problem of cost management, adopting new technologies can increase performance at lower costs. For example, depending on the required service level agreements (SLAs), code base, and matrix processing level, AI workloads can be done on CPUs or GPUs. Some workloads can be moved from the CPU to an auxiliary data processing unit (DPU), which acts as both a network interface and a data processing unit at the same time.
Some workloads would, however, benefit from a custom approach using a field-programmable gate array (FPGA). The introduction of CXL 2.0 (Compute Express Link) provides another layer in the memory hierarchy directly attached below DRAM but above SSDs. Additionally, this enables the concept of pooled memory, which can be flexibly allocated to one of the CPUs on the given system, and mitigates the issue of stranded memory, which is directly attached to a CPU but not fully utilized. These new technologies may benefit the workload and software stack for the intended service. Testing new technologies in a proof-of-concept (POC) setting before large-scale deployment is also essential. Working with a hardware partner on early POC testing with these new technologies is key to gaining competitive advantages over your competitors.
While the initial conversation may be about which server or servers to acquire for the desired workloads, the conversation will quickly shift to one centered around rack scale integration. As the number of racks at a site increases, it is essential to understand the workings and limitations of the entire datacenter. The datacenter must be considered as a whole unit, ranging from topics like the separation of cold and hot aisles, forced air cooling, and the size of chillers and fans, all the way to electrical distribution. The discussion of cooling technologies must be considered at the start because the datacenter’s physical infrastructure will be different depending on the CSP‘s choice of air or liquid- cooling.
- Measurement, Management, And The Supply Chain
To accurately assess your current datacenter’s efficiency, use instrumentation to measure CPU, storage, and network utilization. There are also tools available to do this at the cluster level. These tools can provide valuable information on where existing bottlenecks are occurring and where over- or under–utilization situations are not optimal. In addition, the temperature of the CPUs and servers can also be measured, which can identify potential issues prior to the problems causing failures.
A datacenter for a cloud provider will most likely be used by many customers simultaneously, and therefore, a job management scheduler will be needed to maintain the efficiency of the datacenter’s operations. With finite resources, not all requests for compute, storage, or networking may be satisfied, and jobs or applications will have to be scheduled or fit in as the required resources become more available, or until additional software can be acquired.
In terms of managing the supply chain, it is said that identifying “the weakest link” in managing suppliers is optimal. While we’re not advocating a supply chain hierarchy or caste, simplifying the supply chain for key suppliers is an ideal best practice for ordering, installation, and support. A single supplier who can provide servers, storage, networking, third-party software solutions, and rack integration, and can even integrate unique third-party hardware into a single system is ideal.
- Manufacturing Expertise and Experience in Building a CSP
It is an industry secret that almost all large original equipment manufacturers (OEMs) have outsourced their products’ manufacturing, design, and supply chain to original design manufacturers (ODMs) and contact manufacturers (CMs). The OEMs are primarily focused on marketing and selling these products. It is valuable to work with a company that designs all its products, from chassis to motherboards as well as power supplies, and manufactures them in locations geographically close to customers’ locations. From the perspective of the customer, this means that a datacenter supplier can be much more flexible, provide faster time to delivery, and ultimately reduce the total cost of ownership through fewer intermediaries, faster transportation, and economies of scale.
Like adopting new technologies in the datacenter, putting all your eggs into one supplier basket can be a risky decision. Selecting a datacenter solution provider is not the time or place for on-the-job learning or working with a company that is more focused on its own managed service offerings or making laptops. Working instead with a B2B company such as Supermicro that is focused specifically on the datacenter and has been working for decades with service providers and large-scale HPC clusters of servers, and powering solutions for the largest hyperscalers, OEMs, and enterprises is of great benefit to CSPs.
Conclusion
The efficient operation of a datacenter as a CSP requires very meticulous planning and a close working relationship with full-service providers. There are several decisions to make that will affect the start–up times, SLAs, and overall efficiency of the datacenter. Whether designing and implementing a publicly shared datacenter or an on-premises datacenter, plan carefully, educate yourself on and understand server and rack technology, and explore the vast landscape of new technologies and solutions that will keep the datacenter running for years to come.
Michael McNerney is senior vice president of marketing and network security at Supermicro.