Broadcom, AMD and Intel, among others, join forces to break InfiniBand monopoly

2024-03-25 17:17

The Ultra Ethernet Consortium (UEC) announced this week that the Next Generation Interconnect Consortium has grown to 55 members. As the organization works to develop the initial version of the Ultrafast Ethernet standard, they released some of the first technical details about the upcoming standard.

  The UEC was founded in the summer of 2023 to develop a new interconnect standard to meet the needs of AI and HPC data centers as a de facto (if not de jure) replacement for InfiniBand, which is largely controlled by NVIDIA.The UEC began accepting new members back in November, and in just five months has added 45 The UEC began accepting new members back in November, adding 45 new members in just five months, underscoring the huge interest in this new technology. The consortium currently has 55 members and 715 trades working in eight technical groups.

  The UEC has a lot of work on its hands, as explained in the organization's development blog post, where the consortium works to build an Ethernet-based unified communications stack to enable high-performance networking that supports AI and high-performance networking. performance computing clusters. The technical goals of the consortium include developing specifications, APIs, and source code for Hyper-Ethernet communications, updating existing protocols, and introducing new mechanisms for telemetry, signaling, security, and congestion management. In particular, Ultra Ethernet introduces UEC Transport (UET) to improve network utilization and reduce tail latency to accelerate RDMA (Remote Direct Memory Access) operations over Ethernet. Key features include multipath packet injection, flexible sequencing and advanced congestion control to ensure efficient and reliable data transfer.

  These enhancements are designed to meet the needs of large AI and HPC clusters - with separate profiles for each type of deployment - and while everything is done surgically to enhance the technology, as much of the existing Ethernet is reused as possible to maintain cost efficiencies and interoperability

  Founding members of the alliance include AMD, Arista, Broadcom, Cisco, Eviden (an Atos company), HPE, Intel, Meta, and Microsoft. after the Ultra Ethernet Consortium (UEC) began accepting new members in October 2023, a number of industry giants joined the alliance, including Baidu, Dell, Huawei, IBM, Nokia, Lenovo, Supermicro and Tencent.

  The alliance now plans to publicly release the initial 1.0 version of the UEC specification sometime in the third quarter of 2024.

  "There has always been a perception that UEC is meeting the needs of the industry," said J Metz, chair of the UEC Steering Committee. "There is a strong desire for an open, accessible, Ethernet-based network designed specifically to meet the needs of AI and high-performance computing workloads. This level of participation is encouraging; it helps us achieve our goals of broad interoperability and stability."

  While it's clear that the Hyper Ethernet Alliance is gaining support across the industry, it's still unclear where other industry giants like AWS and Google stand. While the hardware companies involved can design Ultra Ethernet support into their hardware and systems, the technology ultimately exists to serve large data center and HPC system operators. So it will be interesting to see their interest in the emerging Ethernet backbone (and how quickly they adopt it) once the hardware to integrate it is ready.

  UEC Moves Toward v1.0 Specification

  The Ultra Ethernet Consortium (UEC) continues to move toward its v1.0 specification set. These specifications will provide implementers with the means to deploy technological innovations in data center networking to serve AI and HPC workloads. Since UEC retains the existing AI framework and HPC library APIs, it is expected that existing workloads can migrate to UEC without any changes.

  As outlined in the UEC 1.0 Overview white paper, UEC is committed to optimizing AI and HPC workloads through the modernization of RDMA (Remote Direct Memory Access) operations over Ethernet. UEC Transport (UET) achieves this goal by providing a variety of innovations that enable higher network utilization and lower trailing latency, both of which are critical to reducing AI and HPC job completion times. HPC job completion times.

  Network performance is increasingly becoming a limiting factor as AI models and HPC workloads grow in size and require larger clusters. In addition, the network's contribution to power consumption and overall TCO continues to rise. A modest investment in UEC networking technology will provide a rapid return on investment.

  UEC Stack Overview: UEC follows a familiar layered approach from the widely deployed AI framework and HPC libraries, through libfabric and *CCL libraries, to UEC Transport (UET) to Ethernet. This approach is familiar to all network engineers and preserves all Ethernet software and tools. These UEC APIs also promote interoperability while providing room for vendor differentiation.

  The UEC stack simplifies and improves the performance of network software. By choosing libfabric as its northbound API, the UEC stack is designed to integrate into existing frameworks that typically use libfabric. Two key aspects of the UEC stack are worth noting:

  RDMA operations are optimized to better match workload expectations, minimize hardware complexity, and reduce state.

  Ultra Ethernet Transport (UET) provides a variety of transport services that enhance the RDMA hardware and fulfill the key objectives set forth in the UEC 1.0 overview white paper.

  Some of the key features described in the white paper are:

Multi-way package spraying

  Flexible ordering

  "The", easily configurable congestion control mechanism

  End-to-end telemetry

  Multiple transportation delivery services

  Switch offloading (i.e., collectively within the network)

  First-class citizen security co-designed with transportation

  Ethernet link and physical layer enhancements (optional)

  Multi-Path Packet Injection: UET provides multi-path packet delivery and fine-grained load balancing through packet injection. under UET, each flow can use all paths to its destination at the same time, enabling balanced use of all network paths. Rather than relying on simple ECMP hashing to select network links, load imbalance is avoided by coordinating the selection of paths throughout the fabric between endpoints and switches under the guidance of real-time congestion management. This fine-grained load balancing improves network utilization and reduces tail latency.

  Flexible Sequencing: The UEC stack provides a libfabric API that allows applications to express workload-specific requirements for message and packet sequencing and directs the UET to select the appropriate transport service. For example, an AI collection consisting of many packets can be considered a "batch transport" where the application is only concerned with knowing when a given portion of the message reaches its destination. However, multipathing for improved load balancing may result in packets being received out of order. Other techniques compensate for this behavior by simply reordering buffers, which increases cost, latency, complexity, and power consumption. In contrast, UET provides an innovative solution that combines multipath, flexible ordering, and congestion control to achieve efficient receive endpoint implementations without the need for reordering buffers.UET's use of direct data placement (i.e., zero replication) on a per-packet basis is one of the mechanisms that enable efficient implementations.

  Congestion Handling: Large-scale AI clusters and HPC networks create unique traffic patterns that require significant improvements to traditional data center network congestion handling, especially as network speeds and bandwidth requirements for gas pedals increase. Dedicated AI/HPC networks with microsecond round-trip times, parallel processing applications, and shrinking orchestration of multi-stream traffic require innovative sender-based congestion control methods.

  Network optimization of fundamental collectives such as AllReduce and All-to-All is critical to reducing job completion times. To provide network performance for these collectives, coordinated congestion control over multiple paths is essential to direct packet ejection. again, UET provides an efficient and scalable solution: for example, All-to-All collectives can result in incast events where the final switch before the receiver suffers from sustained traffic overload. To avoid these situations, UET defines an optional receiver-based congestion control that augments sender-based congestion control by assigning credit to the sender.

  Recognizing that UEC-compliant networks may need to support different libfabric request types at the same time, another area in which UET differentiates itself is its ability to perform excellent congestion control when some traffic is ejected and others are not. This occurs, for example, when mixing single-path and multipath traffic UET transport services (described below).

  Telemetry: Congestion handling is assisted by Explicit Congestion Notification (ECN), which is widely supported by data center switches and will now be further assisted by UET's innovative endpoint congestion mitigation algorithms UET has also added optional support for switch-based telemetry, which reduces control-plane signaling time, enabling fast sensing and response to short congestion events. This fast response time is especially important for higher link speeds, where congestion events may occur faster and may be of much shorter duration.

  An example of such telemetry is the UET's use of packet pruning, in which a compatible switch truncates or "prunes" a congested packet (rather than discarding it) and sends the packet header and associated congestion information to the receiver. This policy delivers accurate and timely information about lost packets to the receiving endpoint, resulting in faster mitigation of multicast-based congestion. By combining this mechanism with a more focused response to congestion events using selective acknowledgment, the response to congestion and loss is faster and more bandwidth efficient than traditional heavyweight Go-Back-N approaches. In the case of packet loss, Go-Back-N using older RDMA technologies requires the sender to retransmit long sequences of packets that may have already been sent to the receiver or even received.

  Multiple Transport Delivery Services: Application requirements dictate the choice of appropriate UET services.UET offers a number of different transport services, including:

  Reliable Ordered Delivery (ROD): ROD delivers all packets associated with the same message sequentially, preserving the order between messages. It is designed for applications that require message ordering (e.g., match ordering for MPI or OpenSHMEM put-with-signal semantics) while benefiting from UET's congestion control and security enhancements.

  Reliable Out-of-Order Delivery (RUD): RUD is designed to enable operations where packets must be delivered only to the application but can tolerate packet reordering (e.g., large aggregates) in the network.RUD's innovative AI-optimized approach to multipath packet injection by processing out-of-order deliveries at the receiver eliminates the need for reordering buffers, resulting in high network utilization and trailing latency.

  Reliable Unordered Delivery with Idempotent Operations (RUDI): RUDI is designed for applications at scale. The semantic guidance for applications here is that if packets are received multiple times and written to the application's buffers, the end result of the application does not change (idempotency means that multiple application operations do not change the end result). Thus, RUDI allows packets to be delivered to the receiver multiple times before the entire message is received.The motivation for the RUDI transport service is to formalize the state required by the receiver and thus achieve higher scale. It is suitable for operations such as bulk payload delivery that need to be delivered at a fast scale but require less additional semantics.

  Switch offloading (or in-network aggregation): UEC defines the mechanisms needed to offload aggregation operations (e.g., AllReduce) to the network. This improves bandwidth utilization and minimizes the execution time of certain collective operations. The completion of collective operations sets the stage for the next stage of computation for many AI or HPC efforts. While optional to implement, switch offloading can provide application-level performance improvements, not just network-level improvements. When UEC v1.0 becomes available, it will be the first time that such technology will be available and standardized over Ethernet links!

  Security: The UEC transport layer and transport services work together to design security that addresses the threat model while providing efficient and cost-effective solutions for message integrity, confidentiality, and replay prevention. Note that security may present scalability challenges given the expected speed and size of the UEC network. Therefore, special attention needs to be paid to minimizing overhead and large-scale key management, e.g., by providing efficient group key mechanisms. With potentially hundreds of thousands of endpoints and link speeds of 1.6Tb/sec on the horizon, scalable key management becomes a requirement.

  OVERVIEW: While AI and HPC workloads are converging, they still have some different requirements. Most AI workloads generate larger messages and are bandwidth sensitive, while many HPC workloads rely on message sequencing, generate very short messages, and are very latency sensitive. With MPI, HPC is even more sensitive to packet and message ordering. Recent AI inference workloads may be more sensitive to latency than AI training. the extreme bandwidth required by AI requires highly optimized NICs (Network Interface Cards) where chip size, power consumption, and state management must all be taken into account. Therefore, UEC offers multiple profiles to meet different workload requirements. Each profile is an aggregation of transport services, libfabric APIs, and required functionality tailored to the workload. Profiles enable product flexibility and optimization, promote interoperability, and enable vendor differentiation.

  Ethernet Layer Compliance: The UEC architecture is intended to be compatible and interoperable with existing Ethernet switches. However, the UEC specifies additional optional extensions to improve support for AI and HPC. For example, the UET supports Link Level Reliability (LLR). In typical high-bandwidth intensive environments where AI deployments that are sensitive to tail latency are common, if one link performs poorly (e.g., due to intermittently higher bit error rates (BER)), it can slow down the entire parallel link. Applications. the LLR provides a fast hardware-based response to mitigate link performance issues.

  Lossy and Lossless Networks: While UET provides excellent performance on lossy networks by utilizing improved congestion control aided by multipath and network telemetry. it is also designed to operate on lossless networks. In fact, it makes lossless Ethernet easy to tune and avoids contributing to congestion spreading, which has always been a concern for lossless networks.

  What makes a UEC device a UEC device?

  Consistent with industry trends over the past two decades, the key innovation in UEC is the endpoint. A UEC-compliant network must have UET-enabled endpoints (i.e., adhere to the relevant UET schemas and APIs defined and summarized in one of the UEC profiles.) UEC profile support brings together simplified RDMA, zero-replica multipathing, higher network utilization, and reduced trailing latency, all of which enable AI and HPC job completion times.

  As mentioned above, UET can run on existing data center switches, but performs even better with advanced telemetry and support for UEC's enhanced Ethernet, transport and software layers.

  A Growing Community

  UEC is now a large consortium. As of this writing, it has 55 member companies, with more than 750 members actively participating in eight working groups. Anticipating the next phase of its architectural and public development, the UEC has added four new working groups: Storage, Management, Performance and Debugging, and Compliance. These four workgroups complement the existing Transport, Software, Link Layer and PHY Layer workgroups.The UEC has become an extremely powerful force in the field of AI and HPC networking, driving a great deal of innovation in the Ethernet ecosystem.