The programmable data plane fundamentally changes the way network elements are constructed and managed, and also balances the conflict between flexibility and performance to a certain extent. The key to this balancing act is a good abstraction of packet processing. A common abstraction for packet processing is the matching action pipeline, first proposed by OpenFlow. With the match-action abstraction, a packet processor can be modeled as a series of match-and-action pipeline stages. Each pipeline stage performs a different action on the packets flowing through it. This abstraction can be mapped to FPGAs and next-generation ASICs.
In this case, the platform- and architecture-independent P4 language came into being. In order to support the P4 language, the programmable data plane mostly adopts the Reconfigurable Match Tables (RMT) abstract forwarding model, which defines the packet processing behaviors such as programmable parsing, programmable matching and programmable actions. A programming language that can quickly and easily describe packet processing behavior.
Therefore, in people’s impressions, when the P4 language is mentioned, the architecture of the parser, inverse parser and multi-level flow table automatically appears in the brain, and Barefoot’s Tofino chip comes to mind. It is considered that the architecture of the parser, the reverse parser and the multi-level Match Action flow table is the only hardware architecture of the P4 language. In practice, however, there are several other implementation architectures besides this architecture.
For example, at the upcoming SIGCOMM conference in late August this year, researchers from MIT and Juniper jointly published an article introducing the architecture of the Trio chip. The Trio chip does not use a pipeline architecture to support the P4 language. The article is mainly aimed at comparing the performance of the switching chip with the traditional PISA pipeline architecture for new application scenarios. Now, part of the translation of the article is provided for readers. Interested readers can also read Zhashen’s article “Trio 6, Express 5 and Silicon One” in advance for a more comprehensive understanding of the relevant content and background.
This article introduces Trio, a programmable chipset for Juniper Networks MX Series routers and switches. Trio’s architecture is based on a multi-threaded programmable packet processing engine and a layered high-capacity memory system, which makes it fundamentally different from pipeline-based architectures. Trio can handle non-homogeneous packet processing rates for a wide variety of network use cases and protocols gracefully, making it an ideal platform for emerging in-network applications.
We first describe the basic building blocks of the Trio chipset, including its multithreaded packet forwarding and packet processing engine. Then, we discuss Trio’s programming language, called microcode. To demonstrate Trio’s flexible Microcode-based programming environment, we describe two use cases.
First, we demonstrate Trio’s ability to perform in-network aggregation for distributed machine learning. Second, we propose and design an in-network lingering mitigation technique using Trio’s timer thread. We prototype both use cases using three real DNN models (ResNet50, DenseNet161, and VGG11) on the testbed to demonstrate Trio’s ability to mitigate crosstalk while performing intra-network aggregation. Our evaluations show that Trio outperforms current pipeline-based solutions by a factor of 1.8 when there are casual workers in the cluster.
Data-intensive applications are the foundation of today’s online services. As Moore’s Law gradually slows, hardware accelerators are struggling to meet the performance demands of emerging cloud applications such as machine learning, databases, storage and data analytics. Further progress is clearly limited by the amount of computation and memory a single server can accommodate, driving the need for efficient distributed systems for data-intensive applications. The advent of programmable switches, such as Intel’s Tofino [2, 20, 22], created opportunities to design new packet processing protocols and compilers [17, 20, 24, 44, 45, 58, 69, 71]. Tofino switches also paved the way for the use of in-network computing [23, 60, 74] to accelerate applications such as caching , database query processing [50, 73], machine learning training [36, 48, 55, 63, 77], inference , and consensus protocols [27, 28, 52]. The key idea of in-network computing is to take advantage of the unique advantages of switches to perform part of the computation directly within the network, thereby reducing latency and improving performance. Although programmable switches have been an important enabler of this new paradigm, the Protocol Independent Switch Architecture (PISA) [2, 20, 22, 58] is often unsuitable for emerging in-network applications, limiting further development, hindering This led to the widespread adoption of in-network computing applications [35, 37, 67]. This paper presents Trio’s programmable architecture for in-network computing. Trio is Juniper’s programmable chipset with a multi-billion dollar pre-existing customer base. It has been deployed in hundreds of thousands of routers and switches in core, edge and data center environments around the world. Trio chipsets have been used in production equipment for over a decade. Trio is built on a custom processor core with an instruction set optimized for networking applications. As a result, the chipset has the performance of a traditional ASIC while enjoying the flexibility of a fully programmable processor, allowing new functions to be installed via software. Trio’s flexible architecture enables it to support features and protocols developed long after the chipset is released. Trio processor cores have access to a high-performance large memory system to store data and state related to system configuration and packets. This memory system is critical to the scalability of emerging applications with large memory footprints. The structure of Trio is fundamentally different from that of Tofino. Trio has a non-pipelined structure, so different packets do not necessarily flow through the same physical path on the chip. Inbound packets in Trio are processed independently using thousands of parallel threads (see 2 for details). These threads use a run-completion model [12, 70], where one thread will execute the required instructions to complete the processing of the packet it is currently processing. Trio has dedicated logic to ensure that packets of the same flow are delivered in-order, but packets of different flows can be processed out of order, making it efficient to handle mixed concurrent applications. As a result, Trio can handle different packet processing rates gracefully: it can provide sub-line-speed support for applications that require rich per-packet processing, while maintaining line-speed for applications with simple per-packet processing needs. In contrast, when a PISA-based switch processes data packets, the data packets in the same pipeline need to traverse all stages of the pipeline, no matter what the P4 program is; the deployment of the P4 program  only has two complete successes and complete failures As a result, PISA-based switches cannot support flexible packet processing rates, trading programmability for wire-speed packet processing capabilities. In this paper, we first describe the basic building blocks of the Trio chipset, including details of its packet processing engine and surrounding memory system (2). Next, we describe Trio’s programming language, called Microcode(3). We then explain Trio’s flexible Microcode design (4) with in-network aggregation trained by machine learning as a first use case. We present crosstalk mitigation in networks as a second use case to demonstrate Trio’s unique ability to launch efficient timer-based threads (5). We demonstrate that achieving stayer mitigation in Trio is straightforward, whereas achieving efficient stayer mitigation in PISA-based devices is challenging, if not impossible, to our knowledge. We implement both use cases on a testbed with a Juniper MX480 device , one 64 100Gbps Tofino switches, and six ASUS ESC4000AE10 servers, each with an A100 Nvidia GPU  and a 100Gbps Mellanox ConnectX5 network card. We train three DNN models (ResNet50 , DenseNet161 , and VGG11 ) to demonstrate Trio’s ability to mitigate skirmish lines while performing in-network aggregation. Our evaluations show that Trio outperforms the state-of-the-art in-network aggregation platform SwitchML  by a factor of 1.8 when casual workers are present in the cluster. Juniper Networks will continue to evolve the Trio chipset to provide higher bandwidth, lower power consumption and more functionality for existing and emerging applications, while also developing the software infrastructure to support additional use cases. We invite the networking community to identify new use cases that can leverage Trio’s programmable architecture.
02Structure of Trio
Since its introduction in 2009, the Trio chipset has gone through six generations , with various performance points and architectures. This section details the latest architecture of Trio. First, we give a high-level overview of packet forwarding and processing in Trio-based Router 1. We then turn to the details of Trio’s packet processing engine. Finally, we explain Trio’s various memory types and read-modify-write operations.
1. Trio-based router architecture
Figure 1 illustrates the high-level differences between Trio-based routers (or switches) and PISA-based switches. Every Trio-based device has two important components: (i) a packet forwarding engine and (ii) a packet processing engine, described below.
Packet Forwarding Engine (PFE). The PFE is the central processing element of the Trio forwarding plane and is used to systematically move packets in and out of the device. A Trio-based device consists of one or more PFEs. Depending on the age, each Trio chipset supports different packet processing bandwidths. Trio’s first-generation PFE uses multiple chips to support 40Gbps of network bandwidth. Today, Trio’s sixth-generation PFE supports 1.6 Tbps in a single chip. Small routers may have only one PFE, while large routers have multiple PFEs, connected by an interconnect fabric, as shown in Figure 1(a). By providing any-to-any fully interconnected connections between PFEs, the interconnect fabric expands the bandwidth of the device far beyond what a single chip can support. Each PPE handles packets in both ingress and egress directions. Packets arrive at the system through an ingress PFE and leave through an egress PFE. Packet Processing Engine (PPE). Each PFE has hundreds of multi-threaded Packet Processing Engines (PPEs), as shown in Figure 2. Each PPE supports dozens of threads processing different packets simultaneously. Unlike Tofino’s architecture, where pipelines cannot access each other’s registers, PPE threads in a PFE can effectively share state through shared memory. Section 2.2 explains the thread-based design of PPE in more detail.
Parallel packet processing. The hardware logic of the PFE automatically divides each incoming packet into header and trailer parts (similar to PISA’s header and payload). The header is the first part of the packet and is usually large enough to hold all the headers needed to process the packet (the size of the header varies with each generation of Trio devices, but is usually around 200 bytes). The trailer consists of the remaining bytes of the packet, if any. When a new packet arrives, a hardware module inside the PFE, called a scheduling module, sends the packet header to the PPE for processing according to availability, and the PPE generates a new thread for this packet header. Packet trailers are kept in the PFE’s memory and packet buffers in the queuing subsystem to avoid storing large numbers of bytes in the PPE thread. By default, each thread works on one packet. Many PPE threads work in parallel to provide the required processing bandwidth. Reorder engine. When the packet processing is complete, the modified packet header is sent to the reordering engine. The reordering engine retains updated packet headers until all packets arriving earlier in the same flow have been processed to ensure in-order delivery. The Reorder Engine then sends the modified packet headers to the memory and queuing subsystems for queuing for transmission.
2. Packet processing engine
Trio’s PPE provides capabilities that are difficult or impossible to achieve with fixed processing pipelines or existing specialized processing units. Each PPE is a VLIW (Very Long Instruction Word) multithreaded microcode engine core. Each microinstruction controls multiple ALUs, operand and result selections, and complex multiplexing. The complexity of the work required to execute one microinstruction means that each instruction requires multiple clock cycles. Because each PFE typically serves many packets at the same time, a PPE does not require high single-threaded performance. Each thread in Trio has only one datapath instruction at a time. Trio does not dispatch an instruction on the same thread as the previous instruction into the PPE pipeline until the latter exits the pipeline. Therefore, there is no need to pass data between instructions in the same thread, since subsequent instructions do not depend on the result of the previous instruction until the data write-back is complete.
PPE thread. A PPE thread is usually started when the packet header arrives at the PPE and destroyed when the PPE has finished processing the packet. Destruction of threads is handled automatically by hardware logic in the chip, although the programmer can control when execution of threads is abandoned. Threads can also be started in response to certain internal events, including statistics collection and timers (see 5 for more details). External events have the ability to spawn new threads of execution through a similar mechanism. The PPEs in the ingress and egress PFEs work together to handle all the functions required to process packets (eg, packet parsing, route lookup, packet rewriting).
Local storage per thread. Each PPE has two main forms of internal storage. First, each thread has a dedicated local memory pool (1.25 KBytes). Local memory can be accessed on any byte boundary, using pointer registers or addresses contained in microinstructions. Before a PPE thread starts, the packet headers are loaded into the thread’s local memory. When a packet is sent, the modified packet header is unloaded from the thread’s local storage. The use of pointer registers allows efficient access to packet headers and other types of data structures. Second, each thread has 32 64-bit general purpose registers that are private to it.
Local storage (memory and registers) holds specific information about the packet being processed. Shared state across packages is kept in a shared memory system accessible by all PPEs.
ALU type. There are two types of ALU: (i) Conditional ALU and (ii) Mobile ALU. Conditional ALUs are used for arithmetic or logical operations, producing 32-bit data results and/or for comparison operations, producing 1-bit conditional results. The mobile ALU produces a 32-bit result that can be written to a register or local memory. The results from the conditional ALU can be used as input to the mobile ALU. This ALU organization allows per-instruction resources to be flexibly allocated between ordering control (described next) and generating logical/arithmetic results for storage in registers/memory.
Importantly, each ALU operand and each shift ALU result can be a bitfield of arbitrary length (up to 32 bits) and an arbitrary bit offset. This has two main benefits. First, it improves the efficiency of accessing fields of different sizes in packet headers. Second, it improves the utilization of memory and register capacity so that each piece of data uses only the bits it needs. Trio has ALUs in both PPE and shared memory systems. The former is used for operations on registers and local memory, while the latter is used for operations on data stored in a shared memory system. Operations on packet trailers are also supported by moving parts of the packet trailer to the local memory of the PPE thread.
Sorting logic. The conditional results of one or more conditional ALUs may be used by the ordering logic unit to select the next microinstruction to execute. Each microinstruction includes the address of the target block of one to eight microinstructions. Any or all of the conditional results can be ignored, and the combination of conditional results used is very flexible. Much of the work in packet processing involves complex conditional branching in the code, especially during parsing. Trio’s ability to perform complex multiplexing in a single instruction is a good match for the needs of packet processing applications. PPE supports a call-return mechanism to subroutines, and subroutines can be nested up to eight levels deep.
Efficient hashing. Efficient load balancing is an important requirement for all routers/switches. In a Trio-based system, a Microcode program is responsible for specifying which packet fields are included in the hash calculation. This allows complete flexibility in decisions about which packet fields contribute to load balancing, including the ability to choose fields from packet headers that the protocol has not yet invented. The hash function in Trio is a high-quality hash function implemented using dedicated logic. Therefore, the implementation of the hash function is more efficient than a similar hash function implemented in software. The combination of programmable field selection and hard-wired hash functions gives PPE an unprecedented balance of flexibility and efficiency.
Flexible programming. There is no fixed limit on the number or types of headers that a PPE can handle. Therefore, the PPE can easily create new headers or consume/delete existing headers in packets using Trio’s microcode routine (3). As new protocols are developed, the Trio packet processing architecture can be adapted by enhancing the software running on the PPE. Due to the multi-threaded structure of the PPE, the PPE can also create or consume packets to complete tasks, such as holdover functions, at speeds much higher than the control plane CPU can support. Importantly, the processing cycles are interchangeable between different applications, allowing the packet processing requirements of different applications to be handled gracefully. Thus, a Trio-based system can provide a lower packet rate for applications with more packet processing, a higher packet rate for applications with simpler packet processing, or a combination of both.
3. Shared memory system
Recent Trio chipsets support several GBytes of memory per PFE. This section provides an overview of Trio’s shared memory system. Advantages of shared memory. For switches and routers, some data structures, such as counters and (traffic) policers, need to be modified at a high rate. To support efficient access to these data structures by hundreds of PPE threads, Trio’s shared memory system becomes where all threads access and modify data.
All data accesses (read, write, and read-modify-write) to the shared memory system are handled by a read-modify-write engine located near the shared memory system. When multiple threads access the same memory location at the same time, there is no need to move data from one thread to another. Instead, data modification happens inside the read-modify-write engine. This allows high-speed updating of data near memory, well suited for packet processing applications.
In contrast, the cache-line-based coherency model used by traditional processors requires data to be moved into threads during access; this can cause long delays when multiple threads try to modify the same memory location. While this model can support more complex and general operations on data, it performs poorly for data structures that can be accessed by hundreds of threads.
memory type. The Trio memory system is optimized to provide high access rates for relatively small (8-byte) requests. To achieve the desired combination of bandwidth, latency, and capacity, the memory system uses two types of memory, as shown in Figure 3: (i) high-bandwidth on-chip memory with ~70ns access latency from the PPE; (ii) based on Large, high-bandwidth off-chip memory for DRAM, with access latency from PPE around 300 ns to 400 ns. On-chip memory is implemented by a serious multi-slot SRAM, typically used for frequently accessed data structures.
The off-chip memory has an on-chip cache of several megabytes, which is similar to the on-chip SRAM, and is a large number of multi-Banks to provide high throughput. The on-chip SRAM and off-chip DRAM cache sizes are software configurable (typically 2-8 MBytes and 8-24 MBytes, respectively). Off-chip DRAM is a few gigabytes. On-chip and off-chip memories are structurally equivalent and exist in different ranges of a unified address space. They only differ in capacity, latency and available bandwidth. This allows data structures to be placed in the type of memory that best matches their capacity and bandwidth requirements.
memory transaction. The memory system supports read and write operations of different sizes, from 8 bytes to 64 bytes (in 8-byte increments). Trio can support the full memory system bandwidth with 8-byte accesses. In addition, rich read-modify-write operations are supported, including Packet/Byte Counters, Policers, Logical Gets and Operations (And/Or/Xor/Clear), Gets and Exchanges, Masked Writes, and 32-bit Adds. Read-modify-write operations are enabled by the read-modify-write engine, as specified below.
Read-modify-write engine. Packet processing requires extremely high-speed read-modify-write operations. Processing a packet may involve updates to multiple counters, actions on one or more policers, and other actions required by the application. The naive way to handle read-modify-write operations is to have a thread take ownership of a memory location while the operation is in progress. But this method can not meet the high efficiency requirements of packet processing. In contrast, Trio offloads read-modify-write operations to its memory system, with a read-modify-write engine handling a sequence of memory locations.
If multiple requests for the same memory location arrive at the same time, the engine will process the requests in sequence to guarantee consistent updates. When mixing read, write, and read-modify-write operations, there is no need to issue an explicit coherence command to a location in memory. Each read-modify-write engine processes memory requests at 8 bytes per clock cycle. Therefore, a single read-modify-write engine of the entire shared memory system cannot provide the memory bandwidth required to process packets at a high enough rate. To address this challenge, Trio supports several sets of SRAM and off-chip caches, and has its own read-modify-write engine, enabling read-modify-write processing bandwidth to scale alongside raw memory bandwidth.
Crossbar and shared memory performance. Trio’s Crossbar is designed to support all read-modify-write engines, so Crossbar itself does not limit memory performance. If the load presented to a read-modify-write engine exceeds the throughput of 8 bytes per cycle, backpressure is generated through the Crossbar. Juniper has increased the number of read-modify-write engines in each generation of Trio chips so that memory bandwidth increases with packet processing bandwidth.
03Trio’s programming environment
This section provides an overview of Trio’s programming environment. Section 1 describes the Trio programming language and the programming toolchain for Trio devices. Section 2 provides an example of packet filtering programmed in Trio microcode.
1. Trio’s programming language and toolchain
The programming language for Trio-based devices is a C-like language called Microcode. Programmers implement all packet processing operations in Microcode, including packet parsing, route lookup, packet rewriting, and in-network computations (if any). Figure 4 shows the tools needed to program a new application on Trio. To program a new application on the Trio, the programmer uses the Microcode language to write the new application and adds the new Microcode program to the existing codebase. The programmer then uses Trio’s compiler to generate a software image and configure the target device.
Expression syntax. Microcode supports C-style expressions. Supported variable types include scalar (label, bool, and integers of various sizes) and composite (struct and union). Microcode also supports pointers and arrays, conditions, function calls and Gotos, and switch statements. instruction boundary. A Microcode program has multiple instructions. A Microcode instruction can perform limited operations, and the programmer needs to explicitly specify the boundaries of the instruction. Typically, a Microcode instruction can perform a read of four registers or two local memory, and a write of two registers or two local memory.
Variable storage class. When defining a new variable in Microcode, the programmer needs to specify where to store the variable. There are three types of variable storage classes: memory (the PPE’s local memory and registers), bus (representing variables as input to the ALU), and virtual (representing constant values). Access to data stored in shared memory systems, such as forwarding tables, is achieved through external transactions as specified below.
external transactions. PPE can issue external transactions (XTXN) to other modules via Crossbar, such as shared memory system, hash lookup/insert/delete, high performance filters and counter/warning blocks. These XTXNs can be synchronous or asynchronous. In synchronous XTXN, the PPE thread is suspended until an XTXN reply is received; in asynchronous XTXN, the PPE thread continues to run normally. PPE can also get data from the end of the packet through XTXNs. In this case, the packet tail is sent from the memory and queue subsystem, through the Crossbar, and then to the PPE’s local memory.
An XTXN consists of a request from the PPE to the target and a reply from the target back to the PPE. The format of XTXN depends on the target block. For example, a read request sent to a shared memory system takes the memory address as an argument, and the data is returned in the XTXN response register. translater. To compile Microcode programs, programmers use a tool called the Trio Compiler (TC). TC maps the source code of an instruction to various resources that the instruction can control, including mapping variables to their underlying storage, and allocating the instruction to Microcode memory within the PPE. TC has the characteristics of both a compiler and an assembler. On the compiler side, TC supports the translation of high-level C-style expressions into hardware instructions.
On the assembler side, TC’s source code must contain the division of instructions, that is, the programmer marks the start and end of a block of code that represents an instruction. If the code assigned to one instruction is inappropriate, the compilation of TC will fail because it cannot implement the required action in multiple instructions. TC does not have a separate compilation and linking stage. It requires full source code instead of individual modules to generate binaries. This binary file contains data to initialize PPE resources, such as Microcode memory and local memory. It also defines required symbols, such as the address in local memory where the packet header starts. This binary is part of the Junos2 software image that Trio’s ASIC driver uses for device initialization.
vMX virtual router. Juniper Networks is working together to enable third parties to access Trio-based device programming. As a first step for third parties to access Trio functionality, Juniper Networks developed the vMX virtual router . vMX is a virtualized universal routing platform consisting of a virtual control plane (VCP) and a virtual forwarding plane (VFP). The VCP is powered by the Junos operating system, and the VFP runs a Microcode engine optimized for the x86 environment. vMX is available as licensed software and deployed on x86-based servers and cloud services such as Amazon Web Services.
Advanced forwarding interface. In Trio, packet forwarding is a sequence of operations performed by the PFE. Each operation can be represented by a node on the graph of potential packet forwarding operations. The PFE performs a series of operations on a single packet based on its type/field. The Juniper Networks Advanced Forwarding Interface (AFI)  provides partial programmability, allowing third-party developers to control and manage part of this forwarding path graph through a small virtual container called a sandbox. The sandbox enables developers to add, remove, and change the sequence of operations for specific packets.
2. Microcode program example (omitted)
04Trio discussion and future use cases
Trio is used for in-network telemetry. Most network operators require telemetry or insight into the traffic in their network for capacity planning, service level agreement monitoring, security mitigation, and other purposes. Current network devices typically rely on packet sampling for further processing using an internal processor embedded in the device or external monitoring equipment. Due to the high volume of traffic passing through the device and the limited processing and bandwidth available for monitoring, only a small fraction of packets (1 in tens of thousands or less) are selected for monitoring, and the decision to sample packets is often blinded. based on a simple time interval . The flexibility of Trio’s packet processing and availability of operational resources make it suitable for in-network telemetry. For example, service providers can take advantage of Trio’s large memory to track incoming packets to keep enough information for telemetry. In addition, Trio’s timer thread is suitable for periodic monitoring and exception analysis. To provide network operators with smarter telemetry, machine learning-based classification techniques can be performed on each packet based on the packet fields that Trio has already extracted for routing. Finally, data structures can be stored more efficiently, reducing the transmission bandwidth and processing cycles of external monitoring devices.
Trio is used for in-network security. To mitigate DDoS attacks, Trio-based MX systems support the ability to identify and drop malicious packets, leveraging the chipset’s high performance and flexible packet filtering mechanisms. Trio also acts as a secure flow based fast forwarding path on the SRX security platform . Trio is able to perform additional sophisticated in-network security processing on incoming packets, by aggregating features or inferring on ML models installed by service providers to identify and mitigate anomalies in traffic. Unlike appliance-based solutions, Trio’s programmable architecture for anomaly detection on the network data path enables low-latency threat mitigation. Packet loss in Trio-ML. Data centers running a variety of different applications can experience transient traffic spikes, which in turn can cause aggregate packet loss. A practical in-network aggregation system requires a degree of resilience that enables long-running jobs to survive such contingencies. SwitchML  suggests how to achieve this resiliency. The implementation of Trio-ML has provisions to support this solution, although it is not part of the current code, we leave it to future work. Future Open Source Initiatives. We are considering several future open source ideas. First, we plan to add full support for P4 programming to Trio. Juniper Engineering has made initial efforts to achieve this goal , but recent revisions and enhancements to the P4 core specification should allow more flexibility and more functionality to be exposed through the P4 interface. Second, we plan to create a domain-specific language that enables third-party developers to use the full forwarding path capabilities of the Trio chipset. Juniper Networks is exploring developments in this area and welcomes feedback from the community.
In-Network Computing Using Programmable Switches. Several previous papers have proposed in-network computing by exploiting some form of programmability within the network. These approaches fall into two categories: (1) using PISAb-based architecture for wire-speed computing [18, 48, 63]; (2) using on-chip FPGA for sub-wire-speed computing . Our in-network ML aggregation use case is closely related to Sharp , SwitchML , ATP , PANAMA , and Flare . Sharp  is Mellanox’s proprietary design for dedicated ML training clusters; it assumes that network bandwidth can be fully reserved. In contrast, we consider a network where multiple users and applications share links. SwitchML  and ATP  use commercially available Tofino switches to perform gradient aggregation. Although Tofino switches can perform wire-speed packet processing, their pipeline structure has limited programmability, which makes straggling mitigation within the network extremely challenging. We use SwitchML as a baseline comparison for Trio-ML. For our use case, SwitchML is an apples-to-apples comparison, making it a more appropriate baseline than ATP. More specifically, the performance improvement of ATP is affected by in-network aggregation and additional parameter servers, while SwitchML and Trio-ML are more similar in that these two methods only use switches/routers for aggregation. PANAMA’s  in-network aggregation hardware can support flexible packet processing, but it is based on FPGAs acting as in-line thrashing, making it impractical for large-scale deployments. However, this paper aims to leverage Trio’s programmable architecture to design new stateful in-network applications from scratch. Several key features of Trio make these new applications possible. First, Trio’s large memory and fast access to packet tail data enable efficient computation within the network. Second, the Trio’s shared memory system provides several gigabytes of storage; that’s more than enough for data storage even when there are stragglers, or when multiple applications are running at the same time. Finally, Trio has no limit on the number of instructions in a single packet, enabling Microcode programs to initiate the computational instructions required for large packets.
Alleviate the casual labor problem. There has been considerable prior work on understanding and mitigating the effects of the casual worker problem in distributed systems [13 15, 25, 26, 30, 31, 33, 34, 38, 46, 51, 56, 57, 59, 61, 72 , 75, 78, 79]. In particular, Harlap et al. proposed FlexRR to mitigate the impact of casual labor problems on distributed learning jobs . FlexRR requires peer-to-peer communication between workers to detect slow workers and reassign jobs. In contrast, we consider mitigating the effects of stragglers inside the network, without any messaging between workers and without parameter servers. Tandon et al.  and Raviv et al.  proposed a coding theoretical framework to alleviate stragglers in distributed learning by duplicating training data among workers; however, Trio-ML does not require data duplication.
Alternative to traditional switch architecture. The research community has been investigating alternative switch architectures to address some of the limitations of PISA-based architectures, such as the lack of shared memory and shallow pipeline depth. The most competitive example is dRMT (Disaggregated Programmable Switching) . The dRMT switch architecture implements a centralized, shared memory pool accessible to all match action stages. Instead of executing match-action stages in a pipeline, dRMT aggregates these stages in a cluster and executes them in a rotating order.
A control logic unit schedules these phases to maximize cluster throughput while respecting program dependencies. However, centralized memory pools are controlled by a multiplexer that connects the stages to the memory, and only one stage can access the memory in a given clock cycle. When an application needs to access memory in multiple stages, this can cause program execution to slow down.
In Trio, multiple threads can send memory access requests to the same memory location at the same time, and Trio’s read-modify-write engine processes these requests in turn, ensuring update consistency. Additionally, dRMT’s memory accesses through the crossbar are scheduled at compile time, which reduces the flexibility to incrementally update and recompile application code.
The complexity of the bar scheduling algorithm can limit the ability of the architecture to scale to a larger number of matched action processors. In contrast, Trio’s crossbar is scheduled in real-time, providing efficient access to memory. This dynamic scheduling mechanism enables Trio to scale from 16 PPEs in the first generation to 160 PPEs in the sixth generation, and will continue to scale in the future.
Furthermore, in dRMT, the packet parser and splitter are located outside the match action handler. Any parsing of the inner headers of packets that depend on the lookup result (such as MPLS-encapsulated packets) must be re-looped into the parser for processing. In contrast, Trio’s PPE is a fully programmable processor capable of handling packet parsing/separation, as well as the rest of packet lookup and processing, on a run-to-run basis. Trio’s multi-threaded PPE also allows packets to be processed by different Microcode programs according to their processing requirements.
This article describes the Trio, Juniper’s programmable chipset, and its use in emerging data-intensive networking applications. Trio has been in production for over a decade and has built a large customer base with a multi-billion dollar market share. We describe Trio’s multithreaded and programmable packet forwarding and packet processing engine. We then illustrate Trio’s microcode and programming environment using distributed machine learning-trained in-network aggregation and intra-network casual worker problem mitigation as two use cases. Our evaluations show that Trio outperforms today’s pipeline-based solutions by a factor of 1.8.
Editor: Huang Fei