Why RDMA and TCP for GigEVision are a step backwards

The GigE Vision standard uses a UDP-based protocol called GVSP to facilitate image transfer using Ethernet. As data rates have increased, some manufacturers have encountered difficulty achieving performance using GVSP, particularly when data rates approach 10Gbps or higher. This has led to experimentation with TCP or RDMA protocols to mitigate those difficulties but Emergent Vision sees these options converge the GigEVision standard with point-to-point protocols like CXP and USB.

Zero Copy Image transfer

With GigEVision, the problem has been tied to the need to dissect the many Ethernet packets at the receiver to provide the image data to the application in contiguous form which necessitates splitting off the Ethernet packet headers. This can be accomplished in software at a large cost with triple the memory bandwidth and higher CPU utilization (this, incidentally, is what RDMA proponents compare with when discussing pros and cons of traditional GigEVision and RDMA). We avoid this cost by utilizing the built in splitting features of the modern day NICs (Network Interface Card) to perform this Zero copy image transfer.

GigE Vision with Support for TCP

TCP is one protocol explored by some to improve the performance of GigEVision. Some even claim this is a guaranteed transfer mechanism which is completely false. TCP is not a Zero copy process so it triples the required memory bandwidth. In addition, TCP is point-to-point which converges this protocol with CXP and USB which all but eliminates the benefits over those protocols especially since CXP is adopting the Ethernet physical layer in newer revisions to address its own deficiencies. In all senses, TCP is a non-starter for performance applications.

GigE Vision with Support for RDMA/RoCE

RDMA/RoCE is another protocol explored by some for the same reasons. Some will continue to claim THIS is now the guaranteed transfer mechanism which is again false. RDMA is a Zero copy process which is its primary benefit, but, as with TCP, is a point-to-point protocol and incurs network overhead to support its connected nature. It is important to remember that RDMA and TCP were really designed for large data transfer on the internet with many multiple hops through switches and routers with dropped and out of order packets. In machine vision, the systems are closed with controlled routing if switches are used. A reminder also that TCP and RDMA are far from ratified into the GigEVision standard. Rest assured, as the high-speed Ethernet camera leader, that Emergent will integrate the RDMA addition if and when support is ratified. This would be a small effort and would be backward compatible with all existing product we sell and support.

Image: Emergent Vision Technologies

Zero copy GigE Vision with mature GVSP Protocol

Zero copy with header splitting is indeed possible with modern NICs by Nvidia/Mellanox, Broadcom, Intel, and Marvell. Emergent has implementations deployed with Nvidia/Mellanox and Broadcom which are the primary NICs explored by those experimenting with RDMA RoCE which eliminates any concerns surrounding interoperability. In fact, Emergent has been using this same method for over 15 years and have the maximum design-in densities of any interface standard with reliability to match. The same approach is also used for ST2110 for the massive media and entertainment market.

Zero copy does not guarantee Zero data loss in any interface or protocol implementation. Any performance system still needs proper design and margining to achieve desired results. This goes for CXP, RDMA / RoCE, and even optimized GVSP implementations. But we can guarantee that the optimal GVSP implementation will equal or better RDMA/RoCE without turning GigEVision into a point-to-point protocol and eliminating what has made GigEVision the most popular interface over the years. It is important to note that when the retransmission feature of RDMA is engaged that this is a sign of a back up in the system which is also a sign of often undesired latency and jitter. It is also important to note that CXP doesn’t use resends or flow control yet is able to sustain high data transfer rates with optimal receiver performance, low latency and jitter. Much of this can be attributed to adequate buffering on the purpose-built frame grabbers required for CXP. Low-cost NICs often lack sufficient buffering capability however modern NICs are readily available at cost-effective price points with ample physical buffering. It is worth noting at 25Gbps and higher that PoE (power over ethernet) is dead. Thus, new deployments should be focussed on SFP technologies and distributed power systems. It is also noteworthy even at 10GigE speeds that the big NIC providers do not support PoE which forces camera vendors to sell their proprietary card solutions.

GPU Direct – Better than Zero copy

Zero copy minimizes the CPU and memory bandwidth utilization by writing to memory only once but we can avoid that transfer altogether by writing directly to the GPU – this is called GPU Direct. And it makes sense in many performance applications to send data directly to the GPU for processing and then taking the lower bandwidth results to the CPU and memory for user or system interaction. Emergent has been supporting GPU Direct with Nvidia GPUs on Windows and Linux for over four years in a variety of applications. Nvidia RTXA6000/5000/4000, Orin, and Xavier are used in many applications using Emergent cameras. Unfortunately for RDMA users, Nvidia/Mellanox only allow GPU Direct on Windows to select partners such as Emergent and this OS is where 80% of machine vision applications continue to be deployed. Linux, however, does remain an option for RDMA with GPU Direct for all.

Image: Emergent Vision Technologies

Integrated interface and processing cards

Zero copy is great. GPU Direct improves on this. But it would be the ultimate achievement if we received and processed the data from the cameras all on one card. In this case, CPU, memory and all server resources are not used at all. Emergent is supporting AMD/Xilinx Alveo cards for this very purpose and have multiple performance applications leveraging this technology. Emergent also is working closely with Nvidia to bring Bluefield NIC support. Think of Bluefield as the merging of Nvidia NICs with Nvidia GPUs. In both cases, the computer can be a very low end PC which primarily supplies power to the chosen card.

Multicast

While not in use by all applications, many of Emergent’s largest deployments utilize the Multicast feature of the Ethernet standard. Point-to-point protocols like TCP and RDMA cannot support multicasting by their nature. RDMA does have other modes it can operate in which essentially remove its flow control and packet retransmission feature – this is tantamount to the current GVSP standard. Two primary benefits of multicast exist: redundancy and distributed processing. Redundancy allows critical systems to have the fastest fail over to avoid downtime. In larger systems, switches are present and back-up servers can be setup to take over camera streams when one server has a problem. Distributed processing is especially important as the number and speed of cameras is increased and also very much dependent on the type of processing required. Some applications will simply take multicast camera data to another system for display while the heavy processing is done in other systems. Even on the same server the switch can send virtually Zero delay copies to different GPUs for parallel processing. It is nice to start with a technology that allows for such an architecture even if not immediately required. One representative deployment is with the 240 Emergent Bolt HB-25000SB 25GigE 25MP cameras running at 90fps across six mid-range servers – that is 40x 25GigE cameras per mid-range server which is unparalleled and miles ahead of any solution out there.

The big picture

Many camera manufacturers focus their attention on enabling the transfer of data from camera to receiver. They claim success once sensor data has arrived in system memory, leaving the integrator or customer with responsibility to manage that data into processing nodes. In some applications, system memory and the CPU are sufficient for managing and processing the incoming data stream(s) particularly when post-processing can be employed. In others however, particularly where multiple 10GigE, 25GigE or 100GigE streams are being used, real-time processing requires the use of offload technologies to more adequate processing nodes. In the concepts and proposals for alternate interface or protocol methods, this seldom comes up. We need to see the big picture. Over the past 15 years, Emergent has pioneered and developed 10GigE, 25GigE, and 100GigE area and line scan cameras and created an eco-system to support the most reliable highest speed imaging applications.