This post is a follow-up to “Implementing a CPU-based real-time video analytics pipeline,” where we discussed a CPU-based end-to-end video analytics pipeline. As seen in that post, a CPU-based pipeline runs into severe performance bottlenecks. Here we discuss how we address and overcome these bottlenecks using FPGAs as hardware accelerators. We explain Megh’s Video Analytics Solution (VAS) stack, which offloads the compute-intensive stages of the video analytics pipeline to the FPGA. We also provide a competitive analysis, comparing it with the CPU-based solution.
The computational requirements for real-time stream processing map very well to FPGA architecture, making FPGAs the preferred hardware accelerator for real-time analytics. The stages of the video analytics pipeline remain unchanged from the CPU reference pipeline. However, all three stages—ingestion, transformation, and inference—are processed inline on the FPGA while the CPU remains mostly idle. For the solution presented here, the pipeline is split across two Intel A10 FPGA PACs. The first two stages are performed on the first FPGA and the last stage on the second.
During the ingestion stage, UDP data packets received from a remote video streaming source are filtered and ingested directly into the FPGA through the onboard NIC. During the transformation stage, which follows immediately, the raw data packets are decoded into image frames and resized. In the final stage, the resized packets are passed through Megh’s Deep Learning Engine to extract the inferred labels. The inferred labels are sent to the host for post-processing.
Megh’s VAS stack facilitates FPGA offload of the entire pipeline, with minimum to zero code change to the CPU reference pipeline implementation.
The stack includes SIRA FPGA Hardware Libraries and ARKA Runtime. SIRA FPGA Hardware Libraries provide highly optimized accelerator function units to implement various stages of a video analytics pipeline. ARKA Runtime exposes APIs to the application layer, allowing FPGA offload. These APIs expose methods to support initialization of the video analytics service that binds to the target resources, loading of weights dynamically, and allows registering a callback function to receive the inferred results from the device.
Here is the performance of CPU+FPGA-based heterogeneous solution.
Infrastructure | One server with a Xeon E5 3106 Bronze + 2x Arria10 PACs |
Video Specification | 1080p, H.264 Encoded |
Throughput | ~240 FPS |
Latency | <100ms |
Here is a cost comparison for the same throughput.
8 Channel | CPU | FPGA |
CPU | Xeon E5 8180 Platinum (2 servers) | Xeon E5 3106 Bronze (1 server) |
FPGA | 2x Arria10 PACs | |
Throughput | 240FPS** | 240 FPS |
Cost* | $60,000 ($30,000 x 2) | $16,000 ($6000 + 2 x $5000) |
* TCO analysis is based on server cost estimates provided by Dell.com.
** Projected throughput.
Our solution provides a 10x improvement in throughput over the CPU-only reference implementation, while reducing the total cost of ownership by 3x. The inline processing of data allows us to achieve much lower, deterministic latencies.
This high-throughput, low-latency heterogeneous solution makes a compelling case for the use of FPGAs for video analytics use cases.