Implementing a CPU+FPGA-based real-time video analytics pipeline

This post is a follow-up to “Implementing a CPU-based real-time video analytics pipeline,” where we discussed a CPU-based end-to-end video analytics pipeline. As seen in that post, a CPU-based pipeline runs into severe performance bottlenecks. Here we discuss how we address and overcome these bottlenecks using FPGAs as hardware accelerators. We explain Megh’s Video Analytics Solution (VAS) stack, which offloads the compute-intensive stages of the video analytics pipeline to the FPGA. We also provide a competitive analysis, comparing it with the CPU-based solution.

The computational requirements for real-time stream processing map very well to FPGA architecture, making FPGAs the preferred hardware accelerator for real-time analytics. The stages of the video analytics pipeline remain unchanged from the CPU reference pipeline. However, all three stages—ingestion, transformation, and inference—are processed inline on the FPGA while the CPU remains mostly idle. For the solution presented here, the pipeline is split across two Intel A10 FPGA PACs. The first two stages are performed on the first FPGA and the last stage on the second.

High-level architecture of the FPGA-based video analytics pipeline.

During the ingestion stage, UDP data packets received from a remote video streaming source are filtered and ingested directly into the FPGA through the onboard NIC. During the transformation stage, which follows immediately, the raw data packets are decoded into image frames and resized. In the final stage, the resized packets are passed through Megh’s Deep Learning Engine to extract the inferred labels. The inferred labels are sent to the host for post-processing.

Megh’s VAS stack facilitates FPGA offload of the entire pipeline, with minimum to zero code change to the CPU reference pipeline implementation.

The solution stack.

The stack includes SIRA FPGA Hardware Libraries and ARKA Runtime. SIRA FPGA Hardware Libraries provide highly optimized accelerator function units to implement various stages of a video analytics pipeline. ARKA Runtime exposes APIs to the application layer, allowing FPGA offload. These APIs expose methods to support initialization of the video analytics service that binds to the target resources, loading of weights dynamically, and allows registering a callback function to receive the inferred results from the device.

Here is the performance of CPU+FPGA-based heterogeneous solution.

InfrastructureOne server with a Xeon E5 3106 Bronze + 2x Arria10 PACs
Video Specification1080p, H.264 Encoded
Throughput~240 FPS

Here is a cost comparison for the same throughput.

8 ChannelCPUFPGA
CPUXeon E5 8180 Platinum (2 servers)Xeon E5 3106 Bronze (1 server)
FPGA 2x Arria10 PACs
Throughput240FPS**240 FPS
($30,000 x 2)
($6000 + 2 x $5000)

* TCO analysis is based on server cost estimates provided by
** Projected throughput.

Our solution provides a 10x improvement in throughput over the CPU-only reference implementation, while reducing the total cost of ownership by 3x. The inline processing of data allows us to achieve much lower, deterministic latencies.

Throughput of the entire pipeline (ingestion + transformation + inference) in frames per second (FPS).
Total cost of ownership per video channel processed.

This high-throughput, low-latency heterogeneous solution makes a compelling case for the use of FPGAs for video analytics use cases.

Share this page
Share on facebook
Share on google
Share on twitter
Share on linkedin

Megh Computing

Enabling the 3rd Wave of Computing in the Data Center