Megh’s flexible, high-performance deep learning engine

As deep learning (DL) becomes more pervasive, the need for efficient and fast computation is increasing. Traditional central processing units (CPUs) and graphical processing units (GPUs) are typically used for acceleration, despite the limiting nature of their fixed architectures. Field programmable gate arrays (FPGAs) have been highlighted for their flexibility, but until recently have fallen short due to lower floating-point arithmetic performance and the need for specialized expertise to program the devices. FPGAs are currently becoming more popular because of better reduced-precision performance and power utilization compared to fixed-function ASICs and the availability of optimized libraries. They also support tight integration with data ingest, allowing for superior low-latency and throughput performance for streaming applications, especially in the case of single datum inference. 

Megh is developing an AI solution that is tightly integrated with our Megh Computing Platform and enables seamless integration of DL into customer pipelines across multi-FPGA deployments. The Deep Learning Engine (DLE) has been designed from the ground up for streaming inference. It consists of a library of high performance, mixed precision DL primitives that are drop-in replacements for TensorFlow and PyTorch layers. Our approach heavily optimizes based on deployment criteria, such as desired power, area, performance, and number of FPGAs, resulting in a thin control scheme where data-ingest ordering infers the execution control. Nearly all the traditional control structures typically required in DL accelerators are removed, maximizing FPGA area for arithmetic computational units. These units are allocated on a layer-by-layer basis, ensuring that the resulting hardware runs at near-100% computational efficiency.

While power-user application developers can access the DL primitives directly, we provide a DLE compiler that directly parses TensorFlow and PyTorch models, creating an optimal DLE configuration. Our primitive-based approach isn’t restricted to popular DL topologies and gives customers total flexibility when designing their network. Our customers shouldn’t have to change their topology or hire HDL/system integration expertise to optimize, profile, and deploy their workloads. Our quantized layers flow is 100% compliant with the TensorFlow Quantization Specification with floating point fallback. As a result, it requires no additional re-training or calibration apart from the standard TensorFlow or PyTorch quantization flow. This allows application developers to take advantage of FPGAs’ superior reduced-precision performance without specialized third-party retraining libraries.

DLE is initially deployed in Megh’s Video Analytics Solution (VAS), which requires image classification and object detection using CNN models. The application takes streaming video directly over Ethernet to an FPGA that performs H264 video decode before passing the raw RGB frames via Ethernet to the DLE in another FPGA. The solution greatly assists in inventory management, helping detect fraud and aid in restocking. Megh is also developing its Text Analytics Solution with RNN models for financial analytics, network analytics, and other use cases.

Learn more about Megh’s DLE and other products at

Share this page
Share on facebook
Share on google
Share on twitter
Share on linkedin

Megh Computing

Enabling the 3rd Wave of Computing in the Data Center