Megh’s Deep Learning Engine usages

Video analytics use cases

Enterprise users are increasingly interested in implementing complex video analytics use cases that provide business value beyond typical applications. These involve multi-stage models for object detection and image classification with custom trained models that are integrated to solve business problems. Some examples include:

Segment

Use case

Deep learning tasks

Retail

Cashier-less store

Pose detection, object detection, classification, etc.

Manufacturing

Worker safety

Pose detection, object detection, classification, etc.

Entertainment

Tracking family

Object detection, re-identification, classification

Deep learning models for object detection and image classification

The two graphs below show the accuracy vs. complexity of implementing the newer models for object detection (EfficientDet) and image classification (EfficientNet) vs. the traditional models.

GPU implementation

The throughput of YOLOv3/v4 and EfficientDet object detection models on Nvidia V100 is shown below. These newer models use depth-wise convolutions of some form that do not map well to GPUs. YOLOv3/v4 operate at approximately 100 fps on the V100, while EfficientDet can only achieve approximately 50 fps (though it has much lower complexity).

For image classification, ResNet models can run at approximately 1000 fps on a T4, but EfficientNet models that can deliver higher Top1 accuracy at lower complexity run at approximately 50 fps for batch = 1 on a T4.

FPGA implementation

Unlike the GPUs, we are able to map these newer models to the FPGA, delivering very high throughput. The wider sparse networks map very well to the FPGA using DLE’s pipeline architecture. Below are the current and projected performance numbers for these networks on FPGAs.

Object detection models (fps)

Ver 1 (Sept)

Ver 2 (Oct)

Ver 3 (Nov)

MobileNetv2 SSD

1000

1300

2000

TinyYOLOv3

350

600

1000

TinyYOLOv4

370

650

1100

YOLOv3

90

110

200

YOLOv4

100

125

250

EfficientDet D0

250

600

900

Image classification models (fps)

Ver 1 (Sept)

Ver 2 (Oct)

Ver 3 (Nov)

MobileNetv2

1000

1400

2200

ResNet50

550

850

1200

EfficientNet B0

900

1200

1600

We are mapping these models to DLE in phases:

  • Ver 1 is the baseline performance by running the model through the current version of DLE compiler using on-chip (BRAM and URAM) and DDR memory for weights and the DSPs mapped for single multiplication.
  • Ver 2 is with DSP multiplier optimization. This will allow us to double the number of multiplications supported by sharing a DSP for two sets of operations. This should enable further scaling and more throughput.
  • Ver 3 is with dual clock optimization and involves running the DSPs at 400+ MHz and the core logic at half that speed. This will allow increasing throughput further.

Megh’s DLE implemented on FPGAs delivers more than 10x the performance for models like EfficicentNet and EfficientDet compared to GPUs. This higher throughput coupled with the flexibility and scalability of Megh’s Video Analytics Solution offers a very compelling value proposition for implementing complex video analytics use cases.

Learn more about Megh’s DLE and other products at megh.com.

Share this page
Share on facebook
Facebook
Share on google
Google+
Share on twitter
Twitter
Share on linkedin
LinkedIn