/ CNN

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

A series of notes, articles and code, for building/converting and accelerating CNNs on embedded devices using FPGAs.

These methods are very important for OSSDC.org Mono or Stereo #SmartCamera designs.

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

https://dl.acm.org/citation.cfm?doid=3020078.3021741

http://www.csl.cornell.edu/~zhiruz/pdfs/bnn-fpga2017.pdf

Binarized Convolutional Neural Networks on Software-Programmable FPGAs
https://github.com/cornell-zhang/bnn-fpga

Pyton based implementation of BNN:
Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1
https://github.com/MatthieuCourbariaux/BinaryNet

A great list of papers and projects for acceleration of neural nets models through compression:

acceleration-model-compression
https://github.com/handong1587/handong1587.github.io/blob/master/_posts/deep_learning/2015-10-09-acceleration-model-compression.md

Binarized Neural Networks with Separable Filters for Efficient Hardware Acceleration
IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Jul. 2017.

https://arxiv.org/abs/1707.04693
https://arxiv.org/pdf/1707.04693.pdf

More papers from one of the authors:
http://www.csl.cornell.edu/~zhiruz/publications.html

Accelerating Face Detection on Programmable SoC Using C-Based Synthesis

http://csl.yale.edu/~rajit/ps/hlsface.pdf

https://nitish2112.github.io/publication/face-detect/

https://github.com/cornell-zhang/facedetect-fpga

Accelerating Face Detection on Zynq-7020 Using High Level Synthesis

https://forums.xilinx.com/t5/Vivado-High-Level-Synthesis-HLS/Accelerating-Face-Detection-on-Zynq-7020-Using-High-Level/td-p/767203

Re: Accelerating Face Detection on Zynq-7020 Using High Level Synthesis
06-04-2017 01:23 PM @heyuning the work necessary for optimization is not too difficult. Basically most CV algorithm assume full frame is available/accessible at one time but this is not true for hardware implementations. You need to go through the loops in the algorithm and of each particular one ask what is the access pattern here? what data does it use and store the minimum number of rows you need to compute the necessary partial products.

To learn more and contribute, I invite you to join the discussions at http://ossdc.org/

For fresh news about self driving cars, artificial intelligence and robotics, follow me on Twitter at @gtarobotics