Software Release

New! ScaleHLS+HIDA: Open Source

ScaleHLS is a High-level Synthesis (HLS) framework on MLIR. ScaleHLS can compile HLS C/C++ or PyTorch model to optimized HLS C/C++ in order to generate high-efficiency RTL design using downstream tools, such as AMD Vitis HLS. By using the MLIR framework that can be better tuned to particular algorithms at different representation levels, ScaleHLS is more scalable and customizable towards various applications coming with intrinsic structural or functional hierarchies. Working with a set of neural networks modeled in PyTorch, ScaleHLS-generated hardware designs provide up to 3825x higher performances compared to the baseline designs that do not contain pragma directives and are only optimized by Xilinx Vivado HLS. Furthermore, HIDA (ScaleHLS 2.0) achieves an 8.54x higher throughput on average compared to that of ScaleHLS. Meanwhile, despite being fully automated and able to handle various applications, HIDA achieves a 1.29x higher throughput over DNNBuilder, a state-of-the-art RTL-based neural network accelerator on FPGAs.

Download ScaleHLS+HIDA

New! ISDC: Open Source

ISDC is a novel feedback-guided iterative system of difference constraints (SDC) scheduling algorithm for high-level synthesis (HLS). ISDC leverages subgraph extraction-based low-level feedback from downstream tools like logic synthesizers to iteratively refine HLS scheduling. Technical innovations include: (1) An enhanced SDC formulation that effectively integrates low-level feedback into the linear-programming (LP) problem; (2) A fanout and window-based subgraph extraction mechanism driving the feedback cycle; (3) A no-human-in-loop ISDC flow compatible with a wide range of downstream tools and process design kits (PDKs). Evaluation shows that ISDC reduces register usage by 28.5% against an industrial-strength open-source HLS tool.

Download ISDC

New! PandoGen: Open Source

An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this work, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Our novel method, called PandoGen, trains protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30× larger. Our method forecasts unseen lineages months in advance.

Download PandoGen

New! AccShield: Open Source

Machine learning accelerators such as the Tensor Processing Unit (TPU) are already being deployed in the hybrid cloud, and we foresee such accelerators proliferating in the future. In such scenarios, secure access to the acceleration service and trustworthiness of the underlying accelerators become a concern. In this work, we present AccShield, a new method to extend trusted execution environments (TEEs) to cloud accelerators which takes both isolation and multi-tenancy into security consideration. We demonstrate the feasibility of accelerator TEEs by a proof of concept on an FPGA board. Experiments with our prototype implementation also provide concrete results and insights for different design choices related to link encryption, isolation using partitioning and memory encryption, etc.

Download AccShield

New! FSLAM: Open Source

Simultaneous Localization and Mapping (SLAM) is one of the main components of autonomous navigation systems. With the increase in popularity of drones, autonomous navigation on low-power systems is seeing widespread application. Most SLAM algorithms are computationally intensive and struggle to run in real-time on embedded devices with reasonable accuracy. We propose an FPGA-based SLAM system, named FSLAM, that accelerates the computationally intensive visual feature extraction and matching on hardware. FSLAM is based on a Zynq-family SoC and runs 8.5x, 1.55x and 1.35x faster compared to an ARM CPU, Intel Desktop CPU, and a state-of-the-art FPGA system respectively, while averaging a 2x improvement in accuracy compared to prior work on FPGA.

Download FSLAM

New! NimBlock: Open Source

This project focuses on enabling virtualization features to facilitate fine-grained FPGA sharing. We employ an overlay architecture which enables arbitrary, independent user logic to share portions of a single FPGA by dividing the FPGA into independently reconfigurable slots. We then explore scheduling possibilities to effectively time- and space-multiplex the virtualized FPGA. The Nimblock scheduling algorithm balances application priorities and performance degradation to improve response time and reduce deadline violations. We achieve up to 5.7× lower average response times when compared to a no-sharing and no-virtualization scheduling algorithm and up to 2.1× average response time improvement over competitive scheduling algorithms that support sharing within our virtualization environment.

Download NimBlock

NEW! HELLO: Open Source

HELLO is a new DNA variant calling tool, where we use novel DNN (Deep Neural Network) architectures and customized variant inference functions that account for the underlying nature of sequencing data. Our method allows vastly smaller DNNs to outperform the Inception-v3 architecture used in DeepVariant for indel and substitution-type variant calls. Our improved accuracy and problem-specific customization of DNN models could enable more accurate pipelines and further method development in the field. Available since 2021.

Download HELLO

Anand Ramachandran, Steven Lumetta, Eric Klee, and Deming Chen, “HELLO: Improved Neural Network Architectures and Methodologies for Small Variant Calling”, BMC Bioinformatics, 2021.

WinoCNN: Open Source

WinoCNN combines systolic array and fast Winograd algorithm for CNN acceleration. This system supports flexible convolution kernel sizes without sacrificing DSP efficiency through various algorithmic, architecture and on-chip memory subsystem designs and optimizations. Overall, our accelerator delivers high throughput and state-of-the-art DSP efficiency compared to previous accelerator implementations. Available since 2021.

Download WinoCNN

Xinheng Liu, Yao Chen, Cong Hao, Ashutosh Dhar, and Deming Chen, “WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs”, Proceedings of IEEE International Conference on Application-specific Systems, Architectures and Processors, July 2021.

TwinDNN: Open Source

TwinDNN system pairs a high-accuracy heavy-duty network with a low-latency light-weight (e.g., highly compressed) network using a hierarchical inference logic that will infer high-accuracy network when the prediction of low-latency network is not considered confident. TwinDNN can recover up to 94% of accuracy drop caused by extreme network compression, with more than 90% speedup. Available since 2021.

Download TwinDNN

Hyunmin Jeong and Deming Chen, “TwinDNN: A Tale of Two Deep Neural Networks”, Proceedings of IEEE International Conference on Application-specific Systems, Architectures and Processors, July 2021.

SkyNet: Open Source

SkyNet is a new hardware-efficient DNN model specialized in object detection and tracking. SkyNet was developed based on the SkyNet Design Methodology to facilitate edge AI solutions, and demonstrated in the 56th IEEE/ACM Design Automation Conference System Design Contest (DAC-SDC), a low power object detection challenge for real-life unmanned aerial vehicle (UAV) applications. SkyNet won the First Place Award for both GPU and FPGA tracks of the contest in 2019. Available since 2019.

Download SkyNet

Xiaofan Zhang, Haoming Lu, Cong Hao, Jiachen Li, Bowen Cheng, Yuhong Li, Kyle Rupnow, Jinjun Xiong, Thomas Huang, Honghui Shi, Wen-Mei Hwu, and Deming Chen, “SkyNet: a Hardware-Efficient Method for Object Detection and Tracking on Embedded Systems,” Proceedings of Machine Learning and Systems (MLSys 2020), March 2020.

DNNBuilder: Open Source

This package provides a novel solution that can automatically convert the Caffe trained DNN to the FPGA RTL level implementation without involving any hardware programming effort. It also provides uniform APIs to the users for their AI recognition task. The developers, without any FPGA programming experience, can deploy their FPGA accelerated deep learning services for both cloud and edge computing, only providing their trained Caffe model. The paper for DNNBuilder has won the IEEE/ACM William J. McCalla ICCAD Best Paper Award in 2018. Available since 2019.

Download DNNBuilder

Xiaofan Zhang, Junsong Wang, Chao Zhu, Yonghua Lin, Jinjun Xiong, Wen-mei Hwu, and Deming Chen, “DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs”, Proceedings of IEEE/ACM International Conference on Computer-Aided Design, November 2018. (Best Paper Award)

uL2Q: Open Source

This open-source package introduces an ultra-low loss quantization (μL2Q) method that provides DNN quantization schemes based on comprehensive quantitative data analysis. μL2Q builds the transformation of the original data to a data space with standard normal distribution, and then finds the optimal parameters to minimize the loss of the quantization of a target bitwidth. Our method can deliver consistent accuracy improvements compared to the state-of-the-art quantization solutions with the same compression ratio. Available since 2019.

Download μL2Q

Cheng Gong, Ye Lu, Cong Hao, Xiaofan Zhang, Tao Li, Deming Chen, and Yao Chen, “μL2Q: An Ultra-Low Loss Quantization Method for DNN Compression,” Proceedings of International Joint Conference on Neural Networks (IJCNN), July 2019.

T-DLA: Open Source

T-DLA (Ternarized Deep Learning Accelerator) is an open-source microprocessor designed specifically for accelerating DNN models trained with ternarized weights. This is the first instruction-based DLA design targeting ternary-quantized weights. The T-DLA system delivers up to 0.4 TOPS with 2.58 W power consumption. It is 873.6× and 5.1× faster on ImageNet for Resnet-18 model comparing to Xeon E5-2630 CPU and Nvidia 1080 Ti GPU respectively. Available since 2019.

Download T-DLA

Yao Chen, Kai Zhang, Cheng Gong, Cong Hao, Xiaofan Zhang, Tao Li, and Deming Chen, “T-DLA: An Open-source Deep Learning Accelerator for Ternarized DNN Models on Embedded FPGA,” Proceedings of IEEE Computer Society Annual Symposium on VLSI, July 2019.

DNN IP: Open Source

This IP Package includes an open-source IP repository specifically designed for machine learning applications. The IPs include: Standard convolution IPs, Depth-wise separable convolution IPs, Pooling IPs, Bounding box regression IP, and Long-term Recurrent Convolutional Network IP. Each IP is provided with: introduction, interface description, inputs and outputs description, parameter configuration, and resource and performance. The IPs are developed in C/C++. The source code is synthesizable and RTL code can be generated conveniently using Xilinx Vivado HLS. Available since 2019.

Download DNN IPs

Thanos: Open Source

This open-source package introduces Thanos, a fast graph partitioning tool which uses the cross-decomposition algorithm that iteratively partitions a graph. It also produces balanced loads of partitions. The algorithm is well suited for parallel GPU programming which leads to fast and high-quality graph partitioning solutions. Experimental results show that we have achieved a 30x speedup and 35% better edge cut reduction compared to the CPU version of the popular graph partitioning tool METIS on average. Available since 2019.

Download Thanos

Dae Hee Kim, Rakesh Nagi, and Deming Chen, “Thanos: High-Performance CPU-GPU Based Graph Partitioning Using Cross-Decomposition,” Proceedings of IEEE/ACM Asia and South Pacific Design Automation Conference, January 2020.

CLOUD-DNN Open Source

Cloud-DNN is an open-source framework that maps DNN (deep neural network) models trained by Caffe to FPGAs in the cloud for inference acceleration. It takes the input *.prototxt DNN description, generates corresponding C++ network description, and then produces the final hardware accelerator IPs through high-level synthesis. The goal of Cloud-DNN is to provide more flexible and user-friendly DNN acceleration on cloud-FPGAs (e.g., AWS F1).

Download CLOUD-DNN

Yao Chen, Jiong He, Xiaofan Zhang, Cong Hao, and Deming Chen, “Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs”, Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays, February 2019.

RIP Open Source

This open source project contains three inter-related software packages (fast software modeling, fast hardware modeling and design space exploration, and hardware/software co-design), for the ultimate task of automated near-optimal hardware/software partitioning targeting either sophisticated SoC designs or computing on heterogeneous systems.

Download RIP

W. Zuo, W. Kemmerer, J. B. Lim, L.-N. Pochet, A. Ayupoy, T. Kim, K. Han, and D. Chen, “A polyhedral-based SystemC modeling and generation framework for effective low-power design space exploration,” Proceedings of IEEE /ACM International Conference on Computer-Aided Design, November 2015. (Best Paper Award)
W. Kemmerer, W. Zuo, and D. Chen, "Parallel Code-Specific CPU Simulation with Dynamic Phase Convergence Modeling for HW/SW Co-Design", Proceedings of IEEE/ACM International Conference on Computer-Aided Design, November 2016.
W. Zuo, L.-N. Pochet, A. Ayupov, T. Kim, C.-W. Lin, S. Shiraishi, and D. Chen, “Accurate High-level Modeling and Automated Hardware/Software Co-design for Effective SoC Design Space Exploration" Proceedings of IEEE/ACM Design Automation Conference, June 2017.

FCUDA Open Source

A source-to-source transformation framework that can take CUDA code, generate functionally equivalent synthesizable C code, and map to an FPGA implementation using high-level synthesis for high performance and energy-efficient reconfigurable computation.

Download FCUDA

T. Nguyen, Y. Chen, K. Rupnow, S. Gurumani, and D. Chen, "SoC, NoC and Hierarchical Bus Implementations of Applications on FPGAs Using the FCUDA Flow", Proceedings of IEEE Computer Society Annual Symposium on VLSI, July 2016.
Y. Chen, T. Nguyen, Y. Chen, S. T. Gurumani, Y. Liang, K. Rupnow, J. Cong, W.M. Hwu, and D. Chen, “FCUDA-HB: Hierarchical and Scalable Bus Architecture Generation on FPGAs with the FCUDA Flow,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2016.
T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen, “FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDA-to-FPGA Compiler,” Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays, February 2016.
Y. Chen, S. T. Gurumani, Y. Liang, G. Li, D. Guo, K. Rupnow, and D. Chen, “FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2015.
A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J. Cong, and W.M. Hwu, "Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs," ACM Transactions on Embedded Computing Systems, Special Issue on Application-Specific Processors, Vol. 13, Issue 2, September 2013.
A. Papakonstantinou, D. Chen, W.M. Hwu, J. Cong, and Y. Liang, "Throughput-oriented Kernel Porting onto FPGAs," Proceedings of IEEE/ACM Design Automation Conference, June 2013.
S. Gurumani, K. Rupnow, Y. Liang, H. Cholakkail, and D. Chen, "High Level Synthesis of Multiple Dependent CUDA Kernels for FPGAs," Proceedings of IEEE/ACM Asia and South Pacific Design Automation Conference, January 2013. (Invited)
S. Gurumani, J. Tolar, Y. Chen, Y. Liang, K. Rupnow, and D. Chen, "Integrated CUDA-to-FPGA Synthesis with Network-on-Chip," Proceedings of IEEE International Symposium on Field-Programmable Custom Computing Machines, May 2014.
A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen, J. Cong, and W.M. Hwu, "FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs," Proceedings of IEEE Symposium on Application Specific Processors, July 2009. (Best Paper Award)
A. Papakonstantinou, Y. Liang, J. Stratton, K. Gururaj, D. Chen, W.M. Hwu and J. Cong, "Multilevel Granularity Parallelism Synthesis on FPGAs," Proceedings of IEEE International Symposium on Field-Programmable Custom Computing Machines, May 2011. (Best Paper Award)

H.264 High Level Synthesis Benchmark

Fully synthesizable H.264 Video Decoder code, which can be synthesized into RTL with high-level synthesis for FPGA implementation and achieve real-time decoding.

Download H.264 Benchmark

X. Liu, Y. Chen, T. Nguyen, S. Gurumani, K. Rupnow, and D. Chen, “High Level Synthesis of Complex Applications: An H.264 Video Decoder”, Proceedings of ACM/SIGDA International Symposium on Field Programmable Gate Arrays, February 2016.

TMDFET SPICE Models

SPICE transistor models of flexible Transition Metal Dichalcogenide Field-Effect Transistors, TMDFET.

Download TMDFET HSPICE Models

Download TMDFET Verilog-A Models

Y-Y Chen, M. Gholipour, and D. Chen, "Flexible Transition Metal Dichalcogenide Field-Effect Transistors: A Circuit-Level Simulation Study of Delay and Power under Bending, Process Variation, and Scaling," Proceedings of IEEE/ACM Asia and South Pacific Design Automation Conference, Jan. 2016.
M. Gholipour, Y.Y. Chen, and D. Chen, “Compact Modeling to Device- and Circuit-Level Evaluation of Flexible TMD Field-Effect Transistors,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume: 37, Issue: 4, Page(s): 820 - 831, April 2018.

GNRFET HSPICE Models

HSPICE transistor models of two types of Graphene Nano-Ribbon Field-Effect Transistors, MOS-GNRFET and SB-GNRFET.

Download GNRFET HSPICE Models

Y-Y. Chen, A. Rogachev, A. Sangai, G. Iannaccone, G. Fiori, and D. Chen, "A SPICE-Compatible Model of Graphene Nano-Ribbon Field-Effect Transistors Enabling Circuit-Level Delay and Power Analysis Under Process Variation," Proceedings of IEEE/ACM Design, Automation & Test in Europe, March 2013.
Y-Y. Chen, A. Sangai, M. Gholipour, and D. Chen, "Schottky-Barrier-Type Graphene Nano-Ribbon Field-Effect Transistors: A Study on Compact Modeling, Process Variation, and Circuit Performance," Proceedings of IEEE/ACM International Symposium on Nanoscale Architectures, July 2013.
Y-Y. Chen, A. Sangai, M. Gholipour, and D. Chen, "Graphene Nano-Ribbon Field-Effect Transistors as Future Low-Power Devices," Proceedings of IEEE/ACM International Symposium on Low Power Electronics and Design, September 2013. (Invited)
Y-Y Chen, A. Sangai, M. Gholipour, and D. Chen, "Effects of Process Variation on the Circuit-Level Performance of Graphene Nano-Ribbon Field-Effect Transistors," Workshop on Variability Modeling and Characterization, November 2013.
M. Gholipour, Y-Y, Chen, A. Sangai, and D. Chen, "Highly Accurate SPICE-Compatible Modeling for Single- and Double-Gate GNRFETs with Studies on Technology Scaling," Proceedings of IEEE/ACM Design, Automation & Test in Europe, March 2014.
M. Gholipour, Y.Y. Chen, A. Sangai, N. Masoumi, and D. Chen, “Analytical SPICE-Compatible Model of Schottky-Barrier-Type GNRFETs With Performance Analysis”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, March 2015.
Y.Y. Chen, A. Sangai, A. Rogachev, M. Gholipour, G. Iannaccone, G. Fiori, and D. Chen, “A SPICE-Compatible Model of MOS-Type Graphene Nano-ribbon Field-Effect Transistors enabling Gate- and Circuit-level Delay and Power Analysis under Process Variation,” IEEE Transactions on Nanotechnology, Volume 14, Issue 6, pp. 1068-1082, November 2015.

BLESS

Bloom-filter-based Error Correction Tool for NGS DNA reads.

Download BLESS

Y. Heo, X-L. Wu, D. Chen, J. Ma, and W-M Hwu, "BLESS: Bloom-filter-based Error Correction Solution for High throughput Sequencing Reads," Bioinformatics, 2014, doi: 10.1093/bioinformatics/btu030.
Yun Heo, Anand Ramachandran, Wen-Mei Hwu, Jian Ma, Deming Chen, "BLESS 2: Accurate, memory-efficient, and fast error correction method," Bioinformatics, Volume 32, Issue 15, Pages 2369–2371, 1 August 2016. https://doi.org/10.1093/bioinformatics/btw146.