Parameterised automated generation of convolvers implemented in FPGAs

Ernest Jamro
 

Supervisor
dr hab. inż. Kazimierz Wiatr, prof. n. AGH

A DISSERATION SUBMITED TO
UNIVERSITY OF MINING AND METALLURGY
DEPARTMENT OF ELECTRONICS
IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY

Kraków, Poland
June 2001


Home Page
Download the thesis
Abstract
Contents



Abstract
 Silicon technology now allows us to build chips consisting of tens of millions of transistors. As a result, more and more projects are constrained by the design time and complexity rather than available chip resources. This thesis describes a (C++ based) Automated Tool for generation 2-dimentional Convolvers (2D FIR filters) implemented in FPGAs (AuToCon). The AuToCon can automatically generates a VHDL description of a wide range of convolvers giving the list of parameters, such as: an input width, a convolution kernel size, coefficient values, a pipelining option, etc.
A novel synthesis approach has been proposed: the AuToCon does not assume any cost-relations between available memories, adders, multiplexers and flip-flops resources, these values are input parameters to the AuToCon. Even different memory types can be freely defined. Consequently, migration from one device family to another is rather effortless. Furthermore, within the same FPGA, cost-relations between different resources might differ and depend on the number of available resources (some resources might be already allocated by other designs incorporated into the same FPGA). Therefore, the AuToCon generates different circuits, i.e. allocates different resources, according to the cost-relations between the FPGA resources.
FPGAs, in comparison to ASICs, can be quickly reconfigured, therefore design functionality can be significantly improved by constant propagation through functional reconfiguration. In the course of this work, different architectures have been studied: constant coefficient architecture, where coefficient values are built-in the circuit. This architecture is the most hardware efficient, however any coefficient change requires the circuit to be redesigned. The second solution is variable coefficient option (usage of fully functional multipliers) which consumes much more chip-area, but coefficient can be changed without restrictions. There is also a mid-way solution for which coefficient is dynamically changed by employing in-circuit reconfiguration.
The AuToCon considers a wide range of possible architectures, employing sophisticated optimisation techniques such as exhaustive search, greedy algorithms, simulated annealing and genetic programming. These techniques have been employed e.g. to optimise the adders tree. As a result, the AuToCon does not only significantly reduces design time but also a generated circuit is, in most cases, more hardware efficient than a hand-crafted counterpart and comparable commercial solutions.
The greatest effort has been put into development of the AuToCon. Nevertheless this thesis presents a wide range of novel architectural solutions and algorithms, such as: a novel binary to Cannonic Sign Digit conversion algorithm, usage of different memory modules, implementation of dual port memories for Dynamic Constant Coefficient Multipliers and adaptive systems, extensive usage of Multiplierless Multiplication in FPGAs, advance optimisation techniques for LUT-based Multiplication, novel structure of Irregular Distributed Arithmetic Convolver, and the algorithm which trade-offs between multiplierless and LUT-based convolution.
In the course of this work, implementation of the convolver on different architectures, such as general-purpose processors, DSPs, dedicated VLSI convolvers and FPGAs, has been presented. As a result, FPGA implementation usually outperforms the other solutions, and the developed synthesis tool significantly reduces design time and hardware requirements of a convolver. In conclusion, as convolution or similar operations (e.g. a sum-of-products) are fundamental operations in most digital signal processing systems, this work might be a crucial contribution in electronic digital designs.
 


Contents
GLOSSARY OF TERMS 7
THESIS 10
1. INTRODUCTION 11
1.1. Convolution Operation 11
1.2. Design Automated Tools 12
1.3. Overview of the thesis 14
2. DIFFERENT MACHINES IMPLEMENTING CONVOLUTION 15
2.1. General purpose processors 15
2.1.1. Loop unrolling 16
2.1.2. Superscalar architecture 17
2.1.3. Very Long Instruction Word (VLIW) 20
2.1.4. SIMD 21
2.1.5. Implementation results 23
2.2. Digital Signal Processors (DSPs) 24
2.2.1. Parallel Processors 24
2.2.2. DSP TMS320C80 28
2.2.3. TigerSHARC 30
2.3. Dedicated VLSI Convolution Processors 32
2.4. Field Programmable Gate Arrays (FPGAs) 34
2.5. Conclusions 36
3. CONSTANT COEFFICIENT MULTIPLICATION (KCM) 39
3.1. Multiplierless multiplication (MM) 40
3.1.1. Canonic Signed Digit Representation 40
3.1.2. Modified algorithm for conversion to the CSD representation 42
3.1.3. Substructure sharing 44
3.1.4. Experimental results 44
3.2. LUT based Multiplication (LM) 46
3.2.1. Concept 46
3.2.2. Implementation in FPGAs 47
3.3. Comparison of the multipliers 52
3.3.1. Area 52
3.3.2. Speed 53
3.4. Conclusions 54
4. ARCHITECTURES OF MULTIPLIERS 56
4.1. Dynamic Constant Coefficient Multiplier (DKCM) 57
4.2. Memory Multiplexers 58
4.3. RAM programming unit (RPU) 58
4.4. Implementation results for the DKCM 60
4.5. Implementation of the DKCM versus the KCM 64
4.6. Implementation of the DKCM versus the VCM 65
4.7. Conclusions 69
5. CONVOLUTION IN FPGAS 71
5.1. Previous Works 71
5.2. Symmetry of Convolution Coefficients 73
5.3. LUT based Convolver (LC) 74
5.3.1. Concept 74
5.3.2. Constant coefficients LUT based Convolver (KLC) 76
5.3.3. Dynamic Constant coefficients LUT based Convolver (DKLC) 77
5.4. Distributed Arithmetic Convolver (DAC) 78
5.4.1. Concept 78
5.4.2. Irregular Distributed Arithmetic Convolver (IDAC) 79
5.5. Multiplierless Convolution (MC) 81
5.5.1. Substructure Sharing (SS) 81
5.6. IDAC versus MC 84
5.7. Implementation Results 87
5.7.1. Canonic Sign Digit (CSD) conversion 87
5.7.2. Sub-structure sharing (SS) 87
5.7.4. Irregular Distributed Arithmetic Convolver 91
5.7.5. Approximated coefficients� cost for the MC and IDAC 92
5.7.6. MC vs. IDAC Algorithm 95
5.8. Conclusions 98
6. OPTIMISATION OF THE ADDERS TREE 99
6.1. Implementation of adders in FPGAs 99
6.2. Addition parameters 101
6.2.1. Input parameters 101
6.2.2. Correlation between inputs 101
6.2.3. Summary of the input parameters 105
6.2.4. Addition tree structure 106
6.2.5. Filters Example 107
6.3. Greedy algorithm 108
6.4. Exhaustive search 109
6.4.1. Concept 109
6.4.2. Constrained Search (CS) 110
6.4.3. Implementation Results 111
6.5. Simulated Annealing (SA) 112
6.5.1. Principle 112
6.5.2. Implementation results 114
6.6. Genetic Programming (GP) 115
6.6.1. Encoding scheme 115
6.6.2. Fitness evaluation 116
6.6.3. Selection 116
6.6.4. Crossover 116
6.6.5. Mutation 120
6.7. Implementation results 121
6.8. Conclusions 122
7. CONCLUSIONS 123
APPENDIX A. BRIEF DESCRIPTION OF THE AUTOCON 128
REFERENCES 137