Труды Института системного программирования РАН (Oct 2018)
Implementation of Loop Pipelining and Assignment Inlining in the C-to-HDL Translator
Abstract
Implementing algorithms for field-programmable gate arrays using a hardware description language is a complex task. Therefore, it would be useful to have a tool that can efficiently translate an algorithm from a high-level language to a hardware description language. In this paper we consider the C-to-HDL which can translate C functions into Verilog modules, the translation process, and two important optimizations implemented on hardware description level: assignment inlining and loop pipelining. The basic structure and ABI interface of C-to-HDL follows that of the C-to-Verilog translator which inspired us for creating our tool. We also use LLVM infrastructure for translating LLVM bitcode representation into the Verilog code. Simple arithmetic operations are executed within a single cycle, while for complex arithmetic operations and loads/stores from/to memory, first, a set of assignments loading instruction operands, memory address, and memory access mode (read or write) is generated and placed on the first cycle of executing the instructions, and second, the final assignment transferring the operation result to a virtual register is generated.The following optimizations are implemented and greatly improve the execution performance. Instruction scheduling is performed, taking into account that the number of instructions executed in parallel is only limited by the FPGA free chip space; the memory operations have limits given by the number of memory channels and blocks on the FPGA. If possible, assignments of temporary registers are inlined to make the more complex operations, but without producing too long dependence chains and unwanted limiting of the FPGA frequency. Finally, software pipelining following the usual modulo scheduling scheme is also performed with the same resource constraint limit relaxations as described for instruction scheduling.Experimental results demonstrate that these optimizations significantly improve (up to 4x) the performance of generated code.