Department of Electrical and Computer Engineering
323 Benton Hall, University of Florida
Interests: Reconfigurable Computing, FPGAs, GPUs, synthesis, compilers, CAD, architecture, embedded Systems
Elastic computing (not to be confused with Amazon's Elastic Compute Cloud) is an optimization framework for multi-core heterogeneous systems that enables mainstream designers to more effectively take advantage of accelerators such as GPUs, FPGAs, and multi-core processors. From a designer's point of view, elastic computing provides a library of specialized elastic functions. However, unlike traditional functions, elastic functions contain a knowledge based of different algorithms, optimizations, and parallelizations for the given function. This knowledge base enables the elastic computing framework to transparently optimize a function for different runtime conditions, while parallelizing the implementation across available resources.
This work is supported by the National Science Foundation grant CNS-0914474. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Despite significant performance and energy advantages for important application domains, FPGAs have limited usage due to low application-design productivity compared to CPUs and GPUs. Two main sources of low productivity are long compile times, often requiring hours or even days, and a lack of application portability that prevents design reuse. Intermediate fabrics address these problems via virtual reconfigurable architectures implemented atop FPGAs. By matching virtual resource granularity to application requirements, intermediate fabrics can perform place & route orders-of-magnitude faster than commerical tools. Furthermore, by hiding the underlying FPGA from the application, intermediate fabrics enable application portabiltiy across potentially any FPGA with enough resources to implement the intermediate fabric.
This work is supported by National Science Foundation grant CNS-1149285 and the I/UCRC Program of the National Science Foundation under Grant No. EEC-0642422.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Warp processors are hybrid SoC (system on a chip) devices that dynamically optimize software by synthesizing hardware implemented in an on-chip FPGA. From a software developer's point of view, a warp processor initially executes an application like any other microprocessor, but after some period of time the application transparently executes more efficiently, with improved performance and reduced energy. This transparency allows for synthesis to be integrated into any existing application development tool flow, allowing developers to use their existing languages and compilers. Warp processors completely hide synthesis from software developers, who often avoid hardware design due to the difficult and time-consuming process of register-transfer level specification. Also, the dynamic nature of warp processing enables dynamic optimizations not possible in existing static approaches, such as phase-based optimizations.
To perform synthesis at runtime, warp processors have a
specialized architecture capable of profiling the executing software,
computation kernels, synthesizing the decompiled kernels, and then mapping,
placing, and routing the kernels into an on-chip FPGA. The main challenge in
the design of warp processors is the design of these CAD tools, which must run
in an on-chip environment - a difficult task considering these tools
power workstations. We have currently implemented a complete on-chip CAD tool
flow that executes in just several seconds on an ARM microprocessor, resulting
in a hardware/software system that is often 10x faster than software
execution. We are currently extending warp processors to handle multithreaded
applications, by synthesizing custom accelerators for executing threads. Early
results show that multithreaded warp processing can achieve more than 100x
speedups compared to software execution on multi-core systems with
up to 64 cores.
Much of my research has focused on one of the enabling technologies of warp
processors - synthesis from software binaries. Because the dynamic synthesis performed by warp processors must
be performed on a software binary as opposed to high-level code, the resulting
hardware can potentially be much slower, due
to the loss of high-level information during software compilation. To make
software binaries feasible, I have adapted existing decompilation techniques
and introduced new
techniques to recover high-level information needed for effective synthesis.
By using these techniques, I have shown that for many applications, including
a commercial h.264 decoder,
synthesis from software binaries can in fact achieve similar, or even
identical results compared to high-level synthesis approaches. Synthesis from
software binaries can also be used independently of warp processors, providing
similar transparency advantages for desktop CAD, in addition to supporting
synthesis of library code, legacy code, and hand-optimized assembly.
[Publications] [Curriculum Vitae]