You will again be working in groups for this project.
Turn in the project using the e-learning website (there is a link on the main course page). Please zip all files (DimeTalk project, C files, VHDL, bitfile, readme.txt explaining the status of the lab, and a screenshot of the output) and attach it to your submission. Please do not turn in other files that are generated during synthesis. Feel free to turn in files used for testing (testbenches, etc.).
You are welcome to use the provided glue logic entity. Note that is has to be modified slightly for the control signals required by this lab.
In this lab, you will be implemeting a fully pipelined accumulator circuit. The circuit will utilize one blockRAM to continually feed 4 inputs per cycle into the accumulator, and one blockRAM to store an output each cycle. In software, you will initially transfer data from the microprocesor into the input blockRAM, start the accumulator circuit, and then wait for completion, at which point the software will read data from the output blockRAM and output it to the screen.
The pseudo-code you will be implementing on the FPGA is shown below. Note that your actual code will look nothing like this, this is simply intended to help you understand the functionality.
#define OUTPUT_SIZE 512 #define INPUT_SIZE OUTPUT_SIZE*4 unsigned char b[INPUT_SIZE]; unsigned int a[OUTPUT_SIZE]; for(i=0,j=0; j < OUTPUT_SIZE; i += 4, j++) { a[j] = b[i] + b[i+1] + b[i+2] + b[i+3]; }
For the first part of the lab, you will be describing the following
accumulator datapath:
As shown, the circuit takes 4 8-bit inputs and accumulates them into a single 10-bit output. The green boxes are registers. The control details are not shown, but you may do anything you want as long as the circuit is implemented as a pipeline. The simplest control would simply be an enable to each register.
The overall circuit (VHDL and DimeTALK) you will be implementing is shown below:
The overall circuit of consists 2 blockRAMS (each needs to be large enough to store 512 32-bit words), one which feeds data into the accumulator and one which stores data from the accumulator. The address generators control each blockRAM, essentially producing a new address each cycle. The controller waits for a go signal from the microprocessor (implemented using a memory map), reads in a data size (also implemented using a memory map), enables the input address generator, enables the datapath, and then waits the appropriate number of cycles before enabling the output address generator. Upon completion, the circuit should assert a done signal (implemented using a memory map) that allows the microprocessor to continue execution. The control signals are intentionally vague, which should give you some flexibility on implementation options. As along as the circuit feeds 4 new inputs to the datapath each cycle, you can implement the control however you like.
Notice that the datapath takes 4 8-bit inputs and has one 10-bit output. However, the DimeTALK BRAM node only has 32-bit inputs/outputs. Therefore, to get 4 inputs to the datapath each cycle, you must divide each 32-bit input into 4 8-bit inputs. Similarly, you must concatenate 22 0's onto each output before writing it to the BRAM.
Main Challenges:
Suggestions:
Your C code will initialize the input blockRAM, start the accumulator circuit, wait for the circuit to complete, and then read the data from the output blockRAM. You should use an input size of 512 32-bit words (512*4 8-bit inputs).
One challenge here is that you can only transfer a 32-bit word to the FPGA, but
the circuit needs 4 8-bit inputs. Therefore, you must pack 4 inputs into
each 32-bit
word that you transfer to the blockRAM. You can easily do this with bit
manipulation operators in C (|, <<, &). An example is shown below. Feel free
to use this code in your design.
unsigned int dat1[512]; for(i=0;i<512;i++) { dat1[i] = ((rand() & 0xff) << 24) | ((rand() & 0xff) << 16) | ((rand() & 0xff) << 8) | ((rand() & 0xff)); }
This code initializes each 32-bit element of the input arrray with 4 8-bit random numbers. You may perform initialization with different code if you like, but you should fill all the inputs with 8-bit random numbers. To test the circuit, the C code should also perform the accumulate in software, so you can easily compare the output of the FPGA to the software output. After performing the accumulate in software and on the FPGA, you should output to the screen the four numbers that were added, their result in software, and their result in hardware. You should also report the execution time for the software code and the hardware code, and the corresponding speedup. For the hardware execution time, start the timer as soon as you enable the circuit (i.e. don't include the time to transfer the data to the blockRAM). The output should basically look something like this:
195+177+210+177 => Sw Result: 759, Hw Result: 759 51+229+151+29 => Sw Result: 460, Hw Result: 460 153+227+81+150 => Sw Result: 611, Hw Result: 611 226+86+64+208 => Sw Result: 584, Hw Result: 584 220+202+211+41 => Sw Result: 674, Hw Result: 674 184+131+124+218 => Sw Result: 657, Hw Result: 657 116+212+31+179 => Sw Result: 538, Hw Result: 538 52+206+25+119 => Sw Result: 402, Hw Result: 402 0+107+41+51 => Sw Result: 199, Hw Result: 199 208+192+207+106 => Sw Result: 713, Hw Result: 713 36+160+128+7 => Sw Result: 331, Hw Result: 331 246+64+87+212 => Sw Result: 609, Hw Result: 609 11+44+125+67 => Sw Result: 247, Hw Result: 247 175+249+30+36 => Sw Result: 490, Hw Result: 490 207+188+215+131 => Sw Result: 741, Hw Result: 741 140+112+250+12 => Sw Result: 514, Hw Result: 514 91+36+64+44 => Sw Result: 235, Hw Result: 235 100+143+150+136 => Sw Result: 529, Hw Result: 529 Sw Execution Time: 0.007774 FPGA Execution Time: 0.003636 FPGA Speedup = 2.138164
Note: The reported performances were for a very large input. You will likely not get this improvement.