Lab 4 - Fully-Pipelined Accumulator (Due October 27, 11:59 PM)

You will again be working in groups for this project.

Turn in the project using the e-learning website (there is a link on the main course page). Please zip all files (DimeTalk project, C files, VHDL, bitfile, readme.txt explaining the status of the lab, and a screenshot of the output) and attach it to your submission. Please do not turn in other files that are generated during synthesis. Feel free to turn in files used for testing (testbenches, etc.).

You are welcome to use the provided glue logic entity. Note that is has to be modified slightly for the control signals required by this lab.

Introduction

In this lab, you will be implemeting a fully pipelined accumulator circuit. The circuit will utilize one blockRAM to continually feed 4 inputs per cycle into the accumulator, and one blockRAM to store an output each cycle. In software, you will initially transfer data from the microprocesor into the input blockRAM, start the accumulator circuit, and then wait for completion, at which point the software will read data from the output blockRAM and output it to the screen.

The pseudo-code you will be implementing on the FPGA is shown below. Note that your actual code will look nothing like this, this is simply intended to help you understand the functionality.

#define OUTPUT_SIZE 512
#define INPUT_SIZE OUTPUT_SIZE*4
unsigned char b[INPUT_SIZE];
unsigned int a[OUTPUT_SIZE];

for(i=0,j=0; j < OUTPUT_SIZE; i += 4, j++) {

  a[j] = b[i] + b[i+1] + b[i+2] + b[i+3];
}

Part 1 - VHDL

For the first part of the lab, you will be describing the following accumulator datapath:

As shown, the circuit takes 4 8-bit inputs and accumulates them into a single 10-bit output. The green boxes are registers. The control details are not shown, but you may do anything you want as long as the circuit is implemented as a pipeline. The simplest control would simply be an enable to each register.

The overall circuit (VHDL and DimeTALK) you will be implementing is shown below:

The overall circuit of consists 2 blockRAMS (each needs to be large enough to store 512 32-bit words), one which feeds data into the accumulator and one which stores data from the accumulator. The address generators control each blockRAM, essentially producing a new address each cycle. The controller waits for a go signal from the microprocessor (implemented using a memory map), reads in a data size (also implemented using a memory map), enables the input address generator, enables the datapath, and then waits the appropriate number of cycles before enabling the output address generator. Upon completion, the circuit should assert a done signal (implemented using a memory map) that allows the microprocessor to continue execution. The control signals are intentionally vague, which should give you some flexibility on implementation options. As along as the circuit feeds 4 new inputs to the datapath each cycle, you can implement the control however you like.

Notice that the datapath takes 4 8-bit inputs and has one 10-bit output. However, the DimeTALK BRAM node only has 32-bit inputs/outputs. Therefore, to get 4 inputs to the datapath each cycle, you must divide each 32-bit input into 4 8-bit inputs. Similarly, you must concatenate 22 0's onto each output before writing it to the BRAM.

Main Challenges:

Previously, you accessed the blockRAM using functions in software. In this lab, your VHDL code must interface with the Nallatech blockRAM entity. To do this, read the Nallatech reference manual to understand the interface and timing issues.
The timing of the your controller must be very precise in order to keep all of the components synchronized. For example, the output address generator must start precisely when the first output of the accumulator is generated.

Suggestions:

Test every component of your VHDL. You can simulate the functionality of blockRAMs by using Xilinx Core Generator (CoreGen), or you can write your own test entity that is similar to a blockRAM. You do not want to debug your circuit after implementing it on the Nallatech board. You should be absolutely positive it works before trying to use the board. Therefore, you should have a testbench for each main component (controller, datapath, address generator, etc.). Monitor your simulation output cycle by cycle to make sure everythin is working exactly as expected. If you don't, it will be extremely difficult to work out timing issues.
I would recommend using signals instead of variables for important values. You can monitor internal signal values during simulation, but not variables. The more signals you can monitor, the better.

Part 2 - C Code

Your C code will initialize the input blockRAM, start the accumulator circuit, wait for the circuit to complete, and then read the data from the output blockRAM. You should use an input size of 512 32-bit words (512*4 8-bit inputs).

One challenge here is that you can only transfer a 32-bit word to the FPGA, but the circuit needs 4 8-bit inputs. Therefore, you must pack 4 inputs into each 32-bit word that you transfer to the blockRAM. You can easily do this with bit manipulation operators in C (|, <<, &). An example is shown below. Feel free to use this code in your design.

    unsigned int dat1[512];
    for(i=0;i<512;i++)
    {
        dat1[i] = ((rand() & 0xff) << 24) |
            ((rand() & 0xff) << 16) |
            ((rand() & 0xff) << 8) |
            ((rand() & 0xff));
    }

This code initializes each 32-bit element of the input arrray with 4 8-bit random numbers. You may perform initialization with different code if you like, but you should fill all the inputs with 8-bit random numbers. To test the circuit, the C code should also perform the accumulate in software, so you can easily compare the output of the FPGA to the software output. After performing the accumulate in software and on the FPGA, you should output to the screen the four numbers that were added, their result in software, and their result in hardware. You should also report the execution time for the software code and the hardware code, and the corresponding speedup. For the hardware execution time, start the timer as soon as you enable the circuit (i.e. don't include the time to transfer the data to the blockRAM). The output should basically look something like this:

195+177+210+177 => Sw Result: 759, Hw Result: 759
51+229+151+29 => Sw Result: 460, Hw Result: 460
153+227+81+150 => Sw Result: 611, Hw Result: 611
226+86+64+208 => Sw Result: 584, Hw Result: 584
220+202+211+41 => Sw Result: 674, Hw Result: 674
184+131+124+218 => Sw Result: 657, Hw Result: 657
116+212+31+179 => Sw Result: 538, Hw Result: 538
52+206+25+119 => Sw Result: 402, Hw Result: 402
0+107+41+51 => Sw Result: 199, Hw Result: 199
208+192+207+106 => Sw Result: 713, Hw Result: 713
36+160+128+7 => Sw Result: 331, Hw Result: 331
246+64+87+212 => Sw Result: 609, Hw Result: 609
11+44+125+67 => Sw Result: 247, Hw Result: 247
175+249+30+36 => Sw Result: 490, Hw Result: 490
207+188+215+131 => Sw Result: 741, Hw Result: 741
140+112+250+12 => Sw Result: 514, Hw Result: 514
91+36+64+44 => Sw Result: 235, Hw Result: 235
100+143+150+136 => Sw Result: 529, Hw Result: 529
Sw Execution Time: 0.007774
FPGA Execution Time: 0.003636
FPGA Speedup = 2.138164

Note: The reported performances were for a very large input. You will likely not get this improvement.

Common Tool Problems

To do post-place+route simultion in Xilinx ISE, the top level entity has to use std_logic or std_logic_vector.
Xilinx ISE does not like generic values in the top level entity
VHDL files imported into DIMETalk need to use std_logic.
DIMETalk will mess up when importing VHDL unless you define each input/ output on a seperate line
Unless you use the names dt_clk and reset for the clock and reset, you will need to manually specify which signal is the clock and which is the reset.
If your code looks like it should, but you are getting strange problems, try creating a new project. This applies to both ISE and DIMETalk. I personally went through 3 ISE projects for this lab, even though the code was completely correct.