Lab 10: CPU Datapath

I dreamed that I slowly rose up, passing through the data plane, through the data network, into and through Wanfang Network, and finally arrived at an unfamiliar place—a place I had never dreamed of before… This place had infinite space, colors that were serene and indescribable, no horizon, no sky, no earth, or any physical area that humans call the ground.

— “The Fall of Hyperion”, Dan Simmons

The goal of this experiment is to complete three important parts of the CPU data path—the register file, ALU, and data memory—before implementing a single-cycle CPU, and test them through functional simulation.

Register file

A register file is a storage unit in a CPU used to store temporary data during instruction execution. The basic version of the RISC-V CPU RV32I that we are going to implement has 32 registers. RV32I uses the Load Store architecture, which means that all data must first be read from memory into the register using a Load statement before arithmetic and logical operations can be performed. Therefore, RV32I has 32 general-purpose registers, and each arithmetic operation may need to read two source registers and write to one destination register at the same time. To support high-speed, multi-port parallel access to the register file, we cannot directly call general-purpose RAM, but need to write the register file separately in Verilog language.

General Purpose Registers in RV32I

RV32I has a total of 32 32-bit general-purpose registers x0 to x31 (register addresses are 5-bit encoded), among which the content of register x0 is always 0 and cannot be changed.

For aliases of other registers and register usage conventions, see Table tab-regname. It should be noted that some registers are saved by the caller when a function is called, while others are saved by the callee. This should be taken into consideration when mixing C and assembly language programming.

Table 9 Definition and usage of general-purpose registers in RV32I
Register	Name	Use	Saver
x0	zero	Constant 0	–
x1	ra	Return Address	Caller
x2	sp	Stack Pointer	Callee
x3	gp	Global Pointer	–
x4	tp	Thread Pointer	–
x5~x7	t0~t2	Temp	Caller
x8	s0/fp	Saved/Frame pointer	Callee
x9	s1	Saved	Callee
x10~x11	a0~a1	Arguments/Return Value	Caller
x12~x17	a2~a7	Arguments	Caller
x18~x27	s2~s11	Saved	Callee
x28~x31	t3~t6	Temp	Caller

Register file Implementation

../_images/regfile.png — Fig. 70 Register file block diagram

Figure fig-regfile describes the interface of the register file, which contains 32 32-bit registers. The register file needs to support two read operations and one write operation simultaneously. Therefore, two read addresses Ra and Rb are required, corresponding to rs1 and rs2 in RISC-V assembly, respectively. The write address is Rw, corresponding to rd. All addresses are 5 bits. The write data busW is 32 bits, and the write enable control is a one-bit high-level enable RegWr signal. The output of the register file is two 32-bit register data, busA and busB, respectively. The register file has a control clock WrClk for writing. In terms of timing, we can make the read operation asynchronous, that is, the output is immediate when the address changes. Writing can be done on the falling edge of the clock. Note that register x0 requires special treatment and is always all zeros. Please think about how to implement the x0 register yourself.

ALU

The ALU is one of the core data path components in the CPU, primarily responsible for performing arithmetic and logical operations within the CPU. We have already implemented a simple ALU in previous experiments. In this experiment, only minor modifications to the ALU are required. To meet the computational requirements of the RV32I, we have redefined the control signals for the ALU, as shown in Table tab-aluctr. The logic diagram of the ALU is shown in Figure fig-alu2. The ALU performs parallel addition, subtraction, shifting, comparison, and XOR operations on the input data. The final ALUout output is the result of selecting different arithmetic components through an 8-to-1 multiplexer, and the control end of the multiplexer can be directly generated by ALUctr[2:0]. Other control signals in the ALU include: A/L controls whether the shifter performs arithmetic or logical shifting, L/R controls left or right shifting, U/S controls whether the comparison is signed or unsigned, and S/A controls addition or subtraction. These control signals must be set according to the required operation. Please design them accordingly. Note: Subtraction should be used for comparisons or equality checks.

../_images/alu2.png — Fig. 71 ALU logic diagram

Table 10 Meaning of control signal ALUctr
ALUctr[3]	ALUctr[2:0]}	ALU Operation
0	000	Select the adder output and perform addition
1	000	Select the adder output and perform subtraction
\(\times\)	001	Select shift register output, left shift
0	010	Perform subtraction, select the signed result with the less-than bit set, and output the result. Set Less according to the signed result
1	010	Perform subtraction, select unsigned less than set result output, Less set according to unsigned result
\(\times\)	011	Select the result of ALU input B for direct output
\(\times\)	100	Select XOR output
0	101	Select shift register output, logical right shift
1	101	Select shift register output, arithmetic right shift
\(\times\)	110	Select logic OR output
\(\times\)	111	Selection Logic AND Output

Data memory

The data memory stores global variables, stacks, and other data during CPU operation. We recommend implementing a data memory capacity of at least 128 kB. In addition, the data memory needs to support read and write operations on the rising edge. The word length of RV32I is 32 bits, but the data memory not only needs to support 32-bit data access, but also needs to support byte (8-bit) or half-word (16-bit) size reads. Since a single-cycle CPU needs to complete all operations of an instruction within one cycle, we need the data RAM to have independent read and write clocks. The read operation is performed on the rising edge of the system clock (i.e., half of a clock cycle), and the write operation is performed on the falling edge of the system clock (i.e., the end of a clock cycle). It is recommended to use dual-port RAM (RAM 2 PORT) to implement the data memory. The large-capacity SRAM on the DE10-Standard development board supports independent read and write clocks. It generally supports a data storage capacity of more than 128 KB.

To achieve byte-sized (8-bit) or half-byte-sized (16-bit) read/write operations, students need to modify the memory generated by the IP core to a certain extent. During implementation, there is no need to consider address misalignment when performing 4-byte or 2-byte read/write operations. By default, when performing 4-byte read/write operations, the lower two bits of the address are 00, and when performing 2-byte read/write operations, the lowest bit of the address is 0.

Specifically, the MemOP signal is defined as follows: The width is 3 bits, controlling the data memory read/write format. When set to 010, it enables 4-byte read/write operations; when set to 001, it enables 2-byte read/write operations with signed extension; when set to 000, it enables 1-byte read/write operations with signed extension; when set to 101, it enables 2-byte read/write operations with unsigned extension; and when set to 100, it enables 1-byte read/write operations with unsigned extension.

The correspondence between MemOP and memory operations in RV32 is as follows:

Table 11 Correspondence between memory access instructions and Memop
Instruction	MemOP	Operation
lb rd,imm12(rs1)}	000	\(R[rd] \leftarrow SEXT(M_{1B}[ R[rs1] + SEXT(imm12) ])\)
lh rd,imm12(rs1)}	001	\(R[rd] \leftarrow SEXT(M_{2B}[ R[rs1] + SEXT(imm12) ])\)
lw rd,imm12(rs1)}	010	\(R[rd] \leftarrow M_{4B}[ R[rs1] + SEXT(imm12) ]\)
lbu rd,imm12(rs1)}	100	\(R[rd] \leftarrow \{24'b0, M_{1B}[ R[rs1] + SEXT(imm12) ]\)
lhu rd,imm12(rs1)}	101	\(R[rd] \leftarrow \{16'b0, M_{2B}[ R[rs1] + SEXT(imm12) ]\)
sb rs2,imm12(rs1)}	000	\(M_{1B}[ R[rs1] + SEXT(imm12) ] \leftarrow R[rs2][7:0]\)
sh rs2,imm12(rs1)}	001	\(M_{2B}[ R[rs1] + SEXT(imm12) ] \leftarrow R[rs2][15:0]\)
sw rs2,imm12(rs1)}	010	\(M_{4B}[ R[rs1] + SEXT(imm12) ] \leftarrow R[rs2]\)

For read operations, we can read 32 bits of data at a time, then determine whether 8 bits, 16 bits, or 32 bits of data are needed based on MemOP, and finally select the appropriate data based on the lower two bits of the address and concatenate it to expand it into the read result.

For write operations, since it is necessary to write specific 8-bit or 16-bit data within 32 bits without damaging other bits, careful consideration is required in the implementation. We provide the following three solutions for your reference.

Using a dual-port RAM with a single-byte write enable signal supported by the IP core.

This is the method we recommend. As shown in Figure fig-rammask, when configuring dual-port RAM in Quartus in Step 3, we can choose to generate a single-byte write enable signal. For example, after generating a 32-bit RAM, the system will generate a single-byte write enable signal byteena_a[3:0], which is high-active. If you need to write to all four bytes of the 32-bit RAM simultaneously, set this signal to 4’b1111. If you only need to write to the lower 8 bits, set this signal to 4’b0001. Therefore, when performing byte or half-word writes, you only need to set the corresponding single-byte write enable signal and compose the write data in the correct format for a 32-bit write operation.

../_images/rammask.png — Fig. 72 Single-byte write enable configuration in dual-port RAM

Read the original data before writing, modify it, and then write 32 bits at once.

The RAM generated by the IP core does not support multiple initializations of data during simulation. We can also use our own rewritten RAM to replace the RAM with single-byte write enable in the above IP core for simulation. We observe that in a single-cycle CPU, the CPU can only perform one of the read or write operations on the memory in a cycle. Therefore, if we want to write 8-bit data without altering the adjacent bits, we can read the unit to be written on the rising edge of the clock instead of the unit corresponding to the read address. After modifying the read data, we can write it back on the falling edge. Note that the write enable signal and write address must be ready on the rising edge of the clock. We provide the following implementation example, whose interface is consistent with the dual-port RAM generated by the IP core.

module testdmem(
  byteena_a,
  data,
  rdaddress,
  rdclock,
  wraddress,
  wrclock,
  wren,
  q);

  input   [3:0]   byteena_a;
  input       [31:0]  data;
  input       [14:0]  rdaddress;
  input       rdclock;
  input       [14:0]  wraddress;
  input       wrclock;
  input       wren;
  output reg  [31:0]  q;

  reg  [31:0] tempout;
  wire [31:0] tempin;
  reg [31:0] ram [32767:0];

  always@(posedge rdclock)
  begin
    if(wren)
      tempout<=ram[wraddress];
    else
      q <= ram[rdaddress];
  end

    assign tempin[7:0]   = (byteena_a[0])? data[7:0]  : tempout[7:0];
    assign tempin[15:8]  = (byteena_a[1])? data[15:8] : tempout[15:8];
    assign tempin[23:16] = (byteena_a[2])? data[23:16]: tempout[23:16];
    assign tempin[31:24] = (byteena_a[3])? data[31:24]: tempout[31:24];

  always@(posedge wrclock)
  begin
    if(wren)
    begin
      ram[wraddress]<=tempin;
    end
  end
endmodule

Non-standard RAM modules

Quartus is unlikely to map such self-written RAM modules to M10K memory modules, which directly leads to insufficient system resources or long compilation times. In actual board code, it is recommended to use IP core-generated RAM for storage capacities greater than 64k.

Use four 8-bit RAM chips and combine them to form a 32-bit memory.

This method uses four 8-bit RAM chips to form a 32-bit RAM, with each 8-bit RAM chip responsible for a specific portion of the 32 bits. For example, RAM0 is responsible for providing data with the lower two bits of the address set to 00, RAM1 is responsible for providing data with the lower two bits of the address set to 01, and so on. If you need to read or write 32-bit data at once, you can connect the corresponding data and the first 30 bits of the address to the four RAM chips, allowing all four RAM chips to be operated simultaneously, resulting in a single read/write operation of \(4\times8=32\) bits of data. If only 8-bit data needs to be written, the RAM write enable port can be controlled based on the lower two bits of the address to write to only one RAM. The main issue with this method is that during memory initialization, each of the four RAMs must be initialized separately, which can be somewhat troublesome.

Lab check-in contents

Online test

Please complete the implementation of the CPU’s register file, ALU, and data memory separately, and pass two online tests.

In-person check-in can be requested.

The online testing system for the course has high requirements for timing and implementation. If you are unable to pass the online test, you can write your own test bench and have it verified in-person by a teaching assistant to complete the check-in process.