### Lecture 08: RISC-V Single-Cycle Implementation

**CSE 564 Computer Architecture Summer 2017** 

Department of Computer Science and Engineering Yonghong Yan <u>yan@oakland.edu</u> www.secs.oakland.edu/~yan

#### Acknowledgements

- The notes cover Appendix C of the textbook, but we use RISC-V instead of MIPS ISA
  - Slides for general RISC ISA implementation are adapted from Lecture slides for "Computer Organization and Design, Fifth Edition: The Hardware/Software Interface" textbook for general RISC ISA implementation
  - Slides for RISC-V single-cycle implementation are adapted from Computer Science 152: Computer Architecture and Engineering, Spring 2016 by Dr. George Michelogiannakis from UC Berkeley

### Introduction

- CPU performance factors *CPU* Time = <u>Instructions</u> \* Cycles \*Time Program Instruction Cycle
  - Instruction count
    - Determined by ISA and compiler
  - CPI and Cycle time
    - Determined by CPU hardware
- Simple subset, shows most aspects
  - Memory reference: lw, sw
  - Arithmetic/logical: add, sub, and, or, slt
  - Control transfer: beq, j

#### **Components of a Computer**



# The CPU

- Processor (CPU): the active part of the computer that does all the work (data manipulation and decisionmaking)
- Datapath: portion of the processor that contains hardware necessary to perform operations required by the processor (the brawn)
- Control : portion of the processor (also in hardware) that tells the datapath what needs to be done (the brain)

#### **Datapath and Control**

- Datapath designed to support data transfers required by instructions
- Controller causes correct transfers to happen



# **Logic Design Basics**

- Information encoded in binary
  - Low voltage = 0, High voltage = 1
  - One wire per bit
  - Multi-bit data encoded on multi-wire buses
- Combinational circuit
  - Operate on data
  - Output is a function of input
- State (sequential) circuit
  - Store information

# **Combinational Circuits**

AND-gate
 Y = A & B





- Multiplexer
  - Y = S ? I1 : I0

Arithmetic/Logic Unit
Y = F(A, B)



# **Sequential Circuits**

- Register: stores data in a circuit
  - Uses a clock signal to determine when to update the stored value
  - Edge-triggered: update when Clk changes from 0 to 1



#### **Edge-Triggered D Flip Flops**



# **Sequential Circuits**

- Register with write control
  - Only updates on clock edge when write control input is 1
  - Used when stored value is required later



# **Clocking Methodology**

- Combinational logic transforms data during clock cycles
  - Between clock edges
  - Input from state elements, output to state element
  - Longest delay determines clock period



#### Single cycle data paths



#### Hardware Elements of CPU

- Combinational circuits
  - Mux, Decoder, ALU, ...





- Synchronous state elements
  - Flipflop, Register, Register file, SRAM, DRAM



Edge-triggered: Data is sampled at the rising edge

#### **Register Files**

Reads are combinational



#### **Register File Implementation**

 RISC-V integer instructions have at most 2 register source operands



#### **A Simple Memory Model**



Reads and writes are always completed in one cycle

- a Read can be done any time (i.e. combinational)
- a Write is performed at the rising clock edge if it is enabled

=> the write address and data must be stable at the clock edge

Later in the course we will present a more realistic model of memory

#### **Five Stages of Instruction Execution**

- Stage 1: Instruction Fetch
- Stage 2: Instruction Decode
- Stage 3: ALU (Arithmetic-Logic Unit)
- Stage 4: Memory Access
- Stage 5: Register Write

#### **Stages of Execution on Datapath**



# Stages of Execution (1/5)

- There is a wide variety of instructions: so what general steps do they have in common?
- Stage 1: Instruction Fetch
  - The 32-bit instruction word must first be fetched from memory
    - the cache-memory hierarchy
  - also, this is where we Increment PC
    - PC = PC + 4, to point to the next instruction: byte addressing so + 4

# Stages of Execution (2/5)

- Stage 2: Instruction Decode: gather data from the fields (decode all necessary instruction data)
  - 1. read the opcode to determine instruction type and field lengths
  - 2. read in data from all necessary registers
    - for add, read two registers
    - for addi, read one register
    - for jal, no reads necessary

| 31      | $25 \hspace{0.1in} 24$ | $20 \ 19$ | 15                    | 5 14                        | $12\ 11$ | 7                     | 6      | 0 |
|---------|------------------------|-----------|-----------------------|-----------------------------|----------|-----------------------|--------|---|
| funct7  | rs                     | 2         | rs1                   | funct3                      |          | rd                    | opcode |   |
| 7       | 5                      |           | <b>5</b>              | 3                           |          | 5                     | 7      |   |
| 000000  | $\mathbf{src}$         | c2        | $\operatorname{src1}$ | ADD/SLT/SI                  | ЛU       | $\operatorname{dest}$ | OP     |   |
| 000000  | sre                    | ${ m e}2$ | $\operatorname{src1}$ | AND/OR/XC                   | DR       | dest                  | OP     |   |
| 000000  | $\mathbf{src}$         | :2        | $\operatorname{src1}$ | $\mathrm{SLL}/\mathrm{SRL}$ |          | $\operatorname{dest}$ | OP     |   |
| 0100000 | $\mathbf{src}$         | :2        | $\operatorname{src1}$ | SUB/SRA                     |          | $\operatorname{dest}$ | OP     |   |

# Stages of Execution (3/5)

- Stage 3: ALU (Arithmetic-Logic Unit): the real work of most instructions is done here
  - AL operations:
    - arithmetic (+, -, \*, /), shifting, logic (&, |), comparisons (slt)
  - loads and stores
    - lw \$t0, 40(\$t1)
    - the address we are accessing in memory = the value in \$t1 PLUS the value 40
    - Addition is done in this stage
  - Conditional branch
    - Comparison is done in this stage (one solution)

# **Stages of Execution (4/5)**

- Stage 4: Memory Access: only load and store instructions
  - the others remain idle during this stage or skip it all together
  - since these instructions have a unique step, we need this extra stage to account for them
  - as a result of the cache system, this stage is expected to be fast

# Stages of Execution (5/5)

- Stage 5: Register Write
  - most instructions write the result of some computation into a register
  - examples: arithmetic, logical, shifts, loads, slt
  - what about stores, branches, jumps?
    - don't write anything into a register at the end
    - these remain idle during this fifth stage or skip it all together

#### **Stages of Execution on Datapath**



# **Instruction Execution**

- $PC \rightarrow$  instruction memory, fetch instruction
- Register numbers  $\rightarrow$  register file, read registers
- Depending on instruction class
  - Use ALU to calculate
    - Arithmetic result
    - Memory address for load/store
    - Branch condition and target address
  - Access data memory for load/store
  - PC ← target address or PC + 4

# **CPU Components**



### **Multiplexers**



# **Control Signals**



# **Building a Datapath**

- Datapath
  - Elements that process data and addresses in the CPU
    - Registers, ALUs, mux's, memories, ...
- We will build a RISCV datapath incrementally
  - Refining the overview design

#### **Instruction Fetch**



## **R-Format Instructions**

- Read two register operands
- Perform arithmetic/logical operation
- Write register result



a. Registers

b. ALU

### **Load/Store Instructions**

- Read register operands
- Calculate address using 12-bit offset
  - Use ALU, but sign-extend offset
- Load: Read memory and update register
- Store: Write register value to memory



# **Branch Instructions**

- Read register operands
- Compare operands
  - Use ALU, subtract and check Zero output
- Calculate target address
  - Sign-extend displacement
  - Shift left 2 places (word displacement)
  - Add to PC + 4
    - Already calculated by instruction fetch

#### **Branch Instructions**



# **Composing the Elements**

- First-cut data path does an instruction in one clock cycle
  - Each datapath element can only do one function at a time
  - Hence, we need separate instruction and data memories
- Use multiplexers where alternate data sources are used for different instructions

# **R-Type/Load/Store Datapath**



# **Full Datapath**



# **ALU Control**

- ALU used for
  - Load/Store: F = add
  - Branch: F = subtract
  - R-type: F depends on funct field

| ALU control | Function         |  |  |  |  |
|-------------|------------------|--|--|--|--|
| 0000        | AND              |  |  |  |  |
| 0001        | OR               |  |  |  |  |
| 0010        | add              |  |  |  |  |
| 0110        | subtract         |  |  |  |  |
| 0111        | set-on-less-than |  |  |  |  |
| 1100        | NOR              |  |  |  |  |

# **ALU Control**

- Assume 2-bit ALUOp derived from opcode
  - Combinational logic derives ALU control

| opcode | ALUOp | Operation        | funct  | funct ALU function |      |
|--------|-------|------------------|--------|--------------------|------|
| lw     | 00    | load word        | XXXXXX | add                | 0010 |
| SW     | 00    | store word       | XXXXXX | add                | 0010 |
| beq    | 01    | branch equal     | XXXXXX | subtract           | 0110 |
| R-type | 10    | add              | 100000 | add                | 0010 |
|        |       | subtract         | 100010 | subtract           | 0110 |
|        |       | AND              | 100100 | AND                | 0000 |
|        |       | OR               | 100101 | OR                 | 0001 |
|        |       | set-on-less-than | 101010 | set-on-less-than   | 0111 |

### **Datapath: Reg-Reg ALU Instructions**



### **Datapath: Reg-Imm ALU Instructions**



## **Conflicts in Merging Datapath**



### **Datapath for ALU Instructions**



## Load/Store Instructions



## **RISC-V Conditional Branches**

| 7 5     |    | 5  | 5 3       |    |     | 5 7 |        | 7  | BEQ/BNE |   |   |           |
|---------|----|----|-----------|----|-----|-----|--------|----|---------|---|---|-----------|
| imm rs2 |    | 52 | rs1 func3 |    | imm |     | opcode |    | BLT/BGE |   |   |           |
| 31      | 25 | 24 | 20        | 19 | 15  | 14  | 12     | 11 | _       | 6 | 0 | BLTU/BGEU |

- Compare two integer registers for equality (BEQ/BNE) or signed magnitude (BLT/BGE) or unsigned magnitude (BLTU/ BGEU)
- 12-bit immediate encodes branch target address as a signed offset from PC, in units of 16-bits (i.e., shift left by 1 then add to PC).

#### Conditional Branches (BEQ/BNE/BLT/BGE/BLTU/BGEU)



### Full RISCV1Stage Datapath



### Hardwired Control is pure Combinational Logic



## **ALU Control & Immediate Extension**



## **Hardwired Control Table**

| Opcode               | ImmSel               | Op2Sel                 | FuncSel | MemWr | RFWen | WBSel | WASel | PCSel |
|----------------------|----------------------|------------------------|---------|-------|-------|-------|-------|-------|
| ALU                  | ALU * Re             |                        | Func    | no    | yes   | ALU   | rd    | pc+4  |
| ALUi                 | IType <sub>12</sub>  | Гуре <sub>12</sub> Imm |         | no    | yes   | ALU   | rd    | pc+4  |
| LW                   | IType <sub>12</sub>  | Imm                    | +       | no    | yes   | Mem   | rd    | pc+4  |
| SW                   | SType <sub>12</sub>  | Imm                    | +       | yes   | no    | *     | *     | pc+4  |
| BEQ <sub>true</sub>  | SBType <sub>12</sub> | *                      | *       | no    | no    | *     | *     | br    |
| BEQ <sub>false</sub> | SBType <sub>12</sub> | *                      | *       | no    | no    | *     | *     | pc+4  |
| J                    | *                    | *                      | *       | no    | no    | *     | *     | jabs  |
| JAL                  | *                    | *                      | *       | no    | yes   | РС    | X1    | jabs  |
| JALR                 | *                    | *                      | *       | no    | yes   | РС    | rd    | rind  |

Op2Sel= Reg / Imm WASel = rd / X1 WBSel = ALU / Mem / PC PCSel = pc+4 / br / rind / jabs

# **Single-Cycle Hardwired Control**

clock period is sufficiently long for all of the following steps to be "completed":

- 1. Instruction fetch
- 2. Decode and register fetch
- 3. ALU operation
- 4. Data fetch if required
- 5. Register write-back setup time
- => tC > tlFetch + tRFetch + tALU+ tDMem+ tRWB

At the rising edge of the following clock, the PC, register file and memory are updated

# **Implementation in Real**

- Load-Store RISC ISAs designed for efficient pipelined implementations
  - Inspired by earlier Cray machines (CDC 6600/7600)
- RISC-V ISA implemented using Chisel hardware construction language
  - Chisel: <u>https://chisel.eecs.berkeley.edu/</u>
  - Getting started:
    - <u>https://chisel.eecs.berkeley.edu/2.2.0/getting-started.html</u>
  - Check resource page for slides and other info

## **Chisel in one slides**

```
    Module
```

- 10
- Wire
- Reg
- Mem

```
import Chisel._
```

```
class GCD extends Module {
 val io = new Bundle {
   val a = UInt(INPUT, 16)
   val b = UInt(INPUT, 16)
   val e = Bool(INPUT)
   val z = UInt(OUTPUT, 16)
   val v = Bool(OUTPUT)
  }
 val x = Reg(UInt())
 val y = Reg(UInt())
 when (x > y) \{ x := x - y \}
 unless (x > y) \{ y := y - x \}
 when (io.e) { x := io.a; y := io.b }
  io.z := x
  io.v := y === UInt(0)
} ...
```

## **UCB RISC-V Sodor**

- https://github.com/ucb-bar/riscv-sodor
  - Single-cycle:
    - https://github.com/ucb-bar/riscv-sodor/tree/master/src/ rv32\_1stage

