CS 230 Notes

Module 1 - Arithmetic, Hardware and Data

Overview

Number representation
Boolean algebra and gate logic
Integer arithmetic
Floating point
Non-numerical data types

Number Representation

Radix Representation

In a positional numeral system, the radix or base is the number of unique digits, including the digit zero, used to represent numbers. For example, for the decimal system (the most common system in use today) the radix is ten, because it uses the ten digits from 0 through 9.

In any standard positional numeral system, a number is conventionally written as $(x)_y$ with $x$ as the string of digits and $y$ as its base, although for base ten the subscript is usually assumed (and omitted, together with the pair of parentheses), as it is the most common way to express value. For example, $(100)_{10}$ is equivalent to $100$ (the decimal system is implied in the latter) and represents the number one hundred, while $(100)_2$ (in the binary system with base 2) represents the number four.

This representation is unique. Let b be a positive integer greater than 1. Then every positive integer a can be expressed uniquely in the form $${\displaystyle a=r_{m}b^{m}+r_{m-1}b^{m-1}+\dotsb +r_{1}b+r_{0},}$$ where $m$ is a nonnegative integer and the $r$'s are integers such that $0 < r_m < b$ and $0 \leq r_i < b$ for $i=0, 1, ... , m - 1$.

Examples

Number in Different Bases to Decimal

Number in different bases	Convert to decimal
135_dec	$5 \cdot 10^0 + 3 \cdot 10^1 + 1 \cdot 10^2 = 135$
1440_sep	$0 \cdot 7^0 + 4 \cdot 7^1 + 4 \cdot 7^2 + 1 \cdot 7^3 = 567$
A32_hex	$2 \cdot {16}^0 + 3 \cdot {16}^1 + 10 \cdot {16}^2 = 2610$

Running it in Python

def convert_to_decimal(num, base): num = str(num) power = len(num) - 1 decimal = 0 for digit in num: decimal += int(digit, base) * (base ** power) power -= 1 return decimal # print() will not print to the DOM element below # but to browser console. # Please make sure this is the last evaluated value so that you # can see the output below. (convert_to_decimal(135, 10), convert_to_decimal(1440, 7), convert_to_decimal('A32', 16))

Output

Decimal to Number in Different Bases

Decimal	Convert to number in different bases
3219_dec	$$\begin{align} 3219 / 16 & = 201 \operatorname{R} 3 \\ 201 / 16 & = 12 \operatorname{R} 9 \\ 12 / 16 & = 0 \operatorname{R} 12 \\ & = C93_{16} \end{align}$$Machine Code

Why does it work?

Recall that every positive integer can be expressed uniquely in the form $${\displaystyle a=r_{m}b^{m}+r_{m-1}b^{m-1}+\dotsb +r_{1}b+r_{0},}$$ where $m$ is a nonnegative integer and the $r$'s are integers such that $0 < r_m < b$ and $0 \leq r_i < b$ for $i=0, 1, ... , m - 1$.

The number in base $b$ are written as $r_m r_{m - 1} \dots r_1 r_0$. So, if you are given a decimal number, you are basically given the $a$ value and you want to find the individual $r_i$ for $i = 0, 1, ... , m - 1$. To do this, you can just divide by $b$ and keep track of the remainder. Let's see what happens: $$a / b = (r_m b^{m - 1} + r_{m - 1} b^{m - 2} + \dots + r_2 b + r_1) \operatorname{R} r_0$$ Now, the remainder is the $r_0$ value. The quotient is the $a$ value for the next iteration. So, you can just keep dividing by $b$ and keep track of the remainder. We denote the quotient to be $q_1 = (r_m b^{m - 1} + r_{m - 1} b^{m - 2} + \dots + r_2 b + r_1)$. $$q_1 / b = (r_m b^{m - 2} + r_{m - 1} b^{m - 3} + \dots + r_2) \operatorname{R} r_1$$

Continue the process until you get a quotient of 0. The remainders are the $r_i$ values in reverse order.

This is the same as the example above.

Binary Numbers

Binary number is just a special case of the radix representation where the base is 2.

Example

11101100_bin

$2^2 + 2^3 + 2^5 + 2^6 + 2^7 = 236$

Hexadecimal Numbers / Binary Numbers Conversion

Hexadecimal	Binary	Decimal
0	0000	0
1	0001	1
2	0010	2
3	0011	3
4	0100	4
5	0101	5
6	0110	6
7	0111	7
8	1000	8
9	1001	9
A	1010	10
B	1011	11
C	1100	12
D	1101	13
E	1110	14
F	1111	15

Benefits of Binary Representation

Electrical Simplicity
- Analog/digital conversion (high vs. low voltage)
Low-level decimal conversion
- storage expansion / waste

Boolean Algebra

Please take a look at this wikipedia article on boolean algebra.

Circuit Elements

Converting a Truth Table to Circuit

One technique: minterms
- identify each row that has an output of 1
- use AND gates to combine the input values (possibly through a NOT gate) that will produce 1
- combine the AND gate results through an OR gate
Reduce the number of gates required using Boolean algebra and/or other techniques

Adders

Please take a look at these wikipedia articles: Adder (electronics), Digital Circuits/Adders, and Carry-lookahead adder

Integer Arithmetic

In computer programming, an integer overflow occurs when an arithmetic operation attempts to create a numeric value that is outside of the range that can be represented with a given number of digits – either higher than the maximum or lower than the minimum representable value.

Sign Representation

Fixed width n-bit representation
- most significant bit: left-most (highest value)
- least significant bit: right-most (lowest value)
Sign extension: treat MSB as sign
- positive: 0
- negative: 1
Two zeros: 0000 and 1000
Cannot use basic addition - $3 - 1 = -4$?

One's Complement

Negation: invert bits
Two Zeroes: 0000 and 1111
Addition possible
- add carry-over to sum

Two's Complement

Negation: invert bits and add 1
Single zero: 0000
Range $-2^{n-1}$ to $2^{n-1}-1$
Straightforward addition

Addition, Subtraction, Multiplication, Division

Please take a look at the Wikipedia article Binary number#Binary arithmetic.

Binary Fractions

Please take a look at the Wikipedia article Fractals/Mathematics/binary.

Floating Point

Scientific Notation

Scientific notation is a way of expressing numbers that are too large or too small to be conveniently written in decimal form, since to do so would require writing out an unusually long string of digits.

In scientific notation, nonzero numbers are written in the form $$m \times 10^n$$

or m times ten raised to the power of n, where n is an integer, and the coefficient m is a nonzero real number (usually between 1 and 10 in absolute value, and nearly always written as a terminating decimal).

Normalized number

A number is normalized when it is written in scientific notation with one non-zero decimal digit before the decimal point. Thus, a real number, when written out in normalized scientific notation, is as follows:

$${\displaystyle \pm d_{0}.d_{1}d_{2}d_{3}\dots \times 10^{n}}$$ where $n$ is an integer, ${\textstyle d_{0},d_{1},d_{2},d_{3},\ldots ,}$ are the digits of the number in base 10, and ${\displaystyle d_{0}}$ is not zero.

Radix Point

The character used to separate the integer and fractional part of a number in a digital representation of a number, regardless of the base used.

"decimal" point for base 10.
"binary" point for base 2.

Floating Point vs. Fixed-point Notation

Floating Point

In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately, using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. For example, 12.345 can be represented as a base-ten floating-point number: $${\displaystyle 12.345=\underbrace {12345} _{\text{significand}}\times \underbrace {10} _{\text{base}}\!\!\!\!\!\!^{\overbrace {-3} ^{\text{exponent}}}}$$

Fixed-point Notation

In computing, fixed-point is a method of representing fractional (non-integer) numbers by storing a fixed number of digits of their fractional part. Dollar amounts, for example, are often stored with exactly two fractional digits, representing the cents (1/100 of dollar).

Terminology

$I$ - integer
$F$ - fraction
$I.F$ - significand (mantissa)
$B$ - base
$E$ - exponent

Notes

binary & normalized - first digit is always 1
base usually implicit - standard is 2

Representation

single precision: 32 bits
double precision: 64 bits
quadruple precision: 128 bits

this is called implicit bit
need a way to represent 0

We still need to represent fraction, exponent, signs.

In order to interpret a floating point number, you must know the number of bits used for the sign (usually 1), the exponent part and the fraction part, as well as the bias.

Format for a 32-bit Number

Please take a look at the Wikipedia page: Single-precision floating-point format.
Binary representation of a 32-bit floating-point number. The value depicted, 0.15625, occupies 4 bytes of memory: 00111110 00100000 00000000 00000000

Binary representation of a 32-bit floating-point number. The value depicted, 0.15625, occupies 4 bytes of memory: 00111110 00100000 00000000 00000000

Interpretation

$(-1)^S \cdot (1 + 0) \cdot 2^{0 - 127} = 0$
$(-1)^S \cdot (1 + F) \cdot 2^{0 - 127} = (-1)^S \cdot (0 + F) * 2^{-126}$ (This is the subnormal representation, we don't consider the implicit 1 anymore)
$(-1)^S \cdot (1 + F) \cdot 2^{E - 127}$: normal numbers
$(-1)^S \cdot (1 + 0) \cdot 2^{255 - 127}$: infinity
$(-1)^S \cdot (1 + F) \cdot 2^{255 - 127}$: NaN (Not a Number)

Rationale of the Format

All numbers with sign "+", arranged by size from +0 up to +Inf correspond to all the bit sequences (0 0...0 0...0) up to (0 1...1 0...0), arranged as binary integers. Therefore it is easy to compare two machine numbers, or to find the next smaller or larger machine number.

Biased Exponent

unsigned, interpreted as x - bias
IEEE 754 32-bit: 8-bit exponent, bias is 127
- i.e., range of exponent -126...127
- -127 is reserved for subnormal and interpreted as -126
- 128 is reserved for special cases infinity and NaN

Comparing Two Floating-point Number

Comparison with constant 0

equal: all bits 0
less: sign bit set
greater: else (any other bit set)

Comparison between numbers

equal: all bits idnetical
less: check sign bit
greater: compare bits left to right

Arithmetic

Addition

align radix points
use normal addition

Multiplication

add exponents
multiply significands

Simplified 8-Bit Model

This section is mostly for examples. If you are interested, please check the slides.

Special Cases

Overflow

Overflow is still possible.

represent as infinity
also for division by zero

Invalid Result - NaN

special cases, like 0/0 or ∞*0 or sqrt(-1)
can safely propagate during computation

Both the overflow and NaN can propagate during computation.

This allows us to have no exception (like integer division by zero) and no silent error (like integer overflow).

Underflow - assume no subnormal

gap between 0 and smallest positive number
subtraction result cannot be represented

Subnormal Representation

complexity & overhead
not always used

Representable Numbers

The representable numbers of a representation format are the numbers that can be exactly represented in the format.

We cannot represent all fractional numbers in the floating-point format. This is a tradeoff of using integers to represent fractions.

Floating Point Caveats

Overflow - result too large
Rounding error - result cannot be represented
Underflow = subnormal representation
NaN - not-a-number

Consequences

Distributive law is not guaranteed! You should pay attention to order of calculations!

Fixed Point Representation

Interpret first N bits as integer, rest as fraction

can use integer arithmetic operations
split can be application-specific
often used for time
- similar to, e.g., using nanoseconds, instead of seconds
Often used in operating system kernels
- faster
- no floating point context needed

Non-numerical data types

Bytes

8 bits = 1 byte.

Signed Range -128 .. 127
Unsigned binary range 0 ... 255
Two hexadecimal digits

Word

In computing, a word is the natural unit of data used by a particular processor design. A word is a fixed-sized datum handled as a unit by the instruction set or the hardware of the processor. The number of bits or digits in a word (the word size, word width, or word length) is an important characteristic of any specific processor design or computer architecture.

Individual bytes are still accessible

Endianness

Little-endian: least-significant byte first
- same number in memory, regardless of length. For more details, please refer to this StackExchange question.
- can start math right away
Big-endian: most-significant byte first
- "natural" way of writing numbers

Diagram of how a 32-bit integer is arranged in memory when stored from a register on a big-endian computer system. Contrast with Little-Endian.svg

Diagram of how a 32-bit integer is arranged in memory when stored from a register on a little-endian computer system. Contrast with Big-Endian.svg

Endianness can become an issue when moving data external to the computer – as when transmitting data between different computers, or a programmer investigating internal computer bytes of data from a memory dump – and the endianness used differs from expectation. In these cases, the endianness of the data must be understood and accounted for.

In CS 230, bits are written in big-endian.

Characters

Please take a look at the Wikipedia page ASCII.

Unicode

Please take a look at the Wikipedia page Unicode.

UTF-8: 1-4 bytes - variable-length character encoding
UTF-16: 2 or 4 bytes - variable-length character encoding
UTF-32: 4 bytes - fixed-length encoding

Big Integers

Word size currently 32 or 64 bits
Programming libraries offer big integer types
- Complex data structures – more costly
- operations in software, rather than hardware

Data Interpretation

Bits have no inherent meaning

interpretation is in the eye of the beholder
must start from implicit agreement (e.g., UTF)

Computer representation of numbers
- finite range and precision
- software needs to account for this
Fundamental coding limits

Module 2 – Assembly Language

Overview

Assembly language: MIPS

arithmetic operations
data movement
conditional execution
memory model
input and output
subroutines

Machine Code

Binary code – comprised of 0s and 1s

“Direct” execution by processor
Program grouped into instructions
fixed vs. variable length
operation code (opcode) + operands
instructions control processor
- opcode designates operation
- operands designate data

Assembly Language

Human-readable “programming language”
- very simple compared to Racket, Python, etc.
Almost direct mapping to machine code
- except a few concepts we’ll cover later
Assembler turns it into machine code
- process is called “assembling” rather than “compiling”

Instruction Set

Instruction set is the vocabulary of commands understood by a given architecture.

Different processors have different sets.

Turing Completeness

Well-defined theoretical concept
Fundamental capabilities of instruction set
Minimum requirements
- while or if/goto & change memory locations
Typical programming languages / instruction sets are Turing complete
- Data description languages (e.g. XML) are not

MIPS Architecture

MIPS: Microprocessor without Interlocked Pipeline Stages
Multiple revisions, systems, and compilers
- not just a single standard MIPS
- we use simplified version in CS 230

MIPS Assembly Language

Each instruction takes 32 bits
- 4 bytes = 1 word
Arithmetic instructions operate on registers
- 32 registers available numbered $0 to $31
  - refer to them in assembly language with $[register number]
- register $0 always equals 0
Instructions have up to 3 operands
- 1st is destination, 2nd and 3rd are sources
- same register can be source and destination

Special Registers

PC – program counter: current location (byte address) in machine code
- incremented by 4 for each instruction
\$0 – constant 0
Register conventions (more later)
- \$29 – stack pointer ($sp)
- \$30 – frame pointer ($fp)
- \$31 – return address ($ra)

MIPS Assembly Language

Two addressing modes
- register: operands in registers
- immediate: one operand is a 16-bit constant
Memory: 32-bit address space
- relative & absolute memory load/store
- relative & absolute branch instructions
CS 230: subset of actual MIPS language

Label Naming

Readable and intuitive vs. unique!
Name scope?
- source file for now
- internal vs. external – more later (Module 4)
Manual vs. automatic generation

Memory Model

Byte addressable
- 2³² = 4294967296 addresses
- each address stores 1 byte
Word aligned
- can only access words
  - only at word boundaries
  - multiples of 4 bytes
- also called “word referenced”
- “unaligned access” errors

Memory Access

lw $t, i($s)

load word from location 's' + i into 't'
's' + i must be word-aligned

sw $t, i($s)

store word from 't' into location 's' + i
's' + i must be word-aligned

Low Level Errors

Illegal instruction
Assignment to read-only register
Integer division by 0
Alignment violation
Memory protection violation
Etc.
- usually result in exception and termination

Stack

Define memory area as stack
- last-in first-out queue
- convention: stack grows downward in memory
Address “bottom” by stack pointer register
- convention: register $29 on MIPS
- convention: $29 points to last used word
HOWEVER: in our MIPS simulator...
- use $30 instead

Stack Save/Restore

Save register on stack – push
- decrement stack pointer to make room
- copy value to stack pointer memory location
Restore register from stack – pop
- copy value from stack pointer memory location
- increment stack pointer to free up space
Remember: always release all stack memory!
- always add back same value to $30

Stack Frame

Stack pointer: bottom of stack, first free field
Stack pointer might vary during routine
- set up stack for subroutine call
At routine entry, frame pointer is set
- e.g. to stack pointer at that time
Frame pointer providers a constant base to
- access local variables and arguments
use offset in lw $t, i($fp) instruction

Register Conventions

Which registers to save?
Which registers to use for arguments/results?
MIPS
- \$1 – assembler temporary
- \$2, \$3 – function results and expressions
- \$4 - \$7 – arguments
- \$8 - \$15, \$24, \$25 – temporary
- \$16 - \$23 – saved temporary
- \$26 - \$27 – OS kernel

Who Saves What?

Caller vs. callee saves registers
- typical: more callers than callees
  - callee saves registers produces less code
  - save registers only if necessary
- MIPS convention
  - callee saves \$16-\$23
    - if you want to use one of these: save it first
  - caller must save \$8 - \$15, \$24, \$25 and others
    - if you call another function: save these first or lose them
  - clean up the stack before jr $31

Local Variables

Default: make room on stack
Usually no limits
- number of local variables
- size of local variables
“Automatic” variables
- memory reserved on routine entry
- memory released on routine exit

Preserving the Stack

A function must preserve the value of the stack pointer
- Always add back any value you subtracted from the stack pointer before ending a function
- The stack pointer should have the same value at the beginning and end of a function

MIPS Reference Sheet (CS 230/CS 241 dialect)

mipsref.pdf

Module 3 – Machine Internals

Overview

Basic control elements
Pipelining
Memory hierarchy
Caching
Performance evaluation

Multiplexor/Multiplexer

In electronics, a multiplexer (or mux; spelled sometimes as multiplexor), also known as a data selector, is a device that selects between several analog or digital input signals and forwards the selected input to a single output line.

Clock

Clock cycle: beat of computer
Clock signal
Electrical signals propagate fast
- but not infinitely fast: remember gate delays
Rising edge called a tick
- like a drummer for a dragon boat, keeps things in sync

Cycle Execution

One instruction per clock cycle
- fixed cycle length must cover slowest instruction
- what about complex instructions?
One instruction – multiple clock cycles
- each instruction is divided into subtasks
- some instructions take fewer cycles than others
  - lw takes 5 cycles
  - add takes 4 cycles

Typical Instruction Cycle

IF: instruction fetch
- retrieve the instruction from memory
ID: instruction decode
- decode the instruction and load register values needed
- sometimes referred to as register read (RR)
EX: execute
- operate the ALU
MEM: memory access
- access memory
WB: write back
- write results back to registers

CPU Clocking

Do work, update state, do work, update state...
Split instruction into pipeline stages
- One pipeline stage per clock cycle
- MIPS has 5 pipeline stages

IF – Instruction Fetch

Look at address in memory stored in PC
Load the 32-bit value from that address
- this is the next instruction
- pass this value on to ID
Increment PC by 4

ID – Instruction Decode

Get 32-bit binary instruction from IF
- normally a direct conversion from assembly code to machine code
Decode it – MIPS Reference Sheet
- what instruction is it?
- what registers does it need?
  - load register values given immediately
- what immediate values does it need?
- if it is a branch:
  - is the branch condition met?
  - if so, update PC accordingly

EX – Execute

Get from ID the:
- 32-bit input register contents
- instruction to do
  - for lw and sw this is an addition of the offset
- destination register
Use the ALU to do the math for the instruction
Pass on to MEM the:
- 32-bit result of the math
- destination register and instruction

MEM – Memory Access

Get from EX the:
- 32-bit result of the math
- destination register and instruction
If instruction is lw or sw
- load from or store to memory
- otherwise do nothing (just pass on values)
Pass on to WB the:
- 32-bit result, or value loaded from memory (for lw)
- destination register and instruction

WB – Write Back

Get from MEM the:
- 32-bit result of the math or loaded value
- destination register and instruction
If instruction is not sw
- put result or loaded value into destination register

Performance

Execution Time

elapsed time:
- total response time
- includes processing, wait times, idle time
- perceived performance
CPU time
- time spent processing instructions
- comprised of user time and system time

CPU Clocking

Clock period

also called cycle time
duration of a clock cycle in units of time
SI units of time in seconds per clock cycle
- ps = 10^-12 s
- ns = 10^-9 s
- μs = 10^-6 s
- ms = 10^-3 s
example
- 250ps = 0.25ns = 250*10^-12 s

Clock frequency

inverse of clock period
measured in cycles per second: Hertz (Hz)
- SI units of Hz:
  - THz = 10¹²Hz
  - GHz = 10⁹Hz
  - MHz = 10⁶Hz
  - KHz = 10³Hz
example
- processor with a clock period of 250ps = 250*10^-12s
- Inverse(250*10^-12 s) = 0.004*10¹² Hz = 0.004THz = 4GHz

Instruction Count and CPI

Cycles per instruction

abbreviated CPI
determined by instruction set architecture (ISA)
different CPUs/programs might have different CPI
different instruction types can take different numbers of cycles
- not all ISAs use the 5-stage pipeline

CPI is defined as the average number of clock cycles per instruction for a program or program fragment.

CPI is determined by CPU hardware. Different instructions might require a different number of cycles. A single process will have an average CPI

Instruction count

total number of instructions executed in a program
total clock cycles = instruction count × CPI

Instruction count is determined by program, instruction set, and compiler.

CPU Time

time spent executing instructions in a program
- only the instructions that are actually executed (e.g. instructions in loops)
does not include waiting for input or other devices

Performance equation

CPU time = Instruction count × CPI × Clock cycle time
or, CPU time = (Instruction count × CPI) / Clock rate
CPU Time = $\frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Clock cycles}}{\text{Instruction}} \times \frac{\text{Seconds}}{\text{Clock cycle}}$

Performance Summary

Performance depends on

algorithm: affects IC, possibly CPI
programming language: affects IC, CPI
compiler: affects IC, CPI
instruction set architecture: affects IC, CPI
hardware: affects clock cycle

Pipelining

Analogy: laundry
- wash – dry – fold – put away
Analogy: industrial assembly line
Sequential vs. pipelined execution
Latency vs. Throughput
- startup latency for individual operation vs.
- overall latency for sequence of operations

Pipeline Speedup

If all stages are balanced
- i.e., all take the same time
- time between instructions_pipelined = time between instructions_serial / # of stages
If stages are not balanced, speedup is less
- Speedup due to increased throughput
- latency for each instruction unchanged, (maybe) even slowed down a bit

Instruction Set for Pipelining

Constant length instructions (fetch, decode)
Few instruction formats, source fields same
- register operands can be fetched while decoding
Memory operands only in load and store
- one memory access per instruction
- compute address during execute
- no separate stage needed
MIPS has all of this

Pipeline Hazards

Instructions are not completely independent
Hazard: condition that blocks pipelined flow
- Structural: combination of instruction types
  - resource is busy
- Data: dependency between instructions
  - need to wait for data read/write
- Control: dependency between instructions
  - control depends on previous instruction
If hazard cannot be resolved
- introduce wait stages: called stall or bubble

Structural Hazard

Example: Instruction fetch vs. load/store
Solution: more resources or more time
- for example use different memories for instructions and data
We will assume that our MIPS pipeline design avoids structural hazards

Data Hazard

add $s0, $t0, $t1
sub $t2, $s0, $t3

reorder instructions, if possible

Forwarding / Bypassing

Use result when it is computed
- don't wait for it to be stored in register
- requires extra connections in the data path

Load-Use Data Hazard

Can't always avoid stalls by forwarding
- if value not computed when needed
- can't forward backward in time

Forwarding “Connections” Diagram

Control Hazard

Conditional branch determines control flow
- fetching next instruction depends on outcome
In MIPS pipeline
- compare registers and compute result early
- add hardware to ID stage
All branch instructions are control hazards

Branch Prediction

Longer pipelines can't determine outcome early
- stall penalty becomes more significant
Predict outcome of branch
- only stall if prediction is wrong
- static and dynamic methods

More Realistic Branch Prediction

Static branch prediction
- based on typical branch behaviour at the high level
  - loop: predict backward branches taken
  - if: predict forward branches not taken
Dynamic branch prediction
- hardware measures actual branch behaviour
  - e.g., record recent history of each branch
- assume future behaviour will continue trend

Pipeline Summary

Pipelining improves performance by increasing instruction throughput
- multiple instructions executed in parallel
- unchanged latency compared to single cycle models
Subject to hazards
- structural (not in MIPS), data, control
Instruction set affects pipeline complexity

RISC vs. CISC

Reduced Instruction Set Computer (RISC) and Complex Instruction Set Computer (CISC)

RISC

Emphasis on software
Single-clock, reduced instruction only
Uniform instruction format
Simple addressing modes
Typically larger code sizes
Few data types in hardware
Direct execution of machine code
Single result write at end

CISC

Emphasis on hardware
Includes multi-cycle complex instructions
Variable length instruction format
Complex addressing and memory access
Small code size
Complex data types in hardware
Hybrid assembly language: microcode
- indirect execution

Memory Characteristics

Cost per unit storage
Performance

access latency
throughput

Persistency

Memory Hierarchy

Going down the hierarchy, memory gets
- cheaper
- bigger
- slower
- further away from the CPU

Some examples are:

Registers
Cache
Main memory
Disk
Network
Off-site archive (tape, optical disk, etc.)

Registers

Very expensive, thus limited
Access at instruction speed
- very low latency
- very high throughput
- can access in less than a cycle in ID and WB
Not persistent
- lose values on power-off

Main Memory

Cheap & large
Noticeable access latency
- approximately 100x slower than registers
Not persistent
Random Access Memory (RAM)
- send address to memory controller on memory bus
- memory controller responds with value

Disk

Hard Disk Drive (HDD) or Solid State Drive (SSD)
Very large storage
- hundreds of gigabytes to multiple terabytes
Very large access latency
- thousands of times slower than main memory
Persistent

Random Access Memory (RAM)

Array
- internally: two-dimensional matrix
Index-based direct access
- via memory bus
Fetching memory
- CPU sends address to memory controller
- controller responds with data

Static RAM (SRAM)

similar to previously shown memory circuit
stable storage, as long as power is applied
multiple transistors per bit
expensive, but faster – used for caches

Dynamic RAM (DRAM)

bit stored in capacitor, needs refreshing
single transistor per bit
cheaper, but slower

Memory Technology

Typical performance and cost figures

as of 2008
ns = nanosecond = 1 billionth (10 -9 ) second

Technology	Access Time	\$/GB
SRAM	0.5-2.5ns	\$2000-\$5000
DRAM	50-70ns	\$20-\$75
Disk	5,000,000-7,000,000ns	\$0.20-\$2

Ideally: have large amounts of fast memory

Memory Stalls

Delay until response from memory controller
What does the CPU do in the meantime?
- compute something else? details later...
Memory compromise: fast vs. cheap
- satisfy as many requests as fast as possible

Locality

The set of data used during certain time period is called the working set
Assumption:
- size (working set) << size (all data/memory)
- Working set is much smaller than available memory

Locality Principle

Temporal locality

same data item likely to be used again soon
e.g., loop

Spatial locality

close data item likely to be used soon
e.g., iteration through array

Example: books in library and on desk.

Caching

A cache is a small amount of fast memory used to keep recently used (and nearby) data quickly accessible. It exploits the locality principle.

Basic challenges
- limited fast memory: replacement strategy
- shared remote data: invalidation
Data vs. Instruction cache

Typical Cache Types

Memory caching
- between cache and main memory
- instruction and data cache (cf. structural hazard)
Disk caching
- between main memory and disk
Processor caching
- between multiple CPUs
Network caching
- between local disk and remote server

Terminology

When a memory access happens the CPU checks the cache first:
- hit -> data found in the cache
- miss -> data not found, go get it from main memory,
- cause a long pipeline stall
Timing
- time cost of direct fetch: hit time
- time cost of copy: miss penalty
Block
- A collection of sequential bytes in memory
- Recall byte-addressable, i.e. each byte has its own address

Memory Caching

Questions to answer:

How should the cache be organized?
- where to store (and look for) data?
- how to know whether data is present?
- how to handle instructions vs. data in memory?
How to manage a finite amount cache memory?
What happens when data is written to memory?

Cache Miss

On cache hit, CPU proceeds normally
On cache miss
- stall the CPU pipeline
- fetch block from next level of hierarchy
- instruction cache miss
  - restart instruction fetch
- data cache miss
  - complete data access

Direct-Mapped Cache

Assume M blocks of cache memory
Each block has size B
Request for address p
Mapping for cache block: c = (p/B) mod M
- typically all numbers are powers of 2
Example:
- 32 blocks memory,
- 8 blocks cache p/B = 0...31
- c = 3 lowest bits of memory block address

Cache Entries

Many memory blocks map to same cache block
Add tags to identify which one is present
Previous example: tag is remaining 2 bits
How do we tell if a cache block actually has data in it?
- add valid bit
- 1 (Yes) if data is loaded, 0 (No) if empty

Example

Please check the course notes for the detailed examples.

Block Size Considerations

Larger blocks should reduce miss rate
- due to spatial locality
But in a fixed-sized cache...
- larger blocks means fewer of them
- more competition might mean increased miss rate
Larger miss penalty
- can override benefit from reduced miss rate
- early restart and critical-word-first can help

Writing

Write through

update both cache and main memory
each write takes longer

Write back – many updates in same block

only update cache, mark block as dirty
update main memory on eviction (or in background)

Write buffer

dedicated buffer for dirty blocks

Associative Caches

Fully associative

allow a given block to go in any cache entry
requires all entries to be searched at once
hardware comparator per entry (expensive)

n-way set associative

each set contains n entries
block number determines which set
- (block number) module (# sets in cache)
only search entries in a given set – cheaper

Replacement Policy

Direct mapped: no choice
Associative: prefer empty/invalid entry
Otherwise: evict least recently used (LRU): A replacement scheme in which the block replaced is the one that has been unused for the longest time.
- difficult/costly with increasing associativity
Alternative: random replacement
- simple and fast, not too much worse than LRU
It’s a design trade-off.

How Much Associativity

Increased associativity decreases miss rate
- but with diminishing return
Cost of content-addressable-memory (CAM)
- best price point at limited size
- often smaller than desired cache size
Depends on level in hierarchy

Cache Performance

CPU time

program execution cycles, including cache hit time

memory stall cycles from cache misses

With simplifying assumptions:

$$\begin{align*} & \quad \; \text{Memory stall cycles}\\ &= \frac{\text{Memory accesses}}{\text{Program}} \times \text{Miss rate} \times \text{Miss penalty} \\ &= \frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Misses}}{\text{Instructions}} \times {\text{Miss penalty}} \\ \end{align*}$$

Cache Performance Example

Example
- Instruction cache miss rate = 2%
- Data cache miss rate = 4%
- Miss penalty = 100 cycles
- Base CPI (ideal cache) = 2
- memory accesses are 36% of instructions
Miss cycles per instruction
- Instruction: 0.02 * 100 = 2
- Data: 0.36 * 0.04 * 100 = 1.44
Actual CPI = 2 + 2 + 1.44 = 5.44

Average Access Time

Average memory access time (AMAT)
- AMAT = hit time + miss rate * miss penalty
Example
- 4ns clock
- hit time in cycles = 1 cycle
- miss penalty = 20 cycles to main memory
- cache miss rate = 5%
- AMAT = 1 * 4 + 0.05 * 20 * 4 = 8ns

Cache Performance Summary

When CPU performance is increased
- miss penalty becomes more significant
Decreasing base CPI
- relatively more time spent on memory stalls
Increasing clock rate
- memory stalls account for more CPU cycles
Cache behaviour is very important for system performance!

Multilevel Caches

Primary (level-1) cache attached to CPU

small, but fast (usually instruction speed)

Level-2 cache services misses from L1 cache
- larger, slower, but still faster than main memory
- you need to check the primary cache and see that it was a miss before looking in the level-2 cache
Main memory services L2 cache misses
Most modern computers have even more levels

Multilevel Cache Example

Assume
- CPU base CPI = 1, clock rate = 4GHz
- Consider data cache misses only
  - data cache miss rate = 2%
  - 50% of program is lw/sw
- main memory access time = 100ns
With just a one-level cache
- miss penalty = 100ns/0.25ns = 400 cycles
- effective CPI = 1 + 0.5 * 0.02 * 400 = 5
Now add L2 cache
- access time = 5ns
- global miss rate to main memory = 0.5%
  - this is the chance of missing L1 cache and L2 cache
Primary miss with L2 hit
- penalty = 5ns/0.25ns = 20 cycles
Primary miss with L2 miss
- 100ns main memory access time = 400 cycles

Eff. CPI = 1 + 0.5 * 0.02 * 20 + 0.5 * 0.005 * 400 = 1 + 0.2 + 1 = 2.2

Summary

Performance measurements are based on the work done by the CPU
Pipelining has a significant effect on performance
- relies on a compatible instruction set
- hazards must be considered
Caching masks access latencies
- some added overhead and complexity
- could cause issues with parallel processing

Module 4 – Build and Execute

Overview

execution approaches
scanning – regular expressions
parsing – context-free grammars
linking and loading

Classical Tool Chain

Compiler translates high level language into assembly program.
Assembler translates assembly program into machine code in object file.
Linker combines multiple object files of machine code into program file.
Loader loads program file into main memory.
Library is special object file that can be added to program file during linking or loading.

Other Execution Approaches

Interpretation

execute source code directly
execute binary code by software

Byte Code

compile to intermediate binary representation

Just-In-Time Compilation

compile during runtime

Compiler

program translation
- from source language
- to target language (usually assembly code)
typically followed by assembler
- to generate machine code

Basic Compilation Steps

Scanning
- Source code to token sequence
Syntax analysis
- Token sequence to parse tree
Semantic analysis
- Use parse tree to generate a symbol table
- Type check the parse tree against the symbol table
Code generation
- Parse tree and symbol table to target language

Scanner / Tokenizer

also called “Lexical Analysis”
convert program text into stream of tokens
types of tokens (sample):
- keyword – 'for', 'while', etc.
- operator – '+', '&&', etc.
- constant – '1000', '3.5', etc.
- delimiter – ':', etc.
- variable name – 'minpos', etc.
- subroutine name – 'power2', etc.

Background

formal languages
well-studied in theoretical computer science
- deterministic finite automata (DFA)
- non-deterministic finite automata (NFA)
- regular expressions
standard algorithms exist

Deterministic Finite Automata (DFA)

Also known as a deterministic finite state machine (FSM)
Comprised of
- A finite set of states
  - Includes exactly one start state
  - Includes at least one final (also called accept) state(s)
- A finite set of input symbols known as the alphabet
  - Usually identified by Sigma, e.g. Σ = {0, 1}
- A finite set of transitions from one state to another based on the input
  - A maximum of one out transition from each state for any symbol in the alphabet
Can determine if input is accepted or rejected

Non-deterministic Finite Automata (NFA)

The same as a DFA except that:
- NFAs may have transitions from the same state on the same input to different states
  - DFAs can only have one transition per input per state
- NFAs can include an ε (epsilon) transition
  - Move to a new state without consuming input
Easier to design than equivalent DFA
- More complex to evaluate input
Can always create a DFA from an NFA
- All DFAs are also legal NFAs

Languages

We call the set of all strings accepted by a DFA or NFA the language accepted by that DFA/NFA
DFAs/NFAs accept a particular type of language: a regular language
We can use a sentence to describe a regular language:
- The set of all 11-digit phone numbers
- Canadian postal codes

Regular Expression

regular expressions (regexs) are a concise way to define regular languages
define set of strings over alphabet ∑
- ∑ is the set of all legal characters in language
constants:
- empty set: Ø
- empty string: ε
- literal character: a ∈ ∑

Basic Regex Operations

Alternation: R|S = R U S
- U is the union operator
Concatenation: RS = { αβ : α in R and β in S}
Kleene star: R* = smallest superset of R containing ε and closed under concatenation
- “Zero or more copies of R”

Basic Regex Examples

a* = { ε, a, aa, aaa, ... }
b|a* = { b, ε, a, aa, aaa, ... }
(0|1)* = binary numbers, plus empty string
(h|c)at = { hat, cat }
(a|b)(c|d) = { ac, ad, bc, bd }
while = { while }
a+ = { a, aa, aaa, ... }
a? = { ε, a }
ab+ = { ab, abb, abbb, ... }
(h|c)?at = { hat, cat, at }
(a|b)+(c|d) = { ac, ad, bc, bd, aac, aad, bbc, abc, bac, abd, bad, bbd, aaac, aaad, bbbc, bbbd, ... }
wh?ile? = { while, wile, whil, wil }

Regular Expressions and DFA/NFA

Regular expressions define regular languages
NFAs and DFAs define regular languages

UNIX Tools

egrep
- search regular expressions in text files
sed
- stream editor for transforming text files
awk
- pattern scanning and processing language
make
- software building utility

Syntax / Extensions

square brackets (with ranges)
- match one of the given letters
- [a-z] matches all lowercase characters
dot matches any single letter
- .at matches hat, cat, fat, mat, bat, 7at, Aat, etc.
escape character
- match actual brackets, dots, etc.
- [0-9]+\.[0-9]+ to match fractional numbers

Scanner

first step of compiler
- convert input string to tokens
- use regular languages
conflicting rules: need priority
- use order of rules
usually: greedy quantification
- produce longest possible match

Language Specification

clear and unambiguous description of
- valid sentences -> natural language (“grammar”)
- valid program -> programming language
rule set for composition of valid sequences
programming language – formal specification
English – informal specification

Objectives

specification
- simple – for humans
- precise – no ambiguity
- easy to build parsing tools
parsing
- validity checking and error reporting
- analysis for subsequent translation

Specification Components

terminal / token: atomic symbol (word)
non-terminal / variable: abstract component
- does not literally appear in input
- designated start symbol
- notation: angle brackets <...>
production / rule
- possible expansion of a non-terminal into zero or more terminals and/or non-terminals
- more than one rule per non-terminal possible

Derivation

application of rules to generate valid input string
- beginning with start symbol
repeatedly replace one variable by one rule
continue until there are no more variables
resulting sequence of terminals is a syntactically correct input string
formal definition of language: collection of all valid sequences that can be derived from start symbol

Context-Free Grammar

formalism to specify languages
- simple and precise
- recursive definition
- building block structure of languages
- well-suited for programming languages
well-studied in theoretical computer science
- nondeterministic finite automata (NFA)
context-free: no overlap between blocks
- each block can be analyzed in isolation

Leftmost Derivation

How many unique derivations?
more systematic approach:

Parse Trees

also called derivation trees
visualize entire derivation at once
non-terminal are internal nodes
- start symbols is root of tree
- children of nodes are given by derivation rule
terminals are leaf nodes
- show value of terminal

Associativity and Precedence

implicit order of evaluating expressions
- in the absence of parentheses
associativity: grouping equivalent operations
- example: 6 – 3 + 4
- read as (6 -3) + 4 or 6 – (3 + 4)?
precedence: grouping non-equivalent symbols
- example: 6 + 3 * 4
- read as (6 + 3) * 4 or 6 + (3 * 4)?

Ambiguity

More than one derivation tree for same input => ambiguous

Why Do We Care?

good specification => good tree structure
good tree structure => easier evaluation
easier evaluation => more automated
encode associativity or precedence in grammar
ambiguous grammar => no definite output

Parse Tree - Evaluation

depth first search
- recursive tree traversal algorithm
- evaluate node after evaluating children
ambiguity: ultimately undecidable
- certain ambiguities can be spotted
- example: same non-terminals in rule
precedence & associativity
- check grammar rules
- rules evaluated left to right
- all non-terminals evaluated, before current rule applied
ensure proper evaluation order
- arrange non-terminals in rules
- potentially introduce non-terminals

Assembler

line by line translation
- from one assembly instruction
- to one machine code instruction
- translate pseudo-instructions
insert data for .word directive
- and possibly other directives...
ignore comments and blank lines
compute address of each label

Linking

goal: combine multiple object files
- avoid compiling whole program each time
- ship binary components
resolve external symbols
- labels can refer to other object file(s)
produce executable file

Object File Format – Basics

file header – meta information
text segment – code
data segment – static data
defined external symbols
- other objects files can refer to these labels
undefined external symbols
- these labels must be found in other object files
local symbols (for debugging, relocation)

Relocation

assembler produces object code starting at 0
- combine multiple such object files?
relative addresses (e.g. branch) not a problem
“absolute” addresses, static data, must be fixed
- object file contains list of such code locations
adjust actual addresses in object code

Symbol Resolution

replace symbol names with address
labels/symbols must be unique
- across all linked object files
- manual name management
class names and overloading: name mangling
C++ example:


      int Example::compute(int x, float y);

      => _ZN7Example7computeEif

Library

collection of object files
with or without preprocessing
- internal relocation and name resolution
ready for linking with other object files

Loading

set up memory region(s) for new program
load executable file from disk
- perform late relocation and symbol resolution
- see next slide
create new process in operating system
start process

Dynamic Linking

dynamic linking
- relocate and resolve symbols at load time
dynamic library
- do not add object code to executable file
- combine object code at load time
shared library
- keep only one copy of object code in memory
- special memory area or relocatable object code

Dynamic Shared Library

“dynamic link library” (DLL) on Windows
modifications apply to all programs
- no rebuilding necessary
can even rewrite symbols at load time
slip wrapper between application and library

Summary

The Classical Tool Chain takes high level code and converts it to machine code that can be executed by the CPU.

Finite state machines and regular expressions
- define the same data set
- a computer is a finite state machine
Parse trees
- generated from the language specification
- the backbone of a compiler