In a positional numeral system, the radix or base is the number of unique digits, including the digit
zero, used to represent numbers. For example, for the decimal system (the most common system in use today) the radix
is ten, because it uses the ten digits from 0 through 9.
In any standard positional numeral system, a number is conventionally written as $(x)_y$ with $x$ as the string of
digits and $y$ as its base, although for base ten the subscript is usually assumed (and omitted, together with the
pair of parentheses), as it is the most common way to express value. For example, $(100)_{10}$ is equivalent to
$100$ (the decimal system is implied in the latter) and represents the number one hundred, while $(100)_2$ (in the
binary system with base 2) represents the number four.
This representation is unique. Let b be a positive integer greater than 1. Then every positive integer a can be
expressed uniquely in the form
$${\displaystyle a=r_{m}b^{m}+r_{m-1}b^{m-1}+\dotsb +r_{1}b+r_{0},}$$
where $m$ is a nonnegative integer and the $r$'s are integers such that
$0 < r_m < b$ and $0 \leq r_i < b$ for $i=0, 1, ... , m - 1$.
def convert_to_decimal(num, base):
num = str(num)
power = len(num) - 1
decimal = 0
for digit in num:
decimal += int(digit, base) * (base ** power)
power -= 1
return decimal
# print() will not print to the DOM element below
# but to browser console.
# Please make sure this is the last evaluated value so that you
# can see the output below.
(convert_to_decimal(135, 10),
convert_to_decimal(1440, 7),
convert_to_decimal('A32', 16))
Recall that every positive integer can be expressed uniquely in the form
$${\displaystyle a=r_{m}b^{m}+r_{m-1}b^{m-1}+\dotsb +r_{1}b+r_{0},}$$
where $m$ is a nonnegative integer and the $r$'s are integers such that
$0 < r_m < b$ and $0 \leq r_i < b$ for $i=0, 1, ... , m - 1$.
The number in base $b$ are written as $r_m r_{m - 1} \dots r_1 r_0$. So, if you are given a decimal number,
you are basically given the $a$ value and you want to find the individual $r_i$ for $i = 0, 1, ... , m - 1$.
To do this, you can just divide by $b$ and keep track of the remainder. Let's see what happens:
$$a / b = (r_m b^{m - 1} + r_{m - 1} b^{m - 2} + \dots + r_2 b + r_1) \operatorname{R} r_0$$
Now, the remainder is the $r_0$ value. The quotient is the $a$ value for the next iteration. So, you can
just
keep dividing by $b$ and keep track of the remainder. We denote the quotient to be $q_1 = (r_m b^{m - 1} +
r_{m - 1} b^{m - 2} + \dots + r_2 b + r_1)$.
$$q_1 / b = (r_m b^{m - 2} + r_{m - 1} b^{m - 3} + \dots + r_2) \operatorname{R} r_1$$
Continue the process until you get a quotient of 0. The remainders are the $r_i$ values in reverse order.
This is the same as the example above.
Binary Numbers
Binary number is just a special case of the radix representation where the base is 2.
In computer programming, an integer overflow occurs when an arithmetic operation attempts to create a
numeric
value that is outside of the range that can be represented with a given number of digits – either higher
than
the maximum or lower than the minimum representable value.
Scientific notation is a way of expressing numbers that are too large or too small to be
conveniently
written in decimal form, since to do so would require writing out an unusually long string of digits.
In scientific notation, nonzero numbers are written in the form
$$m \times 10^n$$
or m times ten raised to the power of n, where n is an integer, and the coefficient m is a nonzero real number
(usually between 1 and 10 in absolute value, and nearly always written as a terminating decimal).
Normalized number
A number is normalized when it is written in scientific notation with one non-zero decimal digit
before the decimal point. Thus, a real number, when written out in normalized scientific notation, is as
follows:
$${\displaystyle \pm d_{0}.d_{1}d_{2}d_{3}\dots \times 10^{n}}$$
where $n$ is an integer, ${\textstyle d_{0},d_{1},d_{2},d_{3},\ldots ,}$ are the digits of the number in base
10, and ${\displaystyle d_{0}}$ is not zero.
Radix Point
The character used to separate the integer and fractional part of a number in a digital representation of a
number, regardless of the base used.
"decimal" point for base 10.
"binary" point for base 2.
Floating Point vs. Fixed-point Notation
Floating Point
In computing, floating-point arithmetic (FP) is arithmetic that represents real numbers approximately,
using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed
base.
For example, 12.345 can be represented as a base-ten floating-point number:
$${\displaystyle 12.345=\underbrace {12345} _{\text{significand}}\times \underbrace {10}
_{\text{base}}\!\!\!\!\!\!^{\overbrace {-3} ^{\text{exponent}}}}$$
Fixed-point Notation
In computing, fixed-point is a method of representing fractional (non-integer) numbers by storing a fixed
number
of digits of their fractional part. Dollar amounts, for example, are often stored with exactly two fractional
digits, representing the cents (1/100 of dollar).
Terminology
$I.F \cdot B^E$
$I$ - integer
$F$ - fraction
$I.F$ - significand (mantissa)
$B$ - base
$E$ - exponent
Notes
binary & normalized - first digit is always 1
base usually implicit - standard is 2
Representation
Standard: IEEE 754
single precision: 32 bits
double precision: 64 bits
quadruple precision: 128 bits
Integer (always 1) not represented
this is called implicit bit
need a way to represent 0
We still need to represent fraction, exponent, signs.
In order to interpret a floating point number, you must
know the number of bits used for the sign (usually 1), the
exponent part and the fraction part, as well as the bias.
$(-1)^S \cdot (1 + F) \cdot 2^{255 - 127}$: NaN (Not a Number)
Rationale of the Format
All numbers with sign "+", arranged by size from +0 up to +Inf correspond to all the bit sequences (0 0...0
0...0) up to (0 1...1 0...0), arranged as binary integers. Therefore it is easy to compare two machine
numbers, or to find the next smaller or larger machine number.
Biased Exponent
unsigned, interpreted as x - bias
IEEE 754 32-bit: 8-bit exponent, bias is 127
i.e., range of exponent -126...127
-127 is reserved for subnormal and interpreted as -126
128 is reserved for special cases infinity and NaN
Comparing Two Floating-point Number
Comparison with constant 0
equal: all bits 0
less: sign bit set
greater: else (any other bit set)
Comparison between numbers
equal: all bits idnetical
less: check sign bit
greater: compare bits left to right
Arithmetic
Addition
align radix points
use normal addition
Multiplication
add exponents
multiply significands
Simplified 8-Bit Model
This section is mostly for examples. If you are interested, please check the slides.
Special Cases
Overflow
Overflow is still possible.
represent as infinity
also for division by zero
Invalid Result - NaN
special cases, like 0/0 or ∞*0 or sqrt(-1)
can safely propagate during computation
Both the overflow and NaN can propagate during computation.
This allows us to have no exception (like integer division by zero) and no silent error (like integer
overflow).
Underflow - assume no subnormal
gap between 0 and smallest positive number
subtraction result cannot be represented
Subnormal Representation
complexity & overhead
not always used
Representable Numbers
The representable numbers of a representation format are the numbers that can be exactly represented in the
format.
We cannot represent all fractional numbers in the floating-point format. This is a tradeoff of using
integers
to represent fractions.
Floating Point Caveats
Overflow - result too large
Rounding error - result cannot be represented
Underflow = subnormal representation
NaN - not-a-number
Consequences
Distributive law is not guaranteed! You should pay attention to order of calculations!
Fixed Point Representation
Interpret first N bits as integer, rest as fraction
can use integer arithmetic operations
split can be application-specific
often used for time
similar to, e.g., using nanoseconds, instead of seconds
Often used in operating system kernels
faster
no floating point context needed
Non-numerical data types
Bytes
8 bits = 1 byte.
Signed Range -128 .. 127
Unsigned binary range 0 ... 255
Two hexadecimal digits
Word
In computing, a word is the natural unit of data used by a particular processor design. A word is a
fixed-sized datum handled as a unit by the instruction set or the hardware of the processor. The number of
bits or digits in a word (the word size, word width, or word length) is an important characteristic of any
specific processor design or computer architecture.
Endianness can become an issue when moving data external to the computer – as when transmitting data
between
different computers, or a programmer investigating internal computer bytes of data from a memory dump – and
the endianness used differs from expectation. In these cases, the endianness of the data must be understood
and accounted for.
In electronics, a multiplexer (or mux; spelled sometimes as multiplexor), also known as a data selector, is a
device that selects between several analog or digital input signals and forwards the selected input to a single
output line.
Clock
Clock cycle: beat of computer
Clock signal
Electrical signals propagate fast
but not infinitely fast: remember gate delays
Rising edge called a tick
like a drummer for a dragon boat, keeps things in sync
Cycle Execution
One instruction per clock cycle
fixed cycle length must cover slowest instruction
what about complex instructions?
One instruction – multiple clock cycles
each instruction is divided into subtasks
some instructions take fewer cycles than others
lw takes 5 cycles
add takes 4 cycles
Typical Instruction Cycle
IF: instruction fetch
retrieve the instruction from memory
ID: instruction decode
decode the instruction and load register values needed
sometimes referred to as register read (RR)
EX: execute
operate the ALU
MEM: memory access
access memory
WB: write back
write results back to registers
CPU Clocking
Do work, update state, do work, update state...
Split instruction into pipeline stages
One pipeline stage per clock cycle
MIPS has 5 pipeline stages
IF – Instruction Fetch
Look at address in memory stored in PC
Load the 32-bit value from that address
this is the next instruction
pass this value on to ID
Increment PC by 4
ID – Instruction Decode
Get 32-bit binary instruction from IF
normally a direct conversion from assembly code to machine code
Decode it – MIPS Reference Sheet
what instruction is it?
what registers does it need?
load register values given immediately
what immediate values does it need?
if it is a branch:
is the branch condition met?
if so, update PC accordingly
EX – Execute
Get from ID the:
32-bit input register contents
instruction to do
for lw and sw this is an addition of the offset
destination register
Use the ALU to do the math for the instruction
Pass on to MEM the:
32-bit result of the math
destination register and instruction
MEM – Memory Access
Get from EX the:
32-bit result of the math
destination register and instruction
If instruction is lw or sw
load from or store to memory
otherwise do nothing (just pass on values)
Pass on to WB the:
32-bit result, or value loaded from memory (for lw)
destination register and instruction
WB – Write Back
Get from MEM the:
32-bit result of the math or loaded value
destination register and instruction
If instruction is not sw
put result or loaded value into destination register
Performance
Execution Time
elapsed time:
total response time
includes processing, wait times, idle time
perceived performance
CPU time
time spent processing instructions
comprised of user time and system time
CPU Clocking
Clock period
also called cycle time
duration of a clock cycle in units of time
SI units of time in seconds per clock cycle
ps = 10-12 s
ns = 10-9 s
μs = 10-6 s
ms = 10-3 s
example
250ps = 0.25ns = 250*10-12 s
Clock frequency
inverse of clock period
measured in cycles per second: Hertz (Hz)
SI units of Hz:
THz = 1012Hz
GHz = 109Hz
MHz = 106Hz
KHz = 103Hz
example
processor with a clock period of 250ps = 250*10-12s
different instruction types can take different numbers of cycles
not all ISAs use the 5-stage pipeline
CPI is defined as the average number of clock cycles per instruction for a program or program fragment.
CPI is determined by CPU hardware. Different instructions might require a different number of cycles. A single
process will have an average CPI
Instruction count
total number of instructions executed in a program
total clock cycles = instruction count × CPI
Instruction count is determined by program, instruction set, and compiler.
CPU Time
time spent executing instructions in a program
only the instructions that are actually executed (e.g. instructions in loops)
does not include waiting for input or other devices
Performance equation
CPU time = Instruction count × CPI × Clock cycle time
or, CPU time = (Instruction count × CPI) / Clock rate
CPU Time = $\frac{\text{Instructions}}{\text{Program}} \times \frac{\text{Clock cycles}}{\text{Instruction}}
\times \frac{\text{Seconds}}{\text{Clock cycle}}$
Performance Summary
Performance depends on
algorithm: affects IC, possibly CPI
programming language: affects IC, CPI
compiler: affects IC, CPI
instruction set architecture: affects IC, CPI
hardware: affects clock cycle
Pipelining
Analogy: laundry
wash – dry – fold – put away
Analogy: industrial assembly line
Sequential vs. pipelined execution
Latency vs. Throughput
startup latency for individual operation vs.
overall latency for sequence of operations
Pipeline Speedup
If all stages are balanced
i.e., all take the same time
time between instructionspipelined = time between instructionsserial / # of stages
If stages are not balanced, speedup is less
Speedup due to increased throughput
latency for each instruction unchanged, (maybe) even slowed down a bit
Instruction Set for Pipelining
Constant length instructions (fetch, decode)
Few instruction formats, source fields same
register operands can be fetched while decoding
Memory operands only in load and store
one memory access per instruction
compute address during execute
no separate stage needed
MIPS has all of this
Pipeline Hazards
Instructions are not completely independent
Hazard: condition that blocks pipelined flow
Structural: combination of instruction types
resource is busy
Data: dependency between instructions
need to wait for data read/write
Control: dependency between instructions
control depends on previous instruction
If hazard cannot be resolved
introduce wait stages: called stall or bubble
Structural Hazard
Example: Instruction fetch vs. load/store
Solution: more resources or more time
for example use different memories for instructions and data
We will assume that our MIPS pipeline design avoids structural hazards
Data Hazard
add $s0, $t0, $t1 sub $t2, $s0, $t3
reorder instructions, if possible
Forwarding / Bypassing
Use result when it is computed
don't wait for it to be stored in register
requires extra connections in the data path
Load-Use Data Hazard
Can't always avoid stalls by forwarding
if value not computed when needed
can't forward backward in time
Forwarding “Connections” Diagram
Control Hazard
Conditional branch determines control flow
fetching next instruction depends on outcome
In MIPS pipeline
compare registers and compute result early
add hardware to ID stage
All branch instructions are control hazards
Branch Prediction
Longer pipelines can't determine outcome early
stall penalty becomes more significant
Predict outcome of branch
only stall if prediction is wrong
static and dynamic methods
More Realistic Branch Prediction
Static branch prediction
based on typical branch behaviour at the high level
loop: predict backward branches taken
if: predict forward branches not taken
Dynamic branch prediction
hardware measures actual branch behaviour
e.g., record recent history of each branch
assume future behaviour will continue trend
Pipeline Summary
Pipelining improves performance by increasing instruction throughput
multiple instructions executed in parallel
unchanged latency compared to single cycle models
Subject to hazards
structural (not in MIPS), data, control
Instruction set affects pipeline complexity
RISC vs. CISC
Reduced Instruction Set Computer (RISC) and Complex Instruction Set Computer (CISC)
RISC
Emphasis on software
Single-clock, reduced instruction only
Uniform instruction format
Simple addressing modes
Typically larger code sizes
Few data types in hardware
Direct execution of machine code
Single result write at end
CISC
Emphasis on hardware
Includes multi-cycle complex instructions
Variable length instruction format
Complex addressing and memory access
Small code size
Complex data types in hardware
Hybrid assembly language: microcode
indirect execution
Memory Characteristics
Cost per unit storage
Performance
access latency
throughput
Persistency
Memory Hierarchy
Going down the hierarchy, memory gets
cheaper
bigger
slower
further away from the CPU
Some examples are:
Registers
Cache
Main memory
Disk
Network
Off-site archive (tape, optical disk, etc.)
Registers
Very expensive, thus limited
Access at instruction speed
very low latency
very high throughput
can access in less than a cycle in ID and WB
Not persistent
lose values on power-off
Main Memory
Cheap & large
Noticeable access latency
approximately 100x slower than registers
Not persistent
Random Access Memory (RAM)
send address to memory controller on memory bus
memory controller responds with value
Disk
Hard Disk Drive (HDD) or Solid State Drive (SSD)
Very large storage
hundreds of gigabytes to multiple terabytes
Very large access latency
thousands of times slower than main memory
Persistent
Random Access Memory (RAM)
Array
internally: two-dimensional matrix
Index-based direct access
via memory bus
Fetching memory
CPU sends address to memory controller
controller responds with data
Static RAM (SRAM)
similar to previously shown memory circuit
stable storage, as long as power is applied
multiple transistors per bit
expensive, but faster – used for caches
Dynamic RAM (DRAM)
bit stored in capacitor, needs refreshing
single transistor per bit
cheaper, but slower
Memory Technology
Typical performance and cost figures
as of 2008
ns = nanosecond = 1 billionth (10 -9 ) second
Technology
Access Time
\$/GB
SRAM
0.5-2.5ns
\$2000-\$5000
DRAM
50-70ns
\$20-\$75
Disk
5,000,000-7,000,000ns
\$0.20-\$2
Ideally: have large amounts of fast memory
Memory Stalls
Delay until response from memory controller
What does the CPU do in the meantime?
compute something else? details later...
Memory compromise: fast vs. cheap
satisfy as many requests as fast as possible
Locality
The set of data used during certain time period is called the working set
Assumption:
size (working set) << size (all data/memory)
Working set is much smaller than available memory
Locality Principle
Temporal locality
same data item likely to be used again soon
e.g., loop
Spatial locality
close data item likely to be used soon
e.g., iteration through array
Example: books in library and on desk.
Caching
A cache is a small amount of fast memory used to keep recently used (and nearby) data quickly accessible. It
exploits the locality principle.
Basic challenges
limited fast memory: replacement strategy
shared remote data: invalidation
Data vs. Instruction cache
Typical Cache Types
Memory caching
between cache and main memory
instruction and data cache (cf. structural hazard)
Disk caching
between main memory and disk
Processor caching
between multiple CPUs
Network caching
between local disk and remote server
Terminology
When a memory access happens the CPU checks the cache first:
hit -> data found in the cache
miss -> data not found, go get it from main memory,
cause a long pipeline stall
Timing
time cost of direct fetch: hit time
time cost of copy: miss penalty
Block
A collection of sequential bytes in memory
Recall byte-addressable, i.e. each byte has its own address
Memory Caching
Questions to answer:
How should the cache be organized?
where to store (and look for) data?
how to know whether data is present?
how to handle instructions vs. data in memory?
How to manage a finite amount cache memory?
What happens when data is written to memory?
Cache Miss
On cache hit, CPU proceeds normally
On cache miss
stall the CPU pipeline
fetch block from next level of hierarchy
instruction cache miss
restart instruction fetch
data cache miss
complete data access
Direct-Mapped Cache
Assume M blocks of cache memory
Each block has size B
Request for address p
Mapping for cache block: c = (p/B) mod M
typically all numbers are powers of 2
Example:
32 blocks memory,
8 blocks cache p/B = 0...31
c = 3 lowest bits of memory block address
Cache Entries
Many memory blocks map to same cache block
Add tags to identify which one is present
Previous example: tag is remaining 2 bits
How do we tell if a cache block actually has data in it?
add valid bit
1 (Yes) if data is loaded, 0 (No) if empty
Example
Please check the course notes for the detailed examples.
Block Size Considerations
Larger blocks should reduce miss rate
due to spatial locality
But in a fixed-sized cache...
larger blocks means fewer of them
more competition might mean increased miss rate
Larger miss penalty
can override benefit from reduced miss rate
early restart and critical-word-first can help
Writing
Write through
update both cache and main memory
each write takes longer
Write back – many updates in same block
only update cache, mark block as dirty
update main memory on eviction (or in background)
Write buffer
dedicated buffer for dirty blocks
Associative Caches
Fully associative
allow a given block to go in any cache entry
requires all entries to be searched at once
hardware comparator per entry (expensive)
n-way set associative
each set contains n entries
block number determines which set
(block number) module (# sets in cache)
only search entries in a given set – cheaper
Replacement Policy
Direct mapped: no choice
Associative: prefer empty/invalid entry
Otherwise: evict least recently used (LRU): A replacement
scheme in which the
block replaced is the one
that has been unused for
the longest time.
difficult/costly with increasing associativity
Alternative: random replacement
simple and fast, not too much worse than LRU
It’s a design trade-off.
How Much Associativity
Increased associativity decreases miss rate
but with diminishing return
Cost of content-addressable-memory (CAM)
best price point at limited size
often smaller than desired cache size
Depends on level in hierarchy
Cache Performance
CPU time
program execution cycles, including cache hit time