396x Filetype PPTX File size 0.41 MB Source: people.inf.ethz.ch
Register file size limits GPU
scalability
• Register fle (RF) already accounts for 60% of on-
chip storage
• But, there is still demand for more registers to
Maximum Required Register File 5.9x
achieve maximum performance and concurrency
Average Required Register File 2.3x
Available Register File
0 200 400 600 8001000120014001600 (KB)
• Future slow memory accesses call for more
threads
• Multi-socket, multi-GPU, RDMA, NVM, etc.
Need mechanisms to expand RF
• Compiler optimizations call for more registers per
capacity (without large area/power
thread
• overheads)
Loop unrolling, thread coarsening, etc. 2
How to make register files larger?
• Emerging technologies [Jing’13][Mao’14][Wang’15][Abdel-
Majid’17]
• Register fle compression [Lee’15]
• Register fle virtualization [Jion’15][Vijaykumar’16]
[Kloosterman’17]
C
P
I
d
e
z
i
l • Common challenge: Latency overhead
a No latency
m
r • overhead
o Example: 8x larger register fle with NTV TFET
N
2
1.5 5.3x slower
Ideal
1
0.5 Real
0 lavaMD lbm leukocyte myocyte NN sad sgemm STO WP GMEAN
Goal: Tolerate register file latencies
3
Contributions
• Latency Tolerant Register File (LTRF)
• “2-level” main register fle + register cache
• Performs prefetch ops while executing other
warps
• Paves the way for several power/area
optimizations
• Compiler-driven Register Prefetching
• Break control flow graph into “prefetch
LTRF tolerates up to 6x slower register
subgraphs”
• files
Prefetch registers at the beginning of each
Example LTRF use case:
subgraph
•
8× larIntegerrv aRl Fan al y3si4s .to8% id ehintigherfy pre fperetch fosubrmagrapnhcse
4
Outline
• Background and challenges
• The case for compiler-driven register prefetching
in GPUs
• LTRF architecture and compiler support
• Evaluation methodology
• Results
5
Register file caching [Gebhart’
ISCA11]
• Promising approach for latency tolerant
register fles
Warp Scheduler
r
o
t s
Main r Register r c t
e i
a l
a l n
b b o U
Register s File s C
s s
o o d D
r r n
a M
File C Cache C r I
e S
p
(multiple banks) (multiple banks) O
Unfortunately, classic demand fetch
and replace yields low hit rate in
register caches
6
no reviews yet
Please Login to review.