321x Filetype PDF File size 1.54 MB Source: developer.download.nvidia.com
OpenCL™
Programming Guide
for the CUDA™
Architecture
Version 4.2
3/9/2012
Table of Contents
Chapter 1. Introduction ................................................................................... 7
1.1 From Graphics Processing to General-Purpose Parallel Computing ................... 7
1.2 CUDA™: a General-Purpose Parallel Computing Architecture .......................... 9
1.3 A Scalable Programming Model .................................................................. 10
1.4 Document’s Structure ............................................................................... 11
Chapter 2. OpenCL on the CUDA Architecture ............................................... 13
2.1 CUDA Architecture .................................................................................... 13
2.1.1 SIMT Architecture .............................................................................. 15
2.1.2 Hardware Multithreading .................................................................... 16
2.2 Compilation .............................................................................................. 17
2.2.1 PTX .................................................................................................. 17
2.2.2 Volatile .............................................................................................. 17
2.3 Compute Capability ................................................................................... 18
2.4 Mode Switches ......................................................................................... 18
2.5 Matrix Multiplication Example ..................................................................... 18
Chapter 3. Performance Guidelines ............................................................... 27
3.1 Overall Performance Optimization Strategies ............................................... 27
3.2 Maximize Utilization .................................................................................. 27
3.2.1 Application Level ................................................................................ 27
3.2.2 Device Level ...................................................................................... 28
3.2.3 Multiprocessor Level ........................................................................... 28
3.3 Maximize Memory Throughput ................................................................... 30
3.3.1 Data Transfer between Host and Device .............................................. 31
3.3.2 Device Memory Accesses .................................................................... 32
3.3.2.1 Global Memory ............................................................................ 32
3.3.2.2 Local Memory .............................................................................. 33
3.3.2.3 Shared Memory ........................................................................... 34
3.3.2.4 Constant Memory ........................................................................ 34
ii OpenCL Programming Guide Version 4.2
3.3.2.5 Texture Memory .......................................................................... 35
3.4 Maximize Instruction Throughput ............................................................... 35
3.4.1 Arithmetic Instructions ....................................................................... 36
3.4.2 Control Flow Instructions .................................................................... 38
3.4.3 Synchronization Instruction ................................................................. 38
Appendix A. CUDA-Enabled GPUs .................................................................. 41
Appendix B. Mathematical Functions Accuracy .............................................. 43
B.1 Standard Functions ................................................................................... 43
B.1.1 Single-Precision Floating-Point Functions .............................................. 43
B.1.2 Double-Precision Floating-Point Functions ............................................ 45
B.2 Native Functions ....................................................................................... 47
Appendix C. Compute Capabilities ................................................................. 49
C.1 Features and Technical Specifications ......................................................... 49
C.2 Floating-Point Standard ............................................................................. 51
C.3 Compute Capability 1.x ............................................................................. 52
C.3.1 Architecture ....................................................................................... 52
C.3.2 Global Memory .................................................................................. 53
C.3.2.1 Devices of Compute Capability 1.0 and 1.1 .................................... 53
C.3.2.2 Devices of Compute Capability 1.2 and 1.3 .................................... 54
C.3.3 Shared Memory ................................................................................. 54
C.3.3.1 32-Bit Strided Access ................................................................... 54
C.3.3.2 32-Bit Broadcast Access ............................................................... 55
C.3.3.3 8-Bit and 16-Bit Access ................................................................ 55
C.3.3.4 Larger Than 32-Bit Access ............................................................ 55
C.4 Compute Capability 2.x ............................................................................. 56
C.4.1 Architecture ....................................................................................... 56
C.4.2 Global Memory .................................................................................. 57
C.4.3 Shared Memory ................................................................................. 58
C.4.3.1 32-Bit Strided Access ................................................................... 58
C.4.3.2 Larger Than 32-Bit Access ............................................................ 58
C.4.4 Constant Memory ............................................................................... 59
C.5 Compute Capability 3.0 ............................................................................. 59
C.5.1 Architecture ....................................................................................... 59
OpenCL Programming Guide Version 4.2 iii
C.5.2 Global Memory .................................................................................. 60
C.5.3 Shared Memory ................................................................................. 62
iv OpenCL Programming Guide Version 4.2
no reviews yet
Please Login to review.