158x Filetype PDF File size 1.54 MB Source: developer.download.nvidia.com
OpenCL™ Programming Guide for the CUDA™ Architecture Version 4.2 3/9/2012 Table of Contents Chapter 1. Introduction ................................................................................... 7 1.1 From Graphics Processing to General-Purpose Parallel Computing ................... 7 1.2 CUDA™: a General-Purpose Parallel Computing Architecture .......................... 9 1.3 A Scalable Programming Model .................................................................. 10 1.4 Document’s Structure ............................................................................... 11 Chapter 2. OpenCL on the CUDA Architecture ............................................... 13 2.1 CUDA Architecture .................................................................................... 13 2.1.1 SIMT Architecture .............................................................................. 15 2.1.2 Hardware Multithreading .................................................................... 16 2.2 Compilation .............................................................................................. 17 2.2.1 PTX .................................................................................................. 17 2.2.2 Volatile .............................................................................................. 17 2.3 Compute Capability ................................................................................... 18 2.4 Mode Switches ......................................................................................... 18 2.5 Matrix Multiplication Example ..................................................................... 18 Chapter 3. Performance Guidelines ............................................................... 27 3.1 Overall Performance Optimization Strategies ............................................... 27 3.2 Maximize Utilization .................................................................................. 27 3.2.1 Application Level ................................................................................ 27 3.2.2 Device Level ...................................................................................... 28 3.2.3 Multiprocessor Level ........................................................................... 28 3.3 Maximize Memory Throughput ................................................................... 30 3.3.1 Data Transfer between Host and Device .............................................. 31 3.3.2 Device Memory Accesses .................................................................... 32 3.3.2.1 Global Memory ............................................................................ 32 3.3.2.2 Local Memory .............................................................................. 33 3.3.2.3 Shared Memory ........................................................................... 34 3.3.2.4 Constant Memory ........................................................................ 34 ii OpenCL Programming Guide Version 4.2 3.3.2.5 Texture Memory .......................................................................... 35 3.4 Maximize Instruction Throughput ............................................................... 35 3.4.1 Arithmetic Instructions ....................................................................... 36 3.4.2 Control Flow Instructions .................................................................... 38 3.4.3 Synchronization Instruction ................................................................. 38 Appendix A. CUDA-Enabled GPUs .................................................................. 41 Appendix B. Mathematical Functions Accuracy .............................................. 43 B.1 Standard Functions ................................................................................... 43 B.1.1 Single-Precision Floating-Point Functions .............................................. 43 B.1.2 Double-Precision Floating-Point Functions ............................................ 45 B.2 Native Functions ....................................................................................... 47 Appendix C. Compute Capabilities ................................................................. 49 C.1 Features and Technical Specifications ......................................................... 49 C.2 Floating-Point Standard ............................................................................. 51 C.3 Compute Capability 1.x ............................................................................. 52 C.3.1 Architecture ....................................................................................... 52 C.3.2 Global Memory .................................................................................. 53 C.3.2.1 Devices of Compute Capability 1.0 and 1.1 .................................... 53 C.3.2.2 Devices of Compute Capability 1.2 and 1.3 .................................... 54 C.3.3 Shared Memory ................................................................................. 54 C.3.3.1 32-Bit Strided Access ................................................................... 54 C.3.3.2 32-Bit Broadcast Access ............................................................... 55 C.3.3.3 8-Bit and 16-Bit Access ................................................................ 55 C.3.3.4 Larger Than 32-Bit Access ............................................................ 55 C.4 Compute Capability 2.x ............................................................................. 56 C.4.1 Architecture ....................................................................................... 56 C.4.2 Global Memory .................................................................................. 57 C.4.3 Shared Memory ................................................................................. 58 C.4.3.1 32-Bit Strided Access ................................................................... 58 C.4.3.2 Larger Than 32-Bit Access ............................................................ 58 C.4.4 Constant Memory ............................................................................... 59 C.5 Compute Capability 3.0 ............................................................................. 59 C.5.1 Architecture ....................................................................................... 59 OpenCL Programming Guide Version 4.2 iii C.5.2 Global Memory .................................................................................. 60 C.5.3 Shared Memory ................................................................................. 62 iv OpenCL Programming Guide Version 4.2
no reviews yet
Please Login to review.