出版時(shí)間:2010-7 出版社:清華大學(xué)出版社 作者:(美)柯克 等著 頁數(shù):258
Tag標(biāo)簽:無
前言
Mass-market computing systems that combine multicore CPUs and many-core GPUs have brought terascale computing to the laptop and petascalecomputing to clusters. Armed with such computing power, we are at thedawn of pervasive use of computational experiments for science, engineer-ing, health, and business disciplines. Many will be able to achieve break-throughs in their disciplines using computational experiments that are ofunprecedented level of scale, controllability, and observability. This bookprovides a critical ingredient for the vision: teaching parallel programmingto millions of graduate and undergraduate students so that computationalthinking and parallel programming skills will be as pervasive as calculus. We started with a course now known as ECE498AL. During the Christ-mas holiday of 2006, we were frantically working on the lecture slides andlab assignments. David was working the system trying to pull the earlyGeForce 8800 GTX GPU cards from customer shipments to Illinois, whichwould not succeed until a few weeks after the semester began. It alsobecame clear that CUDA would not become public until a few weeks afterthe start of the semester. We had to work out the legal agreements so thatwe can offer the course to students under NDA for the first few weeks.We also needed to get the words out so that students would sign up sincethe course was not announced until after the preenrollment period.
內(nèi)容概要
本書介紹了并行程序設(shè)計(jì)與GPU體系結(jié)構(gòu)的基本概念,并詳細(xì)探討了用于構(gòu)建并行程序的各種技術(shù),用案例演示了并行程序設(shè)計(jì)的整個(gè)開發(fā)過程,即從并行計(jì)算的思想開始,直到最終實(shí)現(xiàn)實(shí)際且高效的并行程序?! ”緯攸c(diǎn) 介紹了并行計(jì)算的思想,使得讀者可以把這種問題的思考方式滲透到高性能并行計(jì)算中去。 介紹了CUDA的使用,CUDA是NVIDIA公司專門為大規(guī)模并行環(huán)境創(chuàng)建的一種軟件開發(fā)工具。 介紹如何使用CUDA編程模式和OpenCL來獲得高性能和高可靠性。
作者簡(jiǎn)介
作者:(美國(guó))柯克(David B.Kirk) (美國(guó))Wen-mei W.Hwu
書籍目錄
PrefaceAcknowledgmentsDedicationCHAPTER 1 INTRODUCTION 1.1 GPUs as Parallel Computers 1.2 Architecture of a Modern GPU 1.3 Why More Speed or Parallelism? 1.4 Parallel Programming Languages and Models 1.5 Overarching Goals 1.6 Organization of the BookCHAPTER 2 HISTORY OF GPU COMPUTING 2.1 Evolution of Graphics pipelines 2.1.1 The Era of Fixed-Function Graphics Pipelines 2.1.2 Evolution of Programmable Real-Time Graphics 2.1.3 Unified Graphics and Computing Processors 2.1.4 GPGPU: An Intermediate Step 2.2 GPU Computing 2.2.1 Scalable GPUs 2.2.2 Recent Developments 2.3 Future TrendsCHAPTER 3 INTRODUCTION TO CUDA 3.1 Data Parallelism 3.2 CUDA Program Structure 3.3 A Matrix-Matrix Multiplication Example 3.4 Device Memories and Data Transfer 3.5 Kernel Functions and Threading 3.6 Summary 3.6.1 Function declarations 3.6.2 Kernel launch 3.6.3 Predefined variables 3.6.4 Runtime APlCHAPTER 4 CUDA THREADS 4.1 CUDA Thread Organization 4.2 Using b]ockldx and threadIdx 4.3 Synchronization and Transparent Scalability 4.4 Thread Assignment 4.5 Thread Scheduling and Latency Tolerance 4.6 Summary 4.7 ExercisesCHAPTER 5 CUDATM MEMORIES 5.1 Importance of Memory Access Efficiency 5.2 CUDA Device Memory Types 5.3 A Strategy for Reducing Global Memory Traffic 5.4 Memory as a Limiting Factor to Parallelism 5.5 Summary 5.6 ExercisesCHAPTER 6 PERFORMANCE CONSIDERATIONS 6.1 More on Thread Execution 6.2 Global Memory Bandwidth 6.3 Dynamic Partitioning of SM Resources 6.4 Data Prefetching 6.5 Instruction Mix 6.6 Thread Granularity 6.7 Measured Performance and Summary 6.8 ExercisesCHAPTER 7 FLOATING POINT CONSIDERATIONS 7.1 Floating-Point Format 7.1.1 Normalized Representation of M 7.1.2 Excess Encoding of E 7.2 Representable Numbers 7.3 Special Bit Patterns and Precision 7.4 Arithmetic Accuracy and Rounding 7.5 Algorithm Considerations 7.6 Summary 7.7 ExercisesCHAPTER 8 APPLICATION CASE STUDY: ADVANCED MRI RECONSTRUCTION 8.1 Application Background 8.2 Iterative Reconstruction 8.3 Computing FHd Step 1. Determine the Kernel Parallelism Structure Step 2. Getting Around the Memory Bandwidth Limitation. Step 3. Using Hardware Trigonometry Functions Step 4. Experimental Performance Tuning 8.4 Final Evaluation 8.5 ExercisesCHAPTER 9 APPLICATION CASE STUDY: MOLECULAR VISUALIZATION AND ANALYSISCHAPTER 10 PARALLEL PROGRAMMING AND COMPUTATIONAL THINKINGCHAPTER 11 A BRIEF INTRODUCTION TO OPENCLTMCHAPTER 12 CONCLUSION AND'FuTuRE OUTLOOKAPPENDIX A MATRIX MULTIPLICATION HOST-ONLY VERSION SOURCE CODEAPPENDIX B GPU COMPUTE CAPABILITIESIndex
章節(jié)摘錄
插圖:The raster operation (ROP) stage in Figure 2.2 performs the final rasteroperations on the pixels. It performs color raster operations that blend thecolor of overlapping/adjacent objects for transparency and antialiasingeffects. It also determines the visible objects for a given viewpoint anddiscards the occluded pixels. A pixel becomes occluded when it is blockedby pixels from other objects according to the given view point. Figure 2.3 illustrates antialiasing, one of the ROP stage operations.Notice the three adjacent triangles with a black background. In the aliasedoutput, each pixel assumes the color of one of the objects or the back-ground. The limited resolution makes the edges look crooked and theshapes of the objects distorted. The problem is that many pixels are partlyin one object and partly in another object or the background. Forcing thesepixels to assume the color of one of the objects introduces distortion intothe edges of the objects. The antialiasing operation gives each pixel a colorthat is blended, or linearly combined, from the colors of all the objects andbackground that partially overlap the pixel. The contribution of each objectt the color of the pixel is the amount of the pixel that the object overlaps. Finally, the frame buffer interface (FBI) stage in Figure 2.1 managesmemory reads from and writes to the display frame buffer memory.
圖書封面
圖書標(biāo)簽Tags
無
評(píng)論、評(píng)分、閱讀與下載
大規(guī)模并行處理器程序設(shè)計(jì) PDF格式下載