大卫·B 柯克(David B Kirk) 美国国家工程院院士,NVIDIA Fellow,曾任NVIDIA公司首席科学家。他领导了NVIDIA图形技术的开发,并且是CUDA技术的创始人之一。2002年,他荣获ACM SIGGRAPH计算机图形成就奖,以表彰其在把高性能计算机图形系统推向大众市场方面做出的杰出贡献。他拥有加州理工学院计算机科学博士学位。
胡文美(Wen-mei W Hwu) 美国伊利诺伊大学厄巴纳-香槟分校电气与计算机工程系AMD Jerry Sanders讲席教授,并行计算研究中心首席科学家,领导IMPACT团队和CUDA卓越中心的研究工作。他在编译器设计、计算机体系结构、微体系结构和并行计算方面做出了卓越贡献,是IEEE Fellow、ACM Fellow,荣获了包括ACM SigArch Maurice Wilkes Award在内的众多奖项。他还是MulticoreWare公司的联合创始人兼CTO。他拥有加州大学伯克利分校计算机科学博士学位。
目录
Preface Acknowledgements
CHAPTER1 Introduction1
11 Heterogeneous Parallel Computing2
12 Architecture of a Modern GPU6
13 Why More Speed or Parallelism?8
14 Speeding Up Real Applications10
15 Challenges in Parallel Programming 12
16 Parallel Programming Languages and Models12
17 Overarching Goals14
18 Organization of the Book15
References 18
CHAPTER2 Data Parallel Computing19
21 Data Parallelism20
22 CUDA C Program Structure22
23 A Vector Addition Kernel 25
24 Device Global Memory and Data Transfer27
25 Kernel Functions and Threading32
26 Kernel Launch37
27 Summary38
Function Declarations38
Kernel Launch38
Built-in (Predefined) Variables 39
Run-time API39
28 Exercises39
References 41
CHAPTER3 Scalable Parallel Execution43
31 CUDA Thread Organization43
32 Mapping Threads to Multidimensional Data47
33 Image Blur: A More Complex Kernel 54
34 Synchronization and Transparent Scalability 58
35 Resource Assignment60
36 Querying Device Properties61
37 Thread Scheduling and Latency Tolerance64
38 Summary67
39 Exercises67
CHAPTER4 Memory and Data Locality 71
41 Importance of Memory Access Efficiency72
42 Matrix Multiplication73
43 CUDA Memory Types77
44 Tiling for Reduced Memory Traffic84
45 A Tiled Matrix Multiplication Kernel90
46 Boundary Checks94
47 Memory as a Limiting Factor to Parallelism97
48 Summary99
49 Exercises