Bridging the Gap: Writing Portable Programs for CPU and GPU
1/66Bridging the Gap: Writing Portable Programs for CPU and GPU using CUDA Thomas Mejstrik Sebastian Woblistin 2/66Content 1 Motivation Audience etc.. Cuda crash course Quiz time 2 Patterns Oldschool afterwards7/66 Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Algorithms are designed differently Latency/Throughput Memory bandwidth Number of talk7/66 Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Why it makes sense? Library/Framework developers Embarrassingly parallel algorithms0 码力 | 124 页 | 4.10 MB | 5 月前3TVM@Alibaba AI Labs
Labs 阿里巴巴人工智能实验室 AiILabs & TVM PART 1 : ARM32 CPU CONTENT PART 2 : HIFI4 DSP PART 3 : _ PowervVR GPU [和| Alibaba AL.Labs 阿里巴巴人工智能实验室 ARM 32 CPU Resolution Quantization Orize Kernel ALIOS ent pl 1=int8 int8 * int8 int32 = int16 1 + int16 x int8 Alibaba Al.Labs 阿里巴巴人工智能实验室 CPU : MTK8167S (ARM32 A35 1.5GHz) Model : MobileNetV2_ 1.0_ 224 400 336 350 3丈 300 2500 码力 | 12 页 | 1.94 MB | 5 月前3Au Units
3Example: “CPU ticks” time units constexpr uint64_t CPU_CLOCK_HZ = 400'000'000; // API to implement: std::chrono::nanoseconds elapsed_time(uint64_t num_cpu_ticks); 1 2 3 4 8Example: “CPU ticks” time constexpr uint64_t CPU_CLOCK_HZ = 400'000'000; // API to implement: std::chrono::nanoseconds elapsed_time(uint64_t num_cpu_ticks); 1 2 3 4 std::chrono::nanoseconds elapsed_time(uint64_t num_cpu_ticks) { 0, CPU_CLOCK_HZ>; return std::chrono::nanoseconds{ num_cpu_ticks * NS_PER_TICK::num / NS_PER_TICK::den }; } 1 2 3 4 5 6 8.1Example: “CPU ticks” time units constexpr uint64_t CPU_CLOCK_HZ0 码力 | 191 页 | 22.37 MB | 5 月前3Taro: Task graph-based Asynchronous Programming Using C++ Coroutine
B! : CPU operation B" : GPU operation 9Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread B! : CPU operation B" : GPU operation 10Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread finishes B! : CPU operation B" : GPU operation Atomic execution per task 11Existing TGPSs on Heterogenous Computing - Challenge CPU A B! C Idle GPU D B" Runtime A C D B! B" Assume one CPU and one0 码力 | 84 页 | 8.82 MB | 5 月前3POCOAS in C++: A Portable Abstraction for Distributed Data Structures
structure ops CPU NIC CPU NIC DRAM DRAMAdvantages of PGAS - Asynchronous - RDMA operations executed by NIC - Allows irregular, one-sided access - Maps well to data structure ops CPU NIC CPU NIC structure ops CPU NIC CPU NIC DRAM DRAMAdvantages of PGAS - Asynchronous - RDMA operations executed by NIC - Allows irregular, one-sided access - Maps well to data structure ops CPU NIC CPU NIC structure ops CPU NIC CPU NIC DRAM DRAMAdvantages of PGAS - Asynchronous - RDMA operations executed by NIC - Allows irregular, one-sided access - Maps well to data structure ops CPU NIC CPU NIC0 码力 | 128 页 | 2.03 MB | 5 月前3How Meta Made Debugging Async Code Easier with Coroutines and Senders
:0Walking the stack CPU ret* prev* data frame* instr* process_fileWalking the stack CPU ret* prev* data ret* prev* data frame* instr* process_file coro::resumeWalking the stack CPU ret* prev* data ret* prev* data ret* prev* data frame* instr* process_file coro::resume ...Walking the stack CPU ret* prev* data ret* prev* data ret* prev* data frame* instr* process_file coro::resume ...0 le) () at main.cpp:70Walking the stack CPU ret* prev* data ret* prev* data ret* prev* data frame* instr* process_file coro::resume ...Walking the stack CPU ret* prev* data ret* prev* data ret*0 码力 | 131 页 | 907.41 KB | 5 月前3Oracle VM VirtualBox 5.2.40 User Manual
4 Supported host operating systems . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 Host CPU Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6 Installing VirtualBox . . . 200 9.4.2 Guest graphics and mouse driver setup in depth . . . . . . . . . . . . . 200 9.5 CPU hot-plugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 9.6 PCI passthrough . . . . . . . . . . 264 12.2.4 Frequency scaling effect on CPU usage . . . . . . . . . . . . . . . . . . 264 12.2.5 Inaccurate Windows CPU usage reporting . . . . . . . . . . . . . . . . . 265 12.2.60 码力 | 387 页 | 4.27 MB | 6 月前3Oracle VM VirtualBox 5.2.12 User Manual
4 Supported host operating systems . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.5 Host CPU Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6 Installing VirtualBox . . . 199 9.4.2 Guest graphics and mouse driver setup in depth . . . . . . . . . . . . . 199 9.5 CPU hot-plugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 9.6 PCI passthrough . . . . . . . . . . 263 12.2.4 Frequency scaling effect on CPU usage . . . . . . . . . . . . . . . . . . 263 12.2.5 Inaccurate Windows CPU usage reporting . . . . . . . . . . . . . . . . . 264 12.2.60 码力 | 380 页 | 4.23 MB | 6 月前3Oracle VM VirtualBox 4.3.36 User Manual
. . . 165 9.4.2 Guest graphics and mouse driver setup in depth . . . . . . . . . . . . . 165 9.5 CPU hot-plugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.6 PCI passthrough . . . . . . . . . . 214 12.2.4 Frequency scaling effect on CPU usage . . . . . . . . . . . . . . . . . . 214 12.2.5 Inaccurate Windows CPU usage reporting . . . . . . . . . . . . . . . . . 215 12.2.6 Windows Vista guests . . . . . . . . . . . . . . . . . . 216 12.3.6 Windows guests may cause a high CPU load . . . . . . . . . . . . . . . 217 12.3.7 Long delays when accessing shared folders . . . . .0 码力 | 380 页 | 3.79 MB | 6 月前3Oracle VM VirtualBox 4.1.40 User Manual
. . . 148 9.4.2 Guest graphics and mouse driver setup in depth . . . . . . . . . . . . . 148 9.5 CPU hot-plugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 9.6 PCI passthrough guests . . . . . . . . . . . . . . . . . . 185 6 Contents 12.3.6 Windows guests may cause a high CPU load . . . . . . . . . . . . . . . 185 12.3.7 Long delays when accessing shared folders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 12.4.1 Linux guests may cause a high CPU load . . . . . . . . . . . . . . . . . 187 12.4.2 AMD Barcelona CPUs . . . . . . . . . . . . . . .0 码力 | 310 页 | 4.87 MB | 6 月前3
共 139 条
- 1
- 2
- 3
- 4
- 5
- 6
- 14