Bridging the Gap: Writing Portable Programs for CPU and GPU
1/66Bridging the Gap: Writing Portable Programs for CPU and GPU using CUDA Thomas Mejstrik Sebastian Woblistin 2/66Content 1 Motivation Audience etc.. Cuda crash course Quiz time 2 Patterns Oldschool afterwards7/66 Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Algorithms are designed differently Latency/Throughput Memory bandwidth Number of talk7/66 Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Why it makes sense? Library/Framework developers Embarrassingly parallel algorithms0 码力 | 124 页 | 4.10 MB | 5 月前3TVM@Alibaba AI Labs
Labs 阿里巴巴人工智能实验室 AiILabs & TVM PART 1 : ARM32 CPU CONTENT PART 2 : HIFI4 DSP PART 3 : _ PowervVR GPU [和| Alibaba AL.Labs 阿里巴巴人工智能实验室 ARM 32 CPU Resolution Quantization Orize Kernel ALIOS ent pl 1=int8 int8 * int8 int32 = int16 1 + int16 x int8 Alibaba Al.Labs 阿里巴巴人工智能实验室 CPU : MTK8167S (ARM32 A35 1.5GHz) Model : MobileNetV2_ 1.0_ 224 400 336 350 3丈 300 2500 码力 | 12 页 | 1.94 MB | 5 月前3Au Units
3Example: “CPU ticks” time units constexpr uint64_t CPU_CLOCK_HZ = 400'000'000; // API to implement: std::chrono::nanoseconds elapsed_time(uint64_t num_cpu_ticks); 1 2 3 4 8Example: “CPU ticks” time constexpr uint64_t CPU_CLOCK_HZ = 400'000'000; // API to implement: std::chrono::nanoseconds elapsed_time(uint64_t num_cpu_ticks); 1 2 3 4 std::chrono::nanoseconds elapsed_time(uint64_t num_cpu_ticks) { 0, CPU_CLOCK_HZ>; return std::chrono::nanoseconds{ num_cpu_ticks * NS_PER_TICK::num / NS_PER_TICK::den }; } 1 2 3 4 5 6 8.1Example: “CPU ticks” time units constexpr uint64_t CPU_CLOCK_HZ0 码力 | 191 页 | 22.37 MB | 5 月前3Taro: Task graph-based Asynchronous Programming Using C++ Coroutine
B! : CPU operation B" : GPU operation 9Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread B! : CPU operation B" : GPU operation 10Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread finishes B! : CPU operation B" : GPU operation Atomic execution per task 11Existing TGPSs on Heterogenous Computing - Challenge CPU A B! C Idle GPU D B" Runtime A C D B! B" Assume one CPU and one0 码力 | 84 页 | 8.82 MB | 5 月前3POCOAS in C++: A Portable Abstraction for Distributed Data Structures
structure ops CPU NIC CPU NIC DRAM DRAMAdvantages of PGAS - Asynchronous - RDMA operations executed by NIC - Allows irregular, one-sided access - Maps well to data structure ops CPU NIC CPU NIC structure ops CPU NIC CPU NIC DRAM DRAMAdvantages of PGAS - Asynchronous - RDMA operations executed by NIC - Allows irregular, one-sided access - Maps well to data structure ops CPU NIC CPU NIC structure ops CPU NIC CPU NIC DRAM DRAMAdvantages of PGAS - Asynchronous - RDMA operations executed by NIC - Allows irregular, one-sided access - Maps well to data structure ops CPU NIC CPU NIC0 码力 | 128 页 | 2.03 MB | 5 月前3How Meta Made Debugging Async Code Easier with Coroutines and Senders
:0Walking the stack CPU ret* prev* data frame* instr* process_fileWalking the stack CPU ret* prev* data ret* prev* data frame* instr* process_file coro::resumeWalking the stack CPU ret* prev* data ret* prev* data ret* prev* data frame* instr* process_file coro::resume ...Walking the stack CPU ret* prev* data ret* prev* data ret* prev* data frame* instr* process_file coro::resume ...0 le) () at main.cpp:70Walking the stack CPU ret* prev* data ret* prev* data ret* prev* data frame* instr* process_file coro::resume ...Walking the stack CPU ret* prev* data ret* prev* data ret*0 码力 | 131 页 | 907.41 KB | 5 月前3Conda 24.7.x Documentation
Windows x86 64-bit macOS Miniconda installer for: macOS with Apple Silicon 64-bit macOS with Intel CPU 64-bit Linux Miniconda installer for: Linux x86 64-bit Homebrew Run the following Homebrew command: features present on the system that cannot be managed directly by conda, like system driver versions or CPU features. Virtual packages are not real packages and not displayed by conda list. Instead conda runs modified during installation on ARM64 macOS (#10260) • Add __archspec virtual package to identify CPU microarchitecture (#9930) • Add __unix and __win virtual packages (#10214) • Add --no-capture--output0 码力 | 808 页 | 4.97 MB | 7 月前3Conda 23.10.x Documentation
features present on the system that cannot be managed directly by conda, like system driver versions or CPU features. Virtual packages are not real packages and not displayed by conda list. Instead conda runs modified during installation on ARM64 macOS (#10260) • Add __archspec virtual package to identify CPU microarchitecture (#9930) • Add __unix and __win virtual packages (#10214) • Add --no-capture--output _to_friendly_bytes(input) _friendly_bytes_to_int(friendly_bytes) _parse_cpu_brand_string(cpu_string) _parse_cpu_brand_string_dx(cpu_string) _parse_dmesg_output(output) _parse_arch(arch_string_raw) _is_bit_set(reg0 码力 | 773 页 | 5.05 MB | 7 月前3Conda 23.7.x Documentation
features present on the system that cannot be managed directly by conda, like system driver versions or CPU features. Virtual packages are not real packages and not displayed by conda list. Instead conda runs _to_friendly_bytes(input) _friendly_bytes_to_int(friendly_bytes) _parse_cpu_brand_string(cpu_string) _parse_cpu_brand_string_dx(cpu_string) _parse_dmesg_output(output) _parse_arch(arch_string_raw) _is_bit_set(reg _get_cpu_info_from_cpuid_actual() Warning! This function has the potential to crash the Python runtime. _get_cpu_info_from_cpuid_subprocess_wrapper(queue) _get_cpu_info_from_cpuid() Returns the CPU info0 码力 | 795 页 | 4.91 MB | 7 月前3Conda 23.11.x Documentation
features present on the system that cannot be managed directly by conda, like system driver versions or CPU features. Virtual packages are not real packages and not displayed by conda list. Instead conda runs modified during installation on ARM64 macOS (#10260) • Add __archspec virtual package to identify CPU microarchitecture (#9930) • Add __unix and __win virtual packages (#10214) • Add --no-capture--output _to_friendly_bytes(input) _friendly_bytes_to_int(friendly_bytes) _parse_cpu_brand_string(cpu_string) _parse_cpu_brand_string_dx(cpu_string) _parse_dmesg_output(output) _parse_arch(arch_string_raw) _is_bit_set(reg0 码力 | 781 页 | 4.79 MB | 7 月前3
共 242 条
- 1
- 2
- 3
- 4
- 5
- 6
- 25