Working with Asynchrony Generically: A Tour of C++ Executors
extern unifex::static_thread_pool low_latency; extern unifex::static_thread_pool workers; ex::sender auto accept_and_process_requests() { return ex::on(low_latency.get_scheduler(), accept_request()) request) { process_request(request); }) | unifex::repeat_effect(); } Accept requests on low-latency threads. Process the requests on the worker threads.16 EXAMPLE: TRANSITIONING EXECUTION CONTEXT unifex::static_thread_pool low_latency; extern unifex::static_thread_pool workers; unifex::taskaccept_and_process_requests() { while (true) { auto request = co_await ex::on(low_latency.get_scheduler() 0 码力 | 121 页 | 7.73 MB | 5 月前3Lock-Free Atomic Shared Pointers Without a Split Reference Count? It Can Be Done!
multithreaded/lock-free code is hard… • There are many factors to consider: • Measurement: Throughput vs latency? • Workload: Proportion of reads vs writes • Hotness: Does the data fit in cache? • Contention: reclamation to get performance that is always competitive with both? • More work on optimizing for low latency (see the GitHub, there is some preliminary work!) Thanks to Guy Blelloch and Hao Wei for their collaboration collaboration on some of this work43 Daniel Anderson -- danielanderson.net Bonus Content: Latency44 Daniel Anderson -- danielanderson.net void retire(T* p) • Indicate that an object has been removed0 码力 | 45 页 | 5.12 MB | 5 月前3使用硬件加速Tokio - 戴翔
Enqueue Software Producer Consumer Consumer Consumer • Synchronization latency • Memory/Cache latency • CPU cycles latency DLB : Dynamic Load Balance DLB Enqueue Logic Head and Tail pointers Load Balancer Producer Producer Consumer Consumer Consumer • No Synchronization latency • No memory/cache latency • No CPU cycles DLB-Assist Channel Intro Hardware Senders Receive Senders Senders0 码力 | 17 页 | 1.66 MB | 1 年前3C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程
字节)。 https://developer.download.nvidia.cn/CUDA/training/register_spilling.pdf 板块中的线程数量过少:延迟隐藏( latency hiding )失效 • 我们说过,每个 SM 一次只能执行板块中的一个线程组( warp ),也就是 32 个线程。 • 而当线程组陷入内存等待时,可以切换到另一个线程,继续计算,这样一个 bank conflict ) GPU 优化手法总结 • 线程组分歧( wrap divergence ):尽量保证 32 个线程都进同样的分支,否则两个分支都会执行 。 • 延迟隐藏( latency hiding ):需要有足够的 blockDim 供 SM 在陷入内存等待时调度到其他线程 组。 • 寄存器打翻( register spill ):如果核函数用到很多局部变量(寄存器),则0 码力 | 142 页 | 13.52 MB | 1 年前3绕过conntrack,使用eBPF增强 IPVS优化K8s网络性能
measurement Test topology Test result Service type Short connection cps Short connection P99 latency Long connection pps ClusterIP +40% -31% not available NodePort +64% -47% +22% Test result •0 码力 | 24 页 | 1.90 MB | 1 年前3hazard pointer synchronous reclamation
Michael Watch CPPCON 2021 Talk on Concurrency TS2 The Upcoming Concurrency TS Version 2 for Low-Latency and Lockless Synchronization (with Paul McKenney and Michael Wong)0 码力 | 31 页 | 856.38 KB | 5 月前3C++高性能并行编程与优化 - 课件 - 07 深入浅出访存优化
, cache miss • 伪共享: false sharing • 预取: prefetching • 直写: streaming , write-through • 延迟隐藏: latency hiding • 陷入空转以等待内存: stall • 循环分块: loop-tiling , loop-blocking • 寄存器分块: unroll-and-jam , register-blocking0 码力 | 147 页 | 18.88 MB | 1 年前3
共 7 条
- 1