Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Detect environment changes: external workload and system performance • Identify bottleneck operators, straggler workers, skew • Enumerate scaling actions, predict their effects, and decide which Action • Speculative: small changes at one operator at a time • Predictive: at-once for all operators 8 ??? Vasiliki Kalavri | Boston University 2020 Queuing theory models 9 • Metrics • service single-server queuing system • generalized Jackson networks • Action • predictive, at-once for all operators ??? Vasiliki Kalavri | Boston University 2020 Queuing theory models 9 • Metrics • service time0 码力 | 93 页 | 2.42 MB | 1 年前3Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020
computation rather than the execution flow. • Imperative languages are used to describe plans of operators the streams must flow through. • Pattern-based languages specify conditions and actions to be from the matches. Language Types 3 Vasiliki Kalavri | Boston University 2020 Three classes of operators: • relation-to-relation: similar to standard SQL and define queries over tables. • stream-to-relation: relation-to-stream CQL Example 5 Vasiliki Kalavri | Boston University 2020 CQL relation-to-stream operators • Istream (for “insert stream”) applied to relation R contains a stream elementwhenever0 码力 | 53 页 | 532.37 KB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
basics 4 Dataflow graph • operators are nodes, data channels are edges • channels have FIFO semantics • streams of data elements flow continuously along edges Operators • receive one or more input <#Brexit, 521> <#WorldCup, 480> <#StarWars, 300> <#Brexit> <#Brexit, 521> Stateful operators 5 • Stateful operators maintain state that reflect part of the stream history they have seen • windows a query • There may exist several ways to execute a computation • query plans, e.g. order of operators • scheduling and placement decisions • different algorithms, e.g. hash-based vs. broadcast join0 码力 | 54 页 | 2.83 MB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
src_port, dest_IP, dest_port, size> • Derived stream: produced by a continuous query and its operators, e.g. total traffic from a source every minutepacket generation dataflow systems • Computations as Directed Acyclic Graphs (DAGs) • nodes are operators and edges are data channels • operators can accumulate state, have multiple inputs, express event- time custom window-based dataflows and iterations on streams • Operators are data-parallel • distributed workers (threads) execute one parallel instance of one of more operators on disjoint data partitions 36 Vasiliki 0 码力 | 45 页 | 1.22 MB | 1 年前3Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Detect environment changes: external workload and system performance • Identify bottleneck operators, straggler workers, skew • Enumerate scaling actions, predict their effects, and decide which Detect environment changes: external workload and system performance • Identify bottleneck operators, straggler workers, skew • Enumerate scaling actions, predict their effects, and decide which Detect environment changes: external workload and system performance • Identify bottleneck operators, straggler workers, skew • Enumerate scaling actions, predict their effects, and decide which0 码力 | 41 页 | 4.09 MB | 1 年前3High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Upstream Backup Upstream nodes act as backups for their downstream operators by logging tuples in their output queues until downstream operators have completely processed them. 15 Vasiliki Kalavri | Boston Upstream Backup Upstream nodes act as backups for their downstream operators by logging tuples in their output queues until downstream operators have completely processed them. 15 periodically acknowledge Upstream Backup Upstream nodes act as backups for their downstream operators by logging tuples in their output queues until downstream operators have completely processed them. 15 periodically acknowledge0 码力 | 49 页 | 2.08 MB | 1 年前3Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020
input rates • Where in the query plan? • dropping at the sources vs. dropping at bottleneck operators • How much load to shed? • enough for the system to keep-up • Which tuples to drop? • improve many • The question of where is equivalent to placing special drop operators in the best positions in the query plan • Drop operators can be placed at any location in the query plan • Dropping near importance of tuples with respect to results quality • Drop at random: • Insert random sampling operators in the query plan, parametrized with a sampling rate • The rate defines the probability to discard0 码力 | 43 页 | 2.42 MB | 1 年前3监控Apache Flink应用程序(入门)
Flink应用程序(入门) 进度和吞吐量监控 – 10 3 https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/operators/#task-chaining-and-resource-groups 4 进度和吞吐量监控 知道您的应用程序正在运行并且检查点正常工作是件好事,但是它并不能告诉您应用程序是否正在实际取得进 recordsOutPerSecond = 0 (for a non-Sink operator) 请注意:目前由于metrics体系只考虑Flink的内部通信,所以source operators的输入记录数是0,而sink operators的输出记录数也是0. caolei – 监控Apache Flink应用程序(入门) 进度和吞吐量监控 – 12 4.5 进度 对于使用事件时间语义的应 那些时间戳小于t的operations将会被触发的触发。 例如,当watermarks超过30时,结束于t = 30的事件时间窗口将被关闭并计算。 因此,您应该在应用程序中对事件时间敏感的operators(如流程函数和窗口)上监控watermarks。如果当前处理 时间与被称为 even-time skew的watermarks之间的差异非常高,那么它通常意味着可能会出现两种情况。首 先,它可0 码力 | 23 页 | 148.62 KB | 1 年前3Scalable Stream Processing - Spark Streaming and Flink
in Spark • RDD re-computation ▶ Fault tolerance in Storm • Tracks records with unique IDs. • Operators send acks when a record has been processed. • Records are dropped from the backup when the have in Spark • RDD re-computation ▶ Fault tolerance in Storm • Tracks records with unique IDs. • Operators send acks when a record has been processed. • Records are dropped from the backup when the have in Spark • RDD re-computation ▶ Fault tolerance in Storm • Tracks records with unique IDs. • Operators send acks when a record has been processed. • Records are dropped from the backup when the have0 码力 | 113 页 | 1.22 MB | 1 年前3Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020
things • last 5 sec • last 10 events • last 1h every 10 min • last user session Window operators 2 Vasiliki Kalavri | Boston University 2020 object MaxSensorReadings { def main(args: Array[String]) IngestionTime Vasiliki Kalavri | Boston University 2020 Window operators can be applied on a keyed or a non-keyed stream: • Window operators on keyed windows are evaluated in parallel • Non-keyed windows0 码力 | 35 页 | 444.84 KB | 1 年前3
共 17 条
- 1
- 2