Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020
CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/04: Streaming languages and operator semantics Vasiliki Kalavri | Boston University 2020 Vasiliki Kalavri | Boston University 2020 Kalavri | Boston University 2020 Streaming Operators 9 Vasiliki Kalavri | Boston University 2020 Operator types (I) • Single-Item Operators process stream elements one-by-one. • selection, filtering Consider events from stream S1 and stream S2 11 Vasiliki Kalavri | Boston University 2020 Operator types (II) • Sequence Operators capture the arrival of an ordered set of events. • common in0 码力 | 53 页 | 532.37 KB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
processing optimizations ??? Vasiliki Kalavri | Boston University 2020 2 • Costs of streaming operator execution • state, parallelism, selectivity • Dataflow optimizations • plan translation alternatives Vasiliki Kalavri | Boston University 2020 Operator selectivity 6 • The number of output elements produced per number of input elements • a map operator has a selectivity of 1, i.e. it produces one output element for each input element it processes • an operator that tokenizes sentences into words has selectivity > 1 • a filter operator typically has selectivity < 1 Is selectivity always known0 码力 | 54 页 | 2.83 MB | 1 年前3Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020
dataflow with sources S1, S2, … Sn and rates λ1, λ2, … λn identify the minimum parallelism πi per operator i, such that the physical dataflow can sustain all source rates. S1 S2 λ1 λ2 S1 S2 π=2 => scale out • Analytical dataflow-based models Action • Speculative: small changes at one operator at a time • Predictive: at-once for all operators 8 ??? Vasiliki Kalavri | Boston University per task • total time spent processing a tuple and all its derived results • Policy • each operator as a single-server queuing system • generalized Jackson networks • Action • predictive, at-once0 码力 | 93 页 | 2.42 MB | 1 年前3监控Apache Flink应用程序(入门)
展并与上游系统保持同步。 4.1 吞吐量 Flink提供了多个metrics来衡量应用程序的吞吐量。对于每个operator或task(请记住:一个task可以包含多个 chained-task3),Flink会对进出系统的记录和字节进行计数。在这些metrics中,每个operator输出记录的速率 通常是最直观和最容易理解的。 4.2 关键指标 Metric Scope Description numRecordsOutPerSecond task The number of records this operator/task sends per second. numRecordsOutPerSecond operator The number of records this operator sends per second. caolei – 监控Apache Flink应用程序(入门) 进度和吞吐量监控 – 11 4.3 仪表盘示例 Figure 3: Mean Records Out per Second per Operator 4.4 可能的报警条件 • recordsOutPerSecond = 0 (for a non-Sink operator) 请注意:目前由于metrics体系只考虑Flink的内部通信,所以source operators的输入记录数是0,而sink0 码力 | 23 页 | 148.62 KB | 1 年前3State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020
• Users define state using arbitrary types • The system is unaware of which parts of an operator constitute state Streaming state 3 • Explicit state primitives including state types and interfaces results: a local or instance variable that is accessed by a task’s business logic Operator state is scoped to an operator task, i.e. records processed by the same parallel task have access to the same state scoped to a key defined in the operator’s input records • Flink maintains one state instance per key value and partitions all records with the same key to the operator task that maintains the state for0 码力 | 24 页 | 914.13 KB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
materialized views. • An operator outputs event streams that describe the changing view computed over the input stream according to the relational semantics of the operator. 19 Vasiliki Kalavri | Boston materialized views. • An operator outputs event streams that describe the changing view computed over the input stream according to the relational semantics of the operator. src dest bytes 1 2 20K materialized views. • An operator outputs event streams that describe the changing view computed over the input stream according to the relational semantics of the operator. Results as continuously0 码力 | 45 页 | 1.22 MB | 1 年前3High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020
t3 t’2 t’3 t4 … Rollback-divergent t1 t2 t3 t’2 t’3 t’4 … The output semantics depend on the operator type: • arbitrary: it depends on order, randomness, or external system • deterministic: it produces result semantics 11 sum 4 3 3 9 6 … 5 6 1 9 sum 6 5 3 10 6 … 7 8 1 10 Can you think of an operator that provides correct, possibly repeating, results even if it re-processes tuples after recovery … 7 8 1 10 Can you think of an operator that provides correct, possibly repeating, results even if it re-processes tuples after recovery? Can you think of an operator that will converge to the correct0 码力 | 49 页 | 2.08 MB | 1 年前3Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020
input rates and periodically estimates operator selectivities. • The load shedder assigns a cost, ci, in cycles per tuple, and a selectivity, si, to each operator i. • The statistics manager collects records does the operator produce per record in its input? • map: 1 in 1 out • filter: 1 in, 1 or 0 out • flatMap, join: 1 in 0, 1, or more out • Cost: how many records can an operator process in a • The rate defines the probability to discard a tuple and is computed based on statistics and operator selectivity • The optimization objective is to achieve the highest possible accuracy given the0 码力 | 43 页 | 2.42 MB | 1 年前3Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020
in to save resources • Fix bugs or change business logic • Optimize execution plan • Change operator placement • skew and straggler mitigation • Migrate to a different cluster or software version are scaled by repartitioning keys • Operators with operator list state are scaled by redistributing the list entries. • Operators with operator broadcast state are scaled up by copying the state to range boundaries. • The maximum parallelism parameter of an operator defines the number of key groups into which the keyed state of the operator is split. • The number of key groups limits the maximum0 码力 | 41 页 | 4.09 MB | 1 年前3Apache Flink的过去、现在和未来
JVM Cloud GCE, EC2 Cluster Standalone, YARN DataStream Physical 统一 Operator 抽象 Pull-based operator Push-based operator 算子可自定义读取顺序 Table API & SQL 1.9 新特性 全新的 SQL 类型系统 DDL 初步支持 Table API0 码力 | 33 页 | 3.36 MB | 1 年前3
共 17 条
- 1
- 2