Scalable Stream Processing - Spark Streaming and Flink
live stream into batches of X seconds. • Treats each batch as RDDs and processes them using RDD operations. • Discretized Stream Processing (DStream) 7 / 79 Spark Streaming ▶ Run a streaming computation live stream into batches of X seconds. • Treats each batch as RDDs and processes them using RDD operations. • Discretized Stream Processing (DStream) 7 / 79 Spark Streaming ▶ Run a streaming computation live stream into batches of X seconds. • Treats each batch as RDDs and processes them using RDD operations. • Discretized Stream Processing (DStream) 7 / 79 DStream (1/2) ▶ DStream: sequence of RDDs0 码力 | 113 页 | 1.22 MB | 1 年前3PyFlink 1.15 Documentation
users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & SQL [20]: from pyflink.table.udf import udf # sql_query("SELECT plus_one(id) FROM {}".format(table)).to_pandas() Another example is UDFs used in Row-based Operations [23]: from pyflink.common.types import Row @udf(result_type=DataTypes.ROW([DataTypes.FIELD("id" QueryOperationConverter$SingleRelVisitor. ˓→visit(QueryOperationConverter.java:154) at org.apache.flink.table.operations.CatalogQueryOperation. ˓→accept(CatalogQueryOperation.java:68) at org.apache.flink.table.planner0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & SQL [20]: from pyflink.table.udf import udf # sql_query("SELECT plus_one(id) FROM {}".format(table)).to_pandas() Another example is UDFs used in Row-based Operations [23]: from pyflink.common.types import Row @udf(result_type=DataTypes.ROW([DataTypes.FIELD("id" QueryOperationConverter$SingleRelVisitor. ˓→visit(QueryOperationConverter.java:154) at org.apache.flink.table.operations.CatalogQueryOperation. ˓→accept(CatalogQueryOperation.java:68) at org.apache.flink.table.planner0 码力 | 36 页 | 266.80 KB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Use equivalence transformation rules if the language allows • selection operations are commutative • theta-join operations are commutative • natural joins are associative • Move projections early differ on a combined stream vs. on separate streams Redundancy elimination Eliminate redundant operations, aka subgraph sharing B A B C D A B C D ??? Vasiliki Kalavri | Boston University 2020 elimination • remove a no-op, e.g. a projection that keeps all attributes • remove idempotent operations, e.g. two selections on the same predicate • remove a dead subgraph, i.e. one that never produces0 码力 | 54 页 | 2.83 MB | 1 年前3State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020
checkpoint, restore, merge, split, query, subscribe, … State operations and types 4 Consider you are designing a state interface. What operations should state support? What state types can you think of Flink program. • The keys are ordered according to a user-specified comparator function. Basic operations • Get(key): fetch a single key-value from the DB • Put(key, val): insert a single key-value0 码力 | 24 页 | 914.13 KB | 1 年前3Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020
2/11: Windows and Triggers Vasiliki Kalavri | Boston University 2020 • Practical way to perform operations on unbounded input • e.g. joins, holistic aggregates • Compute on most recent events only clear() } } onTimer() example 31 Vasiliki Kalavri | Boston University 2020 For low-level operations on two inputs: • One transformation method for each input processElement1() and processElement2()0 码力 | 35 页 | 444.84 KB | 1 年前3监控Apache Flink应用程序(入门)
对于使用事件时间语义的应用程序来说,watermarks随着时间的推移而变化是非常重要的。watermarks的时间 t表名框架再也不应该期望接收到时间戳比t早的事件了,相反,那些时间戳小于t的operations将会被触发的触发。 例如,当watermarks超过30时,结束于t = 30的事件时间窗口将被关闭并计算。 因此,您应该在应用程序中对事件时间敏感的operators(如流程函数和窗口) providing more TaskManagers. In general, a system already running under very high load during normal operations, will need much more time to catch-up after recovering from a downtime. During this time you will0 码力 | 23 页 | 148.62 KB | 1 年前3Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 • Transforming languages define transformations specifying operations that process input streams and produce output streams. • Declarative languages specify the computed by a UDA that uses three local tables, IN, TAPE, and OUT, and performs the following operations for each new arriving tuple: 1. Append the encoded new tuple to IN, 2. Copy IN to TAPE, and0 码力 | 53 页 | 532.37 KB | 1 年前3Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020
will cause. • Each row contains a plan with • expected cycle savings • locations for drop operations • drop amounts • QoS effects (provided that tuples can be associated with a utility metric)0 码力 | 43 页 | 2.42 MB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
programmer needs to design and maintain appropriate state synopses. • In order to parallelize operations, events must have associated keys. 38 Distributed dataflow model Vasiliki Kalavri | Boston0 码力 | 45 页 | 1.22 MB | 1 年前3
共 11 条
- 1
- 2