Execution model - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

single-pass Updates arbitrary append-only Update rates relatively low high, bursty Processing Model query-driven / pull-based data-driven / push-based Queries ad-hoc continuous Latency relatively University 2020 Time-Series Model: The jth update is (j, A[j]) and updates arrive in increasing order of j, i.e. we observe the entries of A by increasing index. This can model time-series data streams: sequence of measurements from a temperature sensor • the volume of NASDAQ stock trades over time This model poses a severe limitation on the stream: updates cannot change past entries in A. 11 Useful in

0 码力 | 45 页 | 1.22 MB | 1 年前
3
Scalable Stream Processing - Spark Streaming and Flink

Built on the Spark SQL engine. ▶ Perform database-like query optimizations. 56 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1. Defines a query on the input table, as a static table. • Spark automatically converts this batch-like query to a streaming execution plan. ▶ 2. Specify triggers to control when to update the results. • Each time a trigger fires new data (new row in the input table), and incrementally updates the result. 57 / 79 Programming Model (1/2) ▶ Two main steps to develop a Spark stuctured streaming: ▶ 1. Defines a query on the input

0 码力 | 113 页 | 1.22 MB | 1 年前
3
Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020

streams. • Declarative languages specify the expected results of the computation rather than the execution flow. • Imperative languages are used to describe plans of operators the streams must flow through be expressed using only non-blocking operators? 22 Vasiliki Kalavri | Boston University 2020 Model and formalization (I) A stream is a sequence of unbounded length, where tuples are ordered by their t ∈ S to denote that, for some 1 ≤ i ≤ n, ti = t. 23 Vasiliki Kalavri | Boston University 2020 Model and formalization (II) Pre-sequence (prefix): Let S = [t1, … ,tn] be a sequence and 0 < k ≤ n. Then

0 码力 | 53 页 | 532.37 KB | 1 年前
3
High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020

processed? Was mo delivered downstream? Vasiliki Kalavri | Boston University 2020 A simple system model stream sources N1 NK N2 … input queue output queue primary nodes secondary nodes other apps be the output stream produced by input e. In the event of a failure, let Of be the pre-failure execution of the primary and O’ the output produced by the secondary after recovery. • Precise recovery convergent-capable: it can re-build internal state in a way that it eventually converges to a non-failure execution output • repeatable: it produces identical duplicate tuples Vasiliki Kalavri | Boston University

0 码力 | 49 页 | 2.08 MB | 1 年前
3
Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

HOk8+K8Ox+z1iWnmDmAP3A+fwCD9I4G We need to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination (liveness): Obtain a valid CfjxXg3PqatJaOY2QV/yvj8AfLTl3A= We need to retrieve a distributed cut in a system execution that yields a system configuration Validity (safety): Termination (liveness): Obtain a valid m m’ System Possible Execution ??? Vasiliki Kalavri | Boston University 2020 Validity Explained p1 p2 p3 p1 p2 p3 m m’ C events in cut System Possible Execution ??? Vasiliki Kalavri | Boston

0 码力 | 81 页 | 13.18 MB | 1 年前
3
PyFlink 1.15 Documentation

to set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires Python 3.6 or above with PyFlink given Python virtual environment at client side (for job compiling) and server side (for Python UDF execution) separately. 1.1. Getting Started 7 pyflink-docs, Release release-1.15 • Specify the Python virtual cluster nodes during job execution. It should be noted that option -pyexec is also required to specify the Python virtual environment to use at server side (for Python UDF execution). For the Python virtual

0 码力 | 36 页 | 266.77 KB | 1 年前
3
PyFlink 1.16 Documentation

to set up PyFlink development environment in your local machine. This is usually used for local execution or development in an IDE. Set up Python environment It requires Python 3.6 or above with PyFlink given Python virtual environment at client side (for job compiling) and server side (for Python UDF execution) separately. 1.1. Getting Started 7 pyflink-docs, Release release-1.16 • Specify the Python virtual cluster nodes during job execution. It should be noted that option -pyexec is also required to specify the Python virtual environment to use at server side (for Python UDF execution). For the Python virtual

0 码力 | 36 页 | 266.80 KB | 1 年前
3
Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

processing optimizations ??? Vasiliki Kalavri | Boston University 2020 2 • Costs of streaming operator execution • state, parallelism, selectivity • Dataflow optimizations • plan translation alternatives || B Task: B || C Data: A || A ??? Vasiliki Kalavri | Boston University 2020 8 Distributed execution in Flink ??? Vasiliki Kalavri | Boston University 2020 9 Identify the most efficient way to execute strategies? • before execution or during runtime Query optimization (I) ??? Vasiliki Kalavri | Boston University 2020 10 Optimization strategies • enumerate equivalent execution plans • minimize intermediate

0 码力 | 54 页 | 2.83 MB | 1 年前
3
Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

approximate answers … S1 S2 Sr Input Manager Scheduler QoS Monitor Load Shedder Query Execution Engine Qm Q2 Q1 Ad-hoc or continuous queries Input streams … ??? Vasiliki Kalavri | Boston unnecessary result degradation! • Load shedding components rely on statistics gathered during execution: • A statistics manager module monitors processing and input rates and periodically estimates continuously or by running the system for a designated period of time, prior to regular query execution. 10 ??? Vasiliki Kalavri | Boston University 2020 Estimating cost and selectivity 11 • Selectivity:

0 码力 | 43 页 | 2.42 MB | 1 年前
3
Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020

JobManager is a single point of failure Flink applications • It keeps metadata about application execution, such as pointers to completed checkpoints. • A high-availability mode migrates the responsibility increased load • scale in to save resources • Fix bugs or change business logic • Optimize execution plan • Change operator placement • skew and straggler mitigation • Migrate to a different

0 码力 | 41 页 | 4.09 MB | 1 年前
3

共 15 条前往

页

分类

语言

格式

Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Scalable Stream Processing - Spark Streaming and Flink

Streaming languages and operator semantics - CS 591 K1: Data Stream Processing and Analytics Spring 2020

High-availability, recovery semantics, and guarantees - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020

PyFlink 1.15 Documentation

PyFlink 1.16 Documentation

Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020

Fault-tolerance demo & reconfiguration - CS 591 K1: Data Stream Processing and Analytics Spring 2020