Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 1/23: Stream Processing Fundamentals Vasiliki Kalavri | Boston University University 2020 What is a stream? • In traditional data processing applications, we know the entire dataset in advance, e.g. tables stored in a database. A data stream is a data set that is produced incrementally incrementally over time, rather than being available in full before its processing begins. • Data streams are high-volume, real-time data that might be unbounded • we cannot store the entire stream0 码力 | 45 页 | 1.22 MB | 1 年前3Scalable Stream Processing - Spark Streaming and Flink
Scalable Stream Processing - Spark Streaming and Flink Amir H. Payberah payberah@kth.se 05/10/2018 The Course Web Page https://id2221kth.github.io 1 / 79 Where Are We? 2 / 79 Stream Processing Systems Design Design Issues ▶ Continuous vs. micro-batch processing ▶ Record-at-a-Time vs. declarative APIs 3 / 79 Outline ▶ Spark streaming ▶ Flink 4 / 79 Spark Streaming 5 / 79 Contribution ▶ Design issues Continuous vs. micro-batch processing • Record-at-a-Time vs. declarative APIs 6 / 79 Spark Streaming ▶ Run a streaming computation as a series of very small, deterministic batch jobs. • Chops up the0 码力 | 113 页 | 1.22 MB | 1 年前3Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
??? Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/16: Skew mitigation ??? Vasiliki Kalavri | Uddin Nasir et. al. The power of both choices: Practical load balancing for distributed stream processing engines. ICDE 2015. • Mitzenmacher, Michael. The power of two choices in randomized load balancing0 码力 | 31 页 | 1.47 MB | 1 年前3State management - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 2/25: State Management Vasiliki Kalavri | Boston operator. Keyed state can only be used by functions that are applied on a KeyedStream: • When the processing method of a function with keyed input is called, Flink’s runtime automatically puts all keyed fare, Collector> out) throws Exception { // similar logic for processing fare events } } } Java example (cont.) 21 Vasiliki Kalavri | Boston University 2020 0 码力 | 24 页 | 914.13 KB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/14: Stream processing optimizations ??? Vasiliki Kalavri | Boston University serialization cost • if operators are separate, throughput is bounded by either communication or processing cost • if fused, throughput is determined by operator cost only Operator fusion A B A B is statically configured with a certain number of processing slots that defines the maximum number of concurrent tasks it can execute. • A processing slot can execute one slice of an application, i.e0 码力 | 54 页 | 2.83 MB | 1 年前3Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 2/11: Windows and Triggers Vasiliki Kalavri | Boston windowing use cases: • They assign an element based on its event-time timestamp or the current processing time to windows. • Time windows have a start and an end timestamp. • All built-in window assigners assigners provide a default trigger that triggers the evaluation of a window once the (processing or event) time passes the end of the window. • A window is created when the first element is assigned0 码力 | 35 页 | 444.84 KB | 1 年前3Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 1/21: Introduction Vasiliki Kalavri | Boston University course, you will hopefully: • know when to use stream processing vs other technology • be able to comprehensively compare features and processing guarantees of streaming systems • be proficient in using end-to-end, scalable, and reliable streaming applications • have a solid understanding of how stream processing systems work and what factors affect their performance • be aware of the challenges and trade-offs0 码力 | 34 页 | 2.53 MB | 1 年前3Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020
| Boston University 2020 Vasiliki (Vasia) Kalavri vkalavri@bu.edu CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/06: Notions of time and progress Vasiliki Kalavri | Boston University minute? 4 Vasiliki Kalavri | Boston University 2020 • Processing time • the time of the local clock where an event is being processed • a processing-time window wouldn’t account for game activity while while the train is in the tunnel • results depend on the processing speed and aren’t deterministic • Event time • the time when an event actually happened • an event-time window would give you the0 码力 | 22 页 | 2.22 MB | 1 年前3Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
??? Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/23: Cardinality and frequency estimation0 码力 | 69 页 | 630.01 KB | 1 年前3Graph streaming algorithms - CS 591 K1: Data Stream Processing and Analytics Spring 2020
??? Vasiliki Kalavri | Boston University 2020 CS 591 K1: Data Stream Processing and Analytics Vasiliki (Vasia) Kalavri vkalavri@bu.edu Spring 2020 4/28: Graph Streaming ??? Vasiliki Kalavri | Vertex streams (not today) ??? Vasiliki Kalavri | Boston University 2020 Batch Graph Processing 9 Batch graph processing systems, such as Apache Graph, GraphX, Pregel, operate offline. They are subgraph. Connected components 1 4 3 2 5 6 7 8 ??? Vasiliki Kalavri | Boston University 2020 Batch Connected Components • State: the graph and a component ID per vertex • initially equal to vertex0 码力 | 72 页 | 7.77 MB | 1 年前3
共 24 条
- 1
- 2
- 3