Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020
vkalavri@bu.edu CS 591 K1: Data Stream Processing and Analytics Spring 2020 2/06: Notions of time and progress Vasiliki Kalavri | Boston University 2020 Mobile game application • input stream: Vasiliki Kalavri | Boston University 2020 • Processing time • the time of the local clock where an event is being processed • a processing-time window wouldn’t account for game activity while the train Event time • the time when an event actually happened • an event-time window would give you the extra life • results are deterministic and independent of the processing speed Notions of time 5 Vasiliki0 码力 | 22 页 | 2.22 MB | 1 年前3Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020
synopsis Suppose that our data consists of a large numeric time series. What summary would let us compute the statistical variance of this series? 3 var = ∑ (xi − μ)2 N ??? Vasiliki Kalavri | Boston synopsis Suppose that our data consists of a large numeric time series. What summary would let us compute the statistical variance of this series? 3 • the sum of all the values • the sum of the squares synopsis Suppose that our data consists of a large numeric time series. What summary would let us compute the statistical variance of this series? 3 • the sum of all the values • the sum of the squares0 码力 | 74 页 | 1.06 MB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
set that is produced incrementally over time, rather than being available in full before its processing begins. • Data streams are high-volume, real-time data that might be unbounded • we cannot University 2020 Time-Series Model: The jth update is (j, A[j]) and updates arrive in increasing order of j, i.e. we observe the entries of A by increasing index. This can model time-series data streams: • a sequence of measurements from a temperature sensor • the volume of NASDAQ stock trades over time This model poses a severe limitation on the stream: updates cannot change past entries in A. 110 码力 | 45 页 | 1.22 MB | 1 年前3Scalable Stream Processing - Spark Streaming and Flink
79 Stream Processing Systems Design Issues ▶ Continuous vs. micro-batch processing ▶ Record-at-a-Time vs. declarative APIs 3 / 79 Outline ▶ Spark streaming ▶ Flink 4 / 79 Spark Streaming 5 / 79 Continuous vs. micro-batch processing • Record-at-a-Time vs. declarative APIs 6 / 79 Spark Streaming ▶ Run a streaming computation as a series of very small, deterministic batch jobs. • Chops up the Discretized Stream Processing (DStream) 7 / 79 Spark Streaming ▶ Run a streaming computation as a series of very small, deterministic batch jobs. • Chops up the live stream into batches of X seconds.0 码力 | 113 页 | 1.22 MB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
input streams • perform tuple-at-a-time, window, logic, pattern matching transformations • output one or more streams of possibly different type A series of transformations on streams in Stream 1 • a filter operator typically has selectivity < 1 Is selectivity always known at development time? ??? Vasiliki Kalavri | Boston University 2020 Types of Parallelism 7 B A C A B D A A B Vasiliki Kalavri | Boston University 2020 37 • A TaskManager can execute several tasks at the same time. • It is statically configured with a certain number of processing slots that defines the maximum0 码力 | 54 页 | 2.83 MB | 1 年前3PyFlink 1.15 Documentation
API for Apache Flink that allows you to build scalable batch and streaming workloads, such as real-time data processing pipelines, large-scale exploratory data analysis, Machine Learning (ML) pipelines you have installed an old version of PyFlink before and multiple PyFlink versions exist at the same time for some reason. # List the jar packages under the lib directory ls -lh /path/to/python/site-packages/pyflink/lib Python UDF @udf(result_type=DataTypes.BIGINT(), func_type='pandas') def pandas_plus_one(series): return series + 1 table.select(pandas_plus_one(col('id'))).to_pandas() /Users/duanchen/sourcecode/fl0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
API for Apache Flink that allows you to build scalable batch and streaming workloads, such as real-time data processing pipelines, large-scale exploratory data analysis, Machine Learning (ML) pipelines you have installed an old version of PyFlink before and multiple PyFlink versions exist at the same time for some reason. # List the jar packages under the lib directory ls -lh /path/to/python/site-packages/pyflink/lib Python UDF @udf(result_type=DataTypes.BIGINT(), func_type='pandas') def pandas_plus_one(series): return series + 1 table.select(pandas_plus_one(col('id'))).to_pandas() /Users/duanchen/sourcecode/fl0 码力 | 36 页 | 266.80 KB | 1 年前3Windows and triggers - CS 591 K1: Data Stream Processing and Analytics Spring 2020
• e.g. joins, holistic aggregates • Compute on most recent events only • when providing real-time traffic information, you probably don't care about an accident that happened 2 hours ago • Recent val maxTemp = sensorData .map(r => Reading(r.id,r.time,(r.temp-32)*(5.0/9.0))) .keyBy(_.id) .timeWindow(Time.minutes(1)) .max("temp") } } 3 Example: Window sensor can use the time characteristic to tell Flink how to define time when you are creating windows. The time characteristic is a property of the StreamExecutionEnvironment: Configuring a time characteristic0 码力 | 35 页 | 444.84 KB | 1 年前3Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Conditions might change • State is accumulated over time 2 events/s time rate decrease events/s time throughput degradation events/s time rate increase : input rate : throughput ??? Vasiliki Kalavri | Boston University 2020 Scaling approaches Metrics • service time and waiting time per tuple and per task • total time spent processing a tuple and all its derived results • CPU utilization one operator at a time • Predictive: at-once for all operators 8 ??? Vasiliki Kalavri | Boston University 2020 Queuing theory models 9 • Metrics • service time and waiting time per tuple and per0 码力 | 93 页 | 2.42 MB | 1 年前3Streaming in Apache Flink
Flink programs • Implement streaming data processing pipelines • Flink managed state • Event time Streaming in Apache Flink • Streams are natural • Events of any type like sensors, click streams processing as a subset of stream processing Processing Data Dataflows Let's Talk About Time • Processing Time • Event Time • Events may arrive out of order! What Can Be Streamed? • Anything (if you write events, FALSE for ride end events startTime DateTime the start time of a ride endTime DateTime the end time of a ride, ""1970-01-01 00:00" for start events startLon Float0 码力 | 45 页 | 3.00 MB | 1 年前3
共 23 条
- 1
- 2
- 3