Stream ingestion and pub/sub systems - CS 591 K1: Data Stream Processing and Analytics Spring 2020
communication, UDP multicast, TCP • HTTP or RPC if the consumer exposes a service on the network • Failure handling: application needs to be aware of message loss, producers and consumers always online 5 Message architecture advantages • Multiple producers/consumers as concurrent clients • Effective failure handling, crashes or disconnects • Broker responsible for message durability • Asynchronous communication messages are totally ordered but there is no ordering guarantee across partitions 28 29 Failure handling • The broker does not need to wait for acknowledgements any more, but simply record consumers'0 码力 | 33 页 | 700.14 KB | 1 年前3Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Kalavri | Boston University 2020 Watermarks are essential to both event-time windows and operators handling out-of-order events: • When an operator receives a watermark with time T, it can assume that no0 码力 | 22 页 | 2.22 MB | 1 年前3Cardinality and frequency estimation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
15 The standard error of the LogLog algorithm is inversely related to the number of counters m: Standard error δ ≈ 1.3 m For m = 256, the error is about 8% For m = 1024, the error decreases to 4% Vasiliki Kalavri | Boston University 2020 26 • Query approximation error • Error probability Guarantee: The estimation error for frequencies will not exceed with probability • A higher number ⌈2.71828 ϵ ⌉ Error and space/time trade-offs ??? Vasiliki Kalavri | Boston University 2020 27 Space requirements ??? Vasiliki Kalavri | Boston University 2020 27 For a standard error of , we need0 码力 | 69 页 | 630.01 KB | 1 年前3PyFlink 1.15 Documentation
OverflowError: timeout value is too large . . . . . . . . . . . . . . . . . . . . 30 1.3.5.2 Q2: An error occurred while calling z:org.apache.flink.client.python.PythonEnvUtils.resetCallbackClient 30 1.3 factories.DynamicTableFactory’ in the classpath Exception Stack: py4j.protocol.Py4JJavaError: An error occurred while calling o13.execute. : org.apache.flink.table.api.ValidationException: Unable to create documentation. 1.3.4.2 O2: ClassNotFoundException: com.mysql.cj.jdbc.Driver py4j.protocol.Py4JJavaError: An error occurred while calling o13.execute. : org.apache.flink.runtime.client.JobExecutionException: Job execution0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
OverflowError: timeout value is too large . . . . . . . . . . . . . . . . . . . . 30 1.3.5.2 Q2: An error occurred while calling z:org.apache.flink.client.python.PythonEnvUtils.resetCallbackClient 30 1.3 factories.DynamicTableFactory’ in the classpath Exception Stack: py4j.protocol.Py4JJavaError: An error occurred while calling o13.execute. : org.apache.flink.table.api.ValidationException: Unable to create documentation. 1.3.4.2 O2: ClassNotFoundException: com.mysql.cj.jdbc.Driver py4j.protocol.Py4JJavaError: An error occurred while calling o13.execute. : org.apache.flink.runtime.client.JobExecutionException: Job execution0 码力 | 36 页 | 266.80 KB | 1 年前3Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
estimated frequency of item δ: user-defined threshold, so that freq(x)≥ δ*N,δ∈(0,1) ε: user-defined error Output: All items with frequency greater than or equal to δ*N. No item with frequency less than the current window id • We keep a list D of element frequencies and their maximum associated error. • Once a window fills up, we remove infrequent elements. 6 ??? Vasiliki Kalavri | Boston University in wcur: if x ∈ D, increase its frequency, fx = fx +1 else insert with frequency fx = 1 and error εx = wcur - 1 N = N + 1 Delete step Iterate over D and remove every element x with fx + εx0 码力 | 31 页 | 1.47 MB | 1 年前3Elasticity and state migration: Part I - CS 591 K1: Data Stream Processing and Analytics Spring 2020
actual ??? Vasiliki Kalavri | Boston University 2020 parallelism initial rate target actual error p0 p1 prediction x x x DS2 model properties 24 ??? Vasiliki Kalavri | Boston University 2020 Boston University 2020 parallelism initial rate target actual p0 p1 x error p1’ new prediction Gradually minimizes error DS2 model properties 24 ??? Vasiliki Kalavri | Boston University 2020 250 码力 | 93 页 | 2.42 MB | 1 年前3Flow control and load shedding - CS 591 K1: Data Stream Processing and Analytics Spring 2020
scaled using approximate query processing techniques, where accuracy is measured in terms of relative error in the computed query answers. 17 ??? Vasiliki Kalavri | Boston University 2020 Which tuples to Katsipoulakis, A. Labrinidis, and P. K. Chrysanthis. Concept-driven load shedding: Reducing size and error of voluminous and variable data streams. (IEEE Big Data ’18) • H. T. Kung, T. Blackwell, and A.0 码力 | 43 页 | 2.42 MB | 1 年前3Stream processing fundamentals - CS 591 K1: Data Stream Processing and Analytics Spring 2020
footprint for accuracy • Query results are approximate with either deterministic or probabilistic error bounds • There is no universal synopsis solution • They are purpose-built and query-specific0 码力 | 45 页 | 1.22 MB | 1 年前3Exactly-once fault-tolerance in Apache Flink - CS 591 K1: Data Stream Processing and Analytics Spring 2020
or are discarded cpConfig.setCheckpointTimeout(300000); // do not fail the job on a checkpointing error cpConfig.setFailOnCheckpointingErrors(false); ??? Vasiliki Kalavri | Boston University 2020 Lecture0 码力 | 81 页 | 13.18 MB | 1 年前3
共 10 条
- 1