Course introduction - CS 591 K1: Data Stream Processing and Analytics Spring 2020
2020 Software requirements • All assignments assume a UNIX-based setup. • If you are a Windows user, you are advised to use Windows subsystem for Linux (WSL), Cygwin, or a Linux virtual machine to impressions, clicks, transactions, likes, comments • Analytics on user activity • Filtering, aggregation, joins with static data (e.g. user profile data) Examples • online A/B testing • trending topics0 码力 | 34 页 | 2.53 MB | 1 年前3Filtering and sampling streams - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Synopsis: a lossy, compact summary of the input stream input stream synopsis maintenance component user queries approximate results ??? Vasiliki Kalavri | Boston University 2020 A simple and efficient fixed proportion of the stream, e.g. 1/10th 7 search engine <user, query, timestamp> query stream Example use-case: Web search user behavior study Q: How many queries did users repeat last month When a query arrives: • if the user is sampled: add the query to S • if we haven’t seen the user before: generate a random integer ru between 0 and 9 and add the user to the sample if ru = 0. ??? Vasiliki0 码力 | 74 页 | 1.06 MB | 1 年前3Streaming optimizations - CS 591 K1: Data Stream Processing and Analytics Spring 2020
uk Connection: keep-alive Accept: text/html,application/ xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25 uk Connection: keep-alive Accept: text/html,application/ xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25 uk Connection: keep-alive Accept: text/html,application/ xhtml+xml,application/ xml;q=0.9,*/*;q=0.8 User-Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/250 码力 | 54 页 | 2.83 MB | 1 年前3PyFlink 1.15 Documentation
1.2.2 QuickStart: DataStream API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 User Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table Creation, listing Tables, Conversion between Table and DataStream, etc. • User-defined function management: User-defined function registration, dropping, listing, etc. • Executing SQL queries supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & SQL [20]:0 码力 | 36 页 | 266.77 KB | 1 年前3PyFlink 1.16 Documentation
1.2.2 QuickStart: DataStream API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 1.2 User Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table Creation, listing Tables, Conversion between Table and DataStream, etc. • User-defined function management: User-defined function registration, dropping, listing, etc. • Executing SQL queries supports various UDFs and APIs to allow users to execute Python native functions. See also the latest User- defined Functions and Row-based Operations. The first example is UDFs used in Table API & SQL [20]:0 码力 | 36 页 | 266.80 KB | 1 年前3Introduction to Apache Flink and Apache Kafka - CS 591 K1: Data Stream Processing and Analytics Spring 2020
setParallelism() in your application. taskmanager.numberOfTaskSlots: The number of parallel operator or user function instances that a single TaskManager can run. This value is typically proportional to the run ./examples/batch/WordCount.jar \ --input file:///home/user/hamlet.txt --output file:///home/user/wordcount_out Run with a class entry point and arguments: ./bin/flink run -c ./examples/batch/WordCount.jar \ --input file:///home/user/hamlet.txt --output file:///home/user/wordcount_out 19 Flink commands Vasiliki Kalavri | Boston University 2020 Resources0 码力 | 26 页 | 3.33 MB | 1 年前3监控Apache Flink应用程序(入门)
程度。Flink可以从大多数source获得基本metrics。 4.10 关键指标 Metric Scope Description records-lag-max user applies to FlinkKafkaConsumer The maximum lag in terms of the number of records for any partition best indication that the consumer group is not keeping up with the producers. millisBehindLatest user applies to FlinkKinesisConsumer The number of milliseconds a consumer is behind the head of a time window) for functional reasons. 4. Each computation in your Flink topology (framework or user code), as well as each network shuffle, takes time and adds to latency. 5. If the application emits0 码力 | 23 页 | 148.62 KB | 1 年前3Apache Flink的过去、现在和未来
增量 snapshot 基于 credit 的流控机制 Streaming SQL ------------------------- | USER_SCORES | ------------------------- | User | Score | Time | ------------------------- | Julie | 7 | 12:01 | | Frank -- | ---------------------------- Stream Mode: 12:01> SELECT Name, SUM(Score), MAX(Time) FROM USER_SCORES GROUP BY Name; Flink 在阿里的服务情况 集群规模 超万台 状态数据 PetaBytes 事件处理 十万亿/天 峰值能力 17亿/秒 Flink 的过去0 码力 | 33 页 | 3.36 MB | 1 年前3Notions of time and progress - CS 591 K1: Data Stream Processing and Analytics Spring 2020
Kalavri | Boston University 2020 Mobile game application • input stream: user activity • output: rewards based on how fast the user meets goals • e.g. pop 500 bubbles within 1 minute and get extra life latency. Trade-offs 17 Vasiliki Kalavri | Boston University 2020 Periodic: periodically ask the user-defined function for the current watermark timestamp. Punctuated: check for a watermark in each0 码力 | 22 页 | 2.22 MB | 1 年前3Skew mitigation - CS 591 K1: Data Stream Processing and Analytics Spring 2020
elements • The solution will not contain any item y with frequency: • freq(y) < (δ - ε)*N, for a user-chosen value ε 4 (δ - ε)*Ν δ*Ν not included may be included included ??? Vasiliki Kalavri | Boston the item e in the input stream f: estimated frequency of item δ: user-defined threshold, so that freq(x)≥ δ*N,δ∈(0,1) ε: user-defined error Output: All items with frequency greater than or equal0 码力 | 31 页 | 1.47 MB | 1 年前3
共 19 条
- 1
- 2