5. ClickHouse at Ximalaya for Shanghai Meetup 2019 PDF
groupArray(timestamp) as timestamps, arrayEnumerate(pages) as index FROM (SELECT * FROM client_log_all ORDER BY timestamp) GROUP BY user ����������������� ���������� SELECT user, groupArray(page) as pages except Order] arrayFilter((i, p) -> (pages[i] = 'HomePage' AND pages[i+1]= 'Detail' AND pages[i+2]!='Order'), index, pages) as level_2, // In pages array, find a subarray of [HomePage, Detail, Order] arrayFilter((i (pages[i] = 'HomePage' AND pages[i+1]= 'Detail' AND pages[i+2]='Order'), index, pages) as level_3 FROM (SELECT * FROM client_log_all ORDER BY timestamp) GROUP BY user • �������������������������������0 码力 | 28 页 | 6.87 MB | 1 年前3ClickHouse in Production
SumShows, countIf(CounterType='Click') as SumClicks, BannerID FROM EventLogHDFS GROUP BY BannerID ORDER BY SumClicks desc LIMIT 3; 52 / 97 In ClickHouse: Most Clicked Banner SELECT countIf(CounterType='Show') SumShows, countIf(CounterType='Click') as SumClicks, BannerID FROM EventLogHDFS GROUP BY BannerID ORDER BY SumClicks desc LIMIT 3; ┌─SumShows─┬─SumClicks─┬───BannerID─┐ │ 6485 │ 1015 │ 6251269090 │ │ 97 In ClickHouse: Local Log Copy CREATE TABLE EventLogLocal AS EventLogHDFS ENGINE = MergeTree() ORDER BY BannerID; Ok. INSERT INTO EventLogLocal SELECT * FROM EventLogHDFS; Ok. 0 rows in set. Elapsed:0 码力 | 100 页 | 6.86 MB | 1 年前38. Continue to use ClickHouse as TSDB
`HeartRate` UInt8, `Humidity` Float32, ... ) ENGINE = MergeTree() PARTITION BY toYYYYMM(Time) ORDER BY (Name, Time, Age, ...); ► Column-Orient Model How we do CREATE TABLE demonstration.insert_view `HeartRate` UInt8, `Humidity` Float32, ... ) ENGINE = MergeTree() PARTITION BY toYYYYMM(Time) ORDER BY (Name, Time, Age, ...); ► Column-Orient Model How we do CPU : Intel Skylake 8 core Memory 'cpu-usage_user') AND ((created_at >= '2016-01-01 08:00:00') AND (created_at <= '2016-01-01 09:00:00')) ORDER BY toStartOfMinute(created_at) DESC LIMIT 5 ┌─value─┐ │ 4 │ │ 4 │ │ 4 │ │ 4 │ │0 码力 | 42 页 | 911.10 KB | 1 年前31. Machine Learning with ClickHouse
SAMPLE x OFFSET y CREATE TABLE trips_sample_time ( pickup_datetime DateTime ) ENGINE = MergeTree ORDER BY sipHash64(pickup_datetime) -- Primary Key SAMPLE BY sipHash64(pickup_datetime) -- expression for total_amount, trip_distance, (toYear(pickup_datetime) - 2009) * (trip_distance + 1)) FROM trips WHERE <...> ORDER BY sipHash64(trip_id) ASC [2.138706869701764,0.25152600248358253,4.5418692076782445] That’s better as aggregate function state in a separate table Example CREATE TABLE models ENGINE = MergeTree ORDER BY tuple() AS SELECT stochasticLinearRegressionState(total_amount, trip_distance) AS model FROM0 码力 | 64 页 | 1.38 MB | 1 年前30. Machine Learning with ClickHouse
SAMPLE x OFFSET y CREATE TABLE trips_sample_time ( pickup_datetime DateTime ) ENGINE = MergeTree ORDER BY sipHash64(pickup_datetime) -- Primary Key SAMPLE BY sipHash64(pickup_datetime) -- expression for total_amount, trip_distance, (toYear(pickup_datetime) - 2009) * (trip_distance + 1)) FROM trips WHERE <...> ORDER BY sipHash64(trip_id) ASC [2.138706869701764,0.25152600248358253,4.5418692076782445] That’s better as aggregate function state in a separate table Example CREATE TABLE models ENGINE = MergeTree ORDER BY tuple() AS SELECT stochasticLinearRegressionState(total_amount, trip_distance) AS model FROM0 码力 | 64 页 | 1.38 MB | 1 年前34. ClickHouse在苏宁用户画像场景的实践
groupBitmapState Integer 聚合类 groupBitmapAnd groupBitmapOr groupBitmapXor 14 Bitmap应用示例 order_id order_date user_id product_id 1 2019-10-01 1 p1 2 2019-10-01 1 p2 3 2019-10-01 2 p1 2019-10-02 5 p1 8 2019-10-02 5 p2 一张简单的订单明细表 detail_order,如何计算用户的日留存? 15 标签 SQL 大表join,count distinct 都比较慢,而且容易 OOM! Bitmap应用示例 order_date uv_bitmap 2019-10-01 {1,2,3} 2019-10-02 {3 5] • 新用户: day2 ANDNOT day1 = [4,5] • 流失用户:day1 ANDNOT day2 = [1,2] 16 detail_order 聚合为天维度表 留存用户的SQL Bitmap函数 千万级用户, 秒级出结果! Contents 苏宁如何使用ClickHouse ClickHouse集成Bitmap0 码力 | 32 页 | 1.47 MB | 1 年前32. Clickhouse玩转每天千亿数据-趣头条
1:机器的内存推荐128G+ 2:采用软连接的方式,把不同的表分布到不同的盘上面,这样一台机器可以挂载更多的盘 最新版本的”冷热数据分离”特性,曲线救国? 我们遇到的问题 order by (timestamp, eventType) or order by (eventType, timestamp) 业务场景 1:趣头条和米读的上报数据是按照”事件类型”(eventType)进行区分 2:指标系统分”分时”和”累时”指标 table where dt='' and timestamp>='' and timestamp<='' and eventType='' 建表的时候缺乏深度思考,由于分时指标的特性,我们的表是order by (timestamp, eventType)进行索引 的,这样在计算累时指标的时候出现非常耗时(600亿+数据量) 分析: 对于累时数据,时间索引基本就失效了,由于timestamp”基 from table where column=value select column1, column2 from table where column=value 凡是涉及group by, order by, distinct, join这样的SQL内存占用不再是O(1) 解决: 1:max_bytes_before_external_group_by 2:max_bytes_before_external_sort0 码力 | 14 页 | 1.10 MB | 1 年前32. 腾讯 clickhouse实践 _2019丁晓坤&熊峰
GROUP BY key ORDER BY value DESC LIMIT 10 SELECT play_times_key AS key, sum(play_times_value) AS value FROM wegame ARRAY JOIN play_times_key, play_times_value GROUP BY key ORDER BY value DESC0 码力 | 26 页 | 3.58 MB | 1 年前32. ClickHouse MergeTree原理解析-朱凯
expr] [ORDER BY expr] [PRIMARY KEY expr] [SAMPLE BY expr] [SETTINGS name=value, 省略...] 分区键 排序键 主键 index_granularity = 8192 索引粒度 MergeTree的存储结构 数据以分区的形式被组织 , PARTITION BY 各列独立存储, 按ORDER BY 排序0 码力 | 35 页 | 13.25 MB | 1 年前3Что нужно знать об архитектуре ClickHouse, чтобы его эффективно использовать
count(*) AS count FROM hits WHERE CounterID = 1234 AND Date >= today() - 7 GROUP BY Referer ORDER BY count DESC LIMIT 10 Типичный запрос в системе веб-аналитики Быстро читаем › Только нужные столбцы:0 码力 | 28 页 | 506.94 KB | 1 年前3
共 15 条
- 1
- 2