在 Metabase 上分析 WebP Cloud Services 运营数据,并使用 Clickhouse 让速度提升 5 倍
Intro
我们会对基础的运营数据进行分析,这其中包括每日的流量,流量来自哪些网站(referer),服务平均响应延迟等,这些数据可以告诉我们服务的整体运行情况,同时也可以指导我们进行一些后续的优化。
数据的来源其实非常简单,每一次有请求发到我们基础设施的时候,我们都会对请求的相关信息进行脱敏记录(由于我们的基础设施分布在德国和芬兰,这里我们希望符合 GDPR)并存在一个数据库中,数据库的表结构类似:
CREATE TABLE logs
(
hashed_remote_identifier varchar(64),
requested_hostname varchar(100),
requested_path text,
...
created_at DATETIME
);
这样针对每条请求记录我们都有了一份唯一的存储,虽然简陋,但是 Works。
什么?你说为啥我们数据不存 ES?
Metabase
有了这么一条条的数据之后,我们很快就会希望得到一些曲线,比如:每天我们处理了多少请求,这些请求都是来自哪儿。
虽然这些图表可以自己写一个 API 从数据库中捞,但是有现成的工具为什么不用呢,这里我们使用了比较成熟的一个开源解决方案——Metabase
Metabase 的部署很简单,只要用 docker-compose.yml
内容类似如下:
version: '3'
services:
metabase:
image: metabase/metabase
restart: always
volumes:
- ./metabase-data:/metabase-data
ports:
- 127.0.0.1:3000:3000
environment:
TZ: Asia/Shanghai
启动后配置一下数据库即可使用:
只要点点点就可以创建很多 Kanban 和 Dashboard.
甚至还可以配置 Pulse,让 Metabase 每天自动发「日报」给我们邮箱,你看,多贴心!
最终通过点点点和拖拖拖,我们就可以得到一个看上去还不错的 Dashboard 了,
Clickhouse kicked in
但是很快我们就会遇到问题,随着站点访问量越来越大,Metabase 为了获取这些数据需要扫全表,速度也会越来越慢,页面加载速度也逐渐可以用秒为单位进行计算,为了解决这个问题,我们决定把数据实时同步到 Clickhouse 上,并通过 Clickhouse 上的数据来渲染图表。
我们的基础设施有 4 台服务器,为了获得一个比较高可用的 Clickhouse 集群,对于 Clickhouse 使用一窍不通的我使用了我自己写的集群配置文件生成工具:https://github.com/n0vad3v/simple-multinode-clickhouse-cluster ,编写了集群拓扑:
global:
clickhouse_image: "yandex/clickhouse-server:21.3.2.5"
zookeeper_image: "bitnami/zookeeper:3.6.1"
zookeeper_servers:
- host: 10.1.0.10
- host: 10.1.0.11
- host: 10.1.0.12
- host: 10.1.0.13
clickhouse_servers:
- host: 10.1.0.10
- host: 10.1.0.11
- host: 10.1.0.12
- host: 10.1.0.13
clickhouse_topology:
- clusters:
- name: "novakwok_cluster"
shards:
- name: "novakwok_shard"
servers:
- host: 10.1.0.10
- host: 10.1.0.11
- host: 10.1.0.12
- host: 10.1.0.13
生成出 docker-compose.yml
文件后用 Ansible 部署到机器上并启动。
MySQL Clickhouse replication
现在我们需要迁移 MySQL 的数据并保持和 Clickhouse 同步,我们使用的 MySQL 是 ubuntu 提供的 ubuntu/mysql:8.0-20.04_beta
,这个镜像 binlog 默认打开,并且 server-id 是 1(甚至是 Multi-Arch 的):
MySQL [(none)]> SELECT @@server_id;
+-------------+
| @@server_id |
+-------------+
| 1 |
+-------------+
1 row in set (0.000 sec)
MySQL [(none)]> show variables like '%bin%';
+------------------------------------------------+-----------------------------+
| Variable_name | Value |
+------------------------------------------------+-----------------------------+
| log_bin | ON |
| log_bin_basename | /var/lib/mysql/binlog |
| log_bin_index | /var/lib/mysql/binlog.index |
| log_statements_unsafe_for_binlog | ON |
| mysqlx_bind_address | * |
| sql_log_bin | ON |
| sync_binlog | 1 |
+------------------------------------------------+-----------------------------+
这里我们使用 clickhouse-mysql
工具进行迁移,首先安装必要的包和组件:
apt install libmysqlclient-dev python3-pip -y
pip3 install clickhouse-driver
pip3 install mysql-replication
pip3 install clickhouse-mysql
然后启动一个 tmux
将这个程序后台跑着:
clickhouse-mysql \
--src-server-id=1 \
--src-resume \
--src-wait \
--nice-pause=1 \
--src-host=10.1.0.10 \
--src-user=root \
--src-password=password \
--src-tables=webp_cloud.logs \
--dst-host=10.1.0.11 \
--dst-schema webp_cloud \
--dst-table logs \
--log-level=info \
--csvpool \
--mempool-max-flush-interval=60 \
--mempool-max-events-num=1000 \
--pump-data \
--migrate-table \
--dst-create-table
此时 clickhouse-mysql 就会迁移已有数据,并保持同步写入 MySQL 的所有数据了,这个时候建议在在 MySQL 和 Clickhouse 上都看一下数据的 COUNT,如果发现 Clickhouse 上的数据小于 MySQL 的话,可能丢数据了,没事,我也遇到了这个问题,可以参考「解决用 clickhouse-mysql 迁移数据到 Clickhouse 后丢失部分数据的一点笔记」解决。
Metabase Patch
搞了上面那一堆之后我们回到 Metabase 上准备添加 Clickhouse 然后准备感受起飞一般的速度,我们熟练地点开「Admin」->「Database」,然后…
发现没有 Clickhouse!
所以这里肯定是有人一开始看错了(以为 Metabase 直接支持,不然也不会选择使用 Clickhouse),而且那个看错的人肯定不是我自己
这就很坑了!
所幸,我们还是找到了一个插件: https://github.com/enqueue/metabase-clickhouse-driver ,接下来对 Metabase 进行一点小小的改造,比如:
- 在
docker-compose.yml
同目录下放一个 plugins 目录,然后把 https://github.com/enqueue/metabase-clickhouse-driver/releases/download/0.8.1/clickhouse.metabase-driver.jar 给塞进去 - 改造下
docker-compose.yml
,加入MB_PLUGINS_DIR
环境变量并传plugins
目录进去
version: '3'
services:
metabase:
image: metabase/metabase
restart: always
user: root
volumes:
- ./metabase-data:/metabase-data
- ./plugins:/app/plugins
- ./run_metabase.sh:/app/run_metabase.sh
ports:
- 127.0.0.1:3000:3000
environment:
MB_DB_FILE: /metabase-data/metabase.db
MB_PLUGINS_DIR: /app/plugins
TZ: Asia/Shanghai
- 把容器内的
/app/run_metabase.sh
复制出来,并改造一下让 Metabase 用 root 身份运行(避免读 plugin 目录遇到权限问题)
(或者这里你可以直接用我改好的版本,在: https://github.com/n0vad3v/dockerfiles/tree/master/metabase-clickhouse )
终于,我们在 Metabase 中可以看到 Clickhouse 了!
同样是经过一段时间的拖拖拖和点点点,我们就获得了一个数据在 Clickhouse 上的运营看板,为了对比加载速度,我们看看日志:
MySQL
大约 4.1s
GET /api/user/current 200 7.2 ms (4 DB calls) App DB connections: 0/15 Jetty threads: 4/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/session/properties 200 30.0 ms (4 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/database 200 8.6 ms (3 DB calls) App DB connections: 1/15 Jetty threads: 4/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/dashboard/33 200 39.8 ms (14 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/table/38/query_metadata 200 9.5 ms (9 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/collection/root 200 1.5 ms (2 DB calls) App DB connections: 2/15 Jetty threads: 9/50 (0 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
POST /api/dashboard/33/dashcard/33/card/33/query 202 [ASYNC: completed] 1.3 s (19 DB calls) App DB connections: 2/15 Jetty threads: 3/50 (8 idle, 0 queued) (113 total active threads) Queries in flight: 5 (0 queued); mysql DB 34 connections: 1/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/37/card/37/query 202 [ASYNC: completed] 1.4 s (19 DB calls) App DB connections: 1/15 Jetty threads: 2/50 (8 idle, 0 queued) (113 total active threads) Queries in flight: 4 (0 queued); mysql DB 34 connections: 0/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/33/card/38/query 202 [ASYNC: completed] 1.9 s (21 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (8 idle, 0 queued) (113 total active threads) Queries in flight: 3 (0 queued); mysql DB 34 connections: 2/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/34/card/34/query 202 [ASYNC: completed] 3.6 s (22 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (8 idle, 0 queued) (112 total active threads) Queries in flight: 2 (0 queued); mysql DB 34 connections: 5/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/36/card/36/query 202 [ASYNC: completed] 3.9 s (20 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (8 idle, 0 queued) (112 total active threads) Queries in flight: 1 (0 queued); mysql DB 34 connections: 3/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/35/card/35/query 202 [ASYNC: completed] 4.1 s (20 DB calls) App DB connections: 1/15 Jetty threads: 2/50 (8 idle, 0 queued) (112 total active threads) Queries in flight: 0 (0 queued); mysql DB 34 connections: 4/6 (0 threads blocked)
Clickhouse
大约 825ms
GET /api/session/properties 200 15.5 ms (4 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/user/current 200 5.4 ms (4 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/database 200 9.1 ms (3 DB calls) App DB connections: 0/15 Jetty threads: 4/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/dashboard/34 200 43.0 ms (14 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/table/39/query_metadata 200 14.9 ms (9 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
POST /api/dashboard/34/dashcard/38/card/39/query 202 [ASYNC: completed] 126.1 ms (19 DB calls) App DB connections: 1/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 1 (0 queued); clickhouse DB 35 connections: 0/4 (0 threads blocked)
POST /api/dashboard/34/dashcard/38/card/40/query 202 [ASYNC: completed] 137.7 ms (21 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued); clickhouse DB 35 connections: 1/4 (0 threads blocked)
GET /api/collection/root 200 1.5 ms (2 DB calls) App DB connections: 0/15 Jetty threads: 5/50 (4 idle, 0 queued) (111 total active threads) Queries in flight: 2 (0 queued)
POST /api/dashboard/34/dashcard/39/card/42/query 202 [ASYNC: completed] 296.7 ms (19 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 3 (0 queued); clickhouse DB 35 connections: 1/4 (0 threads blocked)
POST /api/dashboard/34/dashcard/40/card/41/query 202 [ASYNC: completed] 429.3 ms (23 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 2 (0 queued); clickhouse DB 35 connections: 3/4 (0 threads blocked)
POST /api/dashboard/34/dashcard/42/card/44/query 202 [ASYNC: completed] 467.4 ms (20 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 1 (0 queued); clickhouse DB 35 connections: 2/4 (0 threads blocked)
POST /api/dashboard/34/dashcard/41/card/43/query 202 [ASYNC: completed] 825.7 ms (20 DB calls) App DB connections: 2/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 1 (0 queued); clickhouse DB 35 connections: 0/4 (0 threads blocked)
可以发现快了 5 倍左右。
有了更快的速度,再也不用等着 Metabase 转圈圈了,同时我们也可以做出更多的 Dashboard 来支撑我们的决策了~