在 Metabase 上分析 WebP Cloud Services 运营数据,并使用 Clickhouse 让速度提升 5 倍

Intro

WebP Cloud Services

我们会对基础的运营数据进行分析,这其中包括每日的流量,流量来自哪些网站(referer),服务平均响应延迟等,这些数据可以告诉我们服务的整体运行情况,同时也可以指导我们进行一些后续的优化。

数据的来源其实非常简单,每一次有请求发到我们基础设施的时候,我们都会对请求的相关信息进行脱敏记录(由于我们的基础设施分布在德国和芬兰,这里我们希望符合 GDPR)并存在一个数据库中,数据库的表结构类似:

CREATE TABLE logs
(
    hashed_remote_identifier varchar(64),
    requested_hostname varchar(100),
    requested_path text,
    ...
    created_at DATETIME
);

这样针对每条请求记录我们都有了一份唯一的存储,虽然简陋,但是 Works。

什么?你说为啥我们数据不存 ES?

Metabase

有了这么一条条的数据之后,我们很快就会希望得到一些曲线,比如:每天我们处理了多少请求,这些请求都是来自哪儿。

虽然这些图表可以自己写一个 API 从数据库中捞,但是有现成的工具为什么不用呢,这里我们使用了比较成熟的一个开源解决方案——Metabase

Metabase 的部署很简单,只要用 docker-compose.yml 内容类似如下:

version: '3'

services:
  metabase:
    image: metabase/metabase
    restart: always
    volumes:
       - ./metabase-data:/metabase-data
    ports:
      - 127.0.0.1:3000:3000
    environment:
      TZ: Asia/Shanghai

启动后配置一下数据库即可使用:

只要点点点就可以创建很多 Kanban 和 Dashboard.

甚至还可以配置 Pulse,让 Metabase 每天自动发「日报」给我们邮箱,你看,多贴心!

最终通过点点点和拖拖拖,我们就可以得到一个看上去还不错的 Dashboard 了,

Clickhouse kicked in

但是很快我们就会遇到问题,随着站点访问量越来越大,Metabase 为了获取这些数据需要扫全表,速度也会越来越慢,页面加载速度也逐渐可以用秒为单位进行计算,为了解决这个问题,我们决定把数据实时同步到 Clickhouse 上,并通过 Clickhouse 上的数据来渲染图表。

我们的基础设施有 4 台服务器,为了获得一个比较高可用的 Clickhouse 集群,对于 Clickhouse 使用一窍不通的我使用了我自己写的集群配置文件生成工具:https://github.com/n0vad3v/simple-multinode-clickhouse-cluster ,编写了集群拓扑:

global:
  clickhouse_image: "yandex/clickhouse-server:21.3.2.5"
  zookeeper_image: "bitnami/zookeeper:3.6.1"

zookeeper_servers:
  - host: 10.1.0.10
  - host: 10.1.0.11
  - host: 10.1.0.12
  - host: 10.1.0.13

clickhouse_servers:
  - host: 10.1.0.10
  - host: 10.1.0.11
  - host: 10.1.0.12
  - host: 10.1.0.13

clickhouse_topology:
  - clusters:
      - name: "novakwok_cluster"
        shards:
          - name: "novakwok_shard"
            servers:
              - host: 10.1.0.10
              - host: 10.1.0.11
              - host: 10.1.0.12
              - host: 10.1.0.13

生成出 docker-compose.yml 文件后用 Ansible 部署到机器上并启动。

MySQL Clickhouse replication

现在我们需要迁移 MySQL 的数据并保持和 Clickhouse 同步,我们使用的 MySQL 是 ubuntu 提供的 ubuntu/mysql:8.0-20.04_beta,这个镜像 binlog 默认打开,并且 server-id 是 1(甚至是 Multi-Arch 的):

MySQL [(none)]> SELECT @@server_id;
+-------------+
| @@server_id |
+-------------+
|           1 |
+-------------+
1 row in set (0.000 sec)

MySQL [(none)]> show variables like '%bin%';
+------------------------------------------------+-----------------------------+
| Variable_name                                  | Value                       |
+------------------------------------------------+-----------------------------+
| log_bin                                        | ON                          |
| log_bin_basename                               | /var/lib/mysql/binlog       |
| log_bin_index                                  | /var/lib/mysql/binlog.index |
| log_statements_unsafe_for_binlog               | ON                          |
| mysqlx_bind_address                            | *                           |
| sql_log_bin                                    | ON                          |
| sync_binlog                                    | 1                           |
+------------------------------------------------+-----------------------------+

这里我们使用 clickhouse-mysql 工具进行迁移,首先安装必要的包和组件:

apt install libmysqlclient-dev python3-pip -y
pip3 install clickhouse-driver
pip3 install mysql-replication
pip3 install clickhouse-mysql

然后启动一个 tmux 将这个程序后台跑着:

clickhouse-mysql \
--src-server-id=1 \
--src-resume \
--src-wait \
--nice-pause=1 \
--src-host=10.1.0.10 \
--src-user=root \
--src-password=password \
--src-tables=webp_cloud.logs \
--dst-host=10.1.0.11 \
--dst-schema webp_cloud \
--dst-table logs \
--log-level=info \
--csvpool \
--mempool-max-flush-interval=60 \
--mempool-max-events-num=1000 \
--pump-data \
--migrate-table \
--dst-create-table 

此时 clickhouse-mysql 就会迁移已有数据,并保持同步写入 MySQL 的所有数据了,这个时候建议在在 MySQL 和 Clickhouse 上都看一下数据的 COUNT,如果发现 Clickhouse 上的数据小于 MySQL 的话,可能丢数据了,没事,我也遇到了这个问题,可以参考「解决用 clickhouse-mysql 迁移数据到 Clickhouse 后丢失部分数据的一点笔记」解决。

Metabase Patch

搞了上面那一堆之后我们回到 Metabase 上准备添加 Clickhouse 然后准备感受起飞一般的速度,我们熟练地点开「Admin」->「Database」,然后…

发现没有 Clickhouse!

所以这里肯定是有人一开始看错了(以为 Metabase 直接支持,不然也不会选择使用 Clickhouse),而且那个看错的人肯定不是我自己

这就很坑了!

所幸,我们还是找到了一个插件: https://github.com/enqueue/metabase-clickhouse-driver ,接下来对 Metabase 进行一点小小的改造,比如:

  1. docker-compose.yml 同目录下放一个 plugins 目录,然后把 https://github.com/enqueue/metabase-clickhouse-driver/releases/download/0.8.1/clickhouse.metabase-driver.jar 给塞进去
  2. 改造下 docker-compose.yml,加入 MB_PLUGINS_DIR 环境变量并传 plugins 目录进去
version: '3'

services:

  metabase:
    image: metabase/metabase
    restart: always
    user: root
    volumes:
       - ./metabase-data:/metabase-data
       - ./plugins:/app/plugins
       - ./run_metabase.sh:/app/run_metabase.sh
    ports:
      - 127.0.0.1:3000:3000
    environment:
      MB_DB_FILE: /metabase-data/metabase.db
      MB_PLUGINS_DIR: /app/plugins
      TZ: Asia/Shanghai
  1. 把容器内的 /app/run_metabase.sh 复制出来,并改造一下让 Metabase 用 root 身份运行(避免读 plugin 目录遇到权限问题)

(或者这里你可以直接用我改好的版本,在: https://github.com/n0vad3v/dockerfiles/tree/master/metabase-clickhouse

终于,我们在 Metabase 中可以看到 Clickhouse 了!

同样是经过一段时间的拖拖拖和点点点,我们就获得了一个数据在 Clickhouse 上的运营看板,为了对比加载速度,我们看看日志:

MySQL

大约 4.1s

GET /api/user/current 200 7.2 ms (4 DB calls) App DB connections: 0/15 Jetty threads: 4/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/session/properties 200 30.0 ms (4 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/database 200 8.6 ms (3 DB calls) App DB connections: 1/15 Jetty threads: 4/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/dashboard/33 200 39.8 ms (14 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/table/38/query_metadata 200 9.5 ms (9 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/collection/root 200 1.5 ms (2 DB calls) App DB connections: 2/15 Jetty threads: 9/50 (0 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
POST /api/dashboard/33/dashcard/33/card/33/query 202 [ASYNC: completed] 1.3 s (19 DB calls) App DB connections: 2/15 Jetty threads: 3/50 (8 idle, 0 queued) (113 total active threads) Queries in flight: 5 (0 queued); mysql DB 34 connections: 1/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/37/card/37/query 202 [ASYNC: completed] 1.4 s (19 DB calls) App DB connections: 1/15 Jetty threads: 2/50 (8 idle, 0 queued) (113 total active threads) Queries in flight: 4 (0 queued); mysql DB 34 connections: 0/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/33/card/38/query 202 [ASYNC: completed] 1.9 s (21 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (8 idle, 0 queued) (113 total active threads) Queries in flight: 3 (0 queued); mysql DB 34 connections: 2/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/34/card/34/query 202 [ASYNC: completed] 3.6 s (22 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (8 idle, 0 queued) (112 total active threads) Queries in flight: 2 (0 queued); mysql DB 34 connections: 5/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/36/card/36/query 202 [ASYNC: completed] 3.9 s (20 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (8 idle, 0 queued) (112 total active threads) Queries in flight: 1 (0 queued); mysql DB 34 connections: 3/6 (0 threads blocked)
POST /api/dashboard/33/dashcard/35/card/35/query 202 [ASYNC: completed] 4.1 s (20 DB calls) App DB connections: 1/15 Jetty threads: 2/50 (8 idle, 0 queued) (112 total active threads) Queries in flight: 0 (0 queued); mysql DB 34 connections: 4/6 (0 threads blocked)

Clickhouse

大约 825ms

GET /api/session/properties 200 15.5 ms (4 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/user/current 200 5.4 ms (4 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/database 200 9.1 ms (3 DB calls) App DB connections: 0/15 Jetty threads: 4/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/dashboard/34 200 43.0 ms (14 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (5 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
GET /api/table/39/query_metadata 200 14.9 ms (9 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued)
POST /api/dashboard/34/dashcard/38/card/39/query 202 [ASYNC: completed] 126.1 ms (19 DB calls) App DB connections: 1/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 1 (0 queued); clickhouse DB 35 connections: 0/4 (0 threads blocked)
POST /api/dashboard/34/dashcard/38/card/40/query 202 [ASYNC: completed] 137.7 ms (21 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 0 (0 queued); clickhouse DB 35 connections: 1/4 (0 threads blocked)
GET /api/collection/root 200 1.5 ms (2 DB calls) App DB connections: 0/15 Jetty threads: 5/50 (4 idle, 0 queued) (111 total active threads) Queries in flight: 2 (0 queued)
POST /api/dashboard/34/dashcard/39/card/42/query 202 [ASYNC: completed] 296.7 ms (19 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 3 (0 queued); clickhouse DB 35 connections: 1/4 (0 threads blocked)
POST /api/dashboard/34/dashcard/40/card/41/query 202 [ASYNC: completed] 429.3 ms (23 DB calls) App DB connections: 0/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 2 (0 queued); clickhouse DB 35 connections: 3/4 (0 threads blocked)
POST /api/dashboard/34/dashcard/42/card/44/query 202 [ASYNC: completed] 467.4 ms (20 DB calls) App DB connections: 0/15 Jetty threads: 3/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 1 (0 queued); clickhouse DB 35 connections: 2/4 (0 threads blocked)
POST /api/dashboard/34/dashcard/41/card/43/query 202 [ASYNC: completed] 825.7 ms (20 DB calls) App DB connections: 2/15 Jetty threads: 2/50 (6 idle, 0 queued) (111 total active threads) Queries in flight: 1 (0 queued); clickhouse DB 35 connections: 0/4 (0 threads blocked)

可以发现快了 5 倍左右。

有了更快的速度,再也不用等着 Metabase 转圈圈了,同时我们也可以做出更多的 Dashboard 来支撑我们的决策了~


comments powered by Disqus