解决用 clickhouse-mysql 迁移数据到 Clickhouse 后丢失部分数据的一点笔记

2022-03-24

在上一篇文章：「在 Metabase 上分析 WebP Cloud Services 运营数据，并使用 Clickhouse 让速度提升 5 倍」中，我们讲到了可以通过 clickhouse-mysql 很方便地将数据从 MySQL 迁移到 Clickhouse，但是其中我们遇到了部分数据丢失的问题，本文将记录一下整个过程，和对应的解决方案。

ReplacingMergeTree Data Loss

在使用 clickhouse-mysql 之后需要额外关注一下导入的数据库的结构（可以在连接到 Clickhouse 之后通过 show create table <DB_NAME>.<TABLE_NAME> 查看，默认可能会使用 ReplacingMergeTree，并且 SORTING KEY 会随机使用一个）

在这种情况下，如果你原有的表中有一个字段是 DateTime 类型的话，可能这个字段会被用作 SORTING KEY，ReplacingMergeTree 的特性是：

it removes duplicate entries with the same sorting key value。

我们通过观察 clickhouse-mysql 的日志可以看出：

Running with chclient CREATE DATABASE IF NOT EXISTS `webp_cloud`;
Running with chclient CREATE TABLE IF NOT EXISTS `webp_cloud`.`logs`  (
    `hashed_remote_identifier` Nullable(String),
    `requested_hostname` Nullable(String),
    `requested_path` Nullable(String),
    ...
    `created_at` DateTime
) 
ENGINE = ReplacingMergeTree() PARTITION BY toYYYYMM(created_at) ORDER BY (created_at)
;

这里用了 ReplacingMergeTree ，在这种情况下，如果有多条记录的 created_at 是一样的话（比如同一秒内的多个请求），那数据就会被丢到只剩一条了。

Change Engine

为了解决这个问题，我们需要改一下使用的 ENGINE，可以通过类似如下语句导出先建库建表的 SQL：

clickhouse-mysql --src-host=10.1.0.10 --src-user=root --src-password=password --create-table-sql-template --with-create-database --src-tables=webp_cloud.logs > logs.sql

SQL 类似如下：

CREATE DATABASE IF NOT EXISTS `webp_cloud`;
CREATE TABLE IF NOT EXISTS `webp_cloud`.`logs`  (
    `hashed_remote_identifier` Nullable(String),
    `requested_hostname` Nullable(String),
    `requested_path` Nullable(String),
    ...
    `created_at` Nullable(DateTime)
) 
ENGINE = MergeTree(<PRIMARY_DATE_FIELD>, (<COMMA_SEPARATED_INDEX_FIELDS_LIST>), 8192)
;

这就很奇怪了，这样导出的 ENGINE 就是 MergeTree 了.

这里我们需要改一下 SQL，首先 MergeTree 要求按照 PARTITION 来，我们决定用 created_at 来进行，由于原库中 created_at 不会为 Null，所以我们需要改一下，改为：

`created_at` DateTime

然后需要改一下 ENGINE 的部分，改为：

ENGINE = MergeTree() PARTITION BY toYYYYMM(created_at) ORDER BY (created_at)

改完之后类似如下：

CREATE DATABASE IF NOT EXISTS `webp_cloud`;
CREATE TABLE IF NOT EXISTS `webp_cloud`.`logs`  (
    `hashed_remote_identifier` Nullable(String),
    `requested_hostname` Nullable(String),
    `requested_path` Nullable(String),
    ...
    `created_at` DateTime
) 
ENGINE = MergeTree() PARTITION BY toYYYYMM(created_at) ORDER BY (created_at)
;

保存为一个 .sql 文件之后导入到 Clickhouse 中：

clickhouse-client --host=10.1.0.10 -mn < ./path/to/that.sql

此时表结构就已经完成建立了，我们可以继续用之前的命令进行导入并保持同步了：

clickhouse-mysql \
--src-server-id=1 \
--src-resume \
--src-wait \
--nice-pause=1 \
--src-host=10.1.0.10 \
--src-user=root \
--src-password=password \
--src-tables=webp_cloud.logs \
--dst-host=10.1.0.11 \
--dst-schema webp_cloud \
--dst-table logs \
--log-level=info \
--csvpool \
--mempool-max-flush-interval=60 \
--mempool-max-events-num=1000 \
--pump-data \
--migrate-table

这个事件充分说明了我对 Clickhouse 真的一窍不通（

#Chinese #ClickHouse