用 Docker 快速在 Hetzner 上搭建 Qdrant 集群并迁移数据小笔记

2025-05-17

做过 RAG 相关需求的同学肯定知道，当你数据量越来越大的时候，除了需要解决召回率变得稀烂的问题，还会发现你的向量数据库越来越大逐渐毕竟你的内存极限，此时一个大内存的单体服务器就已经没法满足可用性和速度的需求了（因为单体机器内存总会有个上限，而且也不能平稳的滚动升级集群版本，此外普遍单大内存机器价格是高于多个小内存机器的），比如最近有一个需求需要将一个相对比较大的 Qdrant 单体迁移到 Qdrant 集群的部署模式中，所以有了这一篇小小的笔记。

发现我已经快一年没有写过带点技术的博文了…

原始单节点 Qdrant 部署

目前已有的 Qdrant 部署方式非常简单，就是一个单独的 docker-compose.yml 文件，内容如下：

  qdrant:
    image: qdrant/qdrant:v1.14.0
    restart: always
    ports:
      - 6333:6333
      - 6334:6334
    volumes:
      - ./volumes/qdrant:/qdrant/storage

这个 Qdrant 运行在一个 32 核心，128G 内存的 Hetzner CCX53 机器上，qdrant 目录约 200G，总共有 21,000,000+ 个向量。

{
  "result": {
    "status": "yellow",
    "optimizer_status": "ok",
    "indexed_vectors_count": 21297932,
    "points_count": 21337945,
    "segments_count": 10,
  }
}

集群环境

基本配置

这里我们要交代一下集群的环境的机器，由于 Qdrant 使用 Raft 作为共识协议，所以我们的部署应该 >=3 台机器，这里我们初始方案从 3 台机器开始，实验环境在 Hetzner 上，新建的三台机器 IP 如下：

10.0.0.6
10.0.0.7
10.0.0.9

注：这里只是为了演示方便使用，如果你也用 Hetzner 可以使用 Cloud init 配合内部 IP 快速部署 Qdrant 集群，文末有详细介绍。

由于使用 Docker 部署，所以我们只需要在 10.0.0.6 上创建 docker-compose.yml 文件，内容如下：

services:
  qdrant_node1:
    image: qdrant/qdrant:v1.14.0
    restart: always
    volumes:
      - ./qdrant_storage:/qdrant/storage
      - ./qdrant_snapshots:/qdrant/snapshots
    ports:
      - "6333:6333"
      - "6334:6334"
      - "6335:6335"
    environment:
      QDRANT__CLUSTER__ENABLED: "true"
    command: "./qdrant --uri http://10.0.0.6:6335"

在 10.0.0.7 上创建 docker-compose.yml 文件，内容如下：

services:
  qdrant_node2:
    image: qdrant/qdrant:v1.14.0
    ports:
      - "6333:6333"
      - "6334:6334"
      - "6335:6335"
    volumes:
      - ./qdrant_storage:/qdrant/storage
      - ./qdrant_snapshots:/qdrant/snapshots
    environment:
      QDRANT__CLUSTER__ENABLED: "true"
    command: "./qdrant --bootstrap http://10.0.0.6:6335 --uri http://10.0.0.7:6335"

在 10.0.0.9 上创建 docker-compose.yml 文件，内容如下：

services:
  qdrant_node3:
    image: qdrant/qdrant:v1.14.0
    ports:
      - "6333:6333"
      - "6334:6334"
      - "6335:6335"
    volumes:
      - ./qdrant_storage:/qdrant/storage
      - ./qdrant_snapshots:/qdrant/snapshots
    environment:
      QDRANT__CLUSTER__ENABLED: "true"
    command: "./qdrant --bootstrap http://10.0.0.6:6335 --uri http://10.0.0.9:6335"

然后在各自的机器上 docker-compose up -d 启动即可。

所有的节点启动后我们访问任意机器的 http://localhost:6333/cluster 就可以看到集群状态了，例如：

{
  "result": {
    "status": "enabled",
    "peer_id": 5395257186314509,
    "peers": {
      "3095816753490206": {
        "uri": "http://10.0.0.9:6335/"
      },
      "5395257186314509": {
        "uri": "http://10.0.0.6:6335/"
      },
      "4182395837949771": {
        "uri": "http://10.0.0.7:6335/"
      }
    },
    "raft_info": {
      "term": 1,
      "commit": 41,
      "pending_operations": 0,
      "leader": 5395257186314509,
      "role": "Leader",
      "is_voter": true
    },
    "consensus_thread_status": {
      "consensus_thread_status": "working",
      "last_update": "2025-05-17T02:31:10.703071457Z"
    },
    "message_send_failures": {}
  },
  "status": "ok",
  "time": 0.000011782
}

创建 collection

在有了分布式的 Qdrant 集群之后，在官方文档 https://qdrant.tech/documentation/guides/distributed_deployment/#making-use-of-a-new-distributed-qdrant-cluster 中我们可以知道：

When you enable distributed mode and scale up to two or more nodes, your data does not move to the new node automatically; it starts out empty. To make use of your new empty node, do one of the following:

Create a new replicated collection by setting the replication_factor to 2 or more and setting the number of shards to a multiple of your number of nodes.
If you have an existing collection which does not contain enough shards for each node, you must create a new collection as described in the previous bullet point.
If you already have enough shards for each node and you merely need to replicate your data, follow the directions for creating new shard replicas.
If you already have enough shards for each node and your data is already replicated, you can move data (without replicating it) onto the new node(s) by moving shards.

由于目前我们已有的 Qdrant 是一个单节点，上面已有的 Collection 也是只有单一的 Shard， collection 的配置如下：

{
  "params": {
    "vectors": {
      "size": 1536,
      "distance": "Cosine",
      "on_disk": true
    },
    "shard_number": 1,
    "replication_factor": 1,
    "write_consistency_factor": 1,
    "on_disk_payload": true
  },
}
...

所以这里的第一步就是需要在新的集群上面创建一个和目前的 collection 一样参数的 collection，除了 shard_number 和 replication_factor 需要修改。

和 ClickHouse 一样，创建了集群并不代表上面的数据是分布式+Replicated，你得手动指定你的数据

对于上面两个参数的设定，官方文档是这么建议的：

If you anticipate a lot of growth, we recommend 12 shards since you can expand from 1 node up to 2, 3, 6, and 12 nodes without having to re-shard. Having more than 12 shards in a small cluster may not be worth the performance overhead.

anticipate a lot of growth 听上去很符合我们的场景，这里我就使用 shard_number 为 12， replication_factor 为 2 的方式创建新的 collection，相关脚本如下：

from qdrant_client import QdrantClient
import qdrant_client.http.models as models

collection_name = "new_collection"

client = QdrantClient(
    host="10.0.0.6", port=6333
)

vectors_config = models.VectorParams(
    size=1536, distance=models.Distance.COSINE, on_disk=True
)
hnsw_config = HnswConfigDiff(
    m=0,
    payload_m=16,
    ef_construct=100,
    full_scan_threshold=10000,
    max_indexing_threads=100,
    on_disk=True,
)
client.create_collection(
    collection_name=collection_name,
    vectors_config=vectors_config,
    shard_number=12,
    replication_factor= 2,
    hnsw_config=hnsw_config
)

迁移数据

在创建好了集群之后我们就需要将目前已有的集群数据导入到新的集群上，这里我一开始尝试在原有集群上创建 Snapshot 并导入新集群，但是这样会遇到报错：

{"status":{"error":"Wrong input: Snapshot is not compatible with existing collection: Collection shard number: 3 Snapshot shard number: 1"},"time":1107.142566774}

这里需要使用 Qdrant 的一个 Beta 版本的工具来进行迁移： https://github.com/qdrant/migration/

使用方式如下：

docker run --net=host --rm -it registry.cloud.qdrant.io/library/qdrant-migration qdrant \
    --source-url 'http://localhost:6334' \
    --source-collection 'new_collection' \
    --target-url 'http://10.0.0.6:6334' \
    --target-collection 'new_collection'

注：registry.cloud.qdrant.io/library/qdrant-migration 版本较老，建议手动用最新代码构建镜像运行
代码建议手动 Patch 一下 grpc.MaxCallRecvMsgSize 参数并调高 batch-size （默认是 50 ，可以调整到 20000）获得更高的导入速度，相关 Issue： https://github.com/qdrant/migration/issues/30#issuecomment-2876456943

导入数据的时候为了保证最大的导入速度，可以参考 https://qdrant.tech/articles/indexing-optimization/#2-disable-hnsw-for-dense-vectors-m0 文章，将新集群的 hnsw_config 关闭，并设定一个合理的 indexing_threshold 例如：

PATCH /collections/your_collection
{
  "hnsw_config": {
    "m": 0
  },
  "optimizer_config": {
    "indexing_threshold": 10000
  }
}

前者可以保证导入过程中不会建立 HNSW 索引，后者保证导入的 Vector 能在到达 10000 的时候进行落盘，防止 Vector 全部堆积在内存中导致 OOM。

导入完成后可以重新打开 HNSW 建立索引。

迁移后的集群信息

为了方便观测每个节点的 Shard 信息，我们可以使用 /collections/<collection_name>/cluster API 来观测，例如此时响应如下：

{"result":{"peer_id":5395257186314509,"shard_count":12,"local_shards":[{"shard_id":1,"points_count":1794606,"state":"Active"},{"shard_id":2,"points_count":1450924,"state":"Active"},{"shard_id":4,"points_count":1902963,"state":"Active"},{"shard_id":5,"points_count":1774613,"state":"Active"},{"shard_id":7,"points_count":1753521,"state":"Active"},{"shard_id":8,"points_count":1687892,"state":"Active"},{"shard_id":10,"points_count":1477543,"state":"Active"},{"shard_id":11,"points_count":2051536,"state":"Active"}],"remote_shards":[{"shard_id":0,"peer_id":4182395837949771,"state":"Active"},{"shard_id":0,"peer_id":3095816753490206,"state":"Active"},{"shard_id":1,"peer_id":4182395837949771,"state":"Active"},{"shard_id":2,"peer_id":3095816753490206,"state":"Active"},{"shard_id":3,"peer_id":3095816753490206,"state":"Active"},{"shard_id":3,"peer_id":4182395837949771,"state":"Active"},{"shard_id":4,"peer_id":4182395837949771,"state":"Active"},{"shard_id":5,"peer_id":3095816753490206,"state":"Active"},{"shard_id":6,"peer_id":3095816753490206,"state":"Active"},{"shard_id":6,"peer_id":4182395837949771,"state":"Active"},{"shard_id":7,"peer_id":4182395837949771,"state":"Active"},{"shard_id":8,"peer_id":3095816753490206,"state":"Active"},{"shard_id":9,"peer_id":3095816753490206,"state":"Active"},{"shard_id":9,"peer_id":4182395837949771,"state":"Active"},{"shard_id":10,"peer_id":4182395837949771,"state":"Active"},{"shard_id":11,"peer_id":3095816753490206,"state":"Active"}],"shard_transfers":[]},"status":"ok","time":0.00011113}

只是…这样看也太不直观了，所以我们需要自己手写一个小工具来方便检查每个节点（Peer）上的 Shard 分布情况。

这就 Python 冲一个！

import requests

base_url = "http://10.0.0.6:6333"

cluster_endpoint = "/cluster"
collections_endpoint = "/collections/<collection_name>/cluster"

def get_data_from_api(endpoint):
    response = requests.get(base_url + endpoint)
    return response.json()

def parse_cluster_peers(cluster_data):
    peers = cluster_data.get("result", {}).get("peers", {})
    ip_peer_map = {}
    for peer_id, peer_info in peers.items():
        uri = peer_info.get("uri", "")
        ip_address = uri.split("//")[-1].split(":")[0]
        ip_peer_map[ip_address] = int(peer_id)
    return ip_peer_map


def parse_shards(collections_data):
    local_shards = collections_data.get("result", {}).get("local_shards", [])
    remote_shards = collections_data.get("result", {}).get("remote_shards", [])

    peer_shard_map = {}

    for shard in local_shards:
        peer_id = collections_data.get("result", {}).get("peer_id")
        shard_id = shard.get("shard_id")
        peer_shard_map.setdefault(peer_id, []).append(shard_id)

    for shard in remote_shards:
        peer_id = shard.get("peer_id")
        shard_id = shard.get("shard_id")
        peer_shard_map.setdefault(peer_id, []).append(shard_id)

    return peer_shard_map

def main():
    cluster_data = get_data_from_api(cluster_endpoint)
    collections_data = get_data_from_api(collections_endpoint)

    ip_peer_map = parse_cluster_peers(cluster_data)

    peer_shard_map = parse_shards(collections_data)

    ip_shard_map = {}
    for ip, peer_id in ip_peer_map.items():
        if peer_id in peer_shard_map:
            ip_shard_map[ip] = peer_shard_map[peer_id]
        else:
            ip_shard_map[ip] = []

    for ip, shard_ids in ip_shard_map.items():
        peer_id = ip_peer_map[ip]
        print(f"IP: {ip}, Peer ID: {peer_id}, Shard IDs: {shard_ids}")

if __name__ == "__main__":
    main()

我们执行脚本，就可以方便看到每个节点的 Shard 分布了：

IP: 10.0.0.7, Peer ID: 4182395837949771, Shard IDs: [0, 1, 3, 4, 6, 7, 9, 10]
IP: 10.0.0.6, Peer ID: 5395257186314509, Shard IDs: [1, 2, 4, 5, 7, 8, 10, 11]
IP: 10.0.0.9, Peer ID: 3095816753490206, Shard IDs: [0, 2, 3, 5, 6, 8, 9, 11]

可以看到所有的 Shard 均匀地分布在了 3 台机器上，此时任何一台机器掉线/损坏都不会导致任何 Shard 的副本全丢而导致数据丢失。

扩容操作

Rose策略主要是为了保证扩容要有效且安全，该策略的四个阶段分别为复苏（Resuscitation）、优化（Optimization）、稳定（Stabilization）、去复苏（Evacuation）。

假设我们的业务越做越大了，3 个节点就开始逐渐无法满足我们的业务需求了，所以我们需要对节点进行扩容，由于上面我们使用了 shard_number 为 12，所以我们可以以 3 的整数倍进行扩容，现在是 3 节点，那我们继续扩容 3 个节点出来，节点 IP 分别为

10.0.0.10
10.0.0.11
10.0.0.12

创建方式和文初一样，只要每个节点设定好 --url 和 --bootstrap http://10.0.0.6:6335 就可以，节点加入完成后 /cluster 接口响应如下：

{
  "result": {
    "status": "enabled",
    "peer_id": 5395257186314509,
    "peers": {
      "3095816753490206": {
        "uri": "http://10.0.0.9:6335/"
      },
      "4182395837949771": {
        "uri": "http://10.0.0.7:6335/"
      },
      "3841618339255269": {
        "uri": "http://10.0.0.10:6335/"
      },
      "3658649898688837": {
        "uri": "http://10.0.0.12:6335/"
      },
      "5395257186314509": {
        "uri": "http://10.0.0.6:6335/"
      },
      "8689864553665627": {
        "uri": "http://10.0.0.11:6335/"
      }
    },
    "raft_info": {
      "term": 1,
      "commit": 50,
      "pending_operations": 0,
      "leader": 5395257186314509,
      "role": "Leader",
      "is_voter": true
    },
    "consensus_thread_status": {
      "consensus_thread_status": "working",
      "last_update": "2025-05-17T02:49:29.351230053Z"
    },
    "message_send_failures": {}
  },
  "status": "ok",
  "time": 0.000011121
}

这个时候如果你是一个熟练使用 GlusterFS 的用户的话，你的第一反应肯定是通过以下指令进行 rebalance 来平衡一下各个节点的数据：

gluster volume rebalance VOLNAME start

但是很不幸，Qdrant 开源版本没有这样的功能（但是他们的 Cloud 上有）：

It’s worth mentioning that Qdrant only provides the necessary building blocks to create an automated failure recovery. Building a completely automatic process of collection scaling would require control over the cluster machines themself. Check out our cloud solution, where we made exactly that.
Shards are evenly distributed across all existing nodes when a collection is first created, but Qdrant does not automatically rebalance shards if your cluster size or replication factor changes (since this is an expensive operation on large clusters). See the next section for how to move shards after scaling operations.
——来自：https://qdrant.tech/documentation/guides/distributed_deployment/#choosing-the-right-number-of-shard

此时我们继续运行上面的脚本就可以发现：

IP: 10.0.0.6, Peer ID: 5395257186314509, Shard IDs: [1, 2, 4, 5, 7, 8, 10, 11]
IP: 10.0.0.9, Peer ID: 3095816753490206, Shard IDs: [0, 2, 3, 5, 6, 8, 9, 11]
IP: 10.0.0.12, Peer ID: 3658649898688837, Shard IDs: []
IP: 10.0.0.7, Peer ID: 4182395837949771, Shard IDs: [0, 1, 3, 4, 6, 7, 9, 10]
IP: 10.0.0.11, Peer ID: 8689864553665627, Shard IDs: []
IP: 10.0.0.10, Peer ID: 3841618339255269, Shard IDs: []

新加入的节点都在打酱油啊， Shard 全部都在老的 Peer 上，这该怎么办？

rebalance 接雨水？

既然官方表示已经提供了对应的接口用来移动 Shard：

curl -X POST http://localhost:6333/collections/collection_name/cluster \
     -H "api-key: <apiKey>" \
     -H "Content-Type: application/json" \
     -d '{
  "move_shard": {
    "shard_id": 1,
    "to_peer_id": 1000000,
    "from_peer_id": 1000000
  }
}'

我们可以很自然的想到一个解决方式，我们来手动 rebalance 各个 Shard，首先判断一下每个节点（Peer）应该有多少个 Shard，在这里的场景下是：

(Shard 数量 * Replica 数量) / 机器数量

也就是 (12*2)/6 = 4

然后我们可以计算出哪些节点上的 Shard 多于这个数量，哪些节点少于这个数量，计算一个移动的路径（所谓劫富济贫）：

⚠️ 需要注意不要把两个同样的 Shard 调度到一个 Peer 上了，这样这个 Peer 没了你的这个 Shard 的数据就玩完了。

underfilled_peers = []
for peer_id, shard_ids in peer_shard_map.items():
    if len(shard_ids) < average_shards_per_peer:
        underfilled_peers.append(peer_id)

overfilled_peers = []
for peer_id, shard_ids in peer_shard_map.items():
    if len(shard_ids) > average_shards_per_peer:
        overfilled_peers.append(peer_id)

print("underfilled_peers")
print(underfilled_peers)
print("overfilled_peers")
print(overfilled_peers)

rebalance_operations = []
for overfilled_peer in overfilled_peers:
    # Check if overfilled_peer_shard is not already in underfilled_peer, do not move two shards to the same peer
    for underfilled_peer in underfilled_peers:
        for overfilled_peer_shard in peer_shard_map[overfilled_peer]:
            if len(peer_shard_map[underfilled_peer]) < average_shards_per_peer and overfilled_peer_shard not in peer_shard_map[underfilled_peer] and len(peer_shard_map[overfilled_peer]) > average_shards_per_peer:
                print(f"将 shard_id {overfilled_peer_shard} 从 peer_id {overfilled_peer} 移动到 peer_id {underfilled_peer}")
                rebalance_operations.append((overfilled_peer, underfilled_peer, overfilled_peer_shard))
                peer_shard_map[underfilled_peer].append(overfilled_peer_shard)
                peer_shard_map[overfilled_peer].remove(overfilled_peer_shard)
            else:
                # Already in target peer, skip
                continue

当然，上面的逻辑写的比较简单+粗暴，我相信作为读者的你肯定可以写的更好

这样，如果没有出 Bug 的话，我们就可以获得一个移动路径列表了：

peer_id 对应的 shard_id 列表:
{5395257186314509: [1, 2, 4, 5, 7, 8, 10, 11], 4182395837949771: [0, 1, 3, 4, 6, 7, 9, 10], 3095816753490206: [0, 2, 3, 5, 6, 8, 9, 11]}
不足分片的 peer_id 列表:
[3658649898688837, 3841618339255269, 8689864553665627]
过多分片的 peer_id 列表:
[5395257186314509, 4182395837949771, 3095816753490206]
将 shard_id 1 从 peer_id 5395257186314509 移动到 peer_id 3658649898688837
将 shard_id 4 从 peer_id 5395257186314509 移动到 peer_id 3658649898688837
将 shard_id 7 从 peer_id 5395257186314509 移动到 peer_id 3658649898688837
将 shard_id 10 从 peer_id 5395257186314509 移动到 peer_id 3658649898688837
将 shard_id 0 从 peer_id 4182395837949771 移动到 peer_id 3841618339255269
将 shard_id 3 从 peer_id 4182395837949771 移动到 peer_id 3841618339255269
将 shard_id 6 从 peer_id 4182395837949771 移动到 peer_id 3841618339255269
将 shard_id 9 从 peer_id 4182395837949771 移动到 peer_id 3841618339255269
将 shard_id 0 从 peer_id 3095816753490206 移动到 peer_id 8689864553665627
将 shard_id 3 从 peer_id 3095816753490206 移动到 peer_id 8689864553665627
将 shard_id 6 从 peer_id 3095816753490206 移动到 peer_id 8689864553665627
将 shard_id 9 从 peer_id 3095816753490206 移动到 peer_id 8689864553665627
Rebalance 后的 peer_id 对应的 shard_id 列表:
Peer ID: 5395257186314509, Shard IDs: [2, 5, 8, 11]
Peer ID: 4182395837949771, Shard IDs: [1, 4, 7, 10]
Peer ID: 3095816753490206, Shard IDs: [2, 5, 8, 11]
Peer ID: 3658649898688837, Shard IDs: [1, 4, 7, 10]
Peer ID: 3841618339255269, Shard IDs: [0, 3, 6, 9]
Peer ID: 8689864553665627, Shard IDs: [0, 3, 6, 9]

此时我们只要包装一下移动 Shard 的函数：

def rebalance_shards(from_peer, to_peer, shard_id):
    url = f"{base_url}/collections/new_collection/cluster"
    payload = {
        "move_shard": {
            "shard_id": shard_id,
            "from_peer_id": from_peer,
            "to_peer_id": to_peer
        }
    }
    r = requests.post(url, json=payload)

就可以把移动路径传进去然后开冲了：

for from_peer, to_peer, shard_id in rebalance_operations:
    rebalance_shards(from_peer, to_peer, shard_id)

时刻谨记：this is an expensive operation on large clusters

经过一段时间的 Rebalance，我们就可以获得一个 6 节点，且 rebalance 好的集群了，此时可以配置你的 Load balancer 指向这些机器的 IP，然后应用程序连接上任何一个节点或者你的 Load balancer 地址并开始继续猛用了！

如果要缩容，那么就需要将即将被裁员的 Peer 上的 Shard 给 Move 走（类似 kubelet drain），然后通过 API 裁掉对应的 Peer 即可。

灾备

既然有了一个 3+ 集群的节点，根据 Raft，只要有 >50% 的节点在线，且我们的场景下只要掉线的节点不要包含了同一个 Shard ，那么数据是完整的，且所有操作都不会受到影响。

奇怪的是，Qdrant 的文档里面没有提及 Brain split 的情况。

所以：

如果你的某几个机器因为各种原因掉线了，且没有任何 Shard 的所有副本在掉线的机器上
- 那所有的操作都不会受到影响
- 如果你能原地恢复机器，那就恢复
- 如果不能恢复机器的话，用 API 删除这个 Peer，创建新机器加入集群，然后执行 rebalance 操作
如果有某个/些 Shard 的所有副本刚好在掉线的机器上
- 那你最好能把机器修好 🤣
- ~~或者建议尽快跑路~~

好消息是，在用 Cloud 的情况下机器掉线且修不好的概率其实不大（除非服务商着火了），基本掉线的原因可能是服务商的网络问题或者机器上 OOM 了，所以只要做好 Backup 基本可以保证没有数据救不回来的情况。

还有一个需要注意的情况是，如果你的机器 somehow 掉线了并且不可恢复，然后你通过 Snapshot 恢复了一个新的 VM，且这个 VM 获得了和之前已经掉线的机器不一样的，这个新的 VM 在加入集群的时候有概率会通知其他节点更新自己的 IP：

2025-05-10T07:52:31.601762Z  WARN storage::content_manager::consensus::persistent: Replaced address of peer 3994356516252114 from http://10.0.0.5:6335/ to http://10.0.0.9:6335/

也有可能不会通知其他节点，在这种情况下，我们需要手动修改所有机器上的 /qdrant_storage/raft_state.json 文件，并将对应机器的 IP 进行修改，并滚动重启所有节点。

希望你不会需要走到这一步 😇

如果你使用的 Hetzner 的 Cloud，那有额外如下建议可供参考：

创建机器的时候将机器放在 Placement Group 中，这样 Hetzner 会保证 VM 不在同一个 Host 上，减少 Host 翻车导致 VM 集体翻车的情况，相关文档参考： https://docs.hetzner.com/cloud/placement-groups/overview
- In spread Placement Groups, all virtual servers are running on different physical servers. This decreases the probability that some instances might fail together.
- 似乎其他 VPS 服务商很少有看到能提供这种选项的，可能默认大家都是分散的 Spread？
机器开启 Backup
- Hetzner 机器的 Backup 会和机器不在一个可用区，减少可用区翻车导致 VM 和 Backup 同时翻车的概率，参考 https://docs.hetzner.com/cloud/servers/backups-snapshots/faq#what-location-are-backupssnapshots-stored-in
- 注意：美国西部（Hillsboro）/美国东部（Ashburn）/新加坡由于只有一个可用区，所以 Backup 会和机器在一个机房中

自动化部署

还记得文章开头我们提到的如果你用 Hetzner 机器可以快速部署的方式嘛？

Hetzner 对内提供一个 API： https://docs.hetzner.cloud/#server-metadata

例如在 Hetzner 的 VM 上 cURL 一下 http://169.254.169.254/hetzner/v1/metadata/private-networks 就可以得到你的机器内网 IP。

curl http://169.254.169.254/hetzner/v1/metadata/private-networks
- ip: 10.0.0.3
  alias_ips: []
  interface_num: 1
  mac_address: 86:00:00:c3:bf:16
  network_id: 3493377
  network_name: us-west-network
  network: 10.0.0.0/16
  subnet: 10.0.0.0/24
  gateway: 10.0.0.1

只要组合一下搓出一个 cloud-init 就可以自动安装 Docker + 部署 Qdrant 节点，并设定好 --url 等参数，参考如下：

#cloud-config
write_files:
  - path: /root/create_docker_compose.sh
    permissions: "0755"
    owner: root:root
    content: |
      #!/bin/bash
      # Fetch the private network metadata
      METADATA=$(curl -s http://169.254.169.254/hetzner/v1/metadata/private-networks)

      # Extract the IP address from the metadata
      PRIVATE_IP=$(echo "$METADATA" | awk -F': ' '/ip:/ {print $2}' | tr -d ' ')
      
      # Generate the docker-compose.yml file
      cat <<EOF > /root/docker-compose.yml
      services:
        qdrant:
          image: qdrant/qdrant:v1.14.0
          restart: always
          volumes:
            - ./qdrant_storage:/qdrant/storage
          ports:
            - "6333:6333"
            - "6334:6334"
            - "6335:6335"
          environment:
            QDRANT__CLUSTER__ENABLED: "true"
          command: "./qdrant --bootstrap http://10.0.0.6:6335 --uri http://$PRIVATE_IP:6335"
      EOF      

  - path: /root/install_docker.sh
    permissions: "0755"
    owner: root:root
    content: |
      #!/bin/bash
      # Install Docker
      curl -fsSL https://get.docker.com -o install-docker.sh
      bash install-docker.sh

      # Install Docker Compose
      wget https://github.com/docker/compose/releases/download/v2.36.0/docker-compose-linux-x86_64 -O /usr/bin/docker-compose
      chmod +x /usr/bin/docker-compose      

runcmd:
  # Update package lists
  - apt-get update
  - apt-get install -y curl wget 

  # Execute the install docker script
  - /root/install_docker.sh

  # Execute the script to create the docker-compose.yml file
  - /root/create_docker_compose.sh
  
  # Start Docker Compose
  - cd /root && docker-compose up -d