在 Kubernetes 上运行 GitHub Actions Self-hosted Runner

GitHub Actions 很不错,相比较 Travis CI 而言排队不是很严重,除了用于 CI/CD 以外还可以通过提取内部的 DockerHub Credential 放到本地用于 docker pull 来避开 Docker Hub 的 429 Ratelimit 问题(参考:「 同步 docker hub library 镜像到本地 registry 」),对于一些小项目而言,GitHub Actions 提供的 Standard_DS2_v2 虚拟机确实性能还行,但是如果对于以下需求,使用 GitHub Actions 自带的机器可能就不是很合适了:

  • 编译 TiKV(Standard_DS2_v2 的 2C7G 的机器用 build dist_release 可以编译到死(或者 OOM))
  • 需要一些内部镜像协作,或使用到内网资源
  • 私有仓库,且需要大量编译(官方的 Action 对于私有仓库只有 2000 分钟的使用时间)
  • 需要更大的存储空间(官方的 GitHub Actions 只有 15G 不到的可用空间)

这种时候,我们就需要使用 Self-hosted Runner,什么是 Self-hosted Runner?

Self-hosted runners offer more control of hardware, operating system, and software tools than GitHub-hosted runners provide. With self-hosted runners, you can choose to create a custom hardware configuration with more processing power or memory to run larger jobs, install software available on your local network, and choose an operating system not offered by GitHub-hosted runners. Self-hosted runners can be physical, virtual, in a container, on-premises, or in a cloud.

对于一个 Org 而言,要添加一个 Org Level (全 Org 共享的) Runner 比较简单,只需要:

mkdir actions-runner && cd actions-runner
curl -o actions-runner-linux-x64-2.278.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.278.0/actions-runner-linux-x64-2.278.0.tar.gz
./config.sh --url https://github.com/some-github-org --token AF5TxxxxxxxxxxxA6PRRS
./run.sh

你就可以获得一个 Self hosted Runner 了,但是这样做会有一些局限性,比如:

  • 没法弹性扩容,只能一个个手动部署
  • 直接部署在裸机上,会有环境不一致的问题

Runner in Containter

Simple Docker

为了解决这个问题,我们需要把 GitHub Runner 给容器化,这里提供一个 Dockerfile 的 Example (魔改自:https://github.com/SanderKnape/github-runner),由于需要使用到类似 dind 的环境(在 Actions 中直接使用到 Docker 相关的指令),所以我加入了 docker 的 binary 进去,由于默认 Runner 不允许以 root 权限运行,为了避开后续挂载宿主机 Docker 的 sock 导致的权限问题,使用的 GitHub Runner 是一个经过修改的版本,修改版本中让 Runner 可以以 root 权限运行,修改的脚本如下:

wget https://github.com/actions/runner/releases/download/v2.278.0/actions-runner-linux-x64-2.278.0.tar.gz
tar xzf ./actions-runner-linux-x64-2.278.0.tar.gz && rm -f actions-runner-linux-x64-2.278.0.tar.gz

# 这里删除了两个文件中判断是否 root 用户的部分
sed -i '3,9d' ./config.sh
sed -i '3,8d' ./run.sh
# End

# 重新打包
tar -czf actions-runner-linux-x64-2.278.0.tar.gz *

# 删除解压出来的不需要的文件
rm -rf bin config.sh env.sh externals run.sh

然后 Dockerfile 可以这么写

FROM ubuntu:18.04

ENV GITHUB_PAT ""
ENV GITHUB_ORG_NAME ""
ENV RUNNER_WORKDIR "_work"
ENV RUNNER_LABELS ""

RUN apt-get update \
    && apt-get install -y curl sudo git jq iputils-ping zip \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/* \
    && curl https://download.docker.com/linux/static/stable/x86_64/docker-20.10.7.tgz --output docker-20.10.7.tgz \
    && tar xvfz docker-20.10.7.tgz \
    && cp docker/* /usr/bin/

USER root
WORKDIR /root/

RUN GITHUB_RUNNER_VERSION="2.278.0" \
    && curl -Ls https://internal.knat.network/action-runner/actions-runner-linux-x64-${GITHUB_RUNNER_VERSION}.tar.gz | tar xz \
    && ./bin/installdependencies.sh

COPY entrypoint.sh runsvc.sh ./
RUN sudo chmod u+x ./entrypoint.sh ./runsvc.sh

ENTRYPOINT ["./entrypoint.sh"]

其中 entrypoint.sh 的内容如下:

#!/bin/sh

# 这里如果直接使用 ./config.sh --url https://github.com/some-github-org --token AF5TxxxxxxxxxxxA6PRRS 的方式注册的话,token 会动态变化,容易导致注册后无法 remove 的问题,所以参考 https://docs.github.com/en/rest/reference/actions#list-self-hosted-runners-for-an-organization 通过 Personal Access Token 动态获取 Runner 的 Token
registration_url="https://github.com/${GITHUB_ORG_NAME}"
token_url="https://api.github.com/orgs/${GITHUB_ORG_NAME}/actions/runners/registration-token"
payload=$(curl -sX POST -H "Authorization: token ${GITHUB_PAT}" ${token_url})
export RUNNER_TOKEN=$(echo $payload | jq .token --raw-output)

if [ -z "${RUNNER_NAME}" ]; then
    RUNNER_NAME=$(hostname)
fi

./config.sh --unattended --url https://github.com/${GITHUB_ORG_NAME} --token ${RUNNER_TOKEN} --labels "${RUNNER_LABELS}"

# 在容器被干掉的时候自动向 GitHub 解除注册 Runner
remove() {
    if [ -n "${GITHUB_RUNNER_TOKEN}" ]; then
        export REMOVE_TOKEN=$GITHUB_RUNNER_TOKEN
    else
        payload=$(curl -sX POST -H "Authorization: token ${GITHUB_PAT}" ${token_url%/registration-token}/remove-token)
        export REMOVE_TOKEN=$(echo $payload | jq .token --raw-output)
    fi

    ./config.sh remove --unattended --token "${RUNNER_TOKEN}"
}

trap 'remove; exit 130' INT
trap 'remove; exit 143' TERM

./runsvc.sh "$*" &

wait $!

Build + 运行:

docker build . -t n0vad3v/github-runner
docker run -v /var/run/docker.sock:/var/run/docker.sock -e GITHUB_PAT="ghp_bhxxxxxxxxxxxxx7xxxxxxxdONDT" -e GITHUB_ORG_NAME="some-github-org" -it n0vad3v/github-runner

此时你就可以看到你的 Org 下多了一个船新的 Runner 了,现在终于可以利用上自己的机器快速跑任务不排队,而且性能比 GitHub Actions 强了~

Scale with Kubernetes

但是这样并不 Scale,所有的 Runner 都需要手动管理,而且,GitHub Actions 如果同时写了多个 Job ,然后 Runner 数量小于 Job 数量的话,部分 Job 就会一直排队,对于排队时间的话:

Each job for self-hosted runners can be queued for a maximum of 24 hours. If a self-hosted runner does not start executing the job within this limit, the job is terminated and fails to complete.

那这个肯定是没法接受的,正好手边有个 k8s 集群,对于这类基本无状态的服务来说,让 k8s 来自动管理他们不是最好的嘛,于是可以想到写一个 Deployment,比如这样:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: github-runner-some-github-org
  labels:
    app: githubrunner
spec:
  replicas: 10
  selector:
    matchLabels:
      app: githubrunner
  template:
    metadata:
      labels:
        app: githubrunner
    spec:
      volumes:
        - name: docker-sock
          hostPath:
            path: /var/run/docker.sock
            type: File
      containers:
        - name: github-runner-some-github-org
          imagePullPolicy: Always
          image: 'n0vad3v/github-runner'
          env:
            - name: GITHUB_PAT
              value: "ghp_bhxxxxxxxxxxxxx7xxxxxxxdONDT"
            - name: GITHUB_ORG_NAME
              value: "some-github-org"
            - name: RUNNER_LABELS
              value: "docker,internal-k8s"

          volumeMounts:
            - mountPath: /var/run/docker.sock
              name: docker-sock
              readOnly: false

kubectl apply -f action.yml -n novakwok,打上 Tag, 起飞!

[[email protected] action]# kubectl get po -n novakwok
NAME                                                    READY   STATUS    RESTARTS   AGE
github-runner-some-github-org-deployment-9cfb598d9-4shrk   1/1     Running   0          26m
github-runner-some-github-org-deployment-9cfb598d9-5rnj4   1/1     Running   0          26m
github-runner-some-github-org-deployment-9cfb598d9-cvkr9   1/1     Running   0          26m
github-runner-some-github-org-deployment-9cfb598d9-dmbnp   1/1     Running   0          26m
github-runner-some-github-org-deployment-9cfb598d9-ggl24   1/1     Running   0          26m
github-runner-some-github-org-deployment-9cfb598d9-gkgzx   1/1     Running   0          26m
github-runner-some-github-org-deployment-9cfb598d9-jcscq   1/1     Running   0          26m
github-runner-some-github-org-deployment-9cfb598d9-lrrxh   1/1     Running   0          26m
github-runner-some-github-org-deployment-9cfb598d9-pn9cn   1/1     Running   0          26m
github-runner-some-github-org-deployment-9cfb598d9-wj2tj   1/1     Running   0          26m

Demo on Docker

由于我的需求比较特殊,我需要在 Runner 内使用 Docker 相关的指令(比如需要在 Runner 上 docker build/push),这里测试一下 Runner 是否可以正常工作,首先创建一个多 Job 的任务,像这样:

name: Test
on:
  push:
    branches: [ main ]

jobs:
  test-1:
    runs-on: [self-hosted,X64]

    steps:
      - uses: actions/[email protected]
      - name: Run a one-line script
        run: |
          curl ip.sb
          df -h
          lscpu
          docker pull redis
          
  test-2:
    runs-on: [self-hosted,X64]
    steps:
      - uses: actions/[email protected]
      - name: Run a one-line script
        run: |
          curl ip.sb
          df -h
          lscpu
          docker pull redis
          
  test-3:
    runs-on: [self-hosted,X64]
    steps:
      - uses: actions/[email protected]
      - name: Run a one-line script
        run: |
          curl ip.sb
          df -h
          lscpu
          pwd
          docker pull redis

然后跑一下看看是否可以 Work,首先确定是调度到了 Docker Runner 上:

然后看看 Docker 相关的操作是否可以 Work

好耶!

GC

有的时候会由于一些诡异的问题导致 Runner 掉线(比如 Remove 的时候网络断了之类的),这种之后 Org 下就会有一堆 Offline 的 Runner,为了解决这种情况,我们可以写一个简单的脚本来进行 GC,脚本如下:

import requests
import argparse

parser = argparse.ArgumentParser(description='GC Dead Self-hosted runners')
parser.add_argument('--github_pat', help='GitHub Personal Access Token')
parser.add_argument('--org_name', help='GitHub Org Name')
args = parser.parse_args()


def list_runners(org_name,github_pat):
    list_runner_url = 'https://api.github.com/orgs/{}/actions/runners'.format(org_name)
    headers = {"Authorization": "token {}".format(github_pat)}
    r = requests.get(list_runner_url,headers=headers)
    runner_list = r.json()['runners']
    return runner_list

def delete_offline_runners(org_name,github_pat,runner_list):
    headers = {"Authorization": "token {}".format(github_pat)}
    for runner in runner_list:
        if runner['status'] == "offline":
            runner_id = runner['id']
            delete_runner_url = 'https://api.github.com/orgs/{}/actions/runners/{}'.format(org_name,runner_id)
            print("Deleting runner " + str(runner_id) + ", with name of " + runner['name'])
            r = requests.delete(delete_runner_url,headers=headers)

if __name__ == '__main__':
    runner_list = list_runners(args.org_name,args.github_pat)
    delete_offline_runners(args.org_name,args.github_pat,runner_list)

用法是:python3 gc_runner.py --github_pat "ghp_bhxxxxxxxxxxxxx7xxxxxxxdONDT" --org_name "some-github-org"

Some limitations

除了我们自身硬件限制以外,GitHub Actions 本身还有一些限制,比如:

  • Workflow run time - Each workflow run is limited to 72 hours. If a workflow run reaches this limit, the workflow run is cancelled.
  • Job queue time - Each job for self-hosted runners can be queued for a maximum of 24 hours. If a self-hosted runner does not start executing the job within this limit, the job is terminated and fails to complete.
  • API requests - You can execute up to 1000 API requests in an hour across all actions within a repository. If exceeded, additional API calls will fail, which might cause jobs to fail.
  • Job matrix - A job matrix can generate a maximum of 256 jobs per workflow run. This limit also applies to self-hosted runners.
  • Workflow run queue - No more than 100 workflow runs can be queued in a 10 second interval per repository. If a workflow run reaches this limit, the workflow run is terminated and fails to complete.

其中 API requests 这个比较玄学,由于 GitHub Actions 的工作方法官方介绍如下:

The self-hosted runner polls GitHub to retrieve application updates and to check if any jobs are queued for processing. The self-hosted runner uses a HTTPS long poll that opens a connection to GitHub for 50 seconds, and if no response is received, it then times out and creates a new long poll.

所以不是很容易判断怎么样才算是一个 API request,这一点需要在大量使用的时候才可能暴露出问题。

Git Version

这里有个小坑,容器内的 Git 版本建议在 2.18 以上,Ubuntu 18.04 没问题(默认是 2.22.5),但是 arm64v8/ubuntu:18.04 官方源包管理工具的 Git 版本是 2.17,如果用这个版本的话,会遇到这种问题:

所以需要编译一个高版本的 Git,比如 Dockerfile 可以加上这么一行:

apt install -y gcc libssl-dev libcurl4-gnutls-dev zlib1g-dev make gettext wget
wget https://www.kernel.org/pub/software/scm/git/git-2.28.0.tar.gz && tar -xvzf git-2.28.0.tar.gz && cd git-2.28.0 && ./configure --prefix=/usr/ && make && make install

小结

如上,我们已经把 Runner 封进了 Docker 容器中,并且在需要 Scale 的情况下通过 k8s 进行水平扩展,此外,我们还有一个简单的 GC 程序对可能异常掉线的 Runner 进行 GC,看上去已经满足了一些初步的需求啦~

但是这样还是有一些问题,比如:

  1. 用 root 用户跑容器可能会有潜在的风险,尤其是还暴露了宿主机的 Docker sock,所以对于普通的任务来说,还是需要一个非 root 用户的容器来运行
  2. 还是没有实现自动化扩缩容,扩缩容依赖手动修改 replica,这里需要进行自动化(例如预留 20 个 Idle 的 Runner,如果 Idle Runner 小于 20 个就自动增加)
  3. Label 管理,由于 GitHub Actions 依赖的 Label 进行调度,所以这里打 Label 其实是一个需要长期考虑的事情

References

  1. Running self-hosted GitHub Actions runners in your Kubernetes cluster
  2. About GitHub-hosted runners
  3. Actions

comments powered by Disqus