在 Kubernetes 上运行 GitHub Actions Self-hosted Runner
GitHub Actions 很不错,相比较 Travis CI 而言排队不是很严重,除了用于 CI/CD 以外还可以通过提取内部的 DockerHub Credential 放到本地用于 docker pull 来避开 Docker Hub 的 429 Ratelimit 问题(参考:「 同步 docker hub library 镜像到本地 registry 」),对于一些小项目而言,GitHub Actions 提供的 Standard_DS2_v2 虚拟机确实性能还行,但是如果对于以下需求,使用 GitHub Actions 自带的机器可能就不是很合适了:
- 编译 TiKV(Standard_DS2_v2 的 2C7G 的机器用
build dist_release
可以编译到死(或者 OOM)) - 需要一些内部镜像协作,或使用到内网资源
- 私有仓库,且需要大量编译(官方的 Action 对于私有仓库只有 2000 分钟的使用时间)
- 需要更大的存储空间(官方的 GitHub Actions 只有 15G 不到的可用空间)
这种时候,我们就需要使用 Self-hosted Runner,什么是 Self-hosted Runner?
Self-hosted runners offer more control of hardware, operating system, and software tools than GitHub-hosted runners provide. With self-hosted runners, you can choose to create a custom hardware configuration with more processing power or memory to run larger jobs, install software available on your local network, and choose an operating system not offered by GitHub-hosted runners. Self-hosted runners can be physical, virtual, in a container, on-premises, or in a cloud.
对于一个 Org 而言,要添加一个 Org Level (全 Org 共享的) Runner 比较简单,只需要:
mkdir actions-runner && cd actions-runner
curl -o actions-runner-linux-x64-2.278.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.278.0/actions-runner-linux-x64-2.278.0.tar.gz
./config.sh --url https://github.com/some-github-org --token AF5TxxxxxxxxxxxA6PRRS
./run.sh
你就可以获得一个 Self hosted Runner 了,但是这样做会有一些局限性,比如:
- 没法弹性扩容,只能一个个手动部署
- 直接部署在裸机上,会有环境不一致的问题
Runner in Containter
Simple Docker
为了解决这个问题,我们需要把 GitHub Runner 给容器化,这里提供一个 Dockerfile 的 Example (魔改自:https://github.com/SanderKnape/github-runner),由于需要使用到类似 dind 的环境(在 Actions 中直接使用到 Docker 相关的指令),所以我加入了 docker 的 binary 进去,由于默认 Runner 不允许以 root 权限运行,为了避开后续挂载宿主机 Docker 的 sock 导致的权限问题,使用的 GitHub Runner 是一个经过修改的版本,修改版本中让 Runner 可以以 root 权限运行,修改的脚本如下:
wget https://github.com/actions/runner/releases/download/v2.278.0/actions-runner-linux-x64-2.278.0.tar.gz
tar xzf ./actions-runner-linux-x64-2.278.0.tar.gz && rm -f actions-runner-linux-x64-2.278.0.tar.gz
# 这里删除了两个文件中判断是否 root 用户的部分
sed -i '3,9d' ./config.sh
sed -i '3,8d' ./run.sh
# End
# 重新打包
tar -czf actions-runner-linux-x64-2.278.0.tar.gz *
# 删除解压出来的不需要的文件
rm -rf bin config.sh env.sh externals run.sh
然后 Dockerfile 可以这么写
FROM ubuntu:18.04
ENV GITHUB_PAT ""
ENV GITHUB_ORG_NAME ""
ENV RUNNER_WORKDIR "_work"
ENV RUNNER_LABELS ""
RUN apt-get update \
&& apt-get install -y curl sudo git jq iputils-ping zip \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
&& curl https://download.docker.com/linux/static/stable/x86_64/docker-20.10.7.tgz --output docker-20.10.7.tgz \
&& tar xvfz docker-20.10.7.tgz \
&& cp docker/* /usr/bin/
USER root
WORKDIR /root/
RUN GITHUB_RUNNER_VERSION="2.278.0" \
&& curl -Ls https://internal.knat.network/action-runner/actions-runner-linux-x64-${GITHUB_RUNNER_VERSION}.tar.gz | tar xz \
&& ./bin/installdependencies.sh
COPY entrypoint.sh runsvc.sh ./
RUN sudo chmod u+x ./entrypoint.sh ./runsvc.sh
ENTRYPOINT ["./entrypoint.sh"]
其中 entrypoint.sh
的内容如下:
#!/bin/sh
# 这里如果直接使用 ./config.sh --url https://github.com/some-github-org --token AF5TxxxxxxxxxxxA6PRRS 的方式注册的话,token 会动态变化,容易导致注册后无法 remove 的问题,所以参考 https://docs.github.com/en/rest/reference/actions#list-self-hosted-runners-for-an-organization 通过 Personal Access Token 动态获取 Runner 的 Token
registration_url="https://github.com/${GITHUB_ORG_NAME}"
token_url="https://api.github.com/orgs/${GITHUB_ORG_NAME}/actions/runners/registration-token"
payload=$(curl -sX POST -H "Authorization: token ${GITHUB_PAT}" ${token_url})
export RUNNER_TOKEN=$(echo $payload | jq .token --raw-output)
if [ -z "${RUNNER_NAME}" ]; then
RUNNER_NAME=$(hostname)
fi
./config.sh --unattended --url https://github.com/${GITHUB_ORG_NAME} --token ${RUNNER_TOKEN} --labels "${RUNNER_LABELS}"
# 在容器被干掉的时候自动向 GitHub 解除注册 Runner
remove() {
if [ -n "${GITHUB_RUNNER_TOKEN}" ]; then
export REMOVE_TOKEN=$GITHUB_RUNNER_TOKEN
else
payload=$(curl -sX POST -H "Authorization: token ${GITHUB_PAT}" ${token_url%/registration-token}/remove-token)
export REMOVE_TOKEN=$(echo $payload | jq .token --raw-output)
fi
./config.sh remove --unattended --token "${RUNNER_TOKEN}"
}
trap 'remove; exit 130' INT
trap 'remove; exit 143' TERM
./runsvc.sh "$*" &
wait $!
Build + 运行:
docker build . -t n0vad3v/github-runner
docker run -v /var/run/docker.sock:/var/run/docker.sock -e GITHUB_PAT="ghp_bhxxxxxxxxxxxxx7xxxxxxxdONDT" -e GITHUB_ORG_NAME="some-github-org" -it n0vad3v/github-runner
此时你就可以看到你的 Org 下多了一个船新的 Runner 了,现在终于可以利用上自己的机器快速跑任务不排队,而且性能比 GitHub Actions 强了~
Scale with Kubernetes
但是这样并不 Scale,所有的 Runner 都需要手动管理,而且,GitHub Actions 如果同时写了多个 Job ,然后 Runner 数量小于 Job 数量的话,部分 Job 就会一直排队,对于排队时间的话:
Each job for self-hosted runners can be queued for a maximum of 24 hours. If a self-hosted runner does not start executing the job within this limit, the job is terminated and fails to complete.
那这个肯定是没法接受的,正好手边有个 k8s 集群,对于这类基本无状态的服务来说,让 k8s 来自动管理他们不是最好的嘛,于是可以想到写一个 Deployment,比如这样:
apiVersion: apps/v1
kind: Deployment
metadata:
name: github-runner-some-github-org
labels:
app: githubrunner
spec:
replicas: 10
selector:
matchLabels:
app: githubrunner
template:
metadata:
labels:
app: githubrunner
spec:
volumes:
- name: docker-sock
hostPath:
path: /var/run/docker.sock
type: File
containers:
- name: github-runner-some-github-org
imagePullPolicy: Always
image: 'n0vad3v/github-runner'
env:
- name: GITHUB_PAT
value: "ghp_bhxxxxxxxxxxxxx7xxxxxxxdONDT"
- name: GITHUB_ORG_NAME
value: "some-github-org"
- name: RUNNER_LABELS
value: "docker,internal-k8s"
volumeMounts:
- mountPath: /var/run/docker.sock
name: docker-sock
readOnly: false
kubectl apply -f action.yml -n novakwok
,打上 Tag, 起飞!
[root@dev action]# kubectl get po -n novakwok
NAME READY STATUS RESTARTS AGE
github-runner-some-github-org-deployment-9cfb598d9-4shrk 1/1 Running 0 26m
github-runner-some-github-org-deployment-9cfb598d9-5rnj4 1/1 Running 0 26m
github-runner-some-github-org-deployment-9cfb598d9-cvkr9 1/1 Running 0 26m
github-runner-some-github-org-deployment-9cfb598d9-dmbnp 1/1 Running 0 26m
github-runner-some-github-org-deployment-9cfb598d9-ggl24 1/1 Running 0 26m
github-runner-some-github-org-deployment-9cfb598d9-gkgzx 1/1 Running 0 26m
github-runner-some-github-org-deployment-9cfb598d9-jcscq 1/1 Running 0 26m
github-runner-some-github-org-deployment-9cfb598d9-lrrxh 1/1 Running 0 26m
github-runner-some-github-org-deployment-9cfb598d9-pn9cn 1/1 Running 0 26m
github-runner-some-github-org-deployment-9cfb598d9-wj2tj 1/1 Running 0 26m
Demo on Docker
由于我的需求比较特殊,我需要在 Runner 内使用 Docker 相关的指令(比如需要在 Runner 上 docker build/push
),这里测试一下 Runner 是否可以正常工作,首先创建一个多 Job 的任务,像这样:
name: Test
on:
push:
branches: [ main ]
jobs:
test-1:
runs-on: [self-hosted,X64]
steps:
- uses: actions/checkout@v2
- name: Run a one-line script
run: |
curl ip.sb
df -h
lscpu
docker pull redis
test-2:
runs-on: [self-hosted,X64]
steps:
- uses: actions/checkout@v2
- name: Run a one-line script
run: |
curl ip.sb
df -h
lscpu
docker pull redis
test-3:
runs-on: [self-hosted,X64]
steps:
- uses: actions/checkout@v2
- name: Run a one-line script
run: |
curl ip.sb
df -h
lscpu
pwd
docker pull redis
然后跑一下看看是否可以 Work,首先确定是调度到了 Docker Runner 上:
然后看看 Docker 相关的操作是否可以 Work
好耶!
GC
有的时候会由于一些诡异的问题导致 Runner 掉线(比如 Remove 的时候网络断了之类的),这种之后 Org 下就会有一堆 Offline 的 Runner,为了解决这种情况,我们可以写一个简单的脚本来进行 GC,脚本如下:
import requests
import argparse
parser = argparse.ArgumentParser(description='GC Dead Self-hosted runners')
parser.add_argument('--github_pat', help='GitHub Personal Access Token')
parser.add_argument('--org_name', help='GitHub Org Name')
args = parser.parse_args()
def list_runners(org_name,github_pat):
list_runner_url = 'https://api.github.com/orgs/{}/actions/runners'.format(org_name)
headers = {"Authorization": "token {}".format(github_pat)}
r = requests.get(list_runner_url,headers=headers)
runner_list = r.json()['runners']
return runner_list
def delete_offline_runners(org_name,github_pat,runner_list):
headers = {"Authorization": "token {}".format(github_pat)}
for runner in runner_list:
if runner['status'] == "offline":
runner_id = runner['id']
delete_runner_url = 'https://api.github.com/orgs/{}/actions/runners/{}'.format(org_name,runner_id)
print("Deleting runner " + str(runner_id) + ", with name of " + runner['name'])
r = requests.delete(delete_runner_url,headers=headers)
if __name__ == '__main__':
runner_list = list_runners(args.org_name,args.github_pat)
delete_offline_runners(args.org_name,args.github_pat,runner_list)
用法是:python3 gc_runner.py --github_pat "ghp_bhxxxxxxxxxxxxx7xxxxxxxdONDT" --org_name "some-github-org"
Some limitations
除了我们自身硬件限制以外,GitHub Actions 本身还有一些限制,比如:
- Workflow run time - Each workflow run is limited to 72 hours. If a workflow run reaches this limit, the workflow run is cancelled.
- Job queue time - Each job for self-hosted runners can be queued for a maximum of 24 hours. If a self-hosted runner does not start executing the job within this limit, the job is terminated and fails to complete.
- API requests - You can execute up to 1000 API requests in an hour across all actions within a repository. If exceeded, additional API calls will fail, which might cause jobs to fail.
- Job matrix - A job matrix can generate a maximum of 256 jobs per workflow run. This limit also applies to self-hosted runners.
- Workflow run queue - No more than 100 workflow runs can be queued in a 10 second interval per repository. If a workflow run reaches this limit, the workflow run is terminated and fails to complete.
其中 API requests 这个比较玄学,由于 GitHub Actions 的工作方法官方介绍如下:
The self-hosted runner polls GitHub to retrieve application updates and to check if any jobs are queued for processing. The self-hosted runner uses a HTTPS long poll that opens a connection to GitHub for 50 seconds, and if no response is received, it then times out and creates a new long poll.
所以不是很容易判断怎么样才算是一个 API request,这一点需要在大量使用的时候才可能暴露出问题。
Git Version
这里有个小坑,容器内的 Git 版本建议在 2.18 以上,Ubuntu 18.04 没问题(默认是 2.22.5),但是 arm64v8/ubuntu:18.04
官方源包管理工具的 Git 版本是 2.17,如果用这个版本的话,会遇到这种问题:
所以需要编译一个高版本的 Git,比如 Dockerfile 可以加上这么一行:
apt install -y gcc libssl-dev libcurl4-gnutls-dev zlib1g-dev make gettext wget
wget https://www.kernel.org/pub/software/scm/git/git-2.28.0.tar.gz && tar -xvzf git-2.28.0.tar.gz && cd git-2.28.0 && ./configure --prefix=/usr/ && make && make install
小结
如上,我们已经把 Runner 封进了 Docker 容器中,并且在需要 Scale 的情况下通过 k8s 进行水平扩展,此外,我们还有一个简单的 GC 程序对可能异常掉线的 Runner 进行 GC,看上去已经满足了一些初步的需求啦~
但是这样还是有一些问题,比如:
- 用 root 用户跑容器可能会有潜在的风险,尤其是还暴露了宿主机的 Docker sock,所以对于普通的任务来说,还是需要一个非 root 用户的容器来运行
- 还是没有实现自动化扩缩容,扩缩容依赖手动修改 replica,这里需要进行自动化(例如预留 20 个 Idle 的 Runner,如果 Idle Runner 小于 20 个就自动增加)
- Label 管理,由于 GitHub Actions 依赖的 Label 进行调度,所以这里打 Label 其实是一个需要长期考虑的事情