在 ELK 中分析 Google Load Balancer（及 CDN）日志

2020-09-23

作为一个假的 CDN 服务商，虽然流量不多，但还是希望可以对自己 CDN 的性能表现，输出流量，访客来源等进行一些统计和分析，在之前的文章「让博客变得更快——Google Load Balancer 和 Google CDN 使用小记」中已经描述了目前的 CDN 结构，可以发现所有的 TLS 流量都是在 Google Load Balancer 上结束之后到达后面的主机，如下图所示：

由于需要对性能表现进行分析，显然从 Load Balancer 上获取日志会比从主机上获取 Nginx 日志来的更加靠谱和全面一些，所以本文将简述如何导出 Google Load Balancer 的日志并使用 ELK 中的 E(lasticsearch) 和 K(ibana) 进行分析的。

GCP Log

GCP 本身提供了一个非常易懂的 Logs Viewer，如图所示：

对于其中一个请求，我们导出 JSON 之后可以发现类似如下：

{
  "httpRequest": {
    "cacheLookup": true,
    "latency": "1.216810s",
    "remoteIp": "123.123.123.123",
    "requestMethod": "GET",
    "requestSize": "351",
    "requestUrl": "https://website.test/test-quick-start.html",
    "responseSize": "6415",
    "serverIp": "10.174.0.5",
    "status": 200,
    "userAgent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
  },
  "insertId": "5u564gg12v7bsq",
  "jsonPayload": {
    "@type": "type.googleapis.com/google.cloud.loadbalancing.type.LoadBalancerLogEntry",
    "cacheId": "CHS-45f89f72",
    "statusDetails": "response_sent_by_backend"
  },
  "logName": "projects/xxxxxx-223411/logs/requests",
  "receiveTimestamp": "2020-09-21T13:00:13.604551689Z",
  "resource": {
    "labels": {
      "backend_service_name": "xxxx-xxxx-xxxx",
      "forwarding_rule_name": "xxxxx",
      "project_id": "xxxxxxxx-xxxx",
      "target_proxy_name": "xxxx-lb-target-proxy-3",
      "url_map_name": "xxxxx-lb",
      "zone": "global"
    },
    "type": "http_load_balancer"
  },
  "severity": "INFO",
  "spanId": "209153a5bc3264e0",
  "timestamp": "2020-09-21T13:00:11.297154Z",
  "trace": "projects/nova-blog-266907/traces/f97f9d3c5f0d71ea22e13d85e0b65f16"
}

可以看到，每一个请求都有对应的 JSON 格式记录，字段非常的全面，可以帮助我们排查很多问题，但是由于 GCP 自带的 Log Viewer 功能比较简单且默认的 Rentention Period（回收周期）是 30 天（意味着这里只会保存 30 天的日志），加之我们不应该过于依赖这一个平台，于是想到了将日志导出到其他平台上进行分析。

对于日志分析来说，最简单粗暴的方式可能就是写入 MySQL，嗷不，MongoDB，但是为了方便的进行后期分析，这里使用了比较常见的 ELK 的架构。

传统的 ELK 结构如下：

但是本文中数据是直接由 Python 导入 ElasticSearch 的，所以只能算 PEK 了（雾。

Export Log

对于导出日志来说有两种方式：

直接 Stream 出来，导出的是比较实时的热数据
存放到 Bucket 中离线分析，属于冷数据

对于热数据的导出，GCP 官方建议的是 pubsubbeat，然而这个仓库（https://github.com/googlearchive/pubsubbeat）已经被 Google Archive 并标明了：「This project is no longer actively maintained by Google.」，加之主要我对日志分析需求并不是那么实时，所以我选择了第二种方式，离线分析。

GCP Log to GCS

由于是离线分析+归档存储（且在 GCP 平台上），第一步便是创建一个 GCS 桶并在 Logs Viewer 页面创建一个 Sink：

然后指定之前创建好的桶用于存放日志即可：

创建好了之后不要着急，因为日志是每小时导入一次的，而且这个时间并不是非常稳定。

如果上述步骤没有问题的话，GCS 中的文件类似如下：

对于 http_load_balancer 类型的日志会全部被保存在 /requests/<YYYY>/<MM>/<DD>/ 下的一堆 JSON 中，其中文件内容为一行行的 JSON 数据，数据格式如上文所示。

我们有了日志了之后就可以使用 Google 的 gsutil 定期地从 GCS 上把文件同步下来了，比如可以放在 crontab 里面：

0 * * * * gsutil rsync -d -r gs://bucket_name/ /mnt/your_logs_location

这样在你的 /mnt/your_logs_location 下就有你的所有日志了。

Aggregate Logs

从上文中我们知道，Google 对于每天的日志会分散在很多小文件中，所以如果你和我一样每天导入上一天的数据的话，首当其冲的就是整合一下一天的日志，这一步很简单，可以直接用 Python 包裹一下 Shell 进行，类似这么写：

RAW_LOG_LOC = "/mnt/your_logs_location/"
AGG_LOG_LOC = "/mnt/your_BIG_logs_location/"
# current_date -> 2020-09-02
def aggregate_logs(current_date):
    # Convert "2020-09-02" to "2020/09/02"
    date_path = current_date.replace("-","/")
    cmd = "cat " + RAW_LOG_LOC + date_path + "/* > " + AGG_LOG_LOC + current_date + ".json"
    os.system(cmd)

这样每日的零散数据就会被整合并存放在类似 /mnt/your_BIG_logs_location/2020-09-22.json 的地方。

Process Logs

默认的 Log 包含了太多信息，然而对于我们分析有效的其实主要只有如下几种：

{
  "httpRequest": {
    "latency": "1.216810s",
    "remoteIp": "123.123.123.123",
    "requestMethod": "GET",
    "requestSize": "351",
    "requestUrl": "https://website.test/test-quick-start.html",
    "responseSize": "6415",
    "serverIp": "10.174.0.5",
    "status": 200,
    "userAgent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
  },
}

为了方便统计访客来源，我们还希望知道对应 IP 的城市和经纬度，以及请求的回源情况，同时，为了方便统计总流量，我们应该将 requestSize 和 responseSize 转换成数字（而不是字符串），所以最终的日志应该类似这样：

{
  "httpRequest": {
    "latency": "1.216810s",
    "remoteIp": "123.123.123.123",
    "requestMethod": "GET",
    "requestSize": 351,
    "requestUrl": "https://website.test/test-quick-start.html",
    "responseSize": 6415,
    "serverIp": "10.174.0.5",
    "status": 200,
    "userAgent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",

    "severity": "INFO",

    "country": "Japan",
    "city": "Heiwajima",
    "latitude": 35.5819,
    "longitude": 139.7663,

    "statusDetails": "response_sent_by_backend",

    "timestamp": "2020-09-21T13:00:11.297154Z",
  },
}

所以再加入一下 GeoIP 相关的信息：

import geoip2.database

def get_geo_info(ip_addr):
    reader = geoip2.database.Reader('/path/to/GeoIP2-City.mmdb')
    geo_data = reader.city(ip_addr)

    lat = geo_data.location.latitude
    longi = geo_data.location.longitude
    country = geo_data.country.name
    city = geo_data.city.name

    return [lat,longi,country,city]

最后 Parse 的时候需要额外处理一些逻辑，比如 statusDetails 可能是 client_disconnected_before_any_response，这种时候 responseSize 就会为空，需要额外处理一波。

最后导入的函数原型类似如下：

from elasticsearch import Elasticsearch, helpers
import os,uuid
import json
import datetime

def parse(current_date):
    real_path = AGG_LOG_LOC + current_date + ".json"
    with open(real_path) as f:
        for json_data in f:
            json_data = json.loads(json_data)
            new_object = {}

            new_object['statusDetails'] = json_data['jsonPayload']['statusDetails']

            # 此处省略

            new_object['timestamp'] = json_data['timestamp']


            if '{"index"' not in new_object:
                yield {
                    "_index": "<index_name>",
                    "_type": "<doc_type_name>",
                    "_id": uuid.uuid4(),
                    "_source": new_object
                }
# 连接 ElasticSearch 服务器
es_instance = Elasticsearch([{'host':'<Elastic_IP>','port':'9200'}])
response = helpers.bulk(es_instance, parse(yesterday))

EK Kicks in

由于直接是 Python 导入了数据到 Elasticsearch 中，并没有 Logstash，所以这里就是 EK 了~

Map

由于我们花了不少时间把所有 IP 对应的 Geo 信息都已经找了出来，下意识想到可以用 Kibana 的 Map 把所有请求的地理位置给可视化到地图上了，然后，就会发现「Couldn’t find any index patterns with geospatial fields」：

通过找一下对应 index 的 mapping：

curl <Elastic_IP>:9200/<index_name>/_mapping | jq .

就会发现，直接将字符串导入的 Geo 数据的类型在 ES 中是：

"latitude": {                                                                                                                                                                                                
  "type":     "float"                                                                                                                                                                                            
},                                                                                                                                                                                                           
"longitude": {                                                                                                                                                                                               
  "type": "float"                                                                                                                                                                                            
},

显然 Kibana 没有那么智能，所以为了保证地理位置信息是他们要的 geo_point 类型，我们还需要手动写一下 mapping（对应到其他数据库里面就是 schema 啦），如果在 Python 中可以这么写：

Elastic 并不能给已有数据修改 Mapping，所以还得重新导入一次

es_instance.indices.create(index='<index_name>') # 先创建一个 index（也就是库）

# 然后指定 <index_name>（也就是库） 的 <doc_type> （也就是表）的 Mapping（也就是 schema）
es_instance.indices.put_mapping(
    index="<index_name>",
    doc_type="<doc_type_name>",
    body={
        "properties": {  
            
            "latency": {"type": "text"},

            "requestSize": {"type": "long"},
            "responseSize": {"type": "long"},

            "userAgent": {"type": "text"},
            "statusDetails": {"type": "text"},
            "remoteIp": {"type": "ip"},
            "serverIp": {"type": "ip"},
            "severity": {"type": "text"},

            "timestamp": {"type": "date"},

            "country": {"type": "text"},
            "city": {"type": "text"},

            "geo": {"type": "geo_point"},
        }
    },
    include_type_name=True
)

其中，我们要的 geo 字段是一个 geo_point 类型的字段，这个字段可以通过拼接经纬度构造字符串来完成：

new_object['geo'] = str(latitude) + "," + str(longitude)

Kibana

导入完成之后我们就可以直接在 Kibana 上看到导入的数据了~

看看地图是否可以正常渲染了~

可以试试看统计各个 URL 的一些 Sent Byte 之和：

搞上计划任务，让脚本每天自动导入数据，下一步就是开始写 Dashboard 和 Visualization 以及找一些前端的同学合作来对数据的各个维度进行分析了，出于篇幅和主题明确考虑，本文就不涉及这块了。

Happy Hacking！

References

#Chinese #ELK #CDN

在 ELK 中分析 Google Load Balancer（及 CDN） 日志