Fast, Large-Scale Document Typo Correction with GPT-Assisted Judgement: A Case Study of Kong's Documentation Site

2024-04-27

这篇文章有简体中文版本，在「带 GPT 辅助判定的快速大规模修文档 Typo——以 Kong 文档站的实践」

Once, someone jokingly said that the fastest way to participate in open source software is to fix typos in the repository, which is not wrong.

However, some people look down on those who fix typos, thinking that they are just padding their workload. Sometimes typos in code do not need to be fixed. After all, a bit of typo in the code/comments is no harm if the code can run.

When I first joined PingCAP, the first PR I submitted to @pingcap was https://github.com/pingcap/docs/pull/2058, changing http to https in the document.

Apart from the code itself, there is a type of document that I personally feel is worth fixing typos, namely, the various major document repositories, such as:

Documents themselves serve as the facade of the product. If you are a user, what would you think when you see such a document:

Source: https://docs.pingcap.com/zh/tidb/stable/system-variables#tidb_skip_missing_partition_stats-%E4%BB%8E-v730-%E7%89%88%E6%9C%AC%E5%BC%80%E5%A7%8B%E5%BC%95%E5%85%A5

Source: https://docs.pingcap.com/zh/tidb/stable/dashboard-profiling#%E6%94%AF%E6%8C%81%E7%9A%84%E6%80%A7%E8%83%BD%E6%95%B0%E6%8D%AE

Source: https://docs.konghq.com/gateway/changelog/#features-24, on the Kong page, availability is all written as availibilty.

How can this be? I have to reoprt an abouse!

How to fix typos

We fix typos in the same way as Redis cache expiration, divided into two modes:

Passive way
- When we read a document and find a typo, we, filled with a sense of justice, find the “Edit this docs” button, log into our GitHub account and submit a PR.
Active way
- Find a way to actively discover typos and submit them.

The former is a bit slow. To fix typos, you need to read carefully (note, because the order of Chinese characters does not necessarily affect reading comprehension, for example, in the Kong document above, you won’t easily notice that availibilty is actually a typo) all the documents.

So the main purpose of this article is to introduce the latter, providing a “static check -> GPT judgment -> quick manual processing” method.

Der aktive Weg (The active way)

In order to speed up our ability to fix typos as much as possible, this article proposes a “static check -> GPT judgment -> quick manual processing” method. We will introduce it in this order.

Since there are quite a few typos in https://github.com/Kong/docs.konghq.com, and the place where I live is very close to Kong’s Shanghai office, we will use this repository as an example in this article.

Taken downstairs at Kong’s Shanghai office

Static check + preliminary screening

Originally, I wanted to handcraft a tool for this step, but I found a very useful library on GitHub called typos, https://github.com/crate-ci/typos. After installing, you can just go to the project directory and use typos to mark all potential typos. For example, in the docs.konghq.com/app directory:

find . -name "*.md" | xargs -I {} typos {}

You can see quite a few typos marked:

error: `hexidecimal` should be `hexadecimal`
  --> ./_src/gateway/plugin-development/pdk/kong.request.md:335:61
    |
335 |  * Percent-encoded values of reserved characters have their hexidecimal
    |                                                             ^^^^^^^^^^^
    |
error: `Hashi` should be `Hash`
  --> ./_src/gateway/reference/configuration/configuration-3.4.x.md:2223:58
     |
2223 | resurrected for when they cannot be refreshed (e.g., the HashiCorp vault is
     |                                                          ^^^^^
     |
error: `mis` should be `miss`, `mist`
  --> ./_src/gateway/reference/configuration/configuration-3.4.x.md:4138:11
     |
4138 | note that mis-management of keyring data may result in irrecoverable data loss.
     |           ^^^

But as you can see, there are quite a few false positives in this, such as HashiCorp where he thinks Hashi should be changed to Hash, and mis-management where he thinks it should be changed to miss-management.

For this situation, we can write a typos.toml for a simple preliminary screening, the content is as follows:

[default.extend-words]
Hashi = "Hashi"
mis = "mis"

Then change the command to:

find . -name "*.md" | xargs -I {} typos {} --config /path/to/typos.toml

But for the typos marked in this way, we need to manually judge and manually find the corresponding file to make changes, which is still a relatively time-consuming and laborious operation. So we need to let typos output in a way that the program can understand for the next step. Fortunately, typos supports the --format json parameter. After adding this parameter, the output content becomes like this:

{"type":"typo","path":"/path/to/workspace/docs.konghq.com/app/_src/gateway/how-kong-works/routing-traffic.md","line_num":685,"byte_offset":81,"typo":"fo","corrections":["of","for","do","go","to"]}
{"type":"typo","path":"/path/to/workspace/docs.konghq.com/app/_src/gateway/breaking-changes/30x.md","line_num":124,"byte_offset":6,"typo":"fuction","corrections":["function"]}
{"type":"typo","path":"/path/to/workspace/docs.konghq.com/app/_src/gateway/production/tracing/api.md","line_num":2,"byte_offset":19,"typo":"Referenece","corrections":["Reference"]}

We temporarily call this file – dirty Typo JSON.

GPT marking

In the previous section, we have been able to record potential typos in the form of one JSON string per line. The next step we need to do is

Read each line of the “dirty Typo JSON”
Find the corresponding line content
Find the potential typo
Replace the potential typo
And ask GPT if it thinks the replacement is reasonable

The difficulty here lies in the Prompt. My Prompt is as follows:

{
  "messages": [
    {
      "role": "system",
      "content": """
      You are a judge who is familiar with the names and words and spelling of internet companies. I will give you a sentence and the word in the sentence that needs to be replaced. You need to tell me whether this word should be replaced with a number between 0-100. Under any circumstances, you can only answer a number between 0 and 100. The larger the number, the more likely it is to be replaced. If you are not sure, please answer the probability number. You can't have any additional explanations or comments, just answer the number.
      At the same time, you need to judge whether it is a specific company name (for example, Hashicorp is a company name and should not be replaced), or whether it is a meaningless string to decide. If it is a specific company name or a meaningless string, you need to answer 0. If it is a common word, you need to answer 100.
      Please first judge whether the sentence that needs to be rewritten is a meaningful sentence. If it is not a meaningful sentence, you need to answer 0.
      For example, in the sentence "Time-to-live (in seconds) of a HashiCorp vault miss (no secret).", Hashi is part of HashiCorp and is not a typo, so it should not be replaced. You need to answer 0.
      For example, in "02:21:00:86:ce:d0:fc:ba:92:e9:59:16:1c:c3:b2:11:11:ed:", ba is part of an example string and not any meaningful sentence, so it should not be replaced. You need to answer 0.
      For example, in "X-Kong-Admin-Request-ID: ZuUfPfnxNn7D2OTU6Xi4zCnQkavzMUNM", OTU is part of an example string and not any meaningful sentence, so it should not be replaced. You need to answer 0.
      """
    },
    {
      "role": "user",
      "content": "{} \n {} change to {}"
    }
  ],
  "stream": False,
  "model": "gpt-4",
  "temperature": 0.5,
  "presence_penalty": 0,
  "frequency_penalty": 0,
  "top_p": 1
}

After the Prompt is written, it’s time to pay for the OpenAI API Key and then connect it. Here, since it’s mainly a PoC, it’s done in Python, with key code as follows:

gpt_rate_response = client.chat.completions.create(
    messages=formatted_message['messages'],
    model=formatted_message['model'],
    stream=formatted_message['stream'],
    top_p=formatted_message['top_p'],
    temperature=formatted_message['temperature'],
    presence_penalty=formatted_message['presence_penalty'],
    frequency_penalty=formatted_message['frequency_penalty']
)
gpt_rate = gpt_rate_response.choices[0].message.content

In the above code, gpt_rate is a number between 0~100, but it may be because my Prompt is not written very well, it will only output 0 or 100. At this time, we just need to discard the ones with a score of 0.

Here, after GPT marking, we just cleaned up the likely misjudged content in the “dirty Typo JSON” file and wrote it back. We call this GPT cleaned file “clean Typo JSON”.

Quick marking replacement

Now we finally have the “clean Typo JSON”. We need a way to quickly complete the replacement of the file. For the consideration of human-machine engineering, we introduce the operation mode of: “two lines appear on the screen each time, the first line is the original line, the second line is the line after the replacement, users only need to press Y to confirm the replacement, press N to give up the replacement”. The interface is as follows:

Part of the code implementation is as follows:

def do_replace(file_path,line_num,typo,correction):
    lines = []
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    lines[line_num-1] = lines[line_num-1].replace(typo, correction)
    with open(file_path, 'w', encoding='utf-8') as file:
        file.writelines(lines)

def get_ch():
    fd = sys.stdin.fileno()
    old_settings = termios.tcgetattr(fd)
    try:
        tty.setraw(sys.stdin.fileno())
        ch = sys.stdin.read(1)
    finally:
        termios.tcsetattr(fd, termios.TCSADRAIN, old_settings)
    return ch

...
# Display the original and corrected lines, and give color to the corrected word
typo_word_index = orignal_line.find(typo_word)
print("Path: ", json_obj['path'])
print(f"{orignal_line[:typo_word_index]}{Back.RED}{orignal_line[typo_word_index:typo_word_index + len(typo_word)]}{Style.RESET_ALL}{orignal_line[typo_word_index + len(typo_word):]}")
print(f"{corrected_line.replace(json_obj['corrections'][0], Back.GREEN + json_obj['corrections'][0] + Style.RESET_ALL)}")

print("Do you want to continue? [y/n]: ")
ch = get_ch()
if ch == 'y':
    do_replace(json_obj['path'], json_obj['line_num'], typo_word, json_obj['corrections'][0])
    print("Replaced!")
elif ch == 'n':
    continue
else:
    print("Exiting...")
    exit(0)

Of course, if there are enough typos, even if you can press Y/N quickly to replace, pressing the computer for 10+ minutes is no different from doing dog pushing.

Further increase ROI

Using the “static check -> GPT judgment -> quick manual processing” method proposed in this article, I completed the repair of most (GPT may have a small amount of false negative misjudgment) typos in the https://github.com/Kong/docs.konghq.com repository, and submitted a total of 3 PRs:

At the same time, I also practiced on the repositories of Cloudflare and Halo:

https://github.com/cloudflare/cloudflare-docs/pull/14263 (Merged)
https://github.com/halo-dev/docs/pull/343 (Merged)

The changes involved 68 files scattered throughout the repository, with a total time (Clone repository + run script + manual submission) of about 30 minutes, commonly known as “Typo Immortal”.

Currently, the most time-consuming part of this process is undoubtedly the “quick manual processing”. Even though we already have the above TUI program that can quickly press Y/N, manual final judgment is still bottlenecked by the operator (me). So here might be the following idea:

Assume that after “static check -> GPT judgment”, there are a total of 100 typos, but because “GPT judgment” will still have some False Positive parts that are not judged, so there are 10 fake ones in our modification but not judged. So we have 90 real typos and 10 fake typos. Here, if we follow the above “quick manual processing”, we still have to manually press 100 times Y/N, which is quite laborious. It’s better to automatically complete the modification of the file after “GPT judgment” and then submit the PR. Then:

Because the maintainer of the repository should manually review it, they should notice that there are 10 false typos, at which point they have the following choices:

Close the PR directly, but for large open source projects, this might not be very friendly, and others might question the motive of doing so.
Close it and then reopen it themselves, but because the majority are real typos, manually reopening the PR would be very time-consuming.
They might help fix the 10 typos and then merge it, which is quite likely.

With this approach, only one person needs to manually review the changes in the end, and since the responsible maintainer themselves needs to manually review it once, this can distribute this bottlenecked work among the maintainers of various repositories, greatly increasing the efficiency of fixing typos, of course…

Postscript

~~I wonder what the Kong maintainers thought when they saw so many file changes contained within these three PRs~~

In an era where AI/GPT is so prevalent, as a “developer” who doesn’t have much confidence in their abilities, doesn’t write code very well, and doesn’t chase trends very well, I found a seemingly quite practical AI scenario, besides using it for daily Q&A and Copilot. I hope this article can bring some inspiration to the readers~

#English #GPT