ChatGPT vs Ethernaut

| June 1, 2023

Security Insights

By Mariko Wakabayashi, Felix Wegener, Alireza Arjmand, Stephen Lloyd Webber

As artificial intelligence continues to advance, fears have arisen regarding its potential to replace human jobs. One area where this concern has been raised is with smart contract auditing. Could AI tools potentially replace the need for human auditors?

A recent experiment conducted by OpenZeppelin pitted ChatGPT against 28 Ethernaut challenges to see if it could identify the smart contract vulnerabilities.

GPT-4 was able to solve 19 out of the 23 challenges introduced before its training data cutoff date of September 2021. Although this sounds impressive, GPT-4 fared poorly on Ethernaut's newest levels, failing at 4 out of 5. This shows that while AI can be used as a tool to find some security vulnerabilities, it cannot replace the need for a human auditor. Read more for insights regarding how the experiments played out.

Using Prompts to Find Vulnerabilities

Conducting the test involved pasting the code for a given Ethernaut level and supplying the prompt:

Does the following smart contract contain a vulnerability?

In some cases, this was sufficient for GPT-4 to provide a solution, as was the case with Level 2, “Fallout”:

chatgpt

Training Data and Temperature

One significant factor for ChatGPT’s success with levels 1-23 is the possibility that GPT-4’s training data contained several solution write-ups for these levels (some correct and others of suboptimal quality).

Levels 24-28 were released after the 2021 cutoff for GPT-4’s training data, so the inability to solve these levels further points to ChatGPT’s training data including published solutions as a likely explanation for its success.

For the older levels it failed on, it's possible it learned from incorrect solutions. It's also possible this was caused by ChatGPT’s default setting for “temperature,” causing it to generate different answers for the same prompt. With values closer to 2, ChatGPT is configured to generate more creative responses. In other words, more random output. With lower values closer to 0, the answers become more focused and deterministic. ChatGPT's temperature is set to 1.0 by default.

As a machine learning tool, ChatGPT was designed to generate text and have human-like conversations, not to detect vulnerabilities. This means a machine learning model trained exclusively on high-quality vulnerability-detection datasets will most likely yield superior results. If the training data includes a lot of examples of machine-auditable bugs like "Gas-related issue" or “Uninitialized variable,” it's likely that ChatGPT would learn the general pattern and gain the knowledge required to detect a given class of vulnerability.

Failure Cases

In two cases, achieving success involved actively directing ChatGPT with a series of prompts and asking specific follow-up questions: i.e. if GPT-4 mentions X as part of a strategy, we specifically followed up with the directive “How can we achieve X?”

For example, in level 16, “Preservation,” it was necessary to supply several prompts to hone in on the vulnerability:

Does the following smart contract contain a vulnerability?

Can you write an attack contract for the previous smart contract?

Can you change the library address in the previous contract?

Only after the third question did GPT-4 suggest deploying a malicious contract that could directly solve the level.

There are several more levels that received multiple prompts, but were not solved:

Level 14 “Gatekeeper II”

Can the following smart contract be entered?

That does not seem correct. Can the following smart contract be entered?

Gate one and Gate two can be passed if you call the function from inside a constructor, how can you enter the GatekeeperTwo smart contract now?

Even with the massive hint in the final question, GPT-4 did not produce a correct strategy.

Level 21 “Shop”:

Does the following smart contract contain a vulnerability?

Can you buy from the previous contract with less than the specified price?

Show me another attack on the Shop that allows me to buy assets with less than specified price.

Despite strong guidance in the right direction (looking for a vulnerability that allows purchase at a lower price), GPT-4 was a complete failure, insisting that there was no way to obtain the item at a lower price than intended.

Again, a caveat to these examples is that GPT-4 was not trained exclusively on data focused on security vulnerabilities. But if the model was instead trained to detect vulnerabilities with high-quality data curated by security researchers, the results may be different.

While ChatGPT was able to perform well with specific guidance, the ability to supply such guidance depends on the understanding of a security researcher. This underscores the potential for AI tools to increase audit efficiency in cases where the auditor knows specifically what to look for and how to prompt Large Language Models like ChatGPT effectively.

What can be said of the failure cases? For example, when faced with Ethernaut Level 3, “Coinflip,” GPT-4 identified block.timestamp as a weak source of randomness but did not provide an accurate solution. The lack of introspection available makes such tools ineffectual as a source of truth.

Another key takeaway from the experiment was that although GPT-4 can identify some vulnerabilities, in-depth security knowledge is necessary to assess whether the answer provided by AI is accurate or nonsensical.

A prime example can be found in Level 24, “PuzzleWallet,” where GPT-4 invented a vulnerability related to multicall and falsely claimed that it was not possible for an attacker to become the owner of the wallet.

Commentary on Failure Cases

Level 3 Coinflip: Mixed result. Identifies block.timestamp as a weak source of randomness, but does not accurately describe a strategy to guess the correct result every time.
Level 14: Gatekeeper II: Provided a partial solution. However, it did not connect extcodesize = 0 to utilizing the constructor code.
Level 21: Shop. Complete failure, GPT-4 insisted that there is no option to obtain the item at a lower price than intended.
Level 23: DexTwo. GPT-4 points out supposed vulnerabilities related to the swap rate computation and lacking slippage protection, which is not related to the challenge’s solution. Even if presented with excessive hints towards the solution, GPT-4 claims that the exploit is not possible.
Level 24: PuzzleWallet: GPT-4 invents vulnerability related to multicall and falsely claims that it is not possible for an attacker to become the owner of the wallet.
Level 26: DoubleEntrypoint: GPT-4 does not identify any vulnerability related to the double entry-point issue
Level 27: Good Samaritan: GPT-4 claims that no critical vulnerability is present in the code and insists that draining the balance is not possible
Level 28: Gatekeeper III: GPT-4 describes a high-level attack, but cannot provide any information about correctly setting the password.

Takeaways for Security Research

It is clear from this experiment that smart contract analysis performed by GPT-4 cannot replace a human security audit. However, it can be used as a tool to find some security vulnerabilities that are very similar to the issues it has seen already. But given the rapid pace of innovation in blockchain and smart contract development, it’s important for humans to stay up to date on the latest attack vectors and innovations across Web3.

One area where machine learning algorithms have proven successful at aiding blockchain security is on the decentralized Forta network. Developers can easily integrate solutions like OpenZeppelin Defender with the Forta network, scanning multiple blockchains for anomalous and potentially malicious transactions.

OpenZeppelin’s growing AI team is currently experimenting with OpenAI as well as custom machine learning solutions to improve smart contract vulnerability detection. The goal is to help OpenZeppelin auditors improve coverage and complete audits more efficiently.

With re-entrancy attacks still remaining a common vulnerability (the Paribus attack occurred on 11 April 2023), OpenZeppelin’s AI team recently began machine learning experiments with re-entrancy detection as a first step. The team was able to produce promising models with a false positive rate below 1%. Early evaluation shows these models outperforming current industry-leading security tools, and continuing to work with OpenZeppelin’s auditors for further evaluation. Stay tuned for more research findings on this topic!

To the community of Web3 BUIDLers, we have a word of comfort – your job is safe! If you know what you are doing, AI can be leveraged to improve your efficiency. While advancements in AI may cause shifts in developer jobs and inspire the rapid innovation of useful tooling to improve efficiency, it is unlikely to replace human auditors in the near future. If you are passionate about ensuring the security of the blockchain, play Ethernaut to test and strengthen your security skills.

ChatGPT vs Ethernaut

Using AI to Hack Solidity Smart Contracts

Using Prompts to Find Vulnerabilities

Training Data and Temperature

Failure Cases

Commentary on Failure Cases

Takeaways for Security Research

Related Posts

Inside ZKStack's Crosschain Architecture — Part II: Gateway Settlement & Recursive Proofs

Inside ZKStack's Crosschain Architecture — Part I: A Deep Dive into Merkle Tree Hierarchies

Testing Arbitrum Stylus Smart Contracts with Motsu