Deconstructing a Solidity Contract — Part VI: The Metadata Hash

Image from pixbay.com

By Alejandro Santander in collaboration with Leo Arias

Thank you for your interest in this post! We’re undergoing a rebranding process, so please excuse us if some names are out of date.

Note: This article is part of a series. If you haven’t read the previous article, please have a look at it first. We’re deconstructing the EVM bytecode of a simple Solidity contract.

  1. Deconstructing a Solidity Contract — Part I: Introduction
  2. Deconstructing a Solidity Contract — Part II: Creation vs. Runtime
  3. Deconstructing a Solidity Contract — Part III: The Function Selector
  4. Deconstructing a Solidity Contract — Part IV: Function Wrappers
  5. Deconstructing a Solidity Contract — Part V: Function Bodies
  6. Deconstructing a Solidity Contract — Part VI: The Metadata Hash ⬅

Update: This article initially contained mistakes and missed a few important points about the design of the metadata hash, which were pointed out by chriseth from the Solidity team. Thank you! Among the corrections made by Chris, is the fact that the structure is known as the “Metadata hash” instead of the “Swarm hash”, and that the Solidity compiler is agnostic in terms of the system used to store contract metadata.

In the last article, we noticed that the runtime bytecode generated by the Solidity compiler appends a strange structure after the function bodies block. You can see this in the deconstruction diagram or in the image below referred to as the “metadata hash”:

Figure 1: The metadata hash can be found in the last few opcodes of the runtime bytecode of a contract.

What exactly are these opcodes doing?

Notice the STOP opcode at instruction 421. You may think that if there’s a STOP opcode there, whatever comes after it is basically unreachable bytecode, right? Well, not exactly: the code could be JUMP-ing over the STOP opcode. However, there are no JUMPDEST instructions after it, so that rules out the possibility of any execution reaching this part of the bytecode via a JUMP.

In fact, if you analyze all the possible execution flows in this contract (which, believe it or not, is what we have done in this series!), you’ll see that indeed this code is totally unreachable.

So why would the Solidity compiler append non-executable code to its generated output? Actually, this isn’t the first time that this happens. We’ve seen it before in Part II of the series, where a constructor’s arguments were appended to the end of the creation bytecode. That code wasn’t supposed to be executed by the EVM either; it was just there as a sort of hack to store the initialization values of a contract for consumption in constructors.

Alright, we still haven’t answered the question of what this block of code is. To do that, let’s walk through the opcodes and try to make some sense out of them, shall we?

The first thing we see is a LOG1. If we look this opcode up in the Yellow Paper or in the Solidity documentation, we can see that LOG0 to LOG4 opcodes are used for logging events in the Ethereum blockchain. Which…makes no sense, since we won’t be executing any of this code…

After that, we can see a PUSH6 of 0x627a7a723058, a SHA3, a couple of INVALIDs, a SWAP10, a DELEGATECALL, etc. Wait, INVALIDs? What does that even mean?! Yup, total nonsense in terms of EVM interpretation. Clearly, looking at these bytes as EVM opcode representation is absolutely pointless. We need to look at this as raw byte data, which as you remember can be found in Remix’s Compile tab > Details panel > Runtime Bytecode section > object property. The LOG1 opcode is really an 0xa1 byte, so the whole block of code at the end of the contract looks like this:

// …a165627a7a723058202c27c1ef4be478b21f663f0d0ecdd1c73638730ffebbff1e3c7a234db7df6fd10029
// END OF CONTRACT

The answer to this riddle can be found in Solidity’s documentation, in the Encoding of the Metadata Hash in the Bytecode section. The documentation is brief, but it gives us exactly what we need. The compiler is hashing the contract’s metadata (which includes information about the contract such as its source code, how it was compiled, etc.) and injecting this hash into the contract’s own bytecode! This metadata can also be seen in Remix: Remix’s Compile tab > Details panel > Metadata section.

This hash can be used in Swarm as a lookup URL to find the contract’s metadata. Swarm is basically a decentralized storage system, similar to IPFS. The idea here is that some platform like Etherscan identifies this structure in the bytecode and provides the location of the bytecode’s metadata within a decentralized storage system. A user can query such metadata and use it as a means to prove that the bytecode being seen is in fact the product of a given Solidity source code, with a certain version and precise configuration of the Solidity compiler in a deterministic manner. This hash is a digital signature of sorts, that ties together a piece of compiled bytecode with its origins. If you wanted to verify that the bytecode is legit, you would have to hash the metadata yourself and verify that you get the same hash.

And that’s not all, the metadata hash can be used by wallet applications to fetch the contract’s metadata, extract it’s source, recompile it with the compiler settings used originally, verify that the produced bytecode matches the contract’s bytecode, then fetch the contract’s JSON ABI and look at the NATSPEC documentation of the function being called.

 

This end-to-end authentication path built into bytecode generated by the Solidity compiler can not only be used to provide a user with information of the action about to be performed, but also to validate the legitimacy of such action.

For example, if we look at the CryptoKitties contract in Etherscan, we can see that at the end of the page, Etherscan provides us with the contract’s metadata address in Swarm, which was extracted from the bytecode in the way we’ve just seen: bzzr://a6465fc1ce7ab1a92906ff7206b23d80a21bbd50b85b4bde6a91f8e6b2e3edde. You can look into Swarm’s documentation to better understand this URL scheme.

Let’s go back to our BasicToken’s bytecode and understand how Etherscan (or any other similar utility for that matter) actually finds the hash in the bytecode.

As the Solidity documentation states, an 0xa1 and an 0x65 will be injected to the bytecode. In EVM bytecode, these two hexadecimal values would translate to LOG1 and PUSH6. Now, if we decoded the letter “b” as UTF to hex, we would get 0x62, “z” would be 0x7a, and so on. You can see the whole thing decoded in the following diagram:

Figure 2: The metadata hash decoded as a Swarm URL.

So, any application trying to find the metadata hash in the bytecode would look for these bytes at the end of the contract, this particular pattern, and extract the URL from it.

Solidity uses a type of encoding called CBOR encoding, with which not only the hash is stored, but the specific decentralized storage system and version used is stored. In this case, it’s using Swarm’s version zero bzz:// URL scheme and that’s why the structure contains the chars “b”, “z”, “z”, “r”, “0”. Alternatively, it could use something like “i”, “p”, “f”, “s”, “r”, “0”, indicating that the structure encodes an IPFS URL scheme. This makes it agnostic in terms of which storage system is used. It could be changed in the future, or we could even get to choose which storage system we want the bytecode to reference upon compilation.

To retrieve the metadata file, we would have to connect to the same Swarm network to which the metadata file was uploaded to, using something like swarm-gateways.net or setting up a local Swarm node. Right now, this is something that is quite difficult to do, because Swarm is still under heavy development and has not yet stabilized its hashing scheme, which can be seen is something being addressed by Solidity in issue #4092.

The actual hash itself is a specific hashing algorithm executed on the metadata file of a contract, which the Solidity compiler calculates after it has run all its other tasks — mainly, compiling =D. When we were trying to interpret these bytes as EVM opcodes, some of the hash’s bytes didn’t have a corresponding opcode, and that’s why we were getting INVALIDs. In fact, a hash may just by chance produce any set of opcodes, which will make all hashes different and look weird when using tools like Remix. What I do is ignore the bytecode when I see a LOG1, followed by a PUSH6 and a few INVALIDs, understanding that what I am seeing is the metadata hash injection that the Solidity compiler makes.

And this concludes our analysis of this bizarre structure found at the end of every contract produced by Solidity.

Better yet, this concludes the entire series. Yay! If you followed along and digested this considerable amount of highly technical, horrendously boring material (at least to most people), then I salute you! I hope that by now you feel right at home when you see EVM bytecode in the wild, and that you add this skill to your toolset when analyzing and developing smart contracts for Ethereum.

Thanks for reading!

  1. Deconstructing a Solidity Contract — Part I: Introduction
  2. Deconstructing a Solidity Contract — Part II: Creation vs. Runtime
  3. Deconstructing a Solidity Contract — Part III: The Function Selector
  4. Deconstructing a Solidity Contract — Part IV: Function Wrappers
  5. Deconstructing a Solidity Contract — Part V: Function Bodies
  6. Deconstructing a Solidity Contract — Part VI: The Metadata Hash ✔