SHA-256 in 43 lines of Python

A concise implementation of mathematically designed chaos

kuco23
6 min readJul 26, 2022

SHA-256 is a one-way hash function. It’s hard to stress the importance of those functions in cryptography and if you want to get some idea, check out this post. Because crypto is cool, I’ll just say that Bitcoin uses SHA-256 in its proof of work, to make blocks hard to mine and thus give time for consensus to take place over the decentralized peer-to-peer network. The proof of work basically requires partially inverting SHA-256, which is hard as SHA-256 is believed to be unpredictable and hides its input behind the chaotic output very well. It’s a mathematical representation of chaos.

It also has a bunch of other useful properties. For example, it provides a computationally unique 256-bit identifier of the input, meaning it’s computationally impossible to find two inputs that give the same output. So, you can hash 1TB of data into just 256-bits that uniquely identify that whole data. BitTorrent uses this to verify large downloaded parts were not tampered with, by matching their hashes with the ones specified in a trusted document (torrent file).

There’s already a good post on implementing SHA-256 in Python, but I found it slightly over-complicated, so this is an attempt to simplify things by using the bitarray library. The point of the post is to make an implementation of SHA-256 that’s short, concise, and easy to understand.

For the just-give-me-the-code type of people, here’s the gist (constants K, H, and the standard function _binsum are not included in the line count).

Preliminaries

If you’ve ever used SHA-256, you’re probably used to inputting any data and getting back some hexadecimal number, e.g.

e2e7dcf1d4f040269156068f4806cbd5fb054092cc96e5b882a6de0477e2c4b7

But internally SHA-256 works only with binary/bit arrays. (e.g. 010101). As they can (and do by our computers) represent all other data, we will just assume input is a binary array and produce another binary array, which can then be converted to hexadecimal.

To understand SHA-256, we need to explain a few basic operations used to chaotically mess up the input. They are only ever used on bit arrays of length 32 and the length is never affected by them!

AND, OR and XOR

These operations act on bits (zeros and ones), though we’re going to apply them to bit-arrays bit-by-bit. The following table describes how they operate on bits.

Some examples:

1010101010 AND 0101010101 = 0000000000

1111001111 OR 1111100000 = 1111101111

1111000001 XOR 0011000001 = 1100000000

Note that in Python, AND is denoted by &, OR by | and XOR by ^ (though more commonly known as ).

RIGHT SHIFT and ROTR

Right-shift is denoted by >> and shifts every bit in an array to the right, by the given number. This means, we glue that number of zeros to the left and remove that number of bits from the right (imagine pushing the bits right, having the right-overflow fall off and the left void replaced by zeros):

1111000001 >> 4 = 0000111100

Similarly, ROTR is just shifting the array cyclicly, meaning we now don’t add zeros to the left, but instead the bits we removed from the right (imagine pushing the bits right and having the right-overflow appear on the left):

ROTR(111100100101, 4) = 010111110010

It’s called ROTR, as the implementation kind of rotates the bits from the left side to the right.

Addition modulo 2³²

We want to add binary numbers, represented by bit-arrays. But adding two 32-bit numbers can produce a 33-bit number, whereas we can’t change the bit-array size. This is solved by removing the overflow bits on the left (or more mathematically, calculating the remainder of the division by 2³²). As for addition itself, this is a basic high-school carry algorithm.

This is so general it doesn’t count as SHA code :|

We’re stopping the algorithm when it would have overflown our 32-bits.

SHA-256

Sha-256 uses some initial values that pretty much act as a random seed.

This doesn’t count as code :|

The values are 32-bit numbers, expressed in hexadecimal; later, we’ll convert them to bit-arrays. The derivation of the above constants won’t matter to us.

We’ll also use some more specific functions that are just combinations of our previously explained operators:

~ is bit negation (0 to 1 and 1 to 0) acting bit-by-bit

SHA-256 takes a binary array and pads it in a very specific way. Remember that the given bit array is of any size, but SHA-256 wants to work with 32-bit arrays. This will be achieved in two steps.

First, it will pad the given array to be a multiple of 512, so it can then uniformly split it into 512-bit parts. It also wants to encode the information about the given size into the padding, so it pads the n-bit array in the following way:

  • append the bit 1 to the given array,
  • append k zeros (k is specified later),
  • append the 64-bit binary encoding of n.

Now, k takes care of us being able to nicely split the given padding into 512-bit blocks. So, k is the smallest number making this padding have a length that is a multiple of 512. There are various ways of calculating this. I took a more mathematical approach.

Now we’ll just post the whole SHA-256 function and explain the steps after.

First, Kb and Hb are bitarray representations of the previous hexadecimal constants and binsum is applying the previously defined 32-bit summation to sum any (not just two) number of arguments. The chunks function splits an array into an array of fixed-length arrays (in our case 32 and 512). After that:

  • [1] We set the compressor variable to our initial “random seed” Hb. Then we iterate over 512-bit chunks of the padded input. Each chunk will later affect the compressor value in some chaotic way.
  • [2] The 512-bit chunks are further split into 32-bit pieces, which now act as random seeds on which we apply our previously defined chaotic functions. They are applied 48 (64 16) times to produce a nice mess inside variable W.
  • [3] Now we merge the mess with the compressor by applying our chaotic functions (64 times) to its 32-bit parts (the letter variables), while also shuffling them around at each turn.
  • [4] Each processed (messed up) 32-bit array is added (binsumed) to the appropriate 32-bit part of the compressor.
  • [5] After using all 512-bit chunks of our padded input to mess up with the compressor, it has its 8 arrays of length 32 glued into one 256-bit array, which is our return value.

And that’s it for the low-level bit-array SHA-256 implementation!

General input and hex output

It’s completely up to you how you encode your data into bit-arrays. Here we’ll see one way of doing it and construct a more general SHA-256 that can take strings or integers as input.

Let’s demonstrate the chaotic nature of SHA-256 by hashing some large string (e.g. the first page of the Bible) and comparing it to a hash of that same string, but replacing just one letter (e.g. the second one - I to i):

1520227ea79705d034bcfe0cc5e22dfdcb7a0959a4185b842c3aa64daa2e2929

915d2b84debdc0352970d2d40bd6bae5b91d11b5ccd66eb61117ad8840efa4be

A note on the differences in implementations

We’ve implemented SHA-256 as described in this official-looking document, but there are varying versions. E.g. Python’s hashlib accepts the bitarray object, but the outputs are different. The implementation seems to have changed with Python 3.10.0. Anyway, I got into SHA-256 internal workings to match this implementation, done in circom (a rather obscure language for building arithmetic circuits… that’s another story).

Conclusion

The point here really was to nicely modularize the SHA-256 function and make it concise in order to easier understand the big picture. I could make it shorter, but it was a sort-of min-max process, where my OCD maximized the code pythonicity and readability.

--

--

kuco23
kuco23

Written by kuco23

Math MSc | Smart contract dev @ Flare network | https://kuco23.github.io

No responses yet