Building a hash function, especially one that accepts inputs of arbitrary length, seems like a challenging task. In this section, we'll see one approach for constructing hash functions, called the Merkle-Damgård construction. 

Instead of a full-fledged hash function, imagine that we had a collision-resistant function whose inputs were of a single fixed length, but longer than its outputs. In other words, h : {0, 1}^{n+t} → {0, 1}^n , where t > 0. We call such an h a compression function. This is not compression in the usual sense of the word - we are not concerned about recovering the input from the output. We call it a compression function because it "compresses" its input by t bits (analogous to how a pseudorandom generator "stretches" its input by some amount). 

The following construction is one way to build a full-fledged hash function (supporting inputs of arbitrary length) out of such a compression function: 

Construction 11.2 (Merkle-Damgård) 

Let h : {0, 1}^{n+t} → {0, 1}^n be a compression function. Then the Merkle-Damgård transformation of h is MD_h : {0, 1}^∗ → {0, 1}^n, where:

Merkle-Damgard Construction

The idea of the Merkle-Damgård construction is to split the input x into blocks of size t. The end of the string is filled out with 0s if necessary. A final block called the "padding block" is added, which encodes the (original) length of x in binary.


Example 

Suppose we have a compression function h : {0, 1}^{48} → {0, 1}^{32}, so that t = 16. We build a Merkle-Damgård hash function out of this compression function and wish to compute the hash of the following 5-byte (40-bit) string: 

x = 01100011 11001101 01000011 10010111 01010000 

We must first pad x appropriately (MDPAD(X)): 

  • Since x is not a multiple of t = 16 bits, we need to add 8 bits to make it so. 
  • Since |x| = 40, we need to add an extra 16-bit block that encodes the number 40 in binary (101000). 

After this padding, and splitting the result into blocks of length 16, we have the following: 

 \underbrace{01100011 \; 11001101}_{x_1} \; \; \underbrace{01000011 \; 10010111}_{x_2} \; \; \underbrace{0101000 \; 000000}_{x_3} \; \; \underbrace{00000000 \; 00101000}_{x_4}

The final hash of x is computed as follows:

Merkle-Damgard Construction

We are presenting a simplified version, in which MD_h accepts inputs whose maxi- mum length is 2^t − 1 bits (the length of the input must fit into t bits). By using multiple padding blocks (when necessary) and a suitable encoding of the original string length, the construction can be made to accommodate inputs of arbitrary length (see the exercises). 

The value y0 is called the initialization vector (IV), and it is a hard-coded part of the algorithm. 

As discussed above, we will not be making provable security claims using the library- style definitions. However, we can justify the Merkle-Damgård construction with the following claim: 

Claim 11.3 

Suppose h is a compression function and MD_h is the Merkle-Damgård construction applied to h. Given a collision x, x′ in MD_h , it is easy to find a collision in h. In other words, if it is hard to find a collision in h, then it must also be hard to find a collision in MD_h

Proof 

Suppose that x, x′ are a collision under MD_h. Define the values x_1, . . . , x_{k+1} and y_1, . . . , y_{k+1} as in the computation of MD_h (x). Similarly, define x′_1, . . . , x′_{k′+1} and y'_1, . . . , y′_{k′+1} as in the computation of MD_h (x′). Note that, in general, k may not equal k'

Recall that: 

MD_h (x) = y_{k+1} = h(y_k ‖ x_{k+1})

MD_h (x′) = y′_{k′+1} = h(y′_{k′} ‖ x′_{k′+1})

Since we are assuming MD_h (x) = MD_h (x'), we have y_{k+1} = y′_{k′+1}. We consider two cases: 

Case 1: If |x| \neq |x'|, then the padding blocks x_{k+1} and x'_{k′+1} which encode |x| and |x'| are not equal. Hence we have y_k ‖ x_{k+1} \neq y′_{k′} ‖ x'_{k′+1}, so y_k ‖ x_{k+1} and y′_{k′} ‖ x'_{k′+1} are a collision under h and we are done. 

Case 2: If |x| = |x'|, then x and x' are broken into the same number of blocks, so k = k'. Let us work backwards from the final step in the computations of MD_h(x) and MD_h (x'). We know that: 

y_{k+1} = h(y_k ‖ x_{k+1})

y′_{k+1} = h(y′_k ‖ x'_{k+1})

If y_k ‖ x_{k+1} and y′_k ‖ x'_{k+1} are not equal, then they are a collision under h and we are done. Otherwise, we can apply the same logic again to y_k and y′_k , which are equal by our as- sumption. 

More generally, if y_i = y′_i , then either y_{i−1} ‖ x_i and y′_{i−1} ‖ x'_i are a collision under h (and we say we are "lucky"), or else y_{i−1} = y'_{i−1} (and we say we are "unlucky"). We start with the premise that y_k = y′_k . Can we ever get "unlucky" every time, and not encounter a collision when propagating this logic back through the computations of MD_h (x) and MD_h (x')? The answer is no, because encountering the unlucky case every time would imply that x_i = x'_i for all i. That is, x = x'. But this contradicts our original assumption that x \neq x'. Hence we must encounter some "lucky" case and therefore a collision in h.


Source: Mike Rosulek, https://joyofcryptography.com/pdf/chap11.pdf
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License.