In this column, let’s learn about hash functions as a prelude to learning about ways to save and compare passwords.

First, I’ll discuss hashing.

A hash is a function that calculates a fixed-length value from data.

This”fixed-length calculated value” is called a”hash value.”

The concept of hashing itself is relatively simple. There are various benefits and limitations due to this quality. Hash functions are used for a variety of purposes.

### “Make haste slowly.”

This way of learning means that when you have”basic peripheral knowledge,”you effectively understand the subject faster. So we should try to understand the material below as basic knowledge for understanding hashing.

First, let’s understand the implementation of a pseudo hash function as preparation.

Because a hash value is often binary, let’s implement it that way.

With the implementation below, the hash value is 8-bit (1-byte).

` // Pseudo hash function`

function provisional_hash (string $base_data)

{

//

$hash = 0;

// Calculate hash values

for [$i = 0; $i < strlen ($base_data); ++$i] {

$hash = [$hash + ord ($base_data[$i]) ]% 256;

}

// Convert int to binary and return

return pack ("c”, $hash);

}

// test

var_dump [bin2 hex (provisional_hash (’あ’) )];

var_dump [bin2 hex (provisional_hash (’い’) )];

var_dump [bin2 hex (provisional_hash (’abb’) )];

var_dump [bin2 hex (provisional_hash (’abc’) )];

var_dump [bin2 hex (provisional_hash (’abcd’) )];

Because the return value is binary, it is converted with bin2 hex for output.

The function shown above is absolute no-good as a practical hash. But it shows the first part of hashing, the process of reducing a data record of variable length to fixed length, as mentioned above.

From the act of turning data into fixed length values, we see the following principles and qualities in a hash function.

### “Pigeonhole principle” or”Dirichlet’s box principle, Dirichlet’s drawer principle”

This principle states that if you try to place n+1 balls into n boxes, at least one box will have two or more items.

For example, if there are four boxes and five balls, at one extreme you can put all five balls into one box. Or, you can try to average out the placement as much as possible and put one ball in three boxes and two balls in one box. In any case, at the very least one box will have two or more balls.

This principle seems obvious. But to understand hash functions (or more precisely,”collision” described below), this principle is surprisingly important.

### Collision

This is due to the pigeonhole principle when reducing a long strength to a short, fixed length.

Basically, a hash function usually produces different hash values depending on the input data. But on occasion different values may result in the same hash value. This is called a collision.

Why does a collision occur? It is because following the pigeonhole principle, there are more items (the size of the original data) than the number of boxes (the size of the hash value).

A basic characteristic of a hash function is that”collisions will always occur.”(Because a „perfect hash function” has strict preconditions, we exclude it from consideration. ) Dealing with collisions is critical for using a hash function, and there are many approaches.

### Birthday paradox

This phenomenon must also be explained. But first I’ll describe it.

This”paradox” states that there is a divergence (paradox) between human intuition and reality when it comes to the question”How many people must there to reach the probability of 50% or more that two (or more) of them have the same birthday? “

First, using the pigeonhole principle, if there are 367 persons, then the probability that two (or more) of them have the same birthday is 100% (if it is a leap year, there are 366 days, and +1 is 367).

That being the case, many of you may then think,”Well, for 50%, that’s about half of the persons, or 183, right? “

But, according to calculations, you just need 23 persons for a 50% or greater probability that any two of them have the same birthday.

This problem is called a paradox because of the great divergence between the”intuitive number” and the”actual calculated number.”

The detailed calculations are omitted here, but if you’re interested, trying searching this topic on the Internet.

### Collision resistance (property)

Hash functions treat the set of original data and hash values in various ways. Many of them are premised on the basic approach that if the data differ, then their hash values will differ, and that if the hash values are the same, then the original data are the same.

As a result, if this premise can be exploited, then an attack will become easier. To defend against this, it is important to know how firmly this premise should be protected.

**Second pre-image resistance or weak collision resistance**

Suppose you have original data message_1. You are asked to calculate message_hash_1, the hash value of the original data.

If you wish to attack, it would be helpful if you can create a separate original data, message_2, that is different from message_1 but produces the same hash value, message_hash_1. The ease or difficulty of creating such a data is called”second pre-image resistance” or”weak collision resistance.”A function without this characteristic is vulnerable against second pre-image attacks (weak collision attacks).

As the precondition of second pre-image resistance, there is something called”pre-image resistance.”This is an indicator of how hard it is, when given hash value h, to find the original message that produced the hash value.

**Strong collision resistance (collision resistance) **

Whereas in second pre-image resistance or weak collision resistance there is”original data message_1,”strong collision resistance means the ease or difficulty of finding „original data 1 message_10″ and „original data 2 message_11″ that produce the same hash value message_hash_1 under the condition that the original data are not known.

If the messages are found, the method of finding them becomes useful for an attack.

This is related to the birthday paradox.Considering the size of the hash value space (that is, the number of bits of a hash value; using the birthday paradox as an example, one year or 366), an attack is successful with a surprisingly small number of patterns (using the birthday paradox as an example, 23 persons).

The hash function to use changes depending on how firmly to hold onto resistance and on its purpose, especially the question of what algorithm should be used.

From these considerations, at present the hash functions md5 and SHA-1 are often avoided, and functions like SHA-2 should be used. This is the consensus right now. Who knows, in the future SHA-2 may also be considered insecure.

One more thing. There is an attack method different from collision.

### Rainbow table

Because a hash function is irreversible, in principle it is not possible to return the hash value to the original data.

On the other hand, naturally the same hash value is required from the same original data.

As a result, if you prepare a huge database with hash values as the keys and original data as the values, then in principle you can infer the original data from a hash value.

In the past, it was believed while theoretically possible, such a table was impossible in practice. However, recent advances in computing have also brought power to cracking. A rainbow table may work well enough for cracking passwords. So you must be careful.

So as you can see, hash functions have a variety of characteristics and both advantages and disadvantages.

Still, hash functions have been used for a variety of purposes from relatively a long time ago. Roughly they are

- Manipulation detection (confirming message integrity / data check)
- Hash table (search performance)
- Making deciphering difficult due to irreversibility (passwords)

### Manipulation detection (confirming message integrity / data check)

Functions called fingerprint and message digest belong to this category.

Hash functions and values are also used here for the purpose of checking if data is entered correctly, rather than whether they have been tampered with. A checksum or check digit serves this purposes. For example, the final digit of a credit card is the number for checking. So a calculation can determine if there was a mistake in entering credit card numbers.

### Hash table (search performance)

There are various implementations. For example, PHP arrays are often used for this purpose.

According to the PHP manual, an array is actually an ordered map.”Tt can be treated as an array, list (vector), hash table (an implementation of a map), dictionary, collection, stack, queue, and probably more.”So we see that hash is used here.

There is an attack called HashDoS.It uses collisions to attack. In short, it is a denial-of-service (DOS) attack that exhausts CPU resource by biasing the hash values of keys with the hash function used for an associative array, so as to cause collisions.

Incidentally, to prevent HashDoS, setting max_input_vars in ini serves as a major brake on this attack. So be sure to set this variable.

### Making deciphering difficult due to irreversibility (passwords)

When saving passwords, to make sure that even internal personnel cannot view the passwords and that, if there is a data breach, the passwords can’t be understood immediately, an irreversible hash function is used to save the passwords. This use of the hash function is quite common.

We learned basic knowledge about hashing in this column.

In the next column, let’s put this knowledge to practical use, with a focus on passwords.