Hash Collisions: Why Your 'Unique' Fingerprints Aren't (And Why That's Usually OK)

Visual representation of the SHAttered attack showing two different PDFs with identical SHA-1 hashes
In 2017, Google proved SHA-1 was broken by creating two different PDFs with identical hashes.

In 2017, Google researchers generated two different PDF files with identical SHA-1 hashes, finally proving what cryptographers had warned about for years: hash functions don’t create truly unique fingerprints. This “SHAttered” attack required 9 quintillion SHA-1 computations—equivalent to 6,500 years of single-CPU computation.

Yet despite this proof, we still trust hash functions for everything from Git commits to blockchain transactions to password storage. Why? Because the full story of hash collisions is more nuanced than “unique” versus “not unique.”


*“In cryptography, ‘secure’ has always meant ‘secure for now’.”


Historical Note: This post updates my 2005 article on hash collisions. Back then, MD5 was still considered "acceptable with caveats" and adding salt to passwords seemed cutting-edge. It's fascinating (and somewhat alarming) to see which predictions came true and which assumptions had to be completely reconsidered.

Twenty years ago, I wrote about hash collisions when the biggest concern was spyware-removal applications getting false positives. Today, MD5 can be cracked on a laptop, simple salting is woefully inadequate, and billions of dollars in cryptocurrency depend on hash collision resistance. But the fundamental questions remain: How unique are hash codes really? And when does it matter?


The Pigeonhole Principle in Action

Hash algorithms generate fixed-length outputs regardless of input size. SHA-256 always produces 256 bits (32 bytes). SHA-3-512 always produces 512 bits. This creates an inherent mathematical constraint that’s easy to visualize but hard to grasp in practice.

import hashlib

# These vastly different inputs...
short_text = "Hi"
medium_text = "A" * 1000  # 1KB
long_text = "B" * 1_000_000  # 1MB

# ...all produce the same length output
print(len(hashlib.sha256(short_text.encode()).digest()))   # 32 bytes
print(len(hashlib.sha256(medium_text.encode()).digest()))  # 32 bytes  
print(len(hashlib.sha256(long_text.encode()).digest()))    # 32 bytes

The math is unforgiving: if you have \(n\) possible hash values, you only need \(n + 1\) distinct inputs to guarantee at least one collision. For SHA-256, that’s \(2^{256}\) possible values; a number so large (roughly \(10^{77}\)) it exceeds the estimated number of atoms in the observable universe.

But here’s what most discussions miss: not all collisions are created equal.


The Semantic Collision Problem

Diagram showing random bytes rarely forming valid JSON, code, or readable text
Random bytes almost never accidentally form valid structured data.

Finding two inputs that hash to the same value is one thing. Finding two meaningful inputs that hash to the same value is exponentially harder.

Consider trying to generate a collision for a classified document. Even if you could generate billions of documents with matching hash values, the probability of any making coherent sense, let alone containing plausible intelligence, approaches zero.

import random
import string
import json

def generate_random_bytes(length):
    """Generate random bytes that might hash to a target value."""
    return ''.join(random.choices(string.printable, k=length))

# Generate 1 million random attempts
attempts = [generate_random_bytes(100) for _ in range(1_000_000)]

# How many are valid JSON?
valid_json_count = 0
for attempt in attempts:
    try:
        json.loads(attempt)
        valid_json_count += 1
    except:
        pass

print(f"Valid JSON found: {valid_json_count}/1,000,000")  
# Output: Valid JSON found: 0/1,000,000

# How many are valid Python code?
valid_python_count = 0
for attempt in attempts:
    try:
        compile(attempt, '<string>', 'exec')
        valid_python_count += 1
    except:
        pass

print(f"Valid Python found: {valid_python_count}/1,000,000")
# Output: Valid Python found: 0/1,000,000

Language has structure. Random bytes don’t accidentally form valid JSON, compilable code, or readable prose. This is why Git’s reliance on SHA-1 (now SHA-256) remained practically secure despite theoretical vulnerabilities. Creating two valid source code files that compile, run correctly, and contain malicious logic while matching an existing hash? That’s not just hard—it’s effectively impossible with current technology.


What I Got Wrong and Right in 2005

Looking back at my original 2005 post , it’s instructive to see what held up and what didn’t:

What I Got Right ✅

  • Semantic collisions are nearly impossible - Still true. Random data doesn’t accidentally become meaningful.
  • Salt is essential for passwords - Though what seemed “foolproof” then is now the bare minimum.
  • Hash algorithm strength matters - I recommended SHA-256 for sensitive applications. Good call.
  • Context determines security needs - Hash tables vs. cryptographic uses require different guarantees.

What I Underestimated ❌

  • GPU acceleration - Modern GPUs test 10+ billion SHA-256 hashes/second, not the thousands I imagined.
  • Rainbow table sophistication - They grew from gigabytes to terabytes, covering far more than “common passwords.”
  • MD5’s rapid demise - I thought it had years left. It was fully broken within 12 months.
  • Blockchain’s hash dependence - Didn’t see cryptocurrency coming, where hash collisions = economic catastrophe.

What Changed Everything 🔄

  • Cloud computing - Renting 1000 GPUs for an hour costs less than my 2005 laptop.
  • Cryptocurrency - Created billion-dollar bounties for breaking hash functions.
  • Quantum computing - From science fiction to IBM offering quantum cloud access.

Real-World Collision Attacks

What’s Broken

MD5: Completely Compromised

# Don't do this anymore!
import hashlib

password = "user_password"
md5_hash = hashlib.md5(password.encode()).hexdigest()
print(f"MD5 (INSECURE): {md5_hash}")

# Collisions can be generated in seconds
# Real attack: 2008 rogue CA certificate using MD5 collisions
# Modern GPUs: ~25 billion MD5 hashes/second
# Status: NEVER use for anything security-critical

In 2008, researchers created a rogue CA certificate using MD5 collisions, allowing them to impersonate any website. Today, you can generate MD5 collisions on a laptop in seconds using tools like HashClash.

SHA-1: Officially Deprecated

# Also deprecated for security use
sha1_hash = hashlib.sha1(password.encode()).hexdigest()
print(f"SHA-1 (DEPRECATED): {sha1_hash}")

# SHAttered attack stats:
# - 9,223,372,036,854,775,808 SHA-1 computations
# - 6,500 CPU years (or 110 GPU years)
# - ~$45,000 in cloud computing costs (2020)
# Modern GPUs: ~15 billion SHA-1 hashes/second

What’s Still Secure

# Modern, secure hash functions
sha256_hash = hashlib.sha256(password.encode()).hexdigest()
sha3_hash = hashlib.sha3_256(password.encode()).hexdigest()

print(f"SHA-256 (SECURE): {sha256_hash}")
print(f"SHA-3 (SECURE): {sha3_hash}")

# Also consider BLAKE2 for performance
import hashlib
blake2_hash = hashlib.blake2b(password.encode()).hexdigest()
print(f"BLAKE2b (SECURE & FAST): {blake2_hash}")

Current Status:

  • SHA-256/SHA-3: No practical collision attacks known
  • BLAKE2/BLAKE3: Modern alternatives, faster than SHA-256
  • Energy requirement for SHA-256 collision: More than the sun outputs in a year

Modern Password Hashing: Beyond Salt

Timeline showing evolution from MD5 to Argon2, with increasing memory and computation requirements
Password hashing has evolved from simple hashes to memory-hard, time-expensive algorithms.

Twenty years ago, I wrote that adding salt to passwords before hashing would produce “an almost fool-proof system.” Today, it’s the absolute minimum, and often insufficient.

The Evolution: Then vs. Now

Then (2005)

import hashlib
import os

# Old way (DON'T DO THIS)
password = "user_password"
salt = os.urandom(16)
old_hash = hashlib.sha1(
    password.encode() + salt
).hexdigest()

# Seemed secure at the time!

Now (2025)

# pip install argon2-cffi
import argon2

# Modern way with Argon2
hasher = argon2.PasswordHasher(
    memory_cost=65536,  # 64 MB
    time_cost=3,        # iterations  
    parallelism=4,      # threads
)

password = "user_password"
password_hash = hasher.hash(password)

Why Simple Hashing Failed

Rainbow tables (precomputed hash databases) destroyed simple hash-based password storage:

# Demonstrating why unsalted hashes are vulnerable
common_passwords = ["password", "123456", "password123", "admin"]
rainbow_table = {}

# Attacker precomputes common password hashes
for pwd in common_passwords:
    rainbow_table[hashlib.sha256(pwd.encode()).hexdigest()] = pwd

# Now they can instantly reverse any matching hash
target_hash = hashlib.sha256("password".encode()).hexdigest()
if target_hash in rainbow_table:
    print(f"Password found: {rainbow_table[target_hash]}")  
    # Output: Password found: password

Modern GPUs can test billions of SHA-256 hashes per second. Argon2, winner of the 2015 Password Hashing Competition, requires substantial memory for each guess, neutralizing the GPU advantage:

import time
import hashlib
import argon2

password = "test_password"

# Time SHA-256 (fast - bad for passwords)
start = time.time()
for _ in range(10000):
    hashlib.sha256(password.encode()).hexdigest()
sha_time = time.time() - start

# Time Argon2 (slow by design - good for passwords)
hasher = argon2.PasswordHasher()
start = time.time()
for _ in range(10):  # Only 10 iterations!
    hasher.hash(password)
argon_time = time.time() - start

print(f"SHA-256 (10,000 hashes): {sha_time:.2f} seconds")
print(f"Argon2 (10 hashes): {argon_time:.2f} seconds")
# SHA-256 is ~10,000x faster - that's the point!

Alternative Modern Algorithms

While Argon2 is the gold standard, other algorithms serve specific needs:

# bcrypt - Battle-tested, widely supported
import bcrypt
hashed = bcrypt.hashpw(password.encode(), bcrypt.gensalt(rounds=12))

# scrypt - Memory-hard, used by some cryptocurrencies
import hashlib
hashed = hashlib.scrypt(
    password.encode(),
    salt=os.urandom(16),
    n=16384, r=8, p=1
)

# PBKDF2 - NIST approved, works everywhere
import hashlib
hashed = hashlib.pbkdf2_hmac(
    'sha256',
    password.encode(),
    salt=os.urandom(16),
    iterations=100000
)

Blockchain and Hash Collisions

Cryptocurrencies bet billions on hash collision resistance. Bitcoin’s approach is particularly paranoid:

import hashlib

def bitcoin_hash(data):
    """Bitcoin uses double SHA-256 for extra security."""
    first_hash = hashlib.sha256(data.encode()).digest()
    second_hash = hashlib.sha256(first_hash).digest()
    return second_hash.hex()

# Example Bitcoin block header hashing
block_header = "version|prev_block|merkle_root|timestamp|bits|nonce"
block_hash = bitcoin_hash(block_header)
print(f"Bitcoin block hash: {block_hash}")

If SHA-256 falls, the implications are staggering:

  • Transaction history could be rewritten
  • Digital signatures could be forged
  • Proof-of-work becomes meaningless
  • Billions in value evaporates

This is why Ethereum is already planning post-quantum cryptography migrations. NIST’s Post-Quantum Cryptography competition selected four algorithms in 2022, with implementation beginning across critical infrastructure:

# Simulating crypto-agility in smart contracts
class HashAlgorithm:
    SHA256 = 1
    SHA3_256 = 2
    BLAKE3 = 3
    CRYSTALS_DILITHIUM = 4  # Post-quantum signature
    SPHINCS_PLUS = 5         # Stateless post-quantum

class FutureProofContract:
    def __init__(self):
        self.hash_version = HashAlgorithm.SHA256
        
    def hash_data(self, data):
        if self.hash_version == HashAlgorithm.SHA256:
            return hashlib.sha256(data.encode()).hexdigest()
        elif self.hash_version == HashAlgorithm.SHA3_256:
            return hashlib.sha3_256(data.encode()).hexdigest()
        # Ready to upgrade when quantum computers arrive
        
    def upgrade_hash_algorithm(self, new_version):
        """Allows migration to stronger algorithms."""
        self.hash_version = new_version

When Collisions Don’t Matter

Not every use of hashing requires collision resistance:

# Hash table - collisions are expected and handled
class SimpleHashTable:
    def __init__(self, size=10):
        self.size = size
        self.table = [[] for _ in range(size)]
    
    def _hash(self, key):
        # Simple hash - collisions are fine here
        return hash(key) % self.size
    
    def insert(self, key, value):
        index = self._hash(key)
        # Collision handling via chaining
        self.table[index].append((key, value))
    
    def get(self, key):
        index = self._hash(key)
        for k, v in self.table[index]:
            if k == key:
                return v
        return None

# Collisions just mean slight performance degradation
table = SimpleHashTable(size=5)
table.insert("apple", 1)
table.insert("banana", 2)  # Might collide - that's OK!

Use cases where collisions are acceptable:

  • Hash tables/dictionaries (handled via chaining)
  • Cache keys (collision = cache miss, not security breach)
  • Load balancing (collision = slightly uneven distribution)
  • Data deduplication (extremely rare collision = minor storage inefficiency)

Best Practices for 2025

1. Choose the Right Tool

import hashlib
import argon2
import hmac

# File integrity (collision resistance matters)
def file_checksum(filepath):
    sha256_hash = hashlib.sha256()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            sha256_hash.update(chunk)
    return sha256_hash.hexdigest()

# Password storage (use specialized algorithm)
def hash_password(password):
    return argon2.PasswordHasher().hash(password)

# Message authentication (use HMAC)
def create_mac(message, secret_key):
    return hmac.new(
        secret_key.encode(),
        message.encode(),
        hashlib.sha256
    ).hexdigest()

# Content addressing (collision resistance critical)
def content_address(data):
    return hashlib.sha3_256(data.encode()).hexdigest()

2. Plan for Crypto-Agility

import json
from enum import Enum

class HashVersion(Enum):
    SHA256_V1 = "sha256_v1"
    SHA3_256_V2 = "sha3_256_v2"
    BLAKE3_V3 = "blake3_v3"

class VersionedHasher:
    def __init__(self):
        self.current_version = HashVersion.SHA3_256_V2
        
    def hash_with_version(self, data):
        """Store algorithm version with hash."""
        if self.current_version == HashVersion.SHA256_V1:
            hash_value = hashlib.sha256(data.encode()).hexdigest()
        elif self.current_version == HashVersion.SHA3_256_V2:
            hash_value = hashlib.sha3_256(data.encode()).hexdigest()
        
        return {
            "version": self.current_version.value,
            "hash": hash_value
        }
    
    def verify(self, data, versioned_hash):
        """Verify using the original algorithm."""
        version = HashVersion(versioned_hash["version"])
        
        if version == HashVersion.SHA256_V1:
            computed = hashlib.sha256(data.encode()).hexdigest()
        elif version == HashVersion.SHA3_256_V2:
            computed = hashlib.sha3_256(data.encode()).hexdigest()
        
        return computed == versioned_hash["hash"]

# Usage
hasher = VersionedHasher()
stored_hash = hasher.hash_with_version("important data")
print(json.dumps(stored_hash, indent=2))

3. Security Checklist

Never use for security:

  • ❌ MD5 (broken since 2004)
  • ❌ SHA-1 (broken since 2017)
  • ❌ Plain SHA-256 for passwords (too fast)

Current best practices:

  • ✅ SHA-256/SHA-3 for file integrity
  • ✅ Argon2id for password storage
  • ✅ BLAKE3 for performance-critical hashing
  • ✅ HMAC for message authentication
  • ✅ Store algorithm version with hashes
  • ✅ bcrypt/scrypt as Argon2 alternatives

Closing Thoughts

Hash functions don’t produce unique values. They produce values so astronomically unlikely to collide that we can pretend they’re unique. For SHA-256, finding a collision would require more computational power than humanity will likely ever possess. But as history has shown, “unfeasibly difficult” is not “impossible.”

The history of cryptography is littered with “unbreakable” systems that fell to advancing mathematics and computing power. MD5 went from secure to broken in 13 years. SHA-1 lasted 22 years. How long will SHA-256 survive? Quantum computers may accelerate this timeline dramatically.

The lesson isn’t to panic about hash collisions. It’s to design systems that can evolve. Store your algorithm versions. Plan your migration paths. Understand when collision resistance matters and when it doesn’t.

Twenty years from now, someone might be updating this post again, explaining why SHA-256 is broken and why we should have seen it coming. The fundamentals won’t change. Finite outputs for infinite inputs guarantee collisions. But hopefully, we’ll have learned to build systems that gracefully handle the inevitable transition to whatever comes next.

See you in 2045 when we’re explaining why SHA-256 wasn’t quantum-resistant and how we should have known all along.


Further Reading


Try It Yourself

Download the enhanced code samples on GitHub

The repository includes runnable versions with timing comparisons, detailed output, and additional examples that demonstrate each concept in action.