Hash Collisions: Why Your 'Unique' Fingerprints Aren't (And Why That's Usually OK)
In 2017, Google researchers generated two different PDF files with identical SHA-1 hashes, finally proving what cryptographers had warned about for years: hash functions don’t create truly unique fingerprints. This “SHAttered” attack required 9 quintillion SHA-1 computations—equivalent to 6,500 years of single-CPU computation.
Yet despite this proof, we still trust hash functions for everything from Git commits to blockchain transactions to password storage. Why? Because the full story of hash collisions is more nuanced than “unique” versus “not unique.”
*“In cryptography, ‘secure’ has always meant ‘secure for now’.”
Twenty years ago, I wrote about hash collisions when the biggest concern was spyware-removal applications getting false positives. Today, MD5 can be cracked on a laptop, simple salting is woefully inadequate, and billions of dollars in cryptocurrency depend on hash collision resistance. But the fundamental questions remain: How unique are hash codes really? And when does it matter?
The Pigeonhole Principle in Action
Hash algorithms generate fixed-length outputs regardless of input size. SHA-256 always produces 256 bits (32 bytes). SHA-3-512 always produces 512 bits. This creates an inherent mathematical constraint that’s easy to visualize but hard to grasp in practice.
import hashlib
# These vastly different inputs...
short_text = "Hi"
medium_text = "A" * 1000 # 1KB
long_text = "B" * 1_000_000 # 1MB
# ...all produce the same length output
print(len(hashlib.sha256(short_text.encode()).digest())) # 32 bytes
print(len(hashlib.sha256(medium_text.encode()).digest())) # 32 bytes
print(len(hashlib.sha256(long_text.encode()).digest())) # 32 bytes
The math is unforgiving: if you have \(n\) possible hash values, you only need \(n + 1\) distinct inputs to guarantee at least one collision. For SHA-256, that’s \(2^{256}\) possible values; a number so large (roughly \(10^{77}\)) it exceeds the estimated number of atoms in the observable universe.
But here’s what most discussions miss: not all collisions are created equal.
The Semantic Collision Problem
Finding two inputs that hash to the same value is one thing. Finding two meaningful inputs that hash to the same value is exponentially harder.
Consider trying to generate a collision for a classified document. Even if you could generate billions of documents with matching hash values, the probability of any making coherent sense, let alone containing plausible intelligence, approaches zero.
import random
import string
import json
def generate_random_bytes(length):
"""Generate random bytes that might hash to a target value."""
return ''.join(random.choices(string.printable, k=length))
# Generate 1 million random attempts
attempts = [generate_random_bytes(100) for _ in range(1_000_000)]
# How many are valid JSON?
valid_json_count = 0
for attempt in attempts:
try:
json.loads(attempt)
valid_json_count += 1
except:
pass
print(f"Valid JSON found: {valid_json_count}/1,000,000")
# Output: Valid JSON found: 0/1,000,000
# How many are valid Python code?
valid_python_count = 0
for attempt in attempts:
try:
compile(attempt, '<string>', 'exec')
valid_python_count += 1
except:
pass
print(f"Valid Python found: {valid_python_count}/1,000,000")
# Output: Valid Python found: 0/1,000,000
Language has structure. Random bytes don’t accidentally form valid JSON, compilable code, or readable prose. This is why Git’s reliance on SHA-1 (now SHA-256) remained practically secure despite theoretical vulnerabilities. Creating two valid source code files that compile, run correctly, and contain malicious logic while matching an existing hash? That’s not just hard—it’s effectively impossible with current technology.
What I Got Wrong and Right in 2005
Looking back at my original 2005 post , it’s instructive to see what held up and what didn’t:
What I Got Right ✅
- Semantic collisions are nearly impossible - Still true. Random data doesn’t accidentally become meaningful.
- Salt is essential for passwords - Though what seemed “foolproof” then is now the bare minimum.
- Hash algorithm strength matters - I recommended SHA-256 for sensitive applications. Good call.
- Context determines security needs - Hash tables vs. cryptographic uses require different guarantees.
What I Underestimated ❌
- GPU acceleration - Modern GPUs test 10+ billion SHA-256 hashes/second, not the thousands I imagined.
- Rainbow table sophistication - They grew from gigabytes to terabytes, covering far more than “common passwords.”
- MD5’s rapid demise - I thought it had years left. It was fully broken within 12 months.
- Blockchain’s hash dependence - Didn’t see cryptocurrency coming, where hash collisions = economic catastrophe.
What Changed Everything 🔄
- Cloud computing - Renting 1000 GPUs for an hour costs less than my 2005 laptop.
- Cryptocurrency - Created billion-dollar bounties for breaking hash functions.
- Quantum computing - From science fiction to IBM offering quantum cloud access.
Real-World Collision Attacks
What’s Broken
MD5: Completely Compromised
# Don't do this anymore!
import hashlib
password = "user_password"
md5_hash = hashlib.md5(password.encode()).hexdigest()
print(f"MD5 (INSECURE): {md5_hash}")
# Collisions can be generated in seconds
# Real attack: 2008 rogue CA certificate using MD5 collisions
# Modern GPUs: ~25 billion MD5 hashes/second
# Status: NEVER use for anything security-critical
In 2008, researchers created a rogue CA certificate using MD5 collisions, allowing them to impersonate any website. Today, you can generate MD5 collisions on a laptop in seconds using tools like HashClash.
SHA-1: Officially Deprecated
# Also deprecated for security use
sha1_hash = hashlib.sha1(password.encode()).hexdigest()
print(f"SHA-1 (DEPRECATED): {sha1_hash}")
# SHAttered attack stats:
# - 9,223,372,036,854,775,808 SHA-1 computations
# - 6,500 CPU years (or 110 GPU years)
# - ~$45,000 in cloud computing costs (2020)
# Modern GPUs: ~15 billion SHA-1 hashes/second
What’s Still Secure
# Modern, secure hash functions
sha256_hash = hashlib.sha256(password.encode()).hexdigest()
sha3_hash = hashlib.sha3_256(password.encode()).hexdigest()
print(f"SHA-256 (SECURE): {sha256_hash}")
print(f"SHA-3 (SECURE): {sha3_hash}")
# Also consider BLAKE2 for performance
import hashlib
blake2_hash = hashlib.blake2b(password.encode()).hexdigest()
print(f"BLAKE2b (SECURE & FAST): {blake2_hash}")
Current Status:
- SHA-256/SHA-3: No practical collision attacks known
- BLAKE2/BLAKE3: Modern alternatives, faster than SHA-256
- Energy requirement for SHA-256 collision: More than the sun outputs in a year
Modern Password Hashing: Beyond Salt
Twenty years ago, I wrote that adding salt to passwords before hashing would produce “an almost fool-proof system.” Today, it’s the absolute minimum, and often insufficient.
The Evolution: Then vs. Now
Then (2005)
import hashlib
import os
# Old way (DON'T DO THIS)
password = "user_password"
salt = os.urandom(16)
old_hash = hashlib.sha1(
password.encode() + salt
).hexdigest()
# Seemed secure at the time!
Now (2025)
# pip install argon2-cffi
import argon2
# Modern way with Argon2
hasher = argon2.PasswordHasher(
memory_cost=65536, # 64 MB
time_cost=3, # iterations
parallelism=4, # threads
)
password = "user_password"
password_hash = hasher.hash(password)
Why Simple Hashing Failed
Rainbow tables (precomputed hash databases) destroyed simple hash-based password storage:
# Demonstrating why unsalted hashes are vulnerable
common_passwords = ["password", "123456", "password123", "admin"]
rainbow_table = {}
# Attacker precomputes common password hashes
for pwd in common_passwords:
rainbow_table[hashlib.sha256(pwd.encode()).hexdigest()] = pwd
# Now they can instantly reverse any matching hash
target_hash = hashlib.sha256("password".encode()).hexdigest()
if target_hash in rainbow_table:
print(f"Password found: {rainbow_table[target_hash]}")
# Output: Password found: password
Modern GPUs can test billions of SHA-256 hashes per second. Argon2, winner of the 2015 Password Hashing Competition, requires substantial memory for each guess, neutralizing the GPU advantage:
import time
import hashlib
import argon2
password = "test_password"
# Time SHA-256 (fast - bad for passwords)
start = time.time()
for _ in range(10000):
hashlib.sha256(password.encode()).hexdigest()
sha_time = time.time() - start
# Time Argon2 (slow by design - good for passwords)
hasher = argon2.PasswordHasher()
start = time.time()
for _ in range(10): # Only 10 iterations!
hasher.hash(password)
argon_time = time.time() - start
print(f"SHA-256 (10,000 hashes): {sha_time:.2f} seconds")
print(f"Argon2 (10 hashes): {argon_time:.2f} seconds")
# SHA-256 is ~10,000x faster - that's the point!
Alternative Modern Algorithms
While Argon2 is the gold standard, other algorithms serve specific needs:
# bcrypt - Battle-tested, widely supported
import bcrypt
hashed = bcrypt.hashpw(password.encode(), bcrypt.gensalt(rounds=12))
# scrypt - Memory-hard, used by some cryptocurrencies
import hashlib
hashed = hashlib.scrypt(
password.encode(),
salt=os.urandom(16),
n=16384, r=8, p=1
)
# PBKDF2 - NIST approved, works everywhere
import hashlib
hashed = hashlib.pbkdf2_hmac(
'sha256',
password.encode(),
salt=os.urandom(16),
iterations=100000
)
Blockchain and Hash Collisions
Cryptocurrencies bet billions on hash collision resistance. Bitcoin’s approach is particularly paranoid:
import hashlib
def bitcoin_hash(data):
"""Bitcoin uses double SHA-256 for extra security."""
first_hash = hashlib.sha256(data.encode()).digest()
second_hash = hashlib.sha256(first_hash).digest()
return second_hash.hex()
# Example Bitcoin block header hashing
block_header = "version|prev_block|merkle_root|timestamp|bits|nonce"
block_hash = bitcoin_hash(block_header)
print(f"Bitcoin block hash: {block_hash}")
If SHA-256 falls, the implications are staggering:
- Transaction history could be rewritten
- Digital signatures could be forged
- Proof-of-work becomes meaningless
- Billions in value evaporates
This is why Ethereum is already planning post-quantum cryptography migrations. NIST’s Post-Quantum Cryptography competition selected four algorithms in 2022, with implementation beginning across critical infrastructure:
# Simulating crypto-agility in smart contracts
class HashAlgorithm:
SHA256 = 1
SHA3_256 = 2
BLAKE3 = 3
CRYSTALS_DILITHIUM = 4 # Post-quantum signature
SPHINCS_PLUS = 5 # Stateless post-quantum
class FutureProofContract:
def __init__(self):
self.hash_version = HashAlgorithm.SHA256
def hash_data(self, data):
if self.hash_version == HashAlgorithm.SHA256:
return hashlib.sha256(data.encode()).hexdigest()
elif self.hash_version == HashAlgorithm.SHA3_256:
return hashlib.sha3_256(data.encode()).hexdigest()
# Ready to upgrade when quantum computers arrive
def upgrade_hash_algorithm(self, new_version):
"""Allows migration to stronger algorithms."""
self.hash_version = new_version
When Collisions Don’t Matter
Not every use of hashing requires collision resistance:
# Hash table - collisions are expected and handled
class SimpleHashTable:
def __init__(self, size=10):
self.size = size
self.table = [[] for _ in range(size)]
def _hash(self, key):
# Simple hash - collisions are fine here
return hash(key) % self.size
def insert(self, key, value):
index = self._hash(key)
# Collision handling via chaining
self.table[index].append((key, value))
def get(self, key):
index = self._hash(key)
for k, v in self.table[index]:
if k == key:
return v
return None
# Collisions just mean slight performance degradation
table = SimpleHashTable(size=5)
table.insert("apple", 1)
table.insert("banana", 2) # Might collide - that's OK!
Use cases where collisions are acceptable:
- Hash tables/dictionaries (handled via chaining)
- Cache keys (collision = cache miss, not security breach)
- Load balancing (collision = slightly uneven distribution)
- Data deduplication (extremely rare collision = minor storage inefficiency)
Best Practices for 2025
1. Choose the Right Tool
import hashlib
import argon2
import hmac
# File integrity (collision resistance matters)
def file_checksum(filepath):
sha256_hash = hashlib.sha256()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
sha256_hash.update(chunk)
return sha256_hash.hexdigest()
# Password storage (use specialized algorithm)
def hash_password(password):
return argon2.PasswordHasher().hash(password)
# Message authentication (use HMAC)
def create_mac(message, secret_key):
return hmac.new(
secret_key.encode(),
message.encode(),
hashlib.sha256
).hexdigest()
# Content addressing (collision resistance critical)
def content_address(data):
return hashlib.sha3_256(data.encode()).hexdigest()
2. Plan for Crypto-Agility
import json
from enum import Enum
class HashVersion(Enum):
SHA256_V1 = "sha256_v1"
SHA3_256_V2 = "sha3_256_v2"
BLAKE3_V3 = "blake3_v3"
class VersionedHasher:
def __init__(self):
self.current_version = HashVersion.SHA3_256_V2
def hash_with_version(self, data):
"""Store algorithm version with hash."""
if self.current_version == HashVersion.SHA256_V1:
hash_value = hashlib.sha256(data.encode()).hexdigest()
elif self.current_version == HashVersion.SHA3_256_V2:
hash_value = hashlib.sha3_256(data.encode()).hexdigest()
return {
"version": self.current_version.value,
"hash": hash_value
}
def verify(self, data, versioned_hash):
"""Verify using the original algorithm."""
version = HashVersion(versioned_hash["version"])
if version == HashVersion.SHA256_V1:
computed = hashlib.sha256(data.encode()).hexdigest()
elif version == HashVersion.SHA3_256_V2:
computed = hashlib.sha3_256(data.encode()).hexdigest()
return computed == versioned_hash["hash"]
# Usage
hasher = VersionedHasher()
stored_hash = hasher.hash_with_version("important data")
print(json.dumps(stored_hash, indent=2))
3. Security Checklist
Never use for security:
- ❌ MD5 (broken since 2004)
- ❌ SHA-1 (broken since 2017)
- ❌ Plain SHA-256 for passwords (too fast)
Current best practices:
- ✅ SHA-256/SHA-3 for file integrity
- ✅ Argon2id for password storage
- ✅ BLAKE3 for performance-critical hashing
- ✅ HMAC for message authentication
- ✅ Store algorithm version with hashes
- ✅ bcrypt/scrypt as Argon2 alternatives
Closing Thoughts
Hash functions don’t produce unique values. They produce values so astronomically unlikely to collide that we can pretend they’re unique. For SHA-256, finding a collision would require more computational power than humanity will likely ever possess. But as history has shown, “unfeasibly difficult” is not “impossible.”
The history of cryptography is littered with “unbreakable” systems that fell to advancing mathematics and computing power. MD5 went from secure to broken in 13 years. SHA-1 lasted 22 years. How long will SHA-256 survive? Quantum computers may accelerate this timeline dramatically.
The lesson isn’t to panic about hash collisions. It’s to design systems that can evolve. Store your algorithm versions. Plan your migration paths. Understand when collision resistance matters and when it doesn’t.
Twenty years from now, someone might be updating this post again, explaining why SHA-256 is broken and why we should have seen it coming. The fundamentals won’t change. Finite outputs for infinite inputs guarantee collisions. But hopefully, we’ll have learned to build systems that gracefully handle the inevitable transition to whatever comes next.
See you in 2045 when we’re explaining why SHA-256 wasn’t quantum-resistant and how we should have known all along.
Further Reading
- SHAttered - The attack that broke SHA-1 with full technical details
- OWASP Password Storage Cheat Sheet - Current best practices for password hashing
- Argon2 Specification - Winner of the Password Hashing Competition
- Bitcoin’s Double SHA-256 - Why Bitcoin hashes twice
- NIST Post-Quantum Cryptography - Selected algorithms and implementation timeline
- My Original 2005 Post - For historical perspective on how much has changed
Try It Yourself
Download the enhanced code samples on GitHub
The repository includes runnable versions with timing comparisons, detailed output, and additional examples that demonstrate each concept in action.