Discover how Git's content-addressable object storage model works under the hood. Learn about blobs, trees, commits, and tags to master Git internals and understand why Git is fundamentally a database.
Introduction
Every day, millions of developers run git commit, git push, and git pull without understanding what's actually happening under the hood. We interact with Git through high-level commands that hide its true nature: a content-addressable filesystem dressed up as a version control tool.
This abstraction works perfectly until it doesn't. When you encounter corruption, need to recover lost data, or want to understand why your repository is 5GB when your project is only 100MB, you're left helpless. The commands you know don't help because they weren't designed for these problems.
Understanding Git as a content-addressable database transforms it from a mysterious black box into a comprehensible system. Instead of memorizing commands, you'll understand the data structures they manipulate. Instead of fearing repository corruption, you'll know how to inspect and repair it. This knowledge unlocks Git's full potential and gives you the confidence to handle any Git-related situation.
In this deep dive, we'll explore Git's object model, from blobs and trees to commits and tags. You'll learn how content addressing works, how to inspect the object database directly, and why this architecture makes Git so powerful. Whether you're debugging a corrupted repository or just curious about what makes Git tick, this guide will give you the mental model you need.
Background: Why Content Addressability?
Before Git, most version control systems used delta-based storage. They stored the differences between file versions, which required calculating complex patch sequences to reconstruct any version. This approach made certain operations expensive and made data integrity verification difficult.
Git's creator, Linus Torvalds, chose a radically different approach: content-addressable storage. Instead of storing changes, Git stores complete snapshots of files. Each file's content is hashed using SHA-1, and that hash becomes the object's identifier. This design decision has profound implications:
- Deduplication: Identical content stored once, referenced multiple times
- Integrity: Any corruption immediately detectable (hash won't match)
- Efficiency: Most operations become simple key-value lookups
- Distributed: No central authority needed to assign identifiers
This architecture makes Git more database than version control tool. It's a key-value store where keys are cryptographic hashes and values are compressed data objects.
Core Concepts: The Four Git Objects
Git's object database contains four types of objects, each serving a specific purpose. Understanding these objects is fundamental to understanding Git itself.
Blob (Binary Large Object)
Blobs store file contents-nothing more, nothing less. They don't track filenames, permissions, or directory structure. A blob represents the raw bytes of a file at a specific point in time.
# Create a blob from a file
echo "Hello, World!" | git hash-object -w --stdin
# Output: a902cb4e32f17c82701b0c4abb049021c3e6fbb1
# Retrieve blob content
git cat-file -p a902cb4e32f17c82701b0c4abb049021c3e6fbb1
# Output: Hello, World!
Notice that the same content produces the same hash, regardless of filename or location. This property enables Git's efficient deduplication.
Tree
Trees represent directory snapshots. They contain references to blobs (files) and other trees (subdirectories), along with their names, permissions, and modes. Trees are what give structure to your repository.
# Inspect a tree
git cat-file -p HEAD^{tree}
# Output:
# 100644 blob a902cb4e32f17c82701b0c4abb049021c3e6fbb1 README.md
# 040000 tree 0a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0 src
Each tree entry contains:
- Mode: File permissions (100644 for regular file, 040000 for directory, 100755 for executable)
- Type: blob or tree
- Hash: SHA-1 of the referenced object
- Name: Filename or directory name
Commit
Commits tie everything together. Each commit contains:
- Tree hash: Reference to the top-level tree (project snapshot)
- Parent hashes: References to preceding commits
- Author info: Name, email, timestamp
- Committer info: Name, email, timestamp
- Message: Commit message
# Inspect a commit
git cat-file -p HEAD
# Output:
# tree 0a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0
# parent 1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p7q8r9s0t1u
# author John Doe <[email protected]> 1735689600 +0000
# committer John Doe <[email protected]> 1735689600 +0000
#
# Initial commit
Commits form the directed acyclic graph (DAG) that represents your project's history. Each commit points to its parent(s), creating a lineage of snapshots.
Tag
Tags provide human-readable references to specific commits, typically for releases. There are two types:
- Lightweight tags: Simple pointers to a commit (like a branch, but doesn't move)
- Annotated tags: Complete Git objects with tagger info, message, and timestamp
# Lightweight tag
git tag v1.0.0 abc1234
# Annotated tag
git tag -a v1.0.0 -m "Release version 1.0.0"
Implementation Guide: Exploring the Object Database
Now let's explore the .git/objects directory directly to see how Git stores data.
Step 1: Navigate to the Objects Directory
cd /path/to/your/repository
ls -la .git/objects/
You'll see a structure of 256 subdirectories, each named with the first 2 characters of SHA-1 hashes. This divides objects into buckets for efficient filesystem access.
Step 2: Inspect a Specific Object
# Find all objects in current branch
git rev-list --objects --all | head -10
# Pick an object hash and inspect it
git cat-file -t <hash> # Show type
git cat-file -p <hash> # Show content
git cat-file -s <hash> # Show size
Step 3: Explore Loose vs Packed Objects
# List loose objects
find .git/objects -type f | head -10
# Check for pack files
ls -la .git/objects/pack/
Git initially stores objects as "loose" files (one file per object). When loose objects accumulate, Git packs them into a single file with an index for efficient access and transfer.
Step 4: Create Objects Manually
# Hash a file without storing it
echo "test content" | git hash-object --stdin
# Output: d670460b4b4aece5915caf5c68d12f560a9fe3e4
# Store it in the database
echo "test content" | git hash-object -w --stdin
# Output: d670460b4b4aece5915caf5c68d12f560a9fe3e4
# Verify it's stored
ls .git/objects/d6/
# Output: 70460b4b4aece5915caf5c68d12f560a9fe3e4
Code Examples: Production-Ready Git Operations
Listing All Objects
#!/bin/bash
# List all objects with their types and sizes
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize)' --batch-all-objects | sort -k2 -n
Finding Large Blobs
#!/bin/bash
# Find the 10 largest blobs in your repository
git rev-list --objects --all |
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize)' |
awk '/^blob/ {print}' |
sort -nk3 -r |
head -10
Recovering Lost Content
#!/bin/bash
# Find unreachable objects (dangling commits, blobs)
git fsck --full --unreachable --no-reflogs
# Recover dangling objects
for obj in $(git fsck --full --no-reflogs | grep "dangling blob" | awk '{print $3}'); do
echo "Found dangling blob: $obj"
git cat-file -p $obj > /tmp/recovered_$obj.txt
done
Checking Object Integrity
#!/bin/bash
# Verify all objects are reachable and valid
git fsck --full --strict
# Check for corrupted objects
git verify-pack -v .git/objects/pack/*.idx
Best Practices
DO: Use Plumbing Commands for Inspection
Git has two command layers:
- Porcelain: High-level user-friendly commands (commit, push, pull)
- Plumbing: Low-level commands for manipulating the database
Use porcelain commands for daily work, but use plumbing commands for inspection and debugging.
# Good: Use cat-file for inspection
git cat-file -p HEAD
# Avoid: Don't use show for deep inspection
git show HEAD
DO: Understand Hash Object Content
Remember that Git hashes content, not filenames. The same content in different files produces the same blob hash:
echo "hello" > file1.txt
echo "hello" > file2.txt
git add .
# Both files reference the same blob object!
DON'T: Manipulate .git Directory Directly
Never manually modify files in .git/objects. Use Git commands to ensure proper indexing and packing.
DO: Monitor Repository Size
# Check repository size
du -sh .git
# Find largest objects
git rev-list --objects --all |
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize)' |
awk '/^blob/ {sum+=$3} END {print "Total size:", sum/1024/1024 "MB"}'
Common Pitfalls
Pitfall 1: Assuming Filenames Affect Hashes
Problem: Thinking renaming a file changes its blob hash.
Reality: Blobs hash only content. Filenames live in trees, not blobs.
Solution: Understand that renaming is a tree operation, not a blob operation.
Pitfall 2: Ignoring Pack Files
Problem: wondering why .git/objects is empty except for pack/ directory.
Reality: Git packs objects for efficiency. Most objects live in pack files, not as loose files.
Solution: Use git verify-pack to inspect pack contents.
Pitfall 3: Forgetting About Reflogs
Problem: Thinking deleted commits are gone forever.
Reality: Git keeps a reflog of all HEAD movements for 90 days by default.
Solution: Check git reflog to recover lost commits.
Performance Considerations
Compression
Git uses zlib compression for all objects. A 1MB text file might compress to 100KB in the object database.
# Check compression ratio
echo "uncompressed size: $(cat file.txt | wc -c) bytes"
git add file.txt
git cat-file -s $(git hash-object file.txt)
Packing
Git automatically packs loose objects when you run:
git gc(garbage collection)git repack(manual packing)- After too many loose objects (threshold: ~7000 objects)
Packing provides:
- Space savings through delta compression
- Faster transfers (single file vs thousands)
- Efficient networking
Shallow Clones
Shallow clones download only recent history, saving bandwidth:
# Clone only last commit
git clone --depth 1 https://github.com/user/repo.git
# Clone last 5 commits
git clone --depth 5 https://github.com/user/repo.git
Trade-off: You can't access full history, which limits some operations.
Real-World Example: Finding Duplicate Content
Let's solve a practical problem: Your repository is bloated with duplicate assets.
#!/bin/bash
# Find duplicate blobs across all branches
# Step 1: Extract all blob hashes and paths
git rev-list --objects --all |
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(path)' |
grep '^blob' |
sort > /tmp/all-blobs.txt
# Step 2: Find duplicates (same hash, different paths)
awk '{print $2}' /tmp/all-blobs.txt |
sort |
uniq -d |
while read hash; do
echo "Duplicate blob: $hash"
grep "$hash" /tmp/all-blobs.txt | awk '{print $4}'
echo "---"
done
This script reveals files with identical content, allowing you to deduplicate your repository.
Next Steps
Further Learning
- Git Internals Book: Read "Pro Git" Chapter 9 (Git Internals)
- Source Code: Study Git's C source code on GitHub
- Plumbing Commands: Master
git cat-file,git hash-object,git ls-files - Git Hooks: Write custom hooks leveraging object inspection
Advanced Topics
- Git Filter Branch: Rewrite history to remove large files
- Git LFS: Store large files outside the object database
- Submodules: How Git handles nested repositories
- Worktrees: Multiple working directories sharing one object database
FAQ
Q: Can I manually create Git objects?
Yes, using git hash-object -w. This is useful for understanding or scripting Git operations, but rarely needed for daily work.
Q: What happens if two files have different content but the same hash?
This is called a collision. SHA-1 is cryptographically secure, making collisions astronomically unlikely. Git would detect this as an error.
Q: How does Git handle binary files?
Git treats binary files identically to text files. The content is hashed and compressed. However, binary files don't compress well, so consider Git LFS for large binaries.
Q: Can I edit the object database directly?
Never manually modify .git/objects. Use Git commands to maintain database integrity. Manual corruption can make your repository unrecoverable.
Q: Why does Git use SHA-1 instead of SHA-256?
Git was created in 2005 when SHA-1 was the standard. SHA-1 collisions discovered in 2017 prompted Git to add SHA-256 support, but SHA-1 remains the default for backward compatibility.
Key Takeaways
-
Git stores content, not changes: Every file version is stored as a complete blob, enabling efficient deduplication and integrity verification.
-
Hashes are addresses: The SHA-1 hash of content determines its location and identity. Same content = same hash, regardless of filename or location.
-
Four object types rule all: Blobs (file content), trees (directory structure), commits (history), and tags (references) form the complete object model.
-
Trees provide structure: Unlike filesystems that use paths to locate files, Git uses trees that map filenames to blob hashes, enabling powerful operations like cheap branching.
-
Understanding unlocks power: Knowing Git's database structure lets you inspect, debug, and optimize repositories in ways impossible with porcelain commands alone.
-
Integrity is automatic: Content addressing means any corruption immediately breaks hash verification, making Git exceptionally resilient to data loss.
Git's content-addressable architecture isn't an implementation detail-it's the foundation that makes version control possible. By understanding this database model, you gain the ability to work with Git at a deeper level, diagnose complex issues, and appreciate the elegance of Linus Torvalds' design. Whether you're recovering lost data or optimizing a bloated repository, knowing how Git stores and retrieves data transforms it from a mysterious tool into a comprehensible system.