Data Model
Understanding Git's fundamental data model: content-addressable filesystem, objects, and integrity mechanisms.
Content-Addressable Filesystem
What is Content-Addressable Storage?
Git stores all data as objects in a content-addressable filesystem, where:
- Content determines the address - The hash of the content becomes its identifier
- Identical content has same address - Duplicate data is automatically deduplicated
- Immutable objects - Once created, objects cannot be changed
- Cryptographic integrity - Data corruption is automatically detected
Benefits of Content-Addressable Storage
- Deduplication - Identical files stored only once
- Integrity - Corruption detection through checksums
- Immutability - Objects cannot be modified after creation
- Efficiency - Fast comparison and retrieval
- Distribution - Perfect for distributed systems
Git Objects
The Four Object Types
Git stores everything as four types of objects:
- Blob - File contents
- Tree - Directory structure
- Commit - Project snapshot with metadata
- Tag - Named reference to commit
Object Identification
# Every object has a unique SHA-1 hash
# Example: a1b2c3d4e5f6789012345678901234567890abcd
# Hash calculation
echo "Hello, World!" | git hash-object --stdin
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d
# Same content = same hash (always)
echo "Hello, World!" | git hash-object --stdin
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d
Blob Objects
Understanding Blobs
Blobs (Binary Large Objects) store file contents:
# Create blob from file
echo "Hello, Git!" > hello.txt
git hash-object -w hello.txt
# Output: 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
# View blob content
git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
# Output: Hello, Git!
# Check object type
git cat-file -t 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
# Output: blob
Blob Characteristics
- No filename - Blobs only contain file contents
- No metadata - No timestamps, permissions, or other attributes
- Binary safe - Can store any type of file
- Immutable - Content cannot be changed after creation
Tree Objects
Understanding Trees
Trees represent directory structures and contain:
- File entries - Pointers to blobs with filenames and permissions
- Directory entries - Pointers to other trees
- Metadata - File permissions and types
# View tree object
git cat-file -p tree-hash
# Example tree output:
# 100644 blob a1b2c3d4... README.md
# 040000 tree e5f6789... src/
# 100755 blob 12345678... build.sh
Tree Entry Format
Each tree entry contains:
- Mode - File permissions (e.g., 100644, 100755, 040000)
- Type - Object type (blob or tree)
- Hash - SHA-1 hash of the object
- Name - Filename or directory name
Directory Representation
# Directory structure:
project/
├── README.md
├── src/
│ ├── main.js
│ └── utils.js
└── build.sh
# Becomes tree objects:
# Root tree: points to README.md blob, src tree, build.sh blob
# src tree: points to main.js blob, utils.js blob
Commit Objects
Understanding Commits
Commits are snapshots of the entire project at a specific point in time:
# View commit object
git cat-file -p commit-hash
# Example commit output:
# tree e5f6789...
# parent a1b2c3d...
# author John Doe <john@example.com> 1234567890 +0000
# committer John Doe <john@example.com> 1234567890 +0000
#
# Add user authentication system
Commit Components
- Tree - Root tree object representing project state
- Parent(s) - Previous commit(s) in history
- Author - Who wrote the changes
- Committer - Who applied the changes
- Timestamp - When the commit was created
- Message - Description of the changes
Commit Relationships
# Linear history
A ← B ← C ← D
# Branching history
D ← E
↗
A ← B ← C ← F
# Merge commit (two parents)
A ← B ← C ←←← G
↘ ↗
D ← E
Tag Objects
Understanding Tags
Tags provide named references to commits:
# View tag object
git cat-file -p tag-hash
# Example tag output:
# object a1b2c3d...
# type commit
# tag v1.0.0
# tagger John Doe <john@example.com> 1234567890 +0000
#
# Release version 1.0.0
Tag Types
- Lightweight Tags - Direct reference to commit
- Annotated Tags - Tag object with metadata
# Lightweight tag (just a reference)
git tag v1.0.0
# Annotated tag (creates tag object)
git tag -a v1.0.0 -m "Release version 1.0.0"
SHA-1 Hashing
Hash Function Properties
SHA-1 provides:
- Deterministic - Same input always produces same output
- Fixed length - Always 40 hexadecimal characters
- Avalanche effect - Small changes cause large hash changes
- One-way - Computationally infeasible to reverse
Hash Calculation
# Git calculates hash from:
# 1. Object type and size
# 2. Null byte separator
# 3. Object content
# Example for blob:
# "blob 13\0Hello, World!"
# Results in: 8ab686eafeb1f44702738c8b0f24f2567c36da6d
# Verify hash calculation
echo -n "blob 13\0Hello, World!" | sha1sum
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d
Hash Collision Handling
While SHA-1 collisions are theoretically possible:
- Extremely rare - 2^80 probability
- Git detects collisions - Refuses to store conflicting objects
- Migration path - Git is moving to SHA-256
- Practical impact - No real-world Git repositories affected
Data Integrity
Integrity Mechanisms
- Content addressing - Hash verifies content integrity
- Checksum validation - Every object verified on read
- Tamper detection - Any modification detected immediately
- Corruption recovery - Damaged objects easily identified
Integrity Verification
# Check repository integrity
git fsck --full
# Verify object integrity
git verify-pack -v .git/objects/pack/pack-*.idx
# Check connectivity
git fsck --connectivity-only
Object Storage
Storage Location
# Objects stored in .git/objects/
ls -la .git/objects/
# Hash-based directory structure:
# .git/objects/ab/cdef1234567890... (first 2 chars = dir, rest = file)
Storage Formats
- Loose objects - Individual files for each object
- Pack files - Compressed collections of objects
- Multi-pack index - Efficient access to multiple packs
Object Compression
# Loose objects are zlib-compressed
# Pack files use additional delta compression
# Similar objects stored as deltas
# View compression ratio
git count-objects -v
Object Relationships
Parent-Child Relationships
# Commits point to:
# - Tree (project state)
# - Parent commit(s) (history)
# Trees point to:
# - Blobs (file contents)
# - Other trees (subdirectories)
# Blobs are leaf nodes:
# - Contain only file contents
# - No pointers to other objects
Reference Counting
Git uses reference counting for garbage collection:
- Reachable objects - Referenced by commits, branches, or tags
- Unreachable objects - No references, eligible for deletion
- Garbage collection - Removes unreachable objects
Data Model Benefits
Immutability Advantages
- Thread safety - Multiple processes can read safely
- Caching - Objects never change, aggressive caching possible
- Replication - Perfect for distributed systems
- Rollback - Previous states always available
Efficiency Features
- Deduplication - Identical content stored once
- Delta compression - Similar objects stored efficiently
- Lazy loading - Objects loaded on demand
- Parallel operations - Independent object access
Practical Implications
Development Workflow
# Understanding the data model helps with:
# 1. Repository size management
# 2. Conflict resolution
# 3. History analysis
# 4. Performance optimization
Troubleshooting
# Data model knowledge helps debug:
# - Corruption issues
# - Performance problems
# - Storage efficiency
# - Integrity verification
Advanced Topics
Object Database Optimization
# Repack objects for efficiency
git repack -adf
# Garbage collect unreferenced objects
git gc --aggressive
# Verify object database
git fsck --full --strict
Content Tracking
# Track content across renames
git log --follow filename.txt
# Find objects by content
git grep "search term" $(git rev-list --all)
# Object history analysis
git log --all --full-history -- path/to/file
Security Considerations
Cryptographic Security
- Hash integrity - Detects tampering and corruption
- Content verification - Ensures authenticity
- Collision resistance - Prevents malicious duplicates
- Signed commits - GPG signatures for authenticity
Best Practices
- Verify repository integrity - Regular fsck checks
- Use signed commits - For critical repositories
- Monitor object count - Detect unusual growth
- Backup object database - Protect against corruption
Migration Considerations
SHA-256 Transition
Git is transitioning from SHA-1 to SHA-256:
- Stronger security - 256-bit hash vs 160-bit
- Collision resistance - Astronomically unlikely
- Backward compatibility - Migration path planned
- Timeline - Gradual adoption across Git ecosystem
Summary
Git's data model provides:
- Integrity - Cryptographic verification of all data
- Efficiency - Deduplication and compression
- Immutability - Objects never change once created
- Distribution - Perfect for distributed development
- Simplicity - Four object types handle all data
Understanding this model is crucial for:
- Effective Git usage
- Repository optimization
- Troubleshooting issues
- Advanced Git operations
See Repository Structure for how these objects are organized and Git Internals for deeper implementation details.