Skip to main content

Data Model

Understanding Git's fundamental data model: content-addressable filesystem, objects, and integrity mechanisms.

Content-Addressable Filesystem

What is Content-Addressable Storage?

Git stores all data as objects in a content-addressable filesystem, where:

  • Content determines the address - The hash of the content becomes its identifier
  • Identical content has same address - Duplicate data is automatically deduplicated
  • Immutable objects - Once created, objects cannot be changed
  • Cryptographic integrity - Data corruption is automatically detected

Benefits of Content-Addressable Storage

  1. Deduplication - Identical files stored only once
  2. Integrity - Corruption detection through checksums
  3. Immutability - Objects cannot be modified after creation
  4. Efficiency - Fast comparison and retrieval
  5. Distribution - Perfect for distributed systems

Git Objects

The Four Object Types

Git stores everything as four types of objects:

  1. Blob - File contents
  2. Tree - Directory structure
  3. Commit - Project snapshot with metadata
  4. Tag - Named reference to commit

Object Identification

# Every object has a unique SHA-1 hash
# Example: a1b2c3d4e5f6789012345678901234567890abcd

# Hash calculation
echo "Hello, World!" | git hash-object --stdin
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

# Same content = same hash (always)
echo "Hello, World!" | git hash-object --stdin
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

Blob Objects

Understanding Blobs

Blobs (Binary Large Objects) store file contents:

# Create blob from file
echo "Hello, Git!" > hello.txt
git hash-object -w hello.txt
# Output: 3b18e512dba79e4c8300dd08aeb37f8e728b8dad

# View blob content
git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
# Output: Hello, Git!

# Check object type
git cat-file -t 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
# Output: blob

Blob Characteristics

  • No filename - Blobs only contain file contents
  • No metadata - No timestamps, permissions, or other attributes
  • Binary safe - Can store any type of file
  • Immutable - Content cannot be changed after creation

Tree Objects

Understanding Trees

Trees represent directory structures and contain:

  • File entries - Pointers to blobs with filenames and permissions
  • Directory entries - Pointers to other trees
  • Metadata - File permissions and types
# View tree object
git cat-file -p tree-hash

# Example tree output:
# 100644 blob a1b2c3d4... README.md
# 040000 tree e5f6789... src/
# 100755 blob 12345678... build.sh

Tree Entry Format

Each tree entry contains:

  • Mode - File permissions (e.g., 100644, 100755, 040000)
  • Type - Object type (blob or tree)
  • Hash - SHA-1 hash of the object
  • Name - Filename or directory name

Directory Representation

# Directory structure:
project/
├── README.md
├── src/
│ ├── main.js
│ └── utils.js
└── build.sh

# Becomes tree objects:
# Root tree: points to README.md blob, src tree, build.sh blob
# src tree: points to main.js blob, utils.js blob

Commit Objects

Understanding Commits

Commits are snapshots of the entire project at a specific point in time:

# View commit object
git cat-file -p commit-hash

# Example commit output:
# tree e5f6789...
# parent a1b2c3d...
# author John Doe <john@example.com> 1234567890 +0000
# committer John Doe <john@example.com> 1234567890 +0000
#
# Add user authentication system

Commit Components

  1. Tree - Root tree object representing project state
  2. Parent(s) - Previous commit(s) in history
  3. Author - Who wrote the changes
  4. Committer - Who applied the changes
  5. Timestamp - When the commit was created
  6. Message - Description of the changes

Commit Relationships

# Linear history
A ← B ← C ← D

# Branching history
D ← E

A ← B ← C ← F

# Merge commit (two parents)
A ← B ← C ←←← G
↘ ↗
D ← E

Tag Objects

Understanding Tags

Tags provide named references to commits:

# View tag object
git cat-file -p tag-hash

# Example tag output:
# object a1b2c3d...
# type commit
# tag v1.0.0
# tagger John Doe <john@example.com> 1234567890 +0000
#
# Release version 1.0.0

Tag Types

  1. Lightweight Tags - Direct reference to commit
  2. Annotated Tags - Tag object with metadata
# Lightweight tag (just a reference)
git tag v1.0.0

# Annotated tag (creates tag object)
git tag -a v1.0.0 -m "Release version 1.0.0"

SHA-1 Hashing

Hash Function Properties

SHA-1 provides:

  • Deterministic - Same input always produces same output
  • Fixed length - Always 40 hexadecimal characters
  • Avalanche effect - Small changes cause large hash changes
  • One-way - Computationally infeasible to reverse

Hash Calculation

# Git calculates hash from:
# 1. Object type and size
# 2. Null byte separator
# 3. Object content

# Example for blob:
# "blob 13\0Hello, World!"
# Results in: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

# Verify hash calculation
echo -n "blob 13\0Hello, World!" | sha1sum
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

Hash Collision Handling

While SHA-1 collisions are theoretically possible:

  • Extremely rare - 2^80 probability
  • Git detects collisions - Refuses to store conflicting objects
  • Migration path - Git is moving to SHA-256
  • Practical impact - No real-world Git repositories affected

Data Integrity

Integrity Mechanisms

  1. Content addressing - Hash verifies content integrity
  2. Checksum validation - Every object verified on read
  3. Tamper detection - Any modification detected immediately
  4. Corruption recovery - Damaged objects easily identified

Integrity Verification

# Check repository integrity
git fsck --full

# Verify object integrity
git verify-pack -v .git/objects/pack/pack-*.idx

# Check connectivity
git fsck --connectivity-only

Object Storage

Storage Location

# Objects stored in .git/objects/
ls -la .git/objects/

# Hash-based directory structure:
# .git/objects/ab/cdef1234567890... (first 2 chars = dir, rest = file)

Storage Formats

  1. Loose objects - Individual files for each object
  2. Pack files - Compressed collections of objects
  3. Multi-pack index - Efficient access to multiple packs

Object Compression

# Loose objects are zlib-compressed
# Pack files use additional delta compression
# Similar objects stored as deltas

# View compression ratio
git count-objects -v

Object Relationships

Parent-Child Relationships

# Commits point to:
# - Tree (project state)
# - Parent commit(s) (history)

# Trees point to:
# - Blobs (file contents)
# - Other trees (subdirectories)

# Blobs are leaf nodes:
# - Contain only file contents
# - No pointers to other objects

Reference Counting

Git uses reference counting for garbage collection:

  • Reachable objects - Referenced by commits, branches, or tags
  • Unreachable objects - No references, eligible for deletion
  • Garbage collection - Removes unreachable objects

Data Model Benefits

Immutability Advantages

  1. Thread safety - Multiple processes can read safely
  2. Caching - Objects never change, aggressive caching possible
  3. Replication - Perfect for distributed systems
  4. Rollback - Previous states always available

Efficiency Features

  1. Deduplication - Identical content stored once
  2. Delta compression - Similar objects stored efficiently
  3. Lazy loading - Objects loaded on demand
  4. Parallel operations - Independent object access

Practical Implications

Development Workflow

# Understanding the data model helps with:
# 1. Repository size management
# 2. Conflict resolution
# 3. History analysis
# 4. Performance optimization

Troubleshooting

# Data model knowledge helps debug:
# - Corruption issues
# - Performance problems
# - Storage efficiency
# - Integrity verification

Advanced Topics

Object Database Optimization

# Repack objects for efficiency
git repack -adf

# Garbage collect unreferenced objects
git gc --aggressive

# Verify object database
git fsck --full --strict

Content Tracking

# Track content across renames
git log --follow filename.txt

# Find objects by content
git grep "search term" $(git rev-list --all)

# Object history analysis
git log --all --full-history -- path/to/file

Security Considerations

Cryptographic Security

  1. Hash integrity - Detects tampering and corruption
  2. Content verification - Ensures authenticity
  3. Collision resistance - Prevents malicious duplicates
  4. Signed commits - GPG signatures for authenticity

Best Practices

  1. Verify repository integrity - Regular fsck checks
  2. Use signed commits - For critical repositories
  3. Monitor object count - Detect unusual growth
  4. Backup object database - Protect against corruption

Migration Considerations

SHA-256 Transition

Git is transitioning from SHA-1 to SHA-256:

  • Stronger security - 256-bit hash vs 160-bit
  • Collision resistance - Astronomically unlikely
  • Backward compatibility - Migration path planned
  • Timeline - Gradual adoption across Git ecosystem

Summary

Git's data model provides:

  • Integrity - Cryptographic verification of all data
  • Efficiency - Deduplication and compression
  • Immutability - Objects never change once created
  • Distribution - Perfect for distributed development
  • Simplicity - Four object types handle all data

Understanding this model is crucial for:

  • Effective Git usage
  • Repository optimization
  • Troubleshooting issues
  • Advanced Git operations

See Repository Structure for how these objects are organized and Git Internals for deeper implementation details.