Data Model

Understanding Git's fundamental data model: content-addressable filesystem, objects, and integrity mechanisms.

Content-Addressable Filesystem

What is Content-Addressable Storage?

Git stores all data as objects in a content-addressable filesystem, where:

Content determines the address - The hash of the content becomes its identifier
Identical content has same address - Duplicate data is automatically deduplicated
Immutable objects - Once created, objects cannot be changed
Cryptographic integrity - Data corruption is automatically detected

Benefits of Content-Addressable Storage

Deduplication - Identical files stored only once
Integrity - Corruption detection through checksums
Immutability - Objects cannot be modified after creation
Efficiency - Fast comparison and retrieval
Distribution - Perfect for distributed systems

Git Objects

The Four Object Types

Git stores everything as four types of objects:

Blob - File contents
Tree - Directory structure
Commit - Project snapshot with metadata
Tag - Named reference to commit

Object Identification

# Every object has a unique SHA-1 hash
# Example: a1b2c3d4e5f6789012345678901234567890abcd

# Hash calculation
echo "Hello, World!" | git hash-object --stdin
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

# Same content = same hash (always)
echo "Hello, World!" | git hash-object --stdin
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

Blob Objects

Understanding Blobs

Blobs (Binary Large Objects) store file contents:

# Create blob from file
echo "Hello, Git!" > hello.txt
git hash-object -w hello.txt
# Output: 3b18e512dba79e4c8300dd08aeb37f8e728b8dad

# View blob content
git cat-file -p 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
# Output: Hello, Git!

# Check object type
git cat-file -t 3b18e512dba79e4c8300dd08aeb37f8e728b8dad
# Output: blob

Blob Characteristics

No filename - Blobs only contain file contents
No metadata - No timestamps, permissions, or other attributes
Binary safe - Can store any type of file
Immutable - Content cannot be changed after creation

Tree Objects

Understanding Trees

Trees represent directory structures and contain:

File entries - Pointers to blobs with filenames and permissions
Directory entries - Pointers to other trees
Metadata - File permissions and types

# View tree object
git cat-file -p tree-hash

# Example tree output:
# 100644 blob a1b2c3d4... README.md
# 040000 tree e5f6789... src/
# 100755 blob 12345678... build.sh

Tree Entry Format

Each tree entry contains:

Mode - File permissions (e.g., 100644, 100755, 040000)
Type - Object type (blob or tree)
Hash - SHA-1 hash of the object
Name - Filename or directory name

Directory Representation

# Directory structure:
project/
├── README.md
├── src/
│   ├── main.js
│   └── utils.js
└── build.sh

# Becomes tree objects:
# Root tree: points to README.md blob, src tree, build.sh blob
# src tree: points to main.js blob, utils.js blob

Commit Objects

Understanding Commits

Commits are snapshots of the entire project at a specific point in time:

# View commit object
git cat-file -p commit-hash

# Example commit output:
# tree e5f6789...
# parent a1b2c3d...
# author John Doe <john@example.com> 1234567890 +0000
# committer John Doe <john@example.com> 1234567890 +0000
#
# Add user authentication system

Commit Components

Tree - Root tree object representing project state
Parent(s) - Previous commit(s) in history
Author - Who wrote the changes
Committer - Who applied the changes
Timestamp - When the commit was created
Message - Description of the changes

Commit Relationships

# Linear history
A ← B ← C ← D

# Branching history
    D ← E
   ↗
A ← B ← C ← F

# Merge commit (two parents)
A ← B ← C ←←← G
    ↘     ↗
      D ← E

Tag Objects

Understanding Tags

Tags provide named references to commits:

# View tag object
git cat-file -p tag-hash

# Example tag output:
# object a1b2c3d...
# type commit
# tag v1.0.0
# tagger John Doe <john@example.com> 1234567890 +0000
#
# Release version 1.0.0

Tag Types

Lightweight Tags - Direct reference to commit
Annotated Tags - Tag object with metadata

# Lightweight tag (just a reference)
git tag v1.0.0

# Annotated tag (creates tag object)
git tag -a v1.0.0 -m "Release version 1.0.0"

SHA-1 Hashing

Hash Function Properties

SHA-1 provides:

Deterministic - Same input always produces same output
Fixed length - Always 40 hexadecimal characters
Avalanche effect - Small changes cause large hash changes
One-way - Computationally infeasible to reverse

Hash Calculation

# Git calculates hash from:
# 1. Object type and size
# 2. Null byte separator
# 3. Object content

# Example for blob:
# "blob 13\0Hello, World!"
# Results in: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

# Verify hash calculation
echo -n "blob 13\0Hello, World!" | sha1sum
# Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d

Hash Collision Handling

While SHA-1 collisions are theoretically possible:

Extremely rare - 2^80 probability
Git detects collisions - Refuses to store conflicting objects
Migration path - Git is moving to SHA-256
Practical impact - No real-world Git repositories affected

Data Integrity

Integrity Mechanisms

Content addressing - Hash verifies content integrity
Checksum validation - Every object verified on read
Tamper detection - Any modification detected immediately
Corruption recovery - Damaged objects easily identified

Integrity Verification

# Check repository integrity
git fsck --full

# Verify object integrity
git verify-pack -v .git/objects/pack/pack-*.idx

# Check connectivity
git fsck --connectivity-only

Object Storage

Storage Location

# Objects stored in .git/objects/
ls -la .git/objects/

# Hash-based directory structure:
# .git/objects/ab/cdef1234567890... (first 2 chars = dir, rest = file)

Storage Formats

Loose objects - Individual files for each object
Pack files - Compressed collections of objects
Multi-pack index - Efficient access to multiple packs

Object Compression

# Loose objects are zlib-compressed
# Pack files use additional delta compression
# Similar objects stored as deltas

# View compression ratio
git count-objects -v

Object Relationships

Parent-Child Relationships

# Commits point to:
# - Tree (project state)
# - Parent commit(s) (history)

# Trees point to:
# - Blobs (file contents)
# - Other trees (subdirectories)

# Blobs are leaf nodes:
# - Contain only file contents
# - No pointers to other objects

Reference Counting

Git uses reference counting for garbage collection:

Reachable objects - Referenced by commits, branches, or tags
Unreachable objects - No references, eligible for deletion
Garbage collection - Removes unreachable objects

Data Model Benefits

Immutability Advantages

Thread safety - Multiple processes can read safely
Caching - Objects never change, aggressive caching possible
Replication - Perfect for distributed systems
Rollback - Previous states always available

Efficiency Features

Deduplication - Identical content stored once
Delta compression - Similar objects stored efficiently
Lazy loading - Objects loaded on demand
Parallel operations - Independent object access

Practical Implications

Development Workflow

# Understanding the data model helps with:
# 1. Repository size management
# 2. Conflict resolution
# 3. History analysis
# 4. Performance optimization

Troubleshooting

# Data model knowledge helps debug:
# - Corruption issues
# - Performance problems
# - Storage efficiency
# - Integrity verification

Advanced Topics

Object Database Optimization

# Repack objects for efficiency
git repack -adf

# Garbage collect unreferenced objects
git gc --aggressive

# Verify object database
git fsck --full --strict

Content Tracking

# Track content across renames
git log --follow filename.txt

# Find objects by content
git grep "search term" $(git rev-list --all)

# Object history analysis
git log --all --full-history -- path/to/file

Security Considerations

Cryptographic Security

Hash integrity - Detects tampering and corruption
Content verification - Ensures authenticity
Collision resistance - Prevents malicious duplicates
Signed commits - GPG signatures for authenticity

Best Practices

Verify repository integrity - Regular fsck checks
Use signed commits - For critical repositories
Monitor object count - Detect unusual growth
Backup object database - Protect against corruption

Migration Considerations

SHA-256 Transition

Git is transitioning from SHA-1 to SHA-256:

Stronger security - 256-bit hash vs 160-bit
Collision resistance - Astronomically unlikely
Backward compatibility - Migration path planned
Timeline - Gradual adoption across Git ecosystem

Summary

Git's data model provides:

Integrity - Cryptographic verification of all data
Efficiency - Deduplication and compression
Immutability - Objects never change once created
Distribution - Perfect for distributed development
Simplicity - Four object types handle all data

Understanding this model is crucial for:

Effective Git usage
Repository optimization
Troubleshooting issues
Advanced Git operations

See Repository Structure for how these objects are organized and Git Internals for deeper implementation details.

Content-Addressable Filesystem​

What is Content-Addressable Storage?​

Benefits of Content-Addressable Storage​

Git Objects​

The Four Object Types​

Object Identification​

Blob Objects​

Understanding Blobs​

Blob Characteristics​

Tree Objects​

Understanding Trees​

Tree Entry Format​

Directory Representation​

Commit Objects​

Understanding Commits​

Commit Components​

Commit Relationships​

Tag Objects​

Understanding Tags​

Tag Types​

SHA-1 Hashing​

Hash Function Properties​

Hash Calculation​

Hash Collision Handling​

Data Integrity​

Integrity Mechanisms​

Integrity Verification​

Object Storage​

Storage Location​

Storage Formats​

Object Compression​

Object Relationships​

Parent-Child Relationships​

Reference Counting​

Data Model Benefits​

Immutability Advantages​

Efficiency Features​

Practical Implications​

Development Workflow​

Troubleshooting​

Advanced Topics​

Object Database Optimization​

Content Tracking​

Security Considerations​

Cryptographic Security​

Best Practices​

Migration Considerations​

SHA-256 Transition​

Summary​