14 min read
Write your git - Part 2: Blobs

In Part 1, we laid the groundwork for our Git implementation by creating the basic repository structure. We implemented the .gitgo directory setup with all necessary subdirectories (objects, refs, heads) and initialized the HEAD file to track our current branch.

Now let’s dive into how Git actually stores file contents using blobs.

What are Blobs?

Think of blobs as snapshots of your file contents. When you stage a file in Git, its content gets stored as a blob. When storing blobs, git doesn’t care about file name nor file permissions - a blob only stores the raw content of your file.

For example, if you have a file hello.txt containing:

Hello, World!

And you go on and stage this file, git will do the following:

  • Take this content
  • Add a header
  • Calculate SHA-1 hash
  • Compress it
  • Store it in the objects directory(one we created in part 1)

The actual stored format will look something like this:

blob 13\0Hello, World!

Lets examine this:

  • We put blob so we can identify the object type when we need to
  • “13” is the content length of the message “Hello, World!”
  • \0 is null byte seperator
  • Followed by actual content

Content-Adressable Storage

Personally, one of Git’s most clever feature(there are many) is how it stores these blobs. Instead of using filenames, Git uses the SHA-1 hash of the blob’s content as its identifier. This means:

  1. The same content always gets the same hash
  2. Different content always gets different hashes
  3. You can find any content if you know its hash

Understanding blob storage

Before we dive into code, let’s understand how Git organizes blobs in objects directory:

.gitgo/objects/
├── e9/
│   └── 65047ad7c7a7de...  # Our "Hello, World!" blob
├── a4/
│   └── 2f243...           # Another blob
└── 8b/
    └── 137891...          # Yet another blob

Now how did we get those directories you may ask? Well SHA-1 function will always create hash with 40 characters. We will use the first two for directory name, while rest of the content(38 in length) will be used as file name.

So if hash of our blob is 07c5cd5c1394b8d9e7d359ef0e3e3e7040f7c99e, we should store it like this on the disk:

.gitgo/objects/
└── 07/
    └──c5cd5c1394b8d9e7d359ef0e3e3e7040f7c99e

This is strange at first, but is there for several reasons:

  • Prevents too many files in one directory
  • Makes file lookups faster
  • Keeps the filesystem efficient

Instead of storing thousands of files in a single directory, which would require searching through every file to find one, Git creates 256 possible directories(from 00 to ff) and spreads files among them.

This means when looking for a file, the filesystem only needs to search through a few dozen files in the specific directory rather than thousands in one place.

This is particularly efficient because most filesystem slow down significantly when handling directories with too many files, and smaller directories are easier to cache and search through.

Project structure

Let’s start adding files to our gitgo project!

├── go.mod
└── internal
    ├── blob // NEW PART!
    │   ├── blob.go
    │   ├── blob_test.go
    ├── config // from part one
    │   ├── config.go
    ├── repository // from part one
        ├── repository.go
        └── repository_test.go

Lets start with some test which will ensure our blob implementation behaves properly. We will test:

  1. Creating and storing blobs
    • Creating a new blob from content
    • Verifying the hash is valid(40 characters SHA-1)
    • Storing the blob in correct location(based on hash)
    • Validating file exists in objects directory
  2. Reading blob content
    • Retrieving a stored blob using its hash
    • Verifying retrieved content matches original
    • Ensuring data integrity through the write/read cycle
// internal/blob/blob_test.go
package blob

import (
	"bytes"
	"github.com/HalilFocic/gitgo/internal/config"
	"os"
	"path/filepath"
	"testing"
)

func TestBlobOperations(t *testing.T) {
	objectsPath := filepath.Join(config.GitDirName, "objects")
	t.Run("1.2: Create and store blob", func(t *testing.T) {

		content := []byte("test content")

		// Create new blob
		b, err := New(content)
		if err != nil {
			t.Fatalf("Failed to create blob: %v", err)
		}

		// Verify hash format
		if len(b.Hash()) != 40 {
			t.Errorf("Invalid hash length: got %d, want 40", len(b.Hash()))
		}

		// Store the blob
		err = b.Store(objectsPath)
		if err != nil {
			t.Fatalf("Failed to store blob: %v", err)
		}
		defer os.RemoveAll(config.GitDirName)

		// Check if file exists in correct location
		hash := b.Hash()
		objectPath := filepath.Join(objectsPath, hash[:2], hash[2:])
		if _, err := os.Stat(objectPath); os.IsNotExist(err) {
			t.Error("Blob file was not created in correct location")
		}
	})

	t.Run("1.3: Read blob content", func(t *testing.T) {
		content := []byte("test content")
		originalBlob, _ := New(content)
		originalBlob.Store(objectsPath)
		defer os.RemoveAll(config.GitDirName)
		// Read blob back
		readBlob, err := Read(objectsPath, originalBlob.Hash())
		if err != nil {
			t.Fatalf("Failed to read blob: %v", err)
		}
		// Verify content matches
		if !bytes.Equal(readBlob.Content(), content) {
			t.Error("Retrieved content doesn't match original")
		}
	})
}

blob.go

Here is the structure for our blob.go file with struct and function definitions:

//internal/blob/blob.go
package blob

import (
	"bytes"
	"compress/zlib"
	"crypto/sha1"
	"encoding/hex"
	"fmt"
	"io"
	"os"
	"path/filepath"
)

type Blob struct {
	hash string
	content []byte
}
// creates new Blob based on byte content
func New(content []byte) (*Blob, error) {}
// returns Blob hash
func (b *Blob) Hash() string {}
// returns content of the Blob
func (b *Blob) Content() []byte {}
// store writes the compressed blob to the objects directory
func (b *Blob) Store(objectsDir string) error {}
//Read function reads a blob from the objects directory by its hash
func Read(objectsDir, hash string) (*Blob, error) {}

Now lets start with the easy ones to warm up. We will implement Hash and Content which are quite simple:

func (b *Blob) Hash() string {
	return b.hash
}

func (b *Blob) Content() []byte {
	return b.content
}

Implementing the New Function

The New function is the key part of our blob implementation. Let’s break down what it needs to do:

  1. Create a header containing:
    • The word “blob” (identifies the object type)
    • A space
    • The length of the content
    • A null byte(\0)
  2. Combine this header with the actual content
  3. Calculate a SHA-1 hash of the combined data
  4. Create and return a new Blob struct

Pseudocode

Here is the pseudo code if you want to try implement it yourself:

# Step 1: Create header
header = format("blob {content_length} \0")

# Step 2: Combine header and content
combined_data = header + content

hash = SHA1(combined_data)
hash_string = convert_to_hex(hash)

return new Blob(
    hash: hash_string,
    content: content
)

Golang implementation

func New(content []byte) (*Blob, error) {
	header := fmt.Sprintf("blob %d%c", len(content), 0)
	combined := append([]byte(header), content...)
	sumResult := sha1.Sum(combined)
	hash := hex.EncodeToString(sumResult[:])
	b := Blob{
		hash:    hash,
		content: content,
	}
	return &b, nil
}

Lets go through each step and explain what is does:

1. Header creation

header := fmt.Sprintf("blob %d%c", len(content), 0)

This creates our header string. For example, if content is “Hello, World!” (13 bytes):

  • Input: “Hello, World!”
  • Output: “blob 13\0”

2. Combining Header and Content

combined := append([]byte(header), content...)

This joins our header and content. Using the same example:

  • Input: “blob 13\0” + “Hello, World!”
  • Output: “blob 13\0Hello, World!“

3. Hash Calculation

sumResult := sha1.Sum(combined)
hash := hex.EncodeToString(sumResult[:])

This generates our SHA-1 hash we talked about earlier:

  • Takes all the combined bytes from header and content
  • Produces a 40-character hash. Since we get the result in bytes, we have to convert it to string using EncodeToString function.

4. Blob Creation

b := Blob{
    hash:    hash,
    content: content,
}
return &b, nil

This one is simple, we just create and return the blob that contains:

  • The original, unmodified content
  • The calculated hash

Implementing the Store function

The Store function is responsible for saving our blob in Git’s object format. Here’s what it needs to do:

  1. Extract directory and filename from the hash
    • First two characters for directory
    • Remaining characters for filename
  2. Create the directory if it doesn’t exist
  3. Create a new file for storing the blob
  4. Compress the content using zlib
  5. Write the compressed data to the file

Pseudocode

Here is the pseudo code if you want to try implement it yourself:

# Step 1: Create paths from hash
directory = first_two_chars_of_hash
filename = remaining_chars_of_hash
directory_path = join(objects_dir, directory)

# Step 2: Create directory structure
create_directory_if_not_exists(directory_path)

# Step 3: Create and open file
file_path = join(directory_path, filename)
file = create_file(file_path)

# Step 4: Setup compression and write
compressor = new_zlib_writer(file)
header = format("blob {content_length}\0")
combined_data = header + content
write_compressed(combined_data)

# Step 5: Cleanup
close_compressor
close_file

NOTE: I used zlib for my golang implementation. If you are using any other lanaguage please find the writer that does the same.

Golang implementation

func (b *Blob) Store(objectsDir string) error {
    directory := b.hash[:2]
    fileName := b.hash[2:]
    directoryPath := filepath.Join(objectsDir, directory)

    err := os.MkdirAll(directoryPath, 0755)
    if err != nil {
        return err
    }

    filePath := filepath.Join(directoryPath, fileName)
    file, err := os.Create(filePath)
    if err != nil {
        return err
    }
    defer file.Close()

    writer := zlib.NewWriter(file)
    defer writer.Close()

    header := fmt.Sprintf("blob %d%c", len(b.content), 0)
    combined := append([]byte(header), b.content...)

    _, err = writer.Write(combined)
    return err
}

Let’s go through each step and explain what it does:

1. Path creation

directory := b.hash[:2]
fileName := b.hash[2:]
directoryPath := filepath.Join(objectsDir, directory)

For example, if our hash is “a1b2c3…”:

  • Directory name: “a1”
  • File name: “b2c3…”
  • Full directory path: “.gitgo/objects/a1”

2. Directory Creation

err := os.MkdirAll(directoryPath, 0755)

This ensures our directory exists:

  • Creates the directory and any missing parent directories
  • Sets permissions to 0755 (rwxr-xr-x)
  • Returns error if something goes wrong

3. File Creation and Setup

filePath := filepath.Join(directoryPath, fileName)
file, err := os.Create(filePath)
defer file.Close()

Here we:

  • Create the full file path by joining directory path and filename
  • Create a new file at that location
  • Use defer to ensure the file gets closed when we’re done

4. Compression and Writing

writer := zlib.NewWriter(file)
defer writer.Close()

header := fmt.Sprintf("blob %d%c", len(b.content), 0)
combined := append([]byte(header), b.content...)

_, err = writer.Write(combined)

This is where we:

Set up zlib compression

  • Create the header just like in New
  • Combine header and content
  • Write the compressed data to the file

The end result is a compressed file in our objects directory that:

  • Has a name based on its content hash
  • Contains the compressed header and content
  • Can be retrieved later using just the hash

Implementing Read function

The Read function is the most complex part of our blob implementation. It needs to:

  1. Build the file path from the hash
  2. Open and read the compressed file
  3. Decompress the content
  4. Parse and validate the header
  5. Verify the content matches the hash

Pseudocode

Here is the pseudo code if you want to try implement it yourself:

# Step 1: Build file path from hash
directory = first_two_chars_of_hash
filename = remaining_chars_of_hash
file_path = join(objects_dir, directory, filename)

# Step 2: Open and decompress file
file = open_file(file_path)
decompressor = new_zlib_reader(file)
content = read_all(decompressor)

# Step 3: Validate header format
if not starts_with(content, "blob "):
    return error("Invalid header")

# Step 4: Parse header
null_index = find_null_byte(content)
if no_null_byte_found:
    return error("Invalid header format")

header = content[0:null_index]
length = parse_length_from_header(header)

# Step 5: Extract and validate content
actual_content = content[null_index + 1:]
if length != len(actual_content):
    return error("Content length mismatch")

# Step 6: Verify hash
verify_hash = calculate_hash(header + actual_content)
if verify_hash != input_hash:
    return error("Hash mismatch")

return new Blob(hash: input_hash, content: actual_content)

Golang implementation

func Read(objectsDir, hash string) (*Blob, error) {
    directory := hash[:2]
    fileName := hash[2:]
    fullFilePath := filepath.Join(objectsDir, directory, fileName)

    file, err := os.OpenFile(fullFilePath, os.O_RDONLY, 0644)
    if err != nil {
        return nil, err
    }
    defer file.Close()

    reader, err := zlib.NewReader(file)
    if err != nil {
        return nil, err
    }
    defer reader.Close()

    content, err := io.ReadAll(reader)
    if err != nil {
        return nil, err
    }

    if !bytes.HasPrefix(content, []byte("blob ")) {
        return nil, fmt.Errorf("Invalid blob header: doesn't start with 'blob'")
    }

    nullIndex := bytes.IndexByte(content, 0)
    if nullIndex == -1 {
        return nil, fmt.Errorf("Invalid blob header: no null byte found")
    }

    header := string(content[:nullIndex])
    var length int
    _, err = fmt.Sscanf(header, "blob %d", &length)
    if err != nil {
        return nil, err
    }

    actualContent := content[nullIndex+1:]
    if len(actualContent) != length {
        return nil, fmt.Errorf("Content length mismatch: expected %d, got %d", length, len(actualContent))
    }

    header = fmt.Sprintf("blob %d%c", len(actualContent), 0)
    combined := append([]byte(header), actualContent...)
    sumResult := sha1.Sum(combined)
    hashResult := hex.EncodeToString(sumResult[:])

    if hashResult != hash {
        return nil, fmt.Errorf("Hash mismatch, expected %s, got %s", hash, hashResult)
    }

    return &Blob{
        hash:    hash,
        content: actualContent,
    }, nil
}

As usual, we will go through each step and explain what it does:

1. File Path Construction and Opening

directory := hash[:2]
fileName := hash[2:]
fullFilePath := filepath.Join(objectsDir, directory, fileName)

file, err := os.OpenFile(fullFilePath, os.O_RDONLY, 0644)

This reconstructs the file location:

  • Takes the first two characters for directory
  • Uses remaining characters for filename
  • Opens the file in read-only mode

2. Decompression Setup

reader, err := zlib.NewReader(file)
content, err := io.ReadAll(reader)

Here we:

  • Create a zlib reader for decompression
  • Read all decompressed content into memory

3. Header Validation

if !bytes.HasPrefix(content, []byte("blob ")) {
    return nil, fmt.Errorf("Invalid blob header: doesn't start with 'blob'")
}

nullIndex := bytes.IndexByte(content, 0)
if nullIndex == -1 {
    return nil, fmt.Errorf("Invalid blob header: no null byte found")
}

We verify that:

  • Content starts with “blob ”
  • Contains a null byte separator
  • Can find where header ends

4. Length Parsing

header := string(content[:nullIndex])
var length int
_, err = fmt.Sscanf(header, "blob %d", &length)

This extracts:

  • The header portion before null byte
  • Parses the content length

5. Content Validation

actualContent := content[nullIndex+1:]
if len(actualContent) != length {
    return nil, fmt.Errorf("Content length mismatch: expected %d, got %d", length, len(actualContent))
}

We verify that:

  • Content length matches what’s in header
  • Extraction worked correctly
  1. 6. Hash verification

header = fmt.Sprintf("blob %d%c", len(actualContent), 0)
combined := append([]byte(header), actualContent...)
sumResult := sha1.Sum(combined)
hashResult := hex.EncodeToString(sumResult[:])

if hashResult != hash {
    return nil, fmt.Errorf("Hash mismatch, expected %s, got %s", hash, hashResult)
}

Finally, we:

  • Reconstruct the original data
  • Calculate its hash
  • Verify it matches the input hash

This extensive validation ensures:

  • Data integrity through compression/decompression
  • Protection against corrupted files
  • Verification of content authenticity

Testing Our Implementation

Now that we’ve implemented all our blob functionality, let’s verify everything works:

go test ./internal/blob

If you see all tests passing, congratulations! You’ve completed a crucial part of our Git implementation. What We’ve Built In this part, we’ve created:

  1. A Blob type that represents Git’s basic storage unit
  2. Functions to:
    • Create new blobs (New)
    • Store them on disk (Store)
    • Retrieve them later (Read)
  3. A content-addressable storage system that:
    • Uses SHA-1 hashing for unique identification
    • Compresses content to save space
    • Validates data integrity

Why This Matters

Our blob implementation is fundamental because:

  • It’s the building block for storing all file contents in Git
  • It introduces content-addressable storage, which is key to Git’s efficiency
  • It demonstrates important concepts like:
    • Content hashing
    • Data compression
    • File system organization
    • Error handling and validation

What’s Next?

In Part 3, we’ll implement the staging area (index):

  • Learn how Git tracks which files are staged for commit
  • Implement the index file format
  • Create functions to add and remove files from the index
  • Build the foundation for the git add and git rm commands

The index will build on our blob implementation, using the blob hashes to track file states and prepare for commits. It acts as the bridge between your working directory and Git’s object storage.


See you in Part 3, where we’ll bring our repository to life by implementing the staging area!