In Part 1, we laid the groundwork for our Git implementation by creating the basic repository structure.
We implemented the .gitgo
directory setup with all necessary subdirectories (objects, refs, heads) and initialized the HEAD file to track our current branch.
Now let’s dive into how Git actually stores file contents using blobs.
What are Blobs?
Think of blobs as snapshots of your file contents. When you stage a file in Git, its content gets stored as a blob. When storing blobs, git doesn’t care about file name nor file permissions - a blob only stores the raw content of your file.
For example, if you have a file hello.txt
containing:
Hello, World!
And you go on and stage this file, git will do the following:
- Take this content
- Add a header
- Calculate SHA-1 hash
- Compress it
- Store it in the objects directory(one we created in part 1)
The actual stored format will look something like this:
blob 13\0Hello, World!
Lets examine this:
- We put blob so we can identify the object type when we need to
- “13” is the content length of the message “Hello, World!”
- \0 is null byte seperator
- Followed by actual content
Content-Adressable Storage
Personally, one of Git’s most clever feature(there are many) is how it stores these blobs. Instead of using filenames, Git uses the SHA-1 hash of the blob’s content as its identifier. This means:
- The same content always gets the same hash
- Different content always gets different hashes
- You can find any content if you know its hash
Understanding blob storage
Before we dive into code, let’s understand how Git organizes blobs in objects
directory:
.gitgo/objects/
├── e9/
│ └── 65047ad7c7a7de... # Our "Hello, World!" blob
├── a4/
│ └── 2f243... # Another blob
└── 8b/
└── 137891... # Yet another blob
Now how did we get those directories you may ask? Well SHA-1 function will always create hash with 40 characters. We will use the first two for directory name, while rest of the content(38 in length) will be used as file name.
So if hash of our blob is 07c5cd5c1394b8d9e7d359ef0e3e3e7040f7c99e
, we should store it like this on the disk:
.gitgo/objects/
└── 07/
└──c5cd5c1394b8d9e7d359ef0e3e3e7040f7c99e
This is strange at first, but is there for several reasons:
- Prevents too many files in one directory
- Makes file lookups faster
- Keeps the filesystem efficient
Instead of storing thousands of files in a single directory, which would require searching through every file to find one, Git creates 256 possible directories(from 00 to ff) and spreads files among them.
This means when looking for a file, the filesystem only needs to search through a few dozen files in the specific directory rather than thousands in one place.
This is particularly efficient because most filesystem slow down significantly when handling directories with too many files, and smaller directories are easier to cache and search through.
Project structure
Let’s start adding files to our gitgo
project!
├── go.mod
└── internal
├── blob // NEW PART!
│ ├── blob.go
│ ├── blob_test.go
├── config // from part one
│ ├── config.go
├── repository // from part one
├── repository.go
└── repository_test.go
Lets start with some test which will ensure our blob implementation behaves properly. We will test:
- Creating and storing blobs
- Creating a new blob from content
- Verifying the hash is valid(40 characters SHA-1)
- Storing the blob in correct location(based on hash)
- Validating file exists in objects directory
- Reading blob content
- Retrieving a stored blob using its hash
- Verifying retrieved content matches original
- Ensuring data integrity through the write/read cycle
// internal/blob/blob_test.go
package blob
import (
"bytes"
"github.com/HalilFocic/gitgo/internal/config"
"os"
"path/filepath"
"testing"
)
func TestBlobOperations(t *testing.T) {
objectsPath := filepath.Join(config.GitDirName, "objects")
t.Run("1.2: Create and store blob", func(t *testing.T) {
content := []byte("test content")
// Create new blob
b, err := New(content)
if err != nil {
t.Fatalf("Failed to create blob: %v", err)
}
// Verify hash format
if len(b.Hash()) != 40 {
t.Errorf("Invalid hash length: got %d, want 40", len(b.Hash()))
}
// Store the blob
err = b.Store(objectsPath)
if err != nil {
t.Fatalf("Failed to store blob: %v", err)
}
defer os.RemoveAll(config.GitDirName)
// Check if file exists in correct location
hash := b.Hash()
objectPath := filepath.Join(objectsPath, hash[:2], hash[2:])
if _, err := os.Stat(objectPath); os.IsNotExist(err) {
t.Error("Blob file was not created in correct location")
}
})
t.Run("1.3: Read blob content", func(t *testing.T) {
content := []byte("test content")
originalBlob, _ := New(content)
originalBlob.Store(objectsPath)
defer os.RemoveAll(config.GitDirName)
// Read blob back
readBlob, err := Read(objectsPath, originalBlob.Hash())
if err != nil {
t.Fatalf("Failed to read blob: %v", err)
}
// Verify content matches
if !bytes.Equal(readBlob.Content(), content) {
t.Error("Retrieved content doesn't match original")
}
})
}
blob.go
Here is the structure for our blob.go file with struct and function definitions:
//internal/blob/blob.go
package blob
import (
"bytes"
"compress/zlib"
"crypto/sha1"
"encoding/hex"
"fmt"
"io"
"os"
"path/filepath"
)
type Blob struct {
hash string
content []byte
}
// creates new Blob based on byte content
func New(content []byte) (*Blob, error) {}
// returns Blob hash
func (b *Blob) Hash() string {}
// returns content of the Blob
func (b *Blob) Content() []byte {}
// store writes the compressed blob to the objects directory
func (b *Blob) Store(objectsDir string) error {}
//Read function reads a blob from the objects directory by its hash
func Read(objectsDir, hash string) (*Blob, error) {}
Now lets start with the easy ones to warm up. We will implement Hash
and Content
which are quite simple:
func (b *Blob) Hash() string {
return b.hash
}
func (b *Blob) Content() []byte {
return b.content
}
Implementing the New Function
The New
function is the key part of our blob implementation. Let’s break down what it needs to do:
- Create a header containing:
- The word “blob” (identifies the object type)
- A space
- The length of the content
- A null byte(\0)
- Combine this header with the actual content
- Calculate a SHA-1 hash of the combined data
- Create and return a new Blob struct
Pseudocode
Here is the pseudo code if you want to try implement it yourself:
# Step 1: Create header
header = format("blob {content_length} \0")
# Step 2: Combine header and content
combined_data = header + content
hash = SHA1(combined_data)
hash_string = convert_to_hex(hash)
return new Blob(
hash: hash_string,
content: content
)
Golang implementation
func New(content []byte) (*Blob, error) {
header := fmt.Sprintf("blob %d%c", len(content), 0)
combined := append([]byte(header), content...)
sumResult := sha1.Sum(combined)
hash := hex.EncodeToString(sumResult[:])
b := Blob{
hash: hash,
content: content,
}
return &b, nil
}
Lets go through each step and explain what is does:
1. Header creation
header := fmt.Sprintf("blob %d%c", len(content), 0)
This creates our header string. For example, if content is “Hello, World!” (13 bytes):
- Input: “Hello, World!”
- Output: “blob 13\0”
2. Combining Header and Content
combined := append([]byte(header), content...)
This joins our header and content. Using the same example:
- Input: “blob 13\0” + “Hello, World!”
- Output: “blob 13\0Hello, World!“
3. Hash Calculation
sumResult := sha1.Sum(combined)
hash := hex.EncodeToString(sumResult[:])
This generates our SHA-1 hash we talked about earlier:
- Takes all the combined bytes from header and content
- Produces a 40-character hash.
Since we get the result in bytes, we have to convert it to string using
EncodeToString
function.
4. Blob Creation
b := Blob{
hash: hash,
content: content,
}
return &b, nil
This one is simple, we just create and return the blob that contains:
- The original, unmodified content
- The calculated hash
Implementing the Store function
The Store
function is responsible for saving our blob in Git’s object format. Here’s what it needs to do:
- Extract directory and filename from the hash
- First two characters for directory
- Remaining characters for filename
- Create the directory if it doesn’t exist
- Create a new file for storing the blob
- Compress the content using zlib
- Write the compressed data to the file
Pseudocode
Here is the pseudo code if you want to try implement it yourself:
# Step 1: Create paths from hash
directory = first_two_chars_of_hash
filename = remaining_chars_of_hash
directory_path = join(objects_dir, directory)
# Step 2: Create directory structure
create_directory_if_not_exists(directory_path)
# Step 3: Create and open file
file_path = join(directory_path, filename)
file = create_file(file_path)
# Step 4: Setup compression and write
compressor = new_zlib_writer(file)
header = format("blob {content_length}\0")
combined_data = header + content
write_compressed(combined_data)
# Step 5: Cleanup
close_compressor
close_file
NOTE: I used zlib for my golang implementation. If you are using any other lanaguage please find the writer that does the same.
Golang implementation
func (b *Blob) Store(objectsDir string) error {
directory := b.hash[:2]
fileName := b.hash[2:]
directoryPath := filepath.Join(objectsDir, directory)
err := os.MkdirAll(directoryPath, 0755)
if err != nil {
return err
}
filePath := filepath.Join(directoryPath, fileName)
file, err := os.Create(filePath)
if err != nil {
return err
}
defer file.Close()
writer := zlib.NewWriter(file)
defer writer.Close()
header := fmt.Sprintf("blob %d%c", len(b.content), 0)
combined := append([]byte(header), b.content...)
_, err = writer.Write(combined)
return err
}
Let’s go through each step and explain what it does:
1. Path creation
directory := b.hash[:2]
fileName := b.hash[2:]
directoryPath := filepath.Join(objectsDir, directory)
For example, if our hash is “a1b2c3…”:
- Directory name: “a1”
- File name: “b2c3…”
- Full directory path: “.gitgo/objects/a1”
2. Directory Creation
err := os.MkdirAll(directoryPath, 0755)
This ensures our directory exists:
- Creates the directory and any missing parent directories
- Sets permissions to 0755 (rwxr-xr-x)
- Returns error if something goes wrong
3. File Creation and Setup
filePath := filepath.Join(directoryPath, fileName)
file, err := os.Create(filePath)
defer file.Close()
Here we:
- Create the full file path by joining directory path and filename
- Create a new file at that location
- Use defer to ensure the file gets closed when we’re done
4. Compression and Writing
writer := zlib.NewWriter(file)
defer writer.Close()
header := fmt.Sprintf("blob %d%c", len(b.content), 0)
combined := append([]byte(header), b.content...)
_, err = writer.Write(combined)
This is where we:
Set up zlib compression
- Create the header just like in New
- Combine header and content
- Write the compressed data to the file
The end result is a compressed file in our objects directory that:
- Has a name based on its content hash
- Contains the compressed header and content
- Can be retrieved later using just the hash
Implementing Read function
The Read
function is the most complex part of our blob implementation. It needs to:
- Build the file path from the hash
- Open and read the compressed file
- Decompress the content
- Parse and validate the header
- Verify the content matches the hash
Pseudocode
Here is the pseudo code if you want to try implement it yourself:
# Step 1: Build file path from hash
directory = first_two_chars_of_hash
filename = remaining_chars_of_hash
file_path = join(objects_dir, directory, filename)
# Step 2: Open and decompress file
file = open_file(file_path)
decompressor = new_zlib_reader(file)
content = read_all(decompressor)
# Step 3: Validate header format
if not starts_with(content, "blob "):
return error("Invalid header")
# Step 4: Parse header
null_index = find_null_byte(content)
if no_null_byte_found:
return error("Invalid header format")
header = content[0:null_index]
length = parse_length_from_header(header)
# Step 5: Extract and validate content
actual_content = content[null_index + 1:]
if length != len(actual_content):
return error("Content length mismatch")
# Step 6: Verify hash
verify_hash = calculate_hash(header + actual_content)
if verify_hash != input_hash:
return error("Hash mismatch")
return new Blob(hash: input_hash, content: actual_content)
Golang implementation
func Read(objectsDir, hash string) (*Blob, error) {
directory := hash[:2]
fileName := hash[2:]
fullFilePath := filepath.Join(objectsDir, directory, fileName)
file, err := os.OpenFile(fullFilePath, os.O_RDONLY, 0644)
if err != nil {
return nil, err
}
defer file.Close()
reader, err := zlib.NewReader(file)
if err != nil {
return nil, err
}
defer reader.Close()
content, err := io.ReadAll(reader)
if err != nil {
return nil, err
}
if !bytes.HasPrefix(content, []byte("blob ")) {
return nil, fmt.Errorf("Invalid blob header: doesn't start with 'blob'")
}
nullIndex := bytes.IndexByte(content, 0)
if nullIndex == -1 {
return nil, fmt.Errorf("Invalid blob header: no null byte found")
}
header := string(content[:nullIndex])
var length int
_, err = fmt.Sscanf(header, "blob %d", &length)
if err != nil {
return nil, err
}
actualContent := content[nullIndex+1:]
if len(actualContent) != length {
return nil, fmt.Errorf("Content length mismatch: expected %d, got %d", length, len(actualContent))
}
header = fmt.Sprintf("blob %d%c", len(actualContent), 0)
combined := append([]byte(header), actualContent...)
sumResult := sha1.Sum(combined)
hashResult := hex.EncodeToString(sumResult[:])
if hashResult != hash {
return nil, fmt.Errorf("Hash mismatch, expected %s, got %s", hash, hashResult)
}
return &Blob{
hash: hash,
content: actualContent,
}, nil
}
As usual, we will go through each step and explain what it does:
1. File Path Construction and Opening
directory := hash[:2]
fileName := hash[2:]
fullFilePath := filepath.Join(objectsDir, directory, fileName)
file, err := os.OpenFile(fullFilePath, os.O_RDONLY, 0644)
This reconstructs the file location:
- Takes the first two characters for directory
- Uses remaining characters for filename
- Opens the file in read-only mode
2. Decompression Setup
reader, err := zlib.NewReader(file)
content, err := io.ReadAll(reader)
Here we:
- Create a zlib reader for decompression
- Read all decompressed content into memory
3. Header Validation
if !bytes.HasPrefix(content, []byte("blob ")) {
return nil, fmt.Errorf("Invalid blob header: doesn't start with 'blob'")
}
nullIndex := bytes.IndexByte(content, 0)
if nullIndex == -1 {
return nil, fmt.Errorf("Invalid blob header: no null byte found")
}
We verify that:
- Content starts with “blob ”
- Contains a null byte separator
- Can find where header ends
4. Length Parsing
header := string(content[:nullIndex])
var length int
_, err = fmt.Sscanf(header, "blob %d", &length)
This extracts:
- The header portion before null byte
- Parses the content length
5. Content Validation
actualContent := content[nullIndex+1:]
if len(actualContent) != length {
return nil, fmt.Errorf("Content length mismatch: expected %d, got %d", length, len(actualContent))
}
We verify that:
- Content length matches what’s in header
- Extraction worked correctly
-
6. Hash verification
header = fmt.Sprintf("blob %d%c", len(actualContent), 0)
combined := append([]byte(header), actualContent...)
sumResult := sha1.Sum(combined)
hashResult := hex.EncodeToString(sumResult[:])
if hashResult != hash {
return nil, fmt.Errorf("Hash mismatch, expected %s, got %s", hash, hashResult)
}
Finally, we:
- Reconstruct the original data
- Calculate its hash
- Verify it matches the input hash
This extensive validation ensures:
- Data integrity through compression/decompression
- Protection against corrupted files
- Verification of content authenticity
Testing Our Implementation
Now that we’ve implemented all our blob functionality, let’s verify everything works:
go test ./internal/blob
If you see all tests passing, congratulations! You’ve completed a crucial part of our Git implementation. What We’ve Built In this part, we’ve created:
- A Blob type that represents Git’s basic storage unit
- Functions to:
- Create new blobs (New)
- Store them on disk (Store)
- Retrieve them later (Read)
- A content-addressable storage system that:
- Uses SHA-1 hashing for unique identification
- Compresses content to save space
- Validates data integrity
Why This Matters
Our blob implementation is fundamental because:
- It’s the building block for storing all file contents in Git
- It introduces content-addressable storage, which is key to Git’s efficiency
- It demonstrates important concepts like:
- Content hashing
- Data compression
- File system organization
- Error handling and validation
What’s Next?
In Part 3, we’ll implement the staging area (index):
- Learn how Git tracks which files are staged for commit
- Implement the index file format
- Create functions to add and remove files from the index
- Build the foundation for the git add and git rm commands
The index will build on our blob implementation, using the blob hashes to track file states and prepare for commits. It acts as the bridge between your working directory and Git’s object storage.
See you in Part 3, where we’ll bring our repository to life by implementing the staging area!