Skip to content
Go back

Using Git as S3

Published:  at  18:23

This post is an educational exploration into the S3 protocol and a practical guide on how one might implement it using Git as the underlying storage mechanism. Please be mindful of the terms of service and code of conduct for any Git hosting provider you choose to use with these concepts; I am not responsible for any actions taken against your accounts.

s3 who?

Table of contents

Open Table of contents

Repository

Available at External Link Globe ktunprasert/github-as-s3

Introduction

I started this project because I was curious about how difficult it is to implement something that acts like an S3 protocol. Since my personal SaaS with 0 users won’t have a lot of data anyway it should be viable to store the binary data on a private GitHub repository.

PocketBase specifically offers the ability to view the list of backed up data and restore any one of them. It can be done with simple download/upload as well but where’s the fun in that?

Why use Git?

Exploring Git as a backend for an S3-compatible interface offers several compelling advantages:

Why as S3?

Emulating the S3 API provides access to a rich ecosystem of existing tools and applications:

Downsides

While promising, this approach isn’t without its challenges:

Outlining the Concepts

Before we get into the nitty-gritty of the API calls, let’s talk about how we can conceptually map S3 ideas to the world of Git.

At a high level, the most natural mapping I found was:

Why Not Other Git Structures?

What if we used different orphan branches within a single repository to represent different “buckets”?

This is a different approach that allows the GitHub token to be scoped to a single repository. It would be responsible for managing creation/deletion of branches. If a bucket were to be deleted, it may be recoverable if you still had the local branch. There would also be less overhead in terms of setting up a repository and initialising it (an empty repository is without a worktree and must be initialised with an initial commit).

While this might seem simpler from a “single repo to manage” perspective, it quickly runs into issues with fine-grained access control. If someone has access to the repository, they generally have access to all its branches. Managing permissions per-branch in a way that mirrors S3 bucket permissions would require a lot of external tooling and complexity. The risk of “access to one bucket means access to all” was something I wanted to avoid.

Sticking to the “one repository per bucket” model keeps things much tidier from a security and management standpoint, aligning better with how S3 permissions are typically handled in terms of access control.

Implementation Details

When I started this, I wasn’t sure how much of the S3 API I’d need to implement. It turns out, a surprisingly small subset was enough to get rclone and the AWS CLI to play ball for the core tasks of storing and retrieving files. My journey led me to focus on these key S3 API routes:

These were all I needed to implement a basic s3 compatible protocol to store, retrieve and delete files.

As you would expect, the buckets based routes are all handled using External Link Globe go-github and the rest with External Link Globe go-git.

Approaches Explored

Figuring out how to handle the Git operations efficiently was an iterative process.

The Naive First Try: Direct and Simple

My implementation was pretty straightforward: for every S3 operation that modified data (like PutObject), the server would:

  1. Clone the repository.
  2. Write the file.
  3. Add the file to the worktree.
  4. Commit the changes.
  5. Push to the remote.

Simple, right? And it worked… for a single file transfer at a time. But as soon as I tried tools like rclone that perform multiple operations rapidly (typical use case), this approach quickly showed its flaws.

origin/main
    |
    o -- C1 (Remote main)
   / \
  /   \
 /     \
o--A1   o--B1  (Clones with one new commit each)
|       |
HEAD    HEAD
(img1)  (img2)

The main headaches were:

My Saviour: Queue based Goroutine Worker

I landed on an approach that felt more human-centric and robust. The idea was to manage a single, local clone of the repository per “bucket” (Git repository) for the server’s runtime.

Here’s the gist:

This way, multiple files can be committed locally in rapid succession, and then pushed all at once. It elegantly solved the concurrency and temp directory issues. This was also where I really came to appreciate Go’s time.Timer for debouncing operations!

Flow of Operations

We’d likely need to refine this if we were regularly pushing massive amounts of data (GitHub has a External Link Globe 2GB push limit), we can keep track of file sizes and trigger pushes more deliberately, but for typical use cases, this timed approach worked beautifully.


Caveats

Be aware of the following caveats when using this implementation:

ToolTested
External Link Globe rclonecp, sync, delete, mkdir, purge, ls, deletefile, touch, lsd, rmdir
External Link Globe aws s3mb, cp, ls, rm
External Link Globe pocketbasecreating and restoring back up + deleting files
External Link Globe pocketbaseusing as file storage

Future Potentials

This approach opens up several interesting possibilities:

Conclusion

This has been a great learning experience. I’ve discovered the versatility and power of Go’s time.Timer for managing asynchronous operations. More significantly, it’s surprisingly straightforward to implement a functional subset of the S3-compatible API. Ultimately, this project demonstrates yet another innovative, if unconventional, method for leveraging familiar tools like Git for cost-effective file storage.

Acknowledgements



Next Post
Migrating to Neovim >0.10 LSP core