This post is an educational exploration into the S3 protocol and a practical guide on how one might implement it using Git as the underlying storage mechanism. Please be mindful of the terms of service and code of conduct for any Git hosting provider you choose to use with these concepts; I am not responsible for any actions taken against your accounts.
Table of contents
Open Table of contents
Repository
Available at ktunprasert/github-as-s3
Introduction
I started this project because I was curious about how difficult it is to implement something that acts like an S3 protocol. Since my personal SaaS with 0 users won’t have a lot of data anyway it should be viable to store the binary data on a private GitHub repository.
PocketBase specifically offers the ability to view the list of backed up data and restore any one of them. It can be done with simple download/upload as well but where’s the fun in that?
Why use Git?
Exploring Git as a backend for an S3-compatible interface offers several compelling advantages:
- Cost-Effective: Many Git hosting platforms provide generous free tiers, effectively offering free storage, versioning, and bandwidth for ingress/egress.
- Rich History and Auditing: Git provides detailed logs and diff capabilities (especially for text-based files), giving you a complete history of changes.
- No Egress Fees: Unlike many traditional cloud storage solutions, you typically don’t encounter egress fees with Git hosting providers.
- Host Independence: You’re not locked into a specific S3 provider. Most Git hosting platforms (GitHub, GitLab, Bitbucket, even self-hosted Gitea) can work.
- No Expensive API Calls (usually): Interacting with Git (especially via local clones or even platform APIs for repository management) is often free or falls within generous rate limits for typical use.
Why as S3?
Emulating the S3 API provides access to a rich ecosystem of existing tools and applications:
- Powerful Tooling: Tools like rclone offer robust functionalities such as
sync
for keeping directories aligned andmkdir
for on-demand directory/bucket creation, which integrate seamlessly with an S3-compatible endpoint. - Application Integration: Many applications, for instance PocketBase, offer built-in S3 backup capabilities, allowing for easy integration.
Downsides
While promising, this approach isn’t without its challenges:
- S3 API Complexity: The S3 protocol is extensive. Achieving full compatibility is beyond the scope of this project. This implementation focuses on a core, minimum viable subset.
- Git’s Design for Source Code, Not Large Binaries: Git was fundamentally designed for managing source code, which typically consists of text files where changes are tracked line by line. It’s not optimised for large binary files. Each version of a binary file is stored in its entirety, which can rapidly bloat the repository size, making clone, pull, and push operations increasingly slow.
- Performance of Git Emulation: Emulating Git operations, particularly with libraries like go-git, can be slower than using the native Git command-line tools.
- Latency: Performance is heavily dependent on the chosen Git hosting platform and network conditions; it’s unlikely to match the speed of dedicated object storage services.
- Storage Limitations: While free tiers are generous, there are limitations to consider. For larger files, you might need to explore solutions like git-lfs, though this can introduce its own complexities and potential costs depending on the platform.
Outlining the Concepts
Before we get into the nitty-gritty of the API calls, let’s talk about how we can conceptually map S3 ideas to the world of Git.
At a high level, the most natural mapping I found was:
-
S3 Buckets = Git Repositories: This felt like the most straightforward translation. An S3 bucket is a container for objects, and a Git repository is a container for files and their history. Both can be public or private, and you can grant permissions to control access. This one-to-one mapping makes managing access and separation quite clean. Although ACL support is not planned.
-
S3 Objects = Files/Paths within a Repository: An object in S3, identified by a key translates directly to a file at that corresponding path within its Git repository.
-
S3 Object Versioning = Git Commits: S3 offers object versioning. Git, by its very nature, versions everything through commits. Each commit is a snapshot. So, the commit history inherently provides a version history for every file (object) in the repository (bucket). This also means we will always have object versioning - even if you don’t want to.
Why Not Other Git Structures?
What if we used different orphan branches within a single repository to represent different “buckets”?
- The Orphan Branch Idea: Each branch could be a self-contained “bucket.”
This is a different approach that allows the GitHub token to be scoped to a single repository. It would be responsible for managing creation/deletion of branches. If a bucket were to be deleted, it may be recoverable if you still had the local branch. There would also be less overhead in terms of setting up a repository and initialising it (an empty repository is without a worktree and must be initialised with an initial commit).
While this might seem simpler from a “single repo to manage” perspective, it quickly runs into issues with fine-grained access control. If someone has access to the repository, they generally have access to all its branches. Managing permissions per-branch in a way that mirrors S3 bucket permissions would require a lot of external tooling and complexity. The risk of “access to one bucket means access to all” was something I wanted to avoid.
Sticking to the “one repository per bucket” model keeps things much tidier from a security and management standpoint, aligning better with how S3 permissions are typically handled in terms of access control.
Implementation Details
When I started this, I wasn’t sure how much of the S3 API I’d need to implement. It turns out, a surprisingly small subset was enough to get rclone
and the AWS CLI to play ball for the core tasks of storing and retrieving files. My journey led me to focus on these key S3 API routes:
These were all I needed to implement a basic s3 compatible protocol to store, retrieve and delete files.
As you would expect, the buckets based routes are all handled using go-github and the rest with go-git.
Approaches Explored
Figuring out how to handle the Git operations efficiently was an iterative process.
The Naive First Try: Direct and Simple
My implementation was pretty straightforward: for every S3 operation that modified data (like PutObject
), the server would:
- Clone the repository.
- Write the file.
- Add the file to the worktree.
- Commit the changes.
- Push to the remote.
Simple, right? And it worked… for a single file transfer at a time. But as soon as I tried tools like rclone
that perform multiple operations rapidly (typical use case), this approach quickly showed its flaws.
origin/main
|
o -- C1 (Remote main)
/ \
/ \
/ \
o--A1 o--B1 (Clones with one new commit each)
| |
HEAD HEAD
(img1) (img2)
The main headaches were:
- Temp Directory Bloat: Each operation cloning the repo meant if a repo was 100MB, three quick operations would use 300MB of temp space. Not scalable at all.
- Concurrency Nightmares: If
rclone
sent, say, threePutObject
requests in quick succession, each would try to clone. The second clone wouldn’t see the changes from the first (which might not have even pushed yet), leading to incorrect states and potential conflicts. Handling force pushes or merges in this scenario felt like a rabbit hole I was glad to avoid.
My Saviour: Queue based Goroutine Worker
I landed on an approach that felt more human-centric and robust. The idea was to manage a single, local clone of the repository per “bucket” (Git repository) for the server’s runtime.
Here’s the gist:
- A dedicated, asynchronous worker is responsible for all Git push operations for a given repository.
- When a change operation (like
PutObject
orDeleteObject
) comes in, the file is written/deleted in the local clone, and the changes are committed locally. - Instead of pushing immediately, a signal is sent to this worker. The worker uses a
time.Timer
that resets with each new signal. - If multiple changes come in quick succession, the timer keeps resetting. Once the changes stop for a defined timeout period, the worker performs a single
git push
. - After the
git push
the timer is stopped, reset to zero and consume to ensure the timer channel always block - leaving the goroutine to be blocked waiting for the next change operation(s).
This way, multiple files can be committed locally in rapid succession, and then pushed all at once. It elegantly solved the concurrency and temp directory issues. This was also where I really came to appreciate Go’s time.Timer
for debouncing operations!
We’d likely need to refine this if we were regularly pushing massive amounts of data (GitHub has a 2GB push limit), we can keep track of file sizes and trigger pushes more deliberately, but for typical use cases, this timed approach worked beautifully.
Caveats
Be aware of the following caveats when using this implementation:
- Authentication Signature Versions: Due to conflicting S3 signature versions (v2 vs. v4) and the need for plaintext GitHub tokens for the API, this system doesn’t support dynamic tokens for multiple “accounts” in the way a full S3 service might. S3v4 involves a one-way hash, incompatible with our need for the actual token.
- Path-Style Access: Configuration requires “path-style” bucket access (e.g.,
endpoint/bucketname/object
) rather than “virtual host-style” (e.g.,bucketname.endpoint/object
). - Limited Testing: The current implementation has primarily been tested with the following:
Tool | Tested | |
---|---|---|
rclone | cp, sync, delete, mkdir, purge, ls, deletefile, touch, lsd, rmdir | ✅ |
aws s3 | mb, cp, ls, rm | ✅ |
pocketbase | creating and restoring back up + deleting files | ✅ |
pocketbase | using as file storage | ❓ |
- Full Clones: Currently, each server cold start involves cloning the entire repository (when prompted). A potential optimisation could be to proxy
GetObject
directly to GitHub’s blob URI for read operations per file, though this is not yet implemented.
Future Potentials
This approach opens up several interesting possibilities:
- Automated TTL Clean-up: Leverage GitHub Actions (or similar CI/CD on other platforms) to implement Time-To-Live (TTL) policies for objects, automatically cleaning up old data.
- Repository Snapshots: Use Git tags or releases to create “snapshots” of your buckets, effectively creating versioned tarballs of your data.
- CI/CD Integration: The sky’s the limit when you consider integrating with GitHub Actions or other CI/CD runners for automated workflows based on your stored data.
- Bring your own Git management UI: If you are used to working with TUIs or any others. You might be tempted to host this through the server itself. One I can imagine is using lazygit through sshx routed to an authorised endpoint to manage and view existing bucket progress for debugging or active management purposes.
Conclusion
This has been a great learning experience. I’ve discovered the versatility and power of Go’s time.Timer
for managing asynchronous operations. More significantly, it’s surprisingly straightforward to implement a functional subset of the S3-compatible API. Ultimately, this project demonstrates yet another innovative, if unconventional, method for leveraging familiar tools like Git for cost-effective file storage.
Acknowledgements
- go-git: For providing a pure Go implementation of Git, which was fundamental to interacting with repositories programmatically.
- go-github: For the Go client library to interact with the GitHub API, essential for any GitHub-specific integrations.
- bkbinary: And his insightful video on storing data on YouTube (and by extension, other platforms not designed for it), which served as a source of inspiration for exploring unconventional storage solutions.