How we decreased GitLab repo backup times from 48 hours to 41 minutes

Wait 5 sec.

Repository backups are a critical component of any robust disaster recovery strategy. However, as repositories grow in size, the process of creating reliable backups becomes increasingly challenging. Our own Rails repository was taking 48 hours to back up — forcing impossible choices between backup frequency and system performance. We wanted to tackle this issue for our customers and for our own users internally.Ultimately, we traced the issue to a 15-year-old Git function with O(N²) complexity and fixed it with an algorithmic change, reducing backup times exponentially. The result: lower costs, reduced risk, and backup strategies that actually scale with your codebase.This turned out to be a Git scalability issue that affects anyone with large repositories. Here's how we tracked it down and fixed it.Backup at scaleFirst, let's look at the problem. As organizations scale their repositories and backups grow more complex, here are some of the challenges they can face:Time-prohibitive backups: For very large repositories, creating a repository backup could take several hours, which can hinder the ability to schedule regular backups.Resource intensity: Extended backup processes can consume substantial server resources, potentially impacting other operations.Backup windows: Finding adequate maintenance windows for such lengthy processes can be difficult for teams running 24/7 operations.Increased failure risk: Long-running processes are more susceptible to interruptions from network issues, server restarts, and system errors, which can force teams to restart the entire very long backup process from scratch.Race conditions: Because it takes a long time to create a backup, the repository might have changed a lot during the process, potentially creating an invalid backup or interrupting the backup because objects are no longer available.These challenges can lead to compromising on backup frequency or completeness – an unacceptable trade-off when it comes to data protection. Extended backup windows can force customers into workarounds. Some might adopt external tooling, while others might reduce backup frequency, resulting in potential inconsistent data protection strategies across organizations.Now, let's dig into how we identified a performance bottleneck, found a resolution, and deployed it to help cut backup times.The technical challengeGitLab's repository backup functionality relies on the git bundle create command, which captures a complete snapshot of a repository, including all objects and references like branches and tags. This bundle serves as a restoration point for recreating the repository in its exact state.However, the implementation of the command suffered from poor scalability related to reference count, creating a performance bottleneck. As repositories accumulated more references, processing time increased exponentially. In our largest repositories containing millions of references, backup operations could extend beyond 48 hours.Root cause analysisTo identify the root cause of this performance bottleneck, we analyzed a flame graph of the command during execution.A flame graph displays the execution path of a command through its stack trace. Each bar corresponds to a function in the code, with the bar's width indicating how much time the command spent executing within that particular function.When examining the flame graph of git bundle create running on a repository with 10,000 references, approximately 80% of the execution time is consumed by the object_array_remove_duplicates() function. This function was introduced to Git in the commit b2a6d1c686 (bundle: allow the same ref to be given more than once, 2009-01-17).To understand this change, it's important to know that git bundle create allows users to specify which references to include in the bundle. For complete repository bundles, the --all flag packages all references.The commit addressed a problem where users providing duplicate references through the command line – such as git bundle create main.bundle main main - would create a bundle without properly handling the duplicated main reference. Unbundling this bundle in a Git repository would break, because it tries to write the same ref twice. The code to avoid duplication uses nested for loops that iterate through all references to identify duplicates. This O(N²) algorithm becomes a significant performance bottleneck in repositories with large reference counts, consuming substantial processing time.The fix: From O(N²) to efficient mappingTo resolve this performance issue, we contributed an upstream fix to Git that replaces the nested loops with a map data structure. Each reference is added to the map, which automatically ensures only a single copy of each reference is retained for processing.This change dramatically enhances the performance of git bundle create and enables much better scalability in repositories with large reference counts. Benchmark testing on a repository with 10,000 references demonstrates a 6x performance improvement.Benchmark 1: bundle (refcount = 100000, revision = master) Time (mean ± σ): 14.653 s ± 0.203 s[User: 13.940 s, System: 0.762 s] Range (min … max): 14.237 s … 14.920 s10 runsBenchmark 2: bundle (refcount = 100000, revision = HEAD) Time (mean ± σ): 2.394 s ± 0.023 s[User: 1.684 s, System: 0.798 s] Range (min … max):2.364 s … 2.425 s10 runsSummary bundle (refcount = 100000, revision = HEAD) ran6.12 ± 0.10 times faster than bundle (refcount = 100000, revision = master)The patch was accepted and merged into upstream Git. At GitLab, we backported this fix to ensure our customers could benefit immediately, without waiting for the next Git release.The result: Dramatically decreased backup timesThe performance gains from this improvement have been nothing short of transformative:From 48 hours to 41 minutes: Creating a backup of our largest repository (gitlab-org/gitlab) now takes just 1.4% of the original time.Consistent performance: The improvement scales reliably across repository sizes.Resource efficiency: We significantly reduced server load during backup operations.Broader applicability: While backup creation sees the most dramatic improvement, all bundle-based operations that operate on many references benefit.What this means for GitLab customersFor GitLab customers, this enhancement delivers immediate and tangible benefits on how organizations approach repository backup and disaster recovery planning:Transformed backup strategiesEnterprise teams can establish comprehensive nightly schedules without impacting development workflows or requiring extensive backup windows.Backups can now run seamlessly in the background during nightly schedules, instead of needing to be dedicated and lengthy.Enhanced business continuityWith backup times reduced from days to minutes, organizations significantly minimize their recovery point objectives (RPO). This translates to reduced business risk – in a disaster scenario, you're potentially recovering hours of work instead of days.Reduced operational overheadLess server resource consumption and shorter maintenance windows.Shorter backup windows mean reduced compute costs, especially in cloud environments, where extended processing time translates directly to higher bills.Future-proofed infrastructureGrowing repositories no longer force difficult choices between backup frequency and system performance.As your codebase expands, your backup strategy can scale seamlessly alongside itOrganizations can now implement more robust backup strategies without compromising on performance or completeness. What was once a challenging trade-off has become a straightforward operational practice.Starting with the GitLab 18.0 release, all GitLab customers regardless of their license tier can already fully take advantage of these improvements for their backup strategy and execution. There is no further change in configuration required.What's nextThis breakthrough is part of our ongoing commitment to scalable, enterprise-grade Git infrastructure. While the improvement of 48 hours to 41 minutes for backup creation time represents a significant milestone, we continue to identify and address performance bottlenecks throughout our stack.We're particularly proud that this enhancement was contributed upstream to the Git project, benefiting not just GitLab users but the broader Git community. This collaborative approach to development ensures that improvements are thoroughly reviewed, widely tested, and available to all.Deep infrastructure work like this is how we approach performance at GitLab. Join the GitLab 18 virtual launch event to see what other fundamental improvements we're shipping. Register today!